本篇博文主要内容为 2025-04-21 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-04-21)

今日共更新366篇论文,其中:

  • 自然语言处理53篇(Computation and Language (cs.CL))
  • 人工智能104篇(Artificial Intelligence (cs.AI))
  • 计算机视觉91篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习100篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLM s Beyond the Base Model?

【速读】: 该论文旨在重新审视强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在提升大型语言模型(LLMs)推理能力方面的假设,特别是其是否能够使模型获得超越基础模型的新推理模式。论文的关键在于通过测量不同模型家族和基准测试中的 pass@\textit{k} 指标(尤其是大值 k 的情况),探索 RLVR 训练后模型与基础模型之间的推理能力边界差异。研究发现,尽管 RLVR 在较小的 k 值下(如 k=1)提升了模型性能,但在较大的 k 值下,基础模型的表现可以达到甚至超过经过 RLVR 训练的模型,表明 RLVR 并未引入本质上新的推理模式。相反,RLVR 主要是通过调整模型输出分布以更高效地采样正确答案,从而提高了特定任务的性能,但同时也限制了模型的总体推理能力范围。此外,论文还指出蒸馏方法能够引入不同于 RLVR 的新知识。这些结果揭示了 RLVR 在提升 LLM 推理能力方面的关键局限性,并强调需要重新思考强化学习在推理任务中的作用及寻找更优的训练范式。

链接: https://arxiv.org/abs/2504.13837
作者: Yang Yue,Zhiqi Chen,Rui Lu,Andrew Zhao,Zhaokai Wang,Yang Yue,Shiji Song,Gao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 19 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models’ capacity. In this study, however, we critically re-examines this assumption by measuring the pass@\textitk metric with large values of \textitk to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does \emphnot, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of k (\eg, k =1), base models can achieve a comparable or even higher pass@ k score compared to their RL counterparts at large k values. The reasoning paths generated by RL-trained models are already included in the base models’ sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model’s output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: this https URL
zh

[NLP-1] MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

【速读】: 该论文旨在解决在大规模开放源码指令微调数据集中自动选择高质量且多样化的子集的问题。现有方法通常侧重于实例质量,并采用启发式规则来保持多样性,但这种缺乏对整个数据集合全面视角的方式往往导致次优结果。此外,启发式规则一般关注嵌入空间中的距离或聚类,无法准确捕捉语义空间中复杂指令的意图。为弥合这一差距,论文提出了一种统一的方法来量化数据集的信息内容,通过构建标签图来建模语义空间,并基于图中信息分布量化多样性。在此度量基础上,进一步引入了一种高效的采样方法,迭代选择数据样本以最大化语义空间中的信息增益(MIG)。实验结果显示,MIG方法在多个数据集和基础模型上始终优于最先进的方法,例如,使用MIG采样的Tulu3数据的5%进行微调的模型,在AlpacaEval上的性能提升了+5.73%,在Wildbench上的性能提升了+6.89%,与在完整数据集上训练的官方SFT模型相比具有相当的表现。因此,该论文的关键解决方案在于提出了基于语义空间信息增益最大化的高效采样方法。

链接: https://arxiv.org/abs/2504.13835
作者: Yicheng Chen,Yining Li,Kai Hu,Zerun Ma,Haochen Ye,Kai Chen
机构: Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbfMaximize the \textbfInformation \textbfGain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73% on AlpacaEval and +6.89% on Wildbench.
zh

[NLP-2] Science Hierarchography: Hierarchical Organization of Science Literature

【速读】: 该论文试图解决科学知识快速增长背景下,难以追踪跨学科领域进展及其高层次概念联系的问题。现有的工具如引文网络和搜索引擎虽易于获取少量相关文献,但缺乏灵活的抽象能力以有效表示各科学子领域的活动密度。为应对这一挑战,论文提出SCIENCE HIERARCHOGRAPHY的目标,即构建一个高质量的层级结构,用于从宏观领域到具体研究的多级抽象分类,从而揭示哪些领域已被深入探索,哪些领域尚待开发。解决方案的关键在于结合基于快速嵌入的聚类与大型语言模型(LLM)提示的方法,平衡嵌入方法的计算效率与LLM提示提供的语义精度,同时通过多维度分类超越简单的主题标签,以更好地反映研究论文的跨学科特性。

链接: https://arxiv.org/abs/2504.13834
作者: Muhan Gao,Jash Shah,Weiqi Wang,Daniel Khashabi
机构: Department of Computer Science, Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific knowledge is growing rapidly, making it challenging to track progress and high-level conceptual links across broad disciplines. While existing tools like citation networks and search engines make it easy to access a few related papers, they fundamentally lack the flexible abstraction needed to represent the density of activity in various scientific subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that allows for the categorization of scientific work across varying levels of abstraction, from very broad fields to very specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve the goals of SCIENCE HIERARCHOGRAPHY, we develop a range of algorithms. Our primary approach combines fast embedding-based clustering with LLM-based prompting to balance the computational efficiency of embedding methods with the semantic precision offered by LLM prompting. We demonstrate that this approach offers the best trade-off between quality and speed compared to methods that heavily rely on LLM prompting, such as iterative tree construction with LLMs. To better reflect the interdisciplinary and multifaceted nature of research papers, our hierarchy captures multiple dimensions of categorization beyond simple topic labels. We evaluate the utility of our framework by assessing how effectively an LLM-based agent can locate target papers using the hierarchy. Results show that this structured approach enhances interpretability, supports trend discovery, and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo: \hrefthis https URLthis https URL
zh

[NLP-3] Generative AI Act II: Test Time Scaling Drives Cognition Engineering

【速读】: 该论文旨在解决第一代大型语言模型(Large Language Models)在知识时效性、浅层推理及受限认知过程方面的根本局限性问题。论文指出,通过测试时扩展(test-time scaling)技术,将模型从基于潜在空间的知识检索系统转变为基于语言思维的思想构建引擎,从而实现与人工智能在思维层面的连接。这一转变的关键在于通过认知工程(cognition engineering)的新范式,利用自然语言实现更高层次的人机交互,并通过全面的教程和优化实现,使认知工程得以普及,让每位从业者都能参与到生成式AI(Generative AI)的第二幕发展中。论文还提供了定期更新的相关文献资源库链接以支持这一研究方向。

链接: https://arxiv.org/abs/2504.13828
作者: Shijie Xia,Yiwei Qin,Xuefeng Li,Yan Ma,Run-Ze Fan,Steffi Chern,Haoyang Zou,Fan Zhou,Xiangkun Hu,Jiahe Jin,Yanheng He,Yixin Ye,Yixiu Liu,Pengfei Liu
机构: Shanghai Jiao Tong University (上海交通大学); SII (上海交通大学); Generative AI Research Lab (GAIR)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The first generation of Large Language Models - what might be called “Act I” of generative AI (2020-2023) - achieved remarkable success through massive parameter and data scaling, yet exhibited fundamental limitations in knowledge latency, shallow reasoning, and constrained cognitive processes. During this era, prompt engineering emerged as our primary interface with AI, enabling dialogue-level communication through natural language. We now witness the emergence of “Act II” (2024-present), where models are transitioning from knowledge-retrieval systems (in latent space) to thought-construction engines through test-time scaling techniques. This new paradigm establishes a mind-level connection with AI through language-based thoughts. In this paper, we clarify the conceptual foundations of cognition engineering and explain why this moment is critical for its development. We systematically break down these advanced approaches through comprehensive tutorials and optimized implementations, democratizing access to cognition engineering and enabling every practitioner to participate in AI’s second act. We provide a regularly updated collection of papers on test-time scaling in the GitHub Repository: this https URL
zh

[NLP-4] Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

【速读】: 该论文旨在综述知识蒸馏(Knowledge Distillation, KD)技术在人工智能和机器学习领域的最新进展,重点探讨其在提升模型效率和准确性方面的应用。论文特别关注近年来的知识蒸馏创新方法,如基于注意力机制的方法、块级 logits 蒸馏以及解耦蒸馏等,这些方法通过优化刺激复杂性、注意力机制及全局信息捕获来显著改进学生模型性能。论文的关键在于总结这些技术的核心贡献,并分析知识蒸馏在压缩大型语言模型的同时保持高精度、降低计算开销和加速推理速度的能力,从而为研究人员和从业者提供关于知识蒸馏在未来 AI 和机器学习发展中作用的深刻洞见。

链接: https://arxiv.org/abs/2504.13825
作者: Junjie Yang,Junhao Song,Xudong Han,Ziqian Bi,Tianyang Wang,Chia Xin Liang,Xinyuan Song,Yichao Zhang,Qian Niu,Benji Peng,Keyu Chen,Ming Liu
机构: Pingtan Research Institute of Xiamen University (厦门大学乒乓球研究所); Imperial College London (帝国理工学院); University of Sussex (苏塞克斯大学); Purdue University (普渡大学); University of Liverpool (利物浦大学); JTB Technology Corp. (JTB科技公司); Emory University (埃默里大学); The University of Texas at Dallas (达拉斯大学); Kyoto University (京都大学); AppCubic; Georgia Institute of Technology (乔治亚理工学院); AI Agent Lab (人工智能代理实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, such as attention-based approaches, block-wise logit distillation, and decoupling distillation, have notably improved student model performance. These techniques focus on stimulus complexity, attention mechanisms, and global information capture to optimize knowledge transfer. In addition, KD has proven effective in compressing large language models while preserving accuracy, reducing computational overhead, and improving inference speed. This survey synthesizes the latest literature, highlighting key findings, contributions, and future directions in knowledge distillation to provide insights for researchers and practitioners on its evolving role in artificial intelligence and machine learning.
zh

[NLP-5] Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大规模语言模型推理能力时面临的计算与内存需求不对称问题:推理过程可轻松实现并行化且内存占用极小,而策略更新则需要大量同步操作且内存密集。为应对这一挑战,论文提出了一种名为PODS(Policy Optimization with Down-Sampling)的新框架,通过并行生成大量回放序列但仅基于信息量最大的子集进行更新,从而战略性地解耦推理与策略优化阶段。关键创新在于引入最大方差下采样(max-variance down-sampling),这是一种理论驱动的方法,用于选择具有最大奖励信号多样性的回放序列,并证明其具有高效的算法实现。实验表明,结合PODS的最大方差下采样方法显著提升了GRPO在GSM8K基准上的性能。

链接: https://arxiv.org/abs/2504.13818
作者: Yixuan Even Xu,Yash Savani,Fei Fang,Zico Kolter
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing reasoning capabilities in large language models, but faces a fundamental asymmetry in computation and memory requirements: inference is embarrassingly parallel with a minimal memory footprint, while policy updates require extensive synchronization and are memory-intensive. To address this asymmetry, we introduce PODS (Policy Optimization with Down-Sampling), a framework that strategically decouples these phases by generating numerous rollouts in parallel but updating only on an informative subset. Within this framework, we develop max-variance down-sampling, a theoretically motivated method that selects rollouts with maximally diverse reward signals. We prove that this approach has an efficient algorithmic solution, and empirically demonstrate that GRPO with PODS using max-variance down-sampling achieves superior performance over standard GRPO on the GSM8K benchmark.
zh

[NLP-6] Analyzing LLM s Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations

【速读】: 该论文旨在解决跨语言大语言模型(LLMs)知识边界认知能力的研究空白,特别是如何有效分析和利用不同语言之间的知识边界感知差异。论文的关键解决方案在于提出了一种无需微调的对齐方法,通过在中层到中上层表征中挖掘跨语言的知识边界感知能力,从而实现低资源语言中的幻觉风险降低。此外,论文还通过双语问答对翻译的微调进一步增强了LLMs的跨语言知识边界识别能力,并构建了一个包含三种代表性知识边界数据类型的多语言评估套件以填补现有标准测试基准的缺失。

链接: https://arxiv.org/abs/2504.13816
作者: Chenghao Xiao,Hou Pong Chan,Hao Zhang,Mahani Aljunied,Lidong Bing,Noura Al Moubayed,Yu Rong
机构: DAMO Academy, Alibaba Group (阿里巴巴集团 DAMO 学院); Department of Computer Science, Durham University (杜伦大学计算机科学系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages. Our empirical studies reveal three key findings: 1) LLMs’ perceptions of knowledge boundaries are encoded in the middle to middle-upper layers across different languages. 2) Language differences in knowledge boundary perception follow a linear structure, which motivates our proposal of a training-free alignment method that effectively transfers knowledge boundary perception ability across languages, thereby helping reduce hallucination risk in low-resource languages; 3) Fine-tuning on bilingual question pair translation further enhances LLMs’ recognition of knowledge boundaries across languages. Given the absence of standard testbeds for cross-lingual knowledge boundary analysis, we construct a multilingual evaluation suite comprising three representative types of knowledge boundary data. Our code and datasets are publicly available at this https URL.
zh

[NLP-7] BadApex: Backdoor Attack Based on Adaptive Optimization Mechanism of Black-box Large Language Models

【速读】: 该论文旨在解决现有基于插入和改述的后门攻击方法忽视中毒文本与干净文本之间语义一致性与文本质量的问题,并指出近期引入大语言模型(LLMs)生成中毒文本的研究虽提升了隐蔽性、语义一致性和文本质量,但其手工设计的提示依赖专家经验,在提示适应性和防御后的攻击性能方面面临挑战。论文的关键解决方案是提出了一种基于黑盒大语言模型自适应优化机制的新后门攻击方法(BadApex)。该方法通过精炼的提示利用黑盒LLMs生成中毒文本,并设计了一个自适应优化机制,迭代改进初始提示:生成代理基于初始提示生成中毒文本,修改代理评估中毒文本的质量并优化新的提示。经过多次迭代后,使用优化后的提示通过LLMs生成中毒文本。实验结果表明,BadApex在多个数据集上的表现显著优于现有先进方法,提升了提示适应性、语义一致性和文本质量,即使在应用两种防御方法的情况下,平均攻击成功率(ASR)仍可达96.75%。

链接: https://arxiv.org/abs/2504.13775
作者: Zhengxian Wu,Juan Wen,Wanli Peng,Ziwei Zhang,Yinghan Zhou,Yiming Xue
机构: China Agricultural University (中国农业大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Previous insertion-based and paraphrase-based backdoors have achieved great success in attack efficacy, but they ignore the text quality and semantic consistency between poisoned and clean texts. Although recent studies introduce LLMs to generate poisoned texts and improve the stealthiness, semantic consistency, and text quality, their hand-crafted prompts rely on expert experiences, facing significant challenges in prompt adaptability and attack performance after defenses. In this paper, we propose a novel backdoor attack based on adaptive optimization mechanism of black-box large language models (BadApex), which leverages a black-box LLM to generate poisoned text through a refined prompt. Specifically, an Adaptive Optimization Mechanism is designed to refine an initial prompt iteratively using the generation and modification agents. The generation agent generates the poisoned text based on the initial prompt. Then the modification agent evaluates the quality of the poisoned text and refines a new prompt. After several iterations of the above process, the refined prompt is used to generate poisoned texts through LLMs. We conduct extensive experiments on three dataset with six backdoor attacks and two defenses. Extensive experimental results demonstrate that BadApex significantly outperforms state-of-the-art attacks. It improves prompt adaptability, semantic consistency, and text quality. Furthermore, when two defense methods are applied, the average attack success rate (ASR) still up to 96.75%.
zh

[NLP-8] Scaling sparse feature circuit finding for in-context learning

【速读】: 该论文旨在解决大型语言模型中上下文学习(In-Context Learning, ICL)机制的理解问题。研究的关键在于利用稀疏自编码器(Sparse Autoencoders, SAEs)来揭示模型在执行任务时的潜在向量表示及其因果关系。论文通过发现抽象的SAE特征,这些特征不仅编码了模型对执行何种任务的知识,还能因果性地诱导零样本任务执行。此外,研究进一步表明,这些任务向量可以通过稀疏的SAE潜在变量组合近似表示,并且通过改进的稀疏特征电路方法,成功应用于更大规模的Gemma-1 2B模型中,探索了ICL机制下任务检测特征与任务执行特征之间的关联性,这些特征通过注意力子层和MLP子层相互连接。

链接: https://arxiv.org/abs/2504.13756
作者: Dmitrii Kharlapenko,Stepan Shabalin,Fazl Barez,Arthur Conmy,Neel Nanda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are a popular tool for interpreting large language model activations, but their utility in addressing open questions in interpretability remains unclear. In this work, we demonstrate their effectiveness by using SAEs to deepen our understanding of the mechanism behind in-context learning (ICL). We identify abstract SAE features that (i) encode the model’s knowledge of which task to execute and (ii) whose latent vectors causally induce the task zero-shot. This aligns with prior work showing that ICL is mediated by task vectors. We further demonstrate that these task vectors are well approximated by a sparse sum of SAE latents, including these task-execution features. To explore the ICL mechanism, we adapt the sparse feature circuits methodology of Marks et al. (2024) to work for the much larger Gemma-1 2B model, with 30 times as many parameters, and to the more complex task of ICL. Through circuit finding, we discover task-detecting features with corresponding SAE latents that activate earlier in the prompt, that detect when tasks have been performed. They are causally linked with task-execution features through the attention and MLP sublayers.
zh

[NLP-9] Learning to Attribute with Attention

【速读】: 该论文试图解决语言模型中前置词 tokens 的影响识别问题,即如何有效确定哪些前置 tokens 对模型生成特定序列的影响最大。传统方法通过屏蔽(ablate)前置 tokens 并直接测量其影响来实现这一目标,但这种方法成本高昂。论文的关键解决方案是提出了一种名为 Attribution with Attention (AT2) 的方法,该方法将不同注意力头(attention heads)的注意力权重视为特征,并通过利用屏蔽实验的信号来学习如何有效地利用这些注意力权重进行归因。与需要多次屏蔽的传统方法相比,AT2 方法不仅在性能上相当,而且显著提高了效率。此外,论文展示了 AT2 在问答任务中用于修剪不重要的上下文部分以提高答案质量的实用性。

链接: https://arxiv.org/abs/2504.13752
作者: Benjamin Cohen-Wang,Yung-Sung Chuang,Aleksander Madry
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Given a sequence of tokens generated by a language model, we may want to identify the preceding tokens that influence the model to generate this sequence. Performing such token attribution is expensive; a common approach is to ablate preceding tokens and directly measure their effects. To reduce the cost of token attribution, we revisit attention weights as a heuristic for how a language model uses previous tokens. Naive approaches to attribute model behavior with attention (e.g., averaging attention weights across attention heads to estimate a token’s influence) have been found to be unreliable. To attain faithful attributions, we propose treating the attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution (using signal from ablations). Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations, while being significantly more efficient. To showcase the utility of AT2, we use it to prune less important parts of a provided context in a question answering setting, improving answer quality. We provide code for AT2 at this https URL .
zh

[NLP-10] Controlled Territory and Conflict Tracking (CONTACT): (Geo-)Mapping Occupied Territory from Open Source Intelligence

【速读】: 该论文试图解决利用开放源情报(OSINT)中的无结构文本数据预测领土控制状态的问题。解决方案的关键在于提出了一种名为CONTACT的框架,该框架结合大语言模型(LLMs)和极少量监督进行领土控制预测。具体而言,研究评估了两种方法:基于嵌入的少量-shot分类器SetFit和对多语言生成式LLM BLOOMZ-560m进行提示微调的方法。通过在小规模人工标注的数据集上训练模型,数据集涵盖叙利亚和伊拉克ISIS活动相关的新闻文章,并利用提示条件提取与控制相关信号(如军事行动、伤亡和位置参考),验证了基于BLOOMZ的模型优于SetFit基线,并且提示式监督在低资源场景下提升了泛化能力。最终,CONTACT展示了通过少量-shot方法微调LLMs能够减轻标注负担并支持从开放的OSINT流中进行结构化推理。

链接: https://arxiv.org/abs/2504.13730
作者: Paul K. Mandal,Cole Leo,Connor Hurley
机构: Neurint LLC (Neurint LLC); The University of Texas at Austin (德克萨斯大学奥斯汀分校); T2S Solutions (T2S Solutions)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 1 figure, 1 table

点击查看摘要

Abstract:Open-source intelligence provides a stream of unstructured textual data that can inform assessments of territorial control. We present CONTACT, a framework for territorial control prediction using large language models (LLMs) and minimal supervision. We evaluate two approaches: SetFit, an embedding-based few-shot classifier, and a prompt tuning method applied to BLOOMZ-560m, a multilingual generative LLM. Our model is trained on a small hand-labeled dataset of news articles covering ISIS activity in Syria and Iraq, using prompt-conditioned extraction of control-relevant signals such as military operations, casualties, and location references. We show that the BLOOMZ-based model outperforms the SetFit baseline, and that prompt-based supervision improves generalization in low-resource settings. CONTACT demonstrates that LLMs fine-tuned using few-shot methods can reduce annotation burdens and support structured inference from open-ended OSINT streams. Our code is available at this https URL.
zh

[NLP-11] OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中潜在的欺骗风险评估不足的问题。现有的评估方法通常依赖于模拟游戏或有限的选择场景,无法全面反映真实交互中的复杂性。为应对这一挑战,论文提出了OpenDeception,这是一种基于开放场景数据集的新颖欺骗评估框架。其关键在于通过检查LLM代理的内部推理过程,同时评估其欺骗意图和能力,并通过多轮对话仿真避免与人类测试者的高风险交互。研究发现,主流LLMs普遍存在较高的欺骗意图比例(超过80%)和成功欺骗率(超过50%),尤其能力更强的模型表现出更高的欺骗风险,这凸显了抑制欺骗行为的紧迫性和必要性。

链接: https://arxiv.org/abs/2504.13707
作者: Yichen Wu,Xudong Pan,Geng Hong,Min Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the general capabilities of large language models (LLMs) improve and agent applications become more widespread, the underlying deception risks urgently require systematic evaluation and effective oversight. Unlike existing evaluation which uses simulated games or presents limited choices, we introduce OpenDeception, a novel deception evaluation framework with an open-ended scenario dataset. OpenDeception jointly evaluates both the deception intention and capabilities of LLM-based agents by inspecting their internal reasoning process. Specifically, we construct five types of common use cases where LLMs intensively interact with the user, each consisting of ten diverse, concrete scenarios from the real world. To avoid ethical concerns and costs of high-risk deceptive interactions with human testers, we propose to simulate the multi-turn dialogue via agent simulation. Extensive evaluation of eleven mainstream LLMs on OpenDeception highlights the urgent need to address deception risks and security concerns in LLM-based agents: the deception intention ratio across the models exceeds 80%, while the deception success rate surpasses 50%. Furthermore, we observe that LLMs with stronger capabilities do exhibit a higher risk of deception, which calls for more alignment efforts on inhibiting deceptive behaviors.
zh

[NLP-12] Deep literature reviews: an application of fine-tuned language models to migration research

【速读】: 该论文试图解决传统文献综述方法在处理大规模研究内容时面临的效率与一致性不足的问题。解决方案的关键在于提出了一种结合传统文献计量学方法与经过领域适应性微调的大语言模型(LLMs)的混合框架。通过让LLMs生成初始标签并在错误聚焦的验证过程中由人工评审修正分类错误,该方法不仅显著提升了标注效率和一致性,还增强了知识综合的广度和深度。此外,论文强调了领域适应性LLM作为“专家级”工具的能力,能够在跨学科文献综述中准确筛选相关研究、检测新兴趋势并识别关键的研究空白。

链接: https://arxiv.org/abs/2504.13685
作者: Stefano M. Iacus,Haodong Qi,Jiyoung Han
机构: Harvard University (哈佛大学), USA; Malmö University (马尔默大学), Sweden; Stockholm University Demography Unit (斯德哥尔摩大学人口学研究中心), Sweden; Malmö University (马尔默大学), Sweden
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
备注:

点击查看摘要

Abstract:This paper presents a hybrid framework for literature reviews that augments traditional bibliometric methods with large language models (LLMs). By fine-tuning open-source LLMs, our approach enables scalable extraction of qualitative insights from large volumes of research content, enhancing both the breadth and depth of knowledge synthesis. To improve annotation efficiency and consistency, we introduce an error-focused validation process in which LLMs generate initial labels and human reviewers correct misclassifications. Applying this framework to over 20000 scientific articles about human migration, we demonstrate that a domain-adapted LLM can serve as a “specialist” model - capable of accurately selecting relevant studies, detecting emerging trends, and identifying critical research gaps. Notably, the LLM-assisted review reveals a growing scholarly interest in climate-induced migration. However, existing literature disproportionately centers on a narrow set of environmental hazards (e.g., floods, droughts, sea-level rise, and land degradation), while overlooking others that more directly affect human health and well-being, such as air and water pollution or infectious diseases. This imbalance highlights the need for more comprehensive research that goes beyond physical environmental changes to examine their ecological and societal consequences, particularly in shaping migration as an adaptive response. Overall, our proposed framework demonstrates the potential of fine-tuned LLMs to conduct more efficient, consistent, and insightful literature reviews across disciplines, ultimately accelerating knowledge synthesis and scientific discovery.
zh

[NLP-13] Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

【速读】: 该论文旨在解决不确定性量化(Uncertainty Quantification, UQ)在语言模型(Language Models, LMs)评估中存在的偏差问题。研究发现,常用的正确性函数(如ROUGE-L)会放大某些UQ方法的表现,导致评估结果失真。论文的关键解决方案是指出大型语言模型(Large Language Model, LLM)作为裁判(LLM-as-a-judge)的方法相较于其他基于词典或嵌入的正确性函数,具有更少的长度偏差(length bias),从而能够有效缓解因长度偏差引起的UQ评估扭曲问题。

链接: https://arxiv.org/abs/2504.13677
作者: Andrea Santilli,Adam Golinski,Michael Kirchhof,Federico Danieli,Arno Blaas,Miao Xiong,Luca Zappella,Sinead Williamson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for improving their safety and reliability. Evaluations often use performance metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). In this paper, we show that commonly used correctness functions bias UQ evaluations by inflating the performance of certain UQ methods. We evaluate 7 correctness functions – from lexical-based and embedding-based metrics to LLM-as-a-judge approaches – across 4 datasets x 4 models x 6 UQ methods. Our analysis reveals that length biases in the errors of these correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.
zh

[NLP-14] Large Language Models Will Change The Way Children Think About Technology And Impact Every Interaction Paradigm

【速读】: 该论文试图探讨大型语言模型(Large Language Models, LLMs)对儿童学习方式及其与技术互动期望的潜在深远影响,并预测未来交互系统设计所需应对的关键挑战。论文指出,尽管LLMs对教育的影响目前较为有限,但即将发生的变革将更为显著。为此,作者通过一个小场景和自民族志研究展示了这些变化的效果,并提出了五个重要的考量因素作为解决方案的核心,即交互系统设计师在未来需要适应的关键方向。

链接: https://arxiv.org/abs/2504.13667
作者: Russell Beale
机构: School of Computer Science, University of Birmingham (伯明翰大学计算机学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for IDC 2025. Citation: Russell Beale. 2025. Large Language Models Will Change The Way Children Think About Technology And Impact Every Interaction Paradigm. In Proceedings of Interaction Design and Children Conference (IDC2025). ACM, New York, NY, USA

点击查看摘要

Abstract:This paper presents a hopeful perspective on the potentially dramatic impacts of Large Language Models on how we children learn and how they will expect to interact with technology. We review the effects of LLMs on education so far, and make the case that these effects are minor compared to the upcoming changes that are occurring. We present a small scenario and self-ethnographic study demonstrating the effects of these changes, and define five significant considerations that interactive systems designers will have to accommodate in the future.
zh

[NLP-15] Multi-Type Context-Aware Conversational Recommender Systems via Mixture-of-Experts

【速读】: 该论文旨在解决现有会话推荐系统在利用多类型上下文信息时面临的挑战,特别是如何有效融合结构化(如知识图谱)与非结构化信息(如对话历史和商品评论)。论文的关键创新在于提出了一种名为MCCRS的多类型上下文感知会话推荐系统,通过专家混合机制(mixture-of-experts)有效地整合多种上下文信息。MCCRS由多个专注于特定领域(即特定上下文信息)的专家模块组成,并由一个协调器ChairBot整合各专家的输出以生成最终推荐结果。这种方法突破了单一上下文信息带来的模型瓶颈,充分利用了不同专家的专业化能力以及多样化的上下文信息。实验结果表明,MCCRS在性能上显著优于现有基线方法。

链接: https://arxiv.org/abs/2504.13655
作者: Jie Zou,Cheng Lin,Weikang Guo,Zheng Wang,Jiwei Wei,Yang Yang,Hengtao Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 30 pages

点击查看摘要

Abstract:Conversational recommender systems enable natural language conversations and thus lead to a more engaging and effective recommendation scenario. As the conversations for recommender systems usually contain limited contextual information, many existing conversational recommender systems incorporate external sources to enrich the contextual information. However, how to combine different types of contextual information is still a challenge. In this paper, we propose a multi-type context-aware conversational recommender system, called MCCRS, effectively fusing multi-type contextual information via mixture-of-experts to improve conversational recommender systems. MCCRS incorporates both structured information and unstructured information, including the structured knowledge graph, unstructured conversation history, and unstructured item reviews. It consists of several experts, with each expert specialized in a particular domain (i.e., one specific contextual information). Multiple experts are then coordinated by a ChairBot to generate the final results. Our proposed MCCRS model takes advantage of different contextual information and the specialization of different experts followed by a ChairBot breaks the model bottleneck on a single contextual information. Experimental results demonstrate that our proposed MCCRS method achieves significantly higher performance compared to existing baselines.
zh

[NLP-16] Word Embedding Techniques for Classification of Star Ratings

【速读】: 该论文旨在研究不同词嵌入算法对基于客户评论的文本分类性能的影响,并探索特征工程与降维方法在提升分类效果中的作用。论文的核心问题是评估多种先进的词嵌入技术(如BERT、Word2Vec和Doc2Vec)与分类算法结合时的表现差异,并分析其在精准度、召回率和F1分数方面的性能优劣。此外,论文还关注这些词嵌入方法的能量消耗。

解决方案的关键在于采用新颖的数据集进行详尽的实验,通过结合主成分分析(PCA)等降维策略优化词嵌入表示,并比较传统平均向量聚合方法与基于PCA的组合方式在性能上的差异。研究发现,在更复杂的分类任务中,BERT与PCA的组合表现出最高的性能指标,同时提出的基于第一主成分整合词向量的方法显著优于传统的平均向量方法。

链接: https://arxiv.org/abs/2504.13653
作者: Hesham Abdelmotaleb,Craig McNeile,Malgorzata Wojtys
机构: Centre for Mathematical Sciences (数学科学中心), University of Plymouth (普利茅斯大学)
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注: 40 pages

点击查看摘要

Abstract:Telecom services are at the core of today’s societies’ everyday needs. The availability of numerous online forums and discussion platforms enables telecom providers to improve their services by exploring the views of their customers to learn about common issues that the customers face. Natural Language Processing (NLP) tools can be used to process the free text collected. One way of working with such data is to represent text as numerical vectors using one of many word embedding models based on neural networks. This research uses a novel dataset of telecom customers’ reviews to perform an extensive study showing how different word embedding algorithms can affect the text classification process. Several state-of-the-art word embedding techniques are considered, including BERT, Word2Vec and Doc2Vec, coupled with several classification algorithms. The important issue of feature engineering and dimensionality reduction is addressed and several PCA-based approaches are explored. Moreover, the energy consumption used by the different word embeddings is investigated. The findings show that some word embedding models can lead to consistently better text classifiers in terms of precision, recall and F1-Score. In particular, for the more challenging classification tasks, BERT combined with PCA stood out with the highest performance metrics. Moreover, our proposed PCA approach of combining word vectors using the first principal component shows clear advantages in performance over the traditional approach of taking the average. Comments: 40 pages Subjects: Computation and Language (cs.CL); Applications (stat.AP) MSC classes: 62P99 Cite as: arXiv:2504.13653 [cs.CL] (or arXiv:2504.13653v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.13653 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-17] Exploring the Potential for Large Language Models to Demonstrate Rational Probabilistic Beliefs

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在概率推理表示方面存在的不足,即当前版本的这类模型无法提供理性且连贯的概率信念表达。论文的关键解决方案在于引入一个包含不确定真值陈述的新数据集,并应用多种成熟的不确定性量化技术,以此评估LLMs是否符合概率推理的基本性质。

链接: https://arxiv.org/abs/2504.13644
作者: Gabriel Freedman,Francesca Toni
机构: Department of Computing, Imperial College London (帝国理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Advances in the general capabilities of large language models (LLMs) have led to their use for information retrieval, and as components in automated decision systems. A faithful representation of probabilistic reasoning in these models may be essential to ensure trustworthy, explainable and effective performance in these tasks. Despite previous work suggesting that LLMs can perform complex reasoning and well-calibrated uncertainty quantification, we find that current versions of this class of model lack the ability to provide rational and coherent representations of probabilistic beliefs. To demonstrate this, we introduce a novel dataset of claims with indeterminate truth values and apply a number of well-established techniques for uncertainty quantification to measure the ability of LLM’s to adhere to fundamental properties of probabilistic reasoning.
zh

[NLP-18] Simulating Before Planning : Constructing Intrinsic User World Model for User-Tailored Dialogue Policy Planning SIGIR2025

【速读】: 该论文旨在解决现有对话策略规划方法忽视用户特征的问题,尤其是在需要适应个体用户特质(如个性、偏好和目标)的实际应用场景(如会话搜索和推荐)中。解决方案的关键在于提出了一种名为User-Tailored Dialogue Policy Planning (UDP) 的框架,该框架通过引入Intrinsic User World Model来建模用户特征与反馈。UDP框架包含三个阶段:(1) 使用扩散模型动态推断用户档案的用户角色描绘;(2) 借助基于布朗桥的预测器预判用户反应的用户反馈预测;(3) 整合这些洞察以优化响应策略的用户定制化策略规划。此外,为了确保性能的鲁棒性,还提出了主动学习方法,优先选择具有挑战性的用户角色进行训练。实验结果验证了UDP在学习特定于用户的对话策略方面的有效性及其在协作和非协作场景中的鲁棒性和适应性。

链接: https://arxiv.org/abs/2504.13643
作者: Tao He,Lizi Liao,Ming Liu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 figures, SIGIR 2025

点击查看摘要

Abstract:Recent advancements in dialogue policy planning have emphasized optimizing system agent policies to achieve predefined goals, focusing on strategy design, trajectory acquisition, and efficient training paradigms. However, these approaches often overlook the critical role of user characteristics, which are essential in real-world scenarios like conversational search and recommendation, where interactions must adapt to individual user traits such as personality, preferences, and goals. To address this gap, we first conduct a comprehensive study utilizing task-specific user personas to systematically assess dialogue policy planning under diverse user behaviors. By leveraging realistic user profiles for different tasks, our study reveals significant limitations in existing approaches, highlighting the need for user-tailored dialogue policy planning. Building on this foundation, we present the User-Tailored Dialogue Policy Planning (UDP) framework, which incorporates an Intrinsic User World Model to model user traits and feedback. UDP operates in three stages: (1) User Persona Portraying, using a diffusion model to dynamically infer user profiles; (2) User Feedback Anticipating, leveraging a Brownian Bridge-inspired anticipator to predict user reactions; and (3) User-Tailored Policy Planning, integrating these insights to optimize response strategies. To ensure robust performance, we further propose an active learning approach that prioritizes challenging user personas during training. Comprehensive experiments on benchmarks, including collaborative and non-collaborative settings, demonstrate the effectiveness of UDP in learning user-specific dialogue strategies. Results validate the protocol’s utility and highlight UDP’s robustness, adaptability, and potential to advance user-centric dialogue systems.
zh

[NLP-19] Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)评估中人类评分固有的噪声和不一致性问题。传统基于回归的神经度量方法难以应对这种噪声,而利用大语言模型(Large Language Models, LLMs)进行提示的方法在系统级评估中表现良好,但在段落级评估中效果不佳。论文的关键解决方案是提出了一种名为ReMedy的新框架,它将翻译评估重新定义为奖励建模任务。与直接对不完美的评分进行回归不同,ReMedy通过成对偏好数据学习翻译质量的相对关系,从而实现更可靠的评估。这一方法不仅在WMT22-24共享任务的多项实验中达到了当前最优性能,还显著提升了对翻译错误的检测能力和对低质量翻译的评估能力。

链接: https://arxiv.org/abs/2504.13630
作者: Shaomu Tan,Christof Monz
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.
zh

[NLP-20] Divergent LLM Adoption and Heterogeneous Convergence Paths in Research Writing

【速读】: 该论文旨在探究AI辅助生成性修订(AI-assisted generative revisions)对学术论文写作的影响,重点关注不同学科、性别、母语状态及职业阶段在采用大型语言模型(Large Language Models, LLMs)方面的异质性模式及其对学术写作风格趋同性的驱动作用。为解决这一问题,研究者开发了一种新颖的分类框架,通过微调特定领域的提示词与大型语言模型来识别ChatGPT修订文本的写作风格,并分析其影响。关键在于利用超过627,000篇arXiv数据库中的学术论文数据集,揭示LLMs在学术界不同群体中的采纳差异以及这些工具如何提升文章清晰度、简洁性和正式写作规范的遵守程度,同时评估修订类型对改进效果的具体影响。此外,采用差分差异法进一步探讨LLMs促进学术写作趋同的效果,特别是早期采用者、男性研究人员、非母语使用者及初级学者表现出更显著的写作风格转变。

链接: https://arxiv.org/abs/2504.13629
作者: Cong William Lin,Wu Zhu
机构: Cornell University SC Johnson College of Business(约翰逊商学院); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), such as ChatGPT, are reshaping content creation and academic writing. This study investigates the impact of AI-assisted generative revisions on research manuscripts, focusing on heterogeneous adoption patterns and their influence on writing convergence. Leveraging a dataset of over 627,000 academic papers from arXiv, we develop a novel classification framework by fine-tuning prompt- and discipline-specific large language models to detect the style of ChatGPT-revised texts. Our findings reveal substantial disparities in LLM adoption across academic disciplines, gender, native language status, and career stage, alongside a rapid evolution in scholarly writing styles. Moreover, LLM usage enhances clarity, conciseness, and adherence to formal writing conventions, with improvements varying by revision type. Finally, a difference-in-differences analysis shows that while LLMs drive convergence in academic writing, early adopters, male researchers, non-native speakers, and junior scholars exhibit the most pronounced stylistic shifts, aligning their writing more closely with that of established researchers.
zh

[NLP-21] hought Manipulation: External Thought Can Be Efficient for Large Reasoning Models

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在推理任务中因“过度思考”(overthinking)导致的冗余推理步骤过多、计算成本高以及性能提升有限的问题。传统缓解方法依赖于微调(fine-tuning),但这种方式需要额外的数据、特殊的训练设置、潜在的安全性错位风险以及较差的泛化能力。论文的关键发现是:通过在推理标记(\texttt{think} 和 \texttt{/think} 之间)插入由较小模型生成的外部思维链(CoT, Chain of Thoughts),可以有效操控模型减少不必要的推理步骤。基于此洞察,论文提出了一种简单而高效的管道方法 ThoughtMani,使 LRMs 能够跳过冗余的中间步骤,显著降低计算开销,同时保持原有性能,并提升安全性对齐度约 10%。这种方法尤其适用于多规模模型服务场景,为构建更高效、更易用的 LRMs 提供了解决方案。

链接: https://arxiv.org/abs/2504.13626
作者: Yule Liu,Jingyi Zheng,Zhen Sun,Zifan Peng,Wenhan Dong,Zeyang Sha,Shiwen Cui,Weiqiang Wang,Xinlei He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large reasoning models (LRMs) have demonstrated the effectiveness of scaling test-time computation to enhance reasoning capabilities in multiple tasks. However, LRMs typically suffer from “overthinking” problems, where models generate significantly redundant reasoning steps while bringing limited performance gains. Existing work relies on fine-tuning to mitigate overthinking, which requires additional data, unconventional training setups, risky safety misalignment, and poor generalization. Through empirical analysis, we reveal an important characteristic of LRM behaviors that placing external CoTs generated by smaller models between the thinking token ( \textttthink and \texttt/think) can effectively manipulate the model to generate fewer thoughts. Building on these insights, we propose a simple yet efficient pipeline, ThoughtMani, to enable LRMs to bypass unnecessary intermediate steps and reduce computational costs significantly. We conduct extensive experiments to validate the utility and efficiency of ThoughtMani. For instance, when applied to QwQ-32B on the LiveBench/Code dataset, ThoughtMani keeps the original performance and reduces output token counts by approximately 30%, with little overhead from the CoT generator. Furthermore, we find that ThoughtMani enhances safety alignment by an average of 10%. Since model vendors typically serve models of different sizes simultaneously, ThoughtMani provides an effective way to construct more efficient and accessible LRMs for real-world applications. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.13626 [cs.CL] (or arXiv:2504.13626v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.13626 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-22] Long-context Non-factoid Question Answering in Indic Languages

【速读】: 该论文试图解决长上下文问答(QA)任务中因自注意力机制的二次复杂性导致的性能下降问题,特别是在低资源Indic语言中的挑战。论文的关键解决方案是探索多种上下文缩短技术,包括开放信息抽取(Open Information Extraction, OIE)、共指解析(coreference resolution)、答案段落选择(Answer Paragraph Selection, APS),以及这些技术的组合应用。实验结果表明,这些上下文缩短技术在未微调的三种流行大语言模型(LLMs)上,使四种Indic语言(印地语、泰米尔语、泰卢固语和乌尔都语)的语义评分平均提高4%,标记级别评分提高47%;在微调后,进一步提升语义和标记级别评分各2%。此外,上下文缩短技术显著降低了计算开销,并通过可解释性方法揭示了当APS模型准确识别包含答案的段落时,所选文本内的大部分标记获得高相关性得分的现象。然而,研究也指出基于LLM的QA系统在处理非事实型问题(如需要推理或辩论的问题)方面的局限性。因此,该研究的关键在于通过上下文缩短技术提升LLM在长上下文QA任务中的效率与效果,尤其是在低资源语言环境下的应用潜力。

链接: https://arxiv.org/abs/2504.13615
作者: Ritwik Mishra,Rajiv Ratn Shah,Ponnurangam Kumaraguru
机构: Indraprastha Institute of Information Technology, Delhi (印度理工学院德里分校); International Institute of Information Technology, Hyderabad (海得拉巴信息技术研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Question Answering (QA) tasks, which involve extracting answers from a given context, are relatively straightforward for modern Large Language Models (LLMs) when the context is short. However, long contexts pose challenges due to the quadratic complexity of the self-attention mechanism. This challenge is compounded in Indic languages, which are often low-resource. This study explores context-shortening techniques, including Open Information Extraction (OIE), coreference resolution, Answer Paragraph Selection (APS), and their combinations, to improve QA performance. Compared to the baseline of unshortened (long) contexts, our experiments on four Indic languages (Hindi, Tamil, Telugu, and Urdu) demonstrate that context-shortening techniques yield an average improvement of 4% in semantic scores and 47% in token-level scores when evaluated on three popular LLMs without fine-tuning. Furthermore, with fine-tuning, we achieve an average increase of 2% in both semantic and token-level scores. Additionally, context-shortening reduces computational overhead. Explainability techniques like LIME and SHAP reveal that when the APS model confidently identifies the paragraph containing the answer, nearly all tokens within the selected text receive high relevance scores. However, the study also highlights the limitations of LLM-based QA systems in addressing non-factoid questions, particularly those requiring reasoning or debate. Moreover, verbalizing OIE-generated triples does not enhance system performance. These findings emphasize the potential of context-shortening techniques to improve the efficiency and effectiveness of LLM-based QA systems, especially for low-resource languages. The source code and resources are available at this https URL.
zh

[NLP-23] Continual Pre-Training is (not) What You Need in Domain Adaption

【速读】: 该论文旨在解决Legal Large Language Models (LLMs) 在法律领域的有效适配问题,具体挑战包括法律推理的复杂性、专业语言的精确解读需求以及模型幻觉(hallucinations)的可能性。论文的关键解决方案是探索Domain-Adaptive Continual Pre-Training (DACP) 方法在提升LLMs法律推理能力方面的有效性。通过一系列基于台湾法律框架的实验,研究发现DACP能够增强领域特定知识,但其性能提升并不在所有法律任务中均一。论文进一步讨论了DACP在模型泛化能力和基于提示(prompt-based)任务中的权衡,并提出了优化法律人工智能领域适应策略的研究方向。

链接: https://arxiv.org/abs/2504.13603
作者: Pin-Er Chen,Da-Chen Lian,Shu-Kai Hsieh,Sieh-Chuen Huang,Hsuan-Lei Shao,Jun-Wei Chiu,Yang-Hsien Lin,Zih-Ching Chen,Cheng-Kuang,Eddie TC Huang,Simon See
机构: National Taiwan University (台湾大学); Taipei Medical University (台北医科大学); NVIDIA AI Technology Center, NVIDIA Corporation (NVIDIA人工智能技术中心, NVIDIA公司)
类目: Computation and Language (cs.CL)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:The recent advances in Legal Large Language Models (LLMs) have transformed the landscape of legal research and practice by automating tasks, enhancing research precision, and supporting complex decision-making processes. However, effectively adapting LLMs to the legal domain remains challenging due to the complexity of legal reasoning, the need for precise interpretation of specialized language, and the potential for hallucinations. This paper examines the efficacy of Domain-Adaptive Continual Pre-Training (DACP) in improving the legal reasoning capabilities of LLMs. Through a series of experiments on legal reasoning tasks within the Taiwanese legal framework, we demonstrate that while DACP enhances domain-specific knowledge, it does not uniformly improve performance across all legal tasks. We discuss the trade-offs involved in DACP, particularly its impact on model generalization and performance in prompt-based tasks, and propose directions for future research to optimize domain adaptation strategies in legal AI.
zh

[NLP-24] Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling

【速读】: 本文旨在解决任务导向对话(Task-Oriented Dialogue, TOD)系统中意图检测(Intent Detection)在面对快速增加且具有复杂关系的可集成工具时所面临的适应性挑战。现有方法如零样本重写(zero-shot reformulations)和基于大型语言模型(LLM-based)的动态识别,在处理未见过的意图时性能下降,导致任务路由错误。为提升模型在未见任务上的泛化能力,论文提出的关键解决方案是在意图检测任务的分组相对策略优化(Group Relative Policy Optimization, GRPO)训练过程中结合强化学习(Reinforcement Learning, RL)与基于奖励的课程采样(Reward-based Curriculum Sampling, RCS)。实验表明,RL训练的模型显著优于监督微调(Supervised Fine-Tuning, SFT)基线模型的泛化性能。此外,RCS的引入通过在训练期间聚焦于困难案例,大幅增强了RL在意图检测中的有效性。同时,将思维链(Chain-of-Thought, COT)过程融入RL显著提升了复杂意图检测任务中的泛化能力,强调了思维在挑战场景中的重要性。本研究推进了意图检测任务的泛化能力,为部署可适应的对话系统提供了实用见解。

链接: https://arxiv.org/abs/2504.13592
作者: Zihao Feng,Xiaoxue Wang,Ziwei Bai,Donghang Su,Bowen Wu,Qun Yu,Baoxun Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Intent detection, a critical component in task-oriented dialogue (TOD) systems, faces significant challenges in adapting to the rapid influx of integrable tools with complex interrelationships. Existing approaches, such as zero-shot reformulations and LLM-based dynamic recognition, struggle with performance degradation when encountering unseen intents, leading to erroneous task routing. To enhance the model’s generalization performance on unseen tasks, we employ Reinforcement Learning (RL) combined with a Reward-based Curriculum Sampling (RCS) during Group Relative Policy Optimization (GRPO) training in intent detection tasks. Experiments demonstrate that RL-trained models substantially outperform supervised fine-tuning (SFT) baselines in generalization. Besides, the introduction of the RCS, significantly bolsters the effectiveness of RL in intent detection by focusing the model on challenging cases during training. Moreover, incorporating Chain-of-Thought (COT) processes in RL notably improves generalization in complex intent detection tasks, underscoring the importance of thought in challenging scenarios. This work advances the generalization of intent detection tasks, offering practical insights for deploying adaptable dialogue systems.
zh

[NLP-25] DETAM: Defending LLM s Against Jailbreak Attacks via Targeted Attention Modification

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreak attacks)时的安全性问题,尽管现有的安全对齐(safety-aligned)LLMs能够有效防御常规有害查询,但它们仍易受此类攻击的影响。现有防御方法主要依赖于微调或输入修改,这些方法通常存在泛化能力有限且实用性降低的问题。为此,论文提出了一种名为DETAM的无微调防御方法,其关键是通过目标导向的注意力调整(targeted attention modification)来提升LLMs抵御越狱攻击的能力。具体而言,研究者分析了成功与失败防御之间的注意力分数差异,以识别对越狱攻击敏感的注意力头,并在推理阶段重新分配注意力,强调用户的意图,减少攻击令牌的干扰。实验结果表明,DETAM在越狱防御任务中优于多种基线方法,具有跨不同攻击类型和模型的强大泛化能力,并在真实世界数据上保持有效性。此外,在评估模型实用性时,还使用了过防御数据集进一步验证了该方法的优越性能。

链接: https://arxiv.org/abs/2504.13562
作者: Yu Li,Han Jiang,Zhihua Wei
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the widespread adoption of Large Language Models (LLMs), jailbreak attacks have become an increasingly pressing safety concern. While safety-aligned LLMs can effectively defend against normal harmful queries, they remain vulnerable to such attacks. Existing defense methods primarily rely on fine-tuning or input modification, which often suffer from limited generalization and reduced utility. To address this, we introduce DETAM, a finetuning-free defense approach that improves the defensive capabilities against jailbreak attacks of LLMs via targeted attention modification. Specifically, we analyze the differences in attention scores between successful and unsuccessful defenses to identify the attention heads sensitive to jailbreak attacks. During inference, we reallocate attention to emphasize the user’s core intention, minimizing interference from attack tokens. Our experimental results demonstrate that DETAM outperforms various baselines in jailbreak defense and exhibits robust generalization across different attacks and models, maintaining its effectiveness even on in-the-wild jailbreak data. Furthermore, in evaluating the model’s utility, we incorporated over-defense datasets, which further validate the superior performance of our approach. The code will be released immediately upon acceptance.
zh

[NLP-26] Q-FAKER: Query-free Hard Black-box Attack via Controlled Generation NAACL2025

【速读】: 该论文旨在解决现有对抗攻击方法在实际硬黑盒场景下的局限性,即当目标模型不可访问且查询代价高昂时,无法有效生成对抗样本的问题。论文的关键创新在于提出了一种无需访问目标模型的查询自由硬黑盒攻击方法(Q-faker)。通过利用替代模型进行目标无关的对抗样本生成,并结合可控生成技术,该方法避免了高查询成本和对目标模型信息的依赖,同时确保生成的对抗样本具有高迁移性和高质量。

链接: https://arxiv.org/abs/2504.13551
作者: CheolWon Na,YunSeok Choi,Jee-Hyong Lee
机构: Sungkyunkwan University (成均馆大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NAACL 2025 Findings

点击查看摘要

Abstract:Many adversarial attack approaches are proposed to verify the vulnerability of language models. However, they require numerous queries and the information on the target model. Even black-box attack methods also require the target model’s output information. They are not applicable in real-world scenarios, as in hard black-box settings where the target model is closed and inaccessible. Even the recently proposed hard black-box attacks still require many queries and demand extremely high costs for training adversarial generators. To address these challenges, we propose Q-faker (Query-free Hard Black-box Attacker), a novel and efficient method that generates adversarial examples without accessing the target model. To avoid accessing the target model, we use a surrogate model instead. The surrogate model generates adversarial sentences for a target-agnostic attack. During this process, we leverage controlled generation techniques. We evaluate our proposed method on eight datasets. Experimental results demonstrate our method’s effectiveness including high transferability and the high quality of the generated adversarial examples, and prove its practical in hard black-box settings.
zh

[NLP-27] Enhancing Multilingual Sentiment Analysis with Explainability for Sinhala English and Code-Mixed Content

【速读】: 该论文致力于解决银行品牌声誉管理中跨语言客户反馈情感分析的问题,特别是针对低资源语言(如僧伽罗语)和代码混合文本的处理挑战。现有模型在这些领域存在性能不足及可解释性差的问题。为了解决这些问题,论文提出了一种结合方面级情感分析的混合框架,通过增强多语言处理能力并提供可解释输出来实现改进。关键解决方案包括使用清洗后的银行客户评论对XLM-RoBERTa进行微调以支持僧伽罗语和代码混合文本,引入领域特定词典修正,并利用BERT-base-uncased处理英语;同时结合SHAP和LIME技术提升模型的可解释性,提供实时情感解释。这一方法显著提升了情感分类的准确性和F1分数,在英语中分别达到92.3%的准确率和0.89的F1值,在僧伽罗语和代码混合文本中也实现了88.4%的准确率。此外,系统通过用户友好的界面呈现面向方面的细粒度情感见解,增强了商业应用的可用性与透明度。

链接: https://arxiv.org/abs/2504.13545
作者: Azmarah Rizvi,Navojith Thamindu,A.M.N.H. Adhikari,W.P.U. Senevirathna,Dharshana Kasthurirathna,Lakmini Abeywardhana
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Sentiment analysis is crucial for brand reputation management in the banking sector, where customer feedback spans English, Sinhala, Singlish, and code-mixed text. Existing models struggle with low-resource languages like Sinhala and lack interpretability for practical use. This research develops a hybrid aspect-based sentiment analysis framework that enhances multilingual capabilities with explainable outputs. Using cleaned banking customer reviews, we fine-tune XLM-RoBERTa for Sinhala and code-mixed text, integrate domain-specific lexicon correction, and employ BERT-base-uncased for English. The system classifies sentiment (positive, neutral, negative) with confidence scores, while SHAP and LIME improve interpretability by providing real-time sentiment explanations. Experimental results show that our approaches outperform traditional transformer-based classifiers, achieving 92.3 percent accuracy and an F1-score of 0.89 in English and 88.4 percent in Sinhala and code-mixed content. An explainability analysis reveals key sentiment drivers, improving trust and transparency. A user-friendly interface delivers aspect-wise sentiment insights, ensuring accessibility for businesses. This research contributes to robust, transparent sentiment analysis for financial applications by bridging gaps in multilingual, low-resource NLP and explainability.
zh

[NLP-28] CoT-RAG : Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂任务中基于链式思维(Chain-of-Thought, CoT)推理存在的两个主要挑战:一是仅依赖LLMs生成推理链的可靠性较低;二是自然语言推理链对LLMs推理逻辑的潜在干扰。为了解决这些问题,论文提出了CoT-RAG框架,其关键设计包括:(i) 知识图谱驱动的CoT生成,通过知识图谱调节LLMs的推理链生成以增强可信度;(ii) 可学习的知识案例感知的RAG,将检索增强生成(Retrieval-Augmented Generation, RAG)与知识图谱结合,检索相关子案例和描述,为LLMs提供可学习的信息;(iii) 伪程序提示执行,鼓励LLMs以更高的逻辑严谨性在伪程序中完成推理任务。实验结果表明,相比现有最先进方法,CoT-RAG在九个公开数据集上的准确率提升了4.0%到23.0%,并在四个领域特定数据集上展现了卓越的准确性和高效执行能力。

链接: https://arxiv.org/abs/2504.13534
作者: Feiyang Li,Peng Fang,Zhan Shi,Arijit Khan,Fang Wang,Dan Feng,Weihao Wang,Xin Zhang,Yongjian Cui
机构: School of Computer Science and Technology, Huazhong University of Science and Technology (华中科技大学); Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology (华中科技大学); Department of Computer Science, Aalborg University (奥尔堡大学); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While chain-of-thought (CoT) reasoning improves the performance of large language models (LLMs) in complex tasks, it still has two main challenges: the low reliability of relying solely on LLMs to generate reasoning chains and the interference of natural language reasoning chains on the inference logic of LLMs. To address these issues, we propose CoT-RAG, a novel reasoning framework with three key designs: (i) Knowledge Graph-driven CoT Generation, featuring knowledge graphs to modulate reasoning chain generation of LLMs, thereby enhancing reasoning credibility; (ii) Learnable Knowledge Case-aware RAG, which incorporates retrieval-augmented generation (RAG) into knowledge graphs to retrieve relevant sub-cases and sub-descriptions, providing LLMs with learnable information; (iii) Pseudo-Program Prompting Execution, which encourages LLMs to execute reasoning tasks in pseudo-programs with greater logical rigor. We conduct a comprehensive evaluation on nine public datasets, covering three reasoning problems. Compared with the-state-of-the-art methods, CoT-RAG exhibits a significant accuracy improvement, ranging from 4.0% to 23.0%. Furthermore, testing on four domain-specific datasets, CoT-RAG shows remarkable accuracy and efficient execution, highlighting its strong practical applicability and scalability.
zh

[NLP-29] Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning

【速读】: 本文旨在解决大型语言模型(LLMs)在复杂推理任务中容易出现错误累积的问题,提出了一种基于推理过程预判的新策略。传统方法往往依赖试错机制,而本文通过引入预判节点(prejudge node),使模型能够在推理过程中主动识别可能导致错误的路径,并提前调整推理方向,类似于人类在解决问题时常有的反思行为。关键在于设计了一个结合动态树搜索的自动化推理框架,该框架仅需一个LLM即可完成答案判断、回应批评、预判生成及思维补全等任务。此外,通过监督微调(SFT)与强化学习(RL)相结合的两阶段训练机制进一步优化模型性能。实验结果表明,此方法显著提升了LLMs的推理能力。

链接: https://arxiv.org/abs/2504.13500
作者: Jianing Wang,Jin Jiang,Yang Liu,Mengdi Zhang,Xunliang Cai
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce a new \emphprocess prejudge strategy in LLM reasoning to demonstrate that bootstrapping with process prejudge allows the LLM to adaptively anticipate the errors encountered when advancing the subsequent reasoning steps, similar to people sometimes pausing to think about what mistakes may occur and how to avoid them, rather than relying solely on trial and error. Specifically, we define a prejudge node in the rationale, which represents a reasoning step, with at least one step that follows the prejudge node that has no paths toward the correct answer. To synthesize the prejudge reasoning process, we present an automated reasoning framework with a dynamic tree-searching strategy. This framework requires only one LLM to perform answer judging, response critiquing, prejudge generation, and thought completion. Furthermore, we develop a two-phase training mechanism with supervised fine-tuning (SFT) and reinforcement learning (RL) to further enhance the reasoning capabilities of LLMs. Experimental results from competition-level complex reasoning demonstrate that our method can teach the model to prejudge before thinking and significantly enhance the reasoning ability of LLMs. Code and data is released at this https URL.
zh

[NLP-30] Integrating Locality-Aware Attention with Transformers for General Geometry PDEs IJCNN2025

【速读】: 本文旨在解决神经算子(Neural Operators)在处理复杂几何形状和不规则网格时受限于均匀网格的问题,以及现有基于Transformer的方法在捕捉细粒度动力学和局部偏微分方程(PDE)行为方面存在的不足。论文的关键在于提出了一种名为Locality-Aware Attention Transformer (LA2Former) 的新模型,它通过K近邻动态划分局部区域,并结合全局-局部注意力机制来增强PDE建模能力。LA2Former利用线性注意力实现高效全局上下文编码,同时采用成对注意力捕获复杂的局部交互,从而在计算效率与预测准确性之间达到了理想平衡。这一创新显著提升了在六个基准数据集上的预测精度,相比现有的线性注意力方法提高了超过50%,并且在最优条件下优于全成对注意力方法。因此,论文强调了在复杂不规则域上求解PDE时局部特征学习的重要性。

链接: https://arxiv.org/abs/2504.13480
作者: Minsu Koh,Beom-Chul Park,Heejo Kong,Seong-Whan Lee
机构: Korea University, Seoul, South Korea (韩国高丽大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by IJCNN 2025

点击查看摘要

Abstract:Neural operators have emerged as promising frameworks for learning mappings governed by partial differential equations (PDEs), serving as data-driven alternatives to traditional numerical methods. While methods such as the Fourier neural operator (FNO) have demonstrated notable performance, their reliance on uniform grids restricts their applicability to complex geometries and irregular meshes. Recently, Transformer-based neural operators with linear attention mechanisms have shown potential in overcoming these limitations for large-scale PDE simulations. However, these approaches predominantly emphasize global feature aggregation, often overlooking fine-scale dynamics and localized PDE behaviors essential for accurate solutions. To address these challenges, we propose the Locality-Aware Attention Transformer (LA2Former), which leverages K-nearest neighbors for dynamic patchifying and integrates global-local attention for enhanced PDE modeling. By combining linear attention for efficient global context encoding with pairwise attention for capturing intricate local interactions, LA2Former achieves an optimal balance between computational efficiency and predictive accuracy. Extensive evaluations across six benchmark datasets demonstrate that LA2Former improves predictive accuracy by over 50% relative to existing linear attention methods, while also outperforming full pairwise attention under optimal conditions. This work underscores the critical importance of localized feature learning in advancing Transformer-based neural operators for solving PDEs on complex and irregular domains.
zh

[NLP-31] LLM Sensitivity Evaluation Framework for Clinical Diagnosis

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床诊断任务中对关键医学信息敏感性不足的问题。现有研究主要关注LLMs对无关上下文的敏感性,而忽视了对影响诊断推理的关键信息的处理能力。论文通过引入不同的扰动策略,评估了包括GPT-3.5、GPT-4、Gemini、Claude3和LLaMA2-7b在内的多种LLMs对关键医学信息的敏感性。结果显示,当前LLMs在保持对关键信息的敏感性方面存在显著局限性。解决方案的关键在于改进LLMs的可靠性,增强其对关键信息的敏感度,并有效利用这些信息,从而提升人类对LLMs的信任,促进其在实际场景中的应用。相关代码和数据集已公开。

链接: https://arxiv.org/abs/2504.13475
作者: Chenwei Yan,Xiangling Fu,Yuxuan Xiong,Tianyi Wang,Siu Cheung Hui,Ji Wu,Xien Liu
机构: School of Computer Science, Beijing University of Posts and Telecommunications (北京邮电大学计算机学院); Key Laboratory of Trustworthy Distributed Computing and Service(BUPT), Ministry of Education (可信分布式计算与服务教育部重点实验室(北邮)); Nanyang Technological University (南洋理工大学); Department of Electronic Engineering, Tsinghua University (清华大学电子工程系); College of AI, Tsinghua University (清华大学人工智能学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM’s reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at this https URL.
zh

[NLP-32] CodeVisionary: An Agent -based Framework for Evaluating Large Language Models in Code Generation

【速读】: 该论文旨在解决大型语言模型(LLMs)在代码生成任务中的评估问题。现有评估方法主要分为人本中心法、基于指标法和基于LLM法三类,但这些方法分别存在劳动密集型、过度依赖参考答案以及性能受限等问题。特别是基于LLM的评估方法,尽管具备更强的上下文理解能力和更高的效率,但仍因缺乏多源领域知识和对复杂代码理解不足而受到限制。

为了解决这些问题,论文提出了一种名为CodeVisionary的新框架。CodeVisionary的关键在于其两阶段设计:第一阶段是多源知识分析阶段,通过制定和执行逐步评估计划来收集多源且全面的领域知识;第二阶段是基于协商的评分阶段,让多位评审员参与讨论以更好地理解复杂代码,并就评估分数达成一致。实验结果表明,CodeVisionary在评估LLMs代码生成能力方面表现最佳,在Pearson、Spearman和Kendall-Tau系数上的平均提升分别为0.202、0.139和0.117。此外,该框架还能提供详细的评估报告,帮助开发者发现不足并进行改进。

链接: https://arxiv.org/abs/2504.13472
作者: Xinchen Wang,Pengfei Gao,Chao Peng,Ruida Hu,Cuiyun Gao
机构: Harbin Institute of Technology (哈尔滨工业大学); ByteDance (字节跳动)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities and superior efficiency. However, the performance of LLM-based approaches remains limited due to: (1) lack of multisource domain knowledge, and (2) insufficient comprehension of complex code. To mitigate the limitations, we propose CodeVisionary, the first LLM-based agent framework for evaluating LLMs in code generation. CodeVisionary consists of two stages: (1) Multiscore knowledge analysis stage, which aims to gather multisource and comprehensive domain knowledge by formulating and executing a stepwise evaluation plan. (2) Negotiation-based scoring stage, which involves multiple judges engaging in discussions to better comprehend the complex code and reach a consensus on the evaluation score. Extensive experiments demonstrate that CodeVisionary achieves the best performance for evaluating LLMs in code generation, outperforming the best baseline methods with average improvements of 0.202, 0.139, and 0.117 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. Besides, CodeVisionary provides detailed evaluation reports, which assist developers in identifying shortcomings and making improvements. The resources of CodeVisionary are available at this https URL. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2504.13472 [cs.SE] (or arXiv:2504.13472v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2504.13472 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-33] From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLM s

【速读】: 该论文旨在解决基于大型语言模型(Large Language Models, LLMs)框架中存在的成本与性能之间的权衡问题。传统的一阶段直接部署方法虽然有效,但因需要大规模参数以达到满意效果,导致较高的成本和延迟。为应对这一挑战,论文提出了一种三阶段成本高效的端到端LLM部署管道,包括原型设计、知识迁移以及模型压缩。其关键在于通过功能调用驱动的LLM管道构建一个最优性能的原型系统作为教师模型,随后利用拒绝微调、强化学习和知识蒸馏等技术将知识高效迁移至较小规模的学生模型(0.5B),并在最终阶段采用量化和剪枝技术将模型进一步压缩至0.4B,从而实现极低的延迟和成本。这种模块化设计及其跨领域能力表明了该框架在其他自然语言处理领域的潜在适用性。

链接: https://arxiv.org/abs/2504.13471
作者: Jiliang Ni,Jiachen Pu,Zhongyi Yang,Kun Zhou,Hui Wang,Xiaoliang Xiao,Dakui Wang,Xin Li,Jingfeng Luo,Conggang Hu
机构: Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have significantly advanced artificial intelligence by optimizing traditional Natural Language Processing (NLP) pipelines, improving performance and generalization. This has spurred their integration into various systems. Many NLP systems, including ours, employ a “one-stage” pipeline directly incorporating LLMs. While effective, this approach incurs substantial costs and latency due to the need for large model parameters to achieve satisfactory outcomes. This paper introduces a three-stage cost-efficient end-to-end LLM deployment pipeline-including prototyping, knowledge transfer, and model compression-to tackle the cost-performance dilemma in LLM-based frameworks. Our approach yields a super tiny model optimized for cost and performance in online systems, simplifying the system architecture. Initially, by transforming complex tasks into a function call-based LLM-driven pipeline, an optimal performance prototype system is constructed to produce high-quality data as a teacher model. The second stage combine techniques like rejection fine-tuning, reinforcement learning and knowledge distillation to transfer knowledge to a smaller 0.5B student model, delivering effective performance at minimal cost. The final stage applies quantization and pruning to extremely compress model to 0.4B, achieving ultra-low latency and cost. The framework’s modular design and cross-domain capabilities suggest potential applicability in other NLP areas.
zh

[NLP-34] D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model

【速读】: 该论文致力于解决开放性生成评估中因响应格式不一致导致的难题,以及多选题(Multiple-choice, MC)评估中高质量干扰项(distractor)生成耗时且费力的问题。为应对这些挑战,论文提出了D-GEN,这是首个开源的干扰项生成模型,能够将开放性数据转化为MC格式。解决方案的关键在于引入了两种创新的评估方法:(1) 排序一致性评估,确保生成的干扰项保留真实干扰项的区分能力;(2) 熵分析,比较模型置信度分布与真实值的一致性。实验结果表明,D-GEN在排序一致性(Spearman’s rho 0.99,Kendall’s tau 0.94)和熵分布匹配方面表现优异,并通过人工评估验证了生成干扰项的流畅性、连贯性、干扰性和错误性。这项工作实现了高效且鲁棒的干扰项自动化生成与评估,为MC评估设定了新的标准。

链接: https://arxiv.org/abs/2504.13439
作者: Grace Byun,Jinho Choi
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating generative models with open-ended generation is challenging due to inconsistencies in response formats. Multiple-choice (MC) evaluation mitigates this issue, but generating high-quality distractors is time-consuming and labor-intensive. We introduce D-GEN, the first open-source distractor generator model that transforms open-ended data into an MC format. To evaluate distractor quality, we propose two novel methods: (1) ranking alignment, ensuring generated distractors retain the discriminatory power of ground-truth distractors, and (2) entropy analysis, comparing model confidence distributions. Our results show that D-GEN preserves ranking consistency (Spearman’s rho 0.99, Kendall’s tau 0.94) and closely matches the entropy distribution of ground-truth distractors. Human evaluation further confirms the fluency, coherence, distractiveness, and incorrectness. Our work advances robust and efficient distractor generation with automated evaluation, setting a new standard for MC evaluation.
zh

[NLP-35] Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering

【速读】: 该论文旨在解决现有 Retrieval-Augmented Generation (RAG) 系统在企业环境中因检索范围有限及数据安全风险而面临的挑战。具体问题包括:当无法获取相关内部文档时,系统难以生成准确且完整的内容;同时,使用闭源 Large Language Models (LLMs) 可能导致机密信息泄露。为应对这些问题,论文提出了一种名为 Secure Multifaceted-RAG (SecMulti-RAG) 的框架,其关键在于不仅从内部文档中检索,还结合了两个补充来源——预生成的专家知识(针对预期查询)和按需生成的外部 LLM 知识,并通过本地开源生成器和过滤机制确保安全性,从而实现内容完整性、防止数据泄露并降低运行成本。

链接: https://arxiv.org/abs/2504.13425
作者: Grace Byun,Shinsun Lee,Nayoung Choi,Jinho Choi
机构: Emory University (埃默里大学); Hyundai Motor Company (现代汽车公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9 percent win rates across correctness, richness, and helpfulness in LLM-based evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG.
zh

[NLP-36] STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings ICLR2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)预训练过程中未经授权使用数据的问题,特别是针对数据创作者担心其专有数据被用于模型训练而未获得授权或许可的情况。同时,论文也关注基准测试集可能因包含受污染的数据而受到损害的问题。为应对这些挑战,论文提出了一种名为STAMP的框架,用于检测数据集成员资格,即判断某个特定数据集是否出现在LLMs的预训练语料库中。

STAMP的关键在于通过生成多个改写版本(rephrases)来实现数据水印的嵌入,每个改写版本都带有唯一的秘密密钥。其中,一个版本会被公开发布,其余则保持私密状态。随后,数据创建者可以利用配对统计检验方法比较公共与私人版本之间的模型可能性,以此证明数据集成员资格。实验结果表明,STAMP能够在四个基准测试中成功检测到仅占总令牌数不到0.001%,且唯一一次出现在训练数据中的数据片段,其性能优于多种现有的数据污染检测及数据推断基线算法。此外,STAMP确保了原始数据在语义含义和实际应用价值上的完整性,并已在两个真实场景下验证了其有效性,包括确认论文摘要和博客文章是否被纳入LLMs的预训练语料库之中。

链接: https://arxiv.org/abs/2504.13416
作者: Saksham Rastogi,Pratyush Maini,Danish Pruthi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted at DATA-FM, WMark @ ICLR 2025. Project page at see this https URL

点击查看摘要

Abstract:Given how large parts of publicly available text are crawled to pretrain large language models (LLMs), data creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose test-sets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership-i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an original piece of content, our proposal involves first generating multiple rephrases, each embedding a watermark with a unique secret key. One version is to be released publicly, while others are to be kept private. Subsequently, creators can compare model likelihoods between public and private versions using paired statistical tests to prove membership. We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data and constitute less than 0.001% of the total tokens, outperforming several contamination detection and dataset inference baselines. We verify that STAMP preserves both the semantic meaning and the utility of the original data in comparing different models. We apply STAMP to two real-world scenarios to confirm the inclusion of paper abstracts and blog articles in the pretraining corpora.
zh

[NLP-37] LangCoop: Collaborative Driving with Language

【速读】: 该论文旨在解决现有多智能体通信方法在带宽需求高、智能体异构性以及信息丢失等方面的局限性问题。为应对这些挑战,论文提出了一种名为LangCoop的新范式,其关键是通过自然语言作为紧凑且表达性强的媒介实现智能体间协作通信。LangCoop的关键创新包括用于结构化零样本视觉-语言推理的Mixture Model Modular Chain-of-thought (M³CoT) 和用于高效打包信息为简洁语言消息的Natural Language Information Packaging (LangPack)。实验结果表明,LangCoop相较于基于图像的通信可实现高达96%的带宽减少(每条消息仅需2KB),同时在闭环评估中保持了具有竞争力的驾驶性能。

链接: https://arxiv.org/abs/2504.13406
作者: Xiangbo Gao,Yuheng Wu,Rujia Wang,Chenxi Liu,Yang Zhou,Zhengzhong Tu
机构: Texas A&M University (德州农工大学); KAIST (韩国科学技术院); University of Utah (犹他大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-agent collaboration holds great promise for enhancing the safety, reliability, and mobility of autonomous driving systems by enabling information sharing among multiple connected agents. However, existing multi-agent communication approaches are hindered by limitations of existing communication media, including high bandwidth demands, agent heterogeneity, and information loss. To address these challenges, we introduce LangCoop, a new paradigm for collaborative autonomous driving that leverages natural language as a compact yet expressive medium for inter-agent communication. LangCoop features two key innovations: Mixture Model Modular Chain-of-thought (M ^3 CoT) for structured zero-shot vision-language reasoning and Natural Language Information Packaging (LangPack) for efficiently packaging information into concise, language-based messages. Through extensive experiments conducted in the CARLA simulations, we demonstrate that LangCoop achieves a remarkable 96% reduction in communication bandwidth ( 2KB per message) compared to image-based communication, while maintaining competitive driving performance in the closed-loop evaluation.
zh

[NLP-38] A mean teacher algorithm for unlearning of language models

【速读】: 该论文旨在解决语言模型在实现特定文本实例遗忘的同时保持其整体能力这一挑战,特别是在减少大规模数据集记忆而不显著降低模型效用方面存在的困难。论文的关键解决方案是结合均值教师算法(Mean Teacher Algorithm)与一种新的无学习损失函数“负对数不似然”(Negative Log-Unlikelihood, NLUL)。均值教师算法能够近似慢速自然梯度下降(Slow Natural Gradient Descent, NGD)的轨迹,这种下降方式倾向于寻找低曲率更新以减少对模型效用的损害;而NLUL则通过避免梯度消失问题进一步增强这一过程的有效性。两者的结合提升了MUSE基准测试中某些指标的表现。

链接: https://arxiv.org/abs/2504.13388
作者: Yegor Klochkov
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:One of the goals of language model unlearning is to reduce memorization of selected text instances while retaining the model’s general abilities. Despite various proposed methods, reducing memorization of large datasets without noticeable degradation in model utility remains challenging. In this paper, we investigate the mean teacher algorithm (Tarvainen Valpola, 2017), a simple proximal optimization method from continual learning literature that gradually modifies the teacher model. We show that the mean teacher can approximate a trajectory of a slow natural gradient descent (NGD), which inherently seeks low-curvature updates that are less likely to degrade the model utility. While slow NGD can suffer from vanishing gradients, we introduce a new unlearning loss called “negative log-unlikelihood” (NLUL) that avoids this problem. We show that the combination of mean teacher and NLUL improves some metrics on the MUSE benchmarks (Shi et al., 2024).
zh

[NLP-39] HOUGHTTERMINATOR: Benchmarking Calibrating and Mitigating Overthinking in Reasoning Models

【速读】: 该论文旨在解决生成式推理模型在处理不同难度任务时过度生成无用标记(overthinking)的问题,即这些模型倾向于消耗过多令牌(tokens)而未能有效提升答案准确性。研究引入了衡量问题难度的近似方法,并揭示了问题难度与最优令牌消耗之间的明确关系,同时评估了多种推理模型在高效分配最优令牌数量方面的校准情况。结果显示,大多数推理模型在校准方面表现不佳,尤其是在处理简单问题时。为评估模型在简单问题上的校准能力,作者构建了一个名为DUMB500的数据集,其中包含极其简单的数学、推理、代码和任务问题,并将这些简单示例与现有前沿基准测试中的极难示例进行联合评估。最终,论文提出了THOUGHTTERMINATOR,这是一种无需训练的黑盒解码技术,能够显著改善推理模型的校准性能。关键在于通过引入新的校准评估机制及开发无需训练的解码技术来优化模型在不同难度任务上的令牌使用效率。

链接: https://arxiv.org/abs/2504.13367
作者: Xiao Pu,Michael Saxon,Wenyue Hua,William Yang Wang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning models have demonstrated impressive performance on difficult tasks that traditional language models struggle at. However, many are plagued with the problem of overthinking–generating large amounts of unnecessary tokens which don’t improve accuracy on a question. We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists, and evaluate how well calibrated a variety of reasoning models are in terms of efficiently allocating the optimal token count. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. To evaluate calibration on easy questions we introduce DUMB500, a dataset of extremely easy math, reasoning, code, and task problems, and jointly evaluate reasoning model on these simple examples and extremely difficult examples from existing frontier benchmarks on the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
zh

[NLP-40] Cost-of-Pass: An Economic Framework for Evaluating Language Models

【速读】: 该论文试图解决如何评估语言模型在经济价值与推理成本之间的权衡问题,提出了一种基于生产理论的框架,将模型的准确性与推理成本相结合进行综合评价。关键在于引入“通过成本(Cost-of-Pass)”这一指标,即生成正确解的预期货币成本,并定义了“前沿通过成本(Frontier Cost-of-Pass)”,即在可用模型或人类专家中可实现的最小通过成本。通过分析不同任务类型下的模型表现及成本变化趋势,论文揭示了轻量级模型、大型模型和推理模型在基本量化任务、知识密集型任务和复杂量化任务中的成本效益优势,并强调互补的模型层面创新是提升成本效率的主要驱动力,同时指出多数推理阶段技术(如多数投票和自优化)的边际收益难以抵消其成本。

链接: https://arxiv.org/abs/2504.13359
作者: Mehmet Hamza Erol,Batu El,Mirac Suzgun,Mert Yuksekgonul,James Zou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at: this https URL

点击查看摘要

Abstract:The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. We propose a framework grounded in production theory for evaluating language models by combining accuracy and inference cost. We introduce “cost-of-pass”, the expected monetary cost of generating a correct solution. We then define the “frontier cost-of-pass” as the minimum cost-of-pass achievable across available models or the "human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiers: estimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions afforded by common inference-time techniques like majority voting and self-refinement, finding that their marginal accuracy gains rarely justify their costs. Our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.
zh

[NLP-41] Acoustic to Articulatory Inversion of Speech; Data Driven Approaches Challenges Applications and Future Scope

【速读】: 该论文旨在研究基于数据驱动的方法在语音声学-构音逆向映射(Acoustic-to-Articulatory Inversion, AAI)不同应用中的表现。论文重点分析了过去十年(2011-2021年)的相关工作,涉及说话人相关与无关的AAI类型、声学与构音特征之间的关联探索、自动语音识别(ASR)、构音特征空间选择、语言训练辅助框架等目标,并使用多模态语料库(如电磁构音描记法(EMA)、电腭图(EPG)、喉图、电声门图(EGG)、X射线荧光透视、超声波和实时磁共振成像(rtMRI))进行评估。论文的关键在于采用机器学习方法构建模型,并通过相关系数(CC)、均方根误差(RMSE)、均方误差(MSE)及格式化误差(MFE)来评价性能。其核心解决方案在于利用这些方法提供直观且用户友好的构音位置反馈系统,特别是舌头运动轨迹的可视化反馈,从而为病理学对象提供发音、语言及言语治疗支持。

链接: https://arxiv.org/abs/2504.13308
作者: Leena G Pillai,D. Muhammad Noorul Mubarak
机构: University of Kerala (喀拉拉大学); Digital University Kerala (数字大学喀拉拉)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: This is a review paper about Acoustic to Articulatory inversion of speech, presented in an international conference. This paper has 8 pages and 2 figures

点击查看摘要

Abstract:This review is focused on the data-driven approaches applied in different applications of Acoustic-to-Articulatory Inversion (AAI) of speech. This review paper considered the relevant works published in the last ten years (2011-2021). The selection criteria includes (a) type of AAI - Speaker Dependent and Speaker Independent AAI, (b) objectives of the work - Articulatory approximation, Articulatory Feature space selection and Automatic Speech Recognition (ASR), explore the correlation between acoustic and articulatory features, and framework for Computer-assisted language training, © Corpus - Simultaneously recorded speech (wav) and medical imaging models such as ElectroMagnetic Articulography (EMA), Electropalatography (EPG), Laryngography, Electroglottography (EGG), X-ray Cineradiography, Ultrasound, and real-time Magnetic Resonance Imaging (rtMRI), (d) Methods or models - recent works are considered, and therefore all the works are based on machine learning, (e) Evaluation - as AAI is a non-linear regression problem, the performance evaluation is mostly done by Correlation Coefficient (CC), Root Mean Square Error (RMSE), and also considered Mean Square Error (MSE), and Mean Format Error (MFE). The practical application of the AAI model can provide a better and user-friendly interpretable image feedback system of articulatory positions, especially tongue movement. Such trajectory feedback system can be used to provide phonetic, language, and speech therapy for pathological subjects.
zh

[NLP-42] Sentiment Analysis on the young peoples perception about the mobile Internet costs in Senegal

【速读】: 该论文试图探讨塞内加尔年轻人对移动互联网价格与服务质量感知之间关系的看法,并分析其情感倾向。解决方案的关键在于通过扫描Twitter和Facebook上的相关评论,应用情感分析模型来汇总和理解他们的总体感受。

链接: https://arxiv.org/abs/2504.13284
作者: Derguene Mbaye,Madoune Robert Seye,Moussa Diallo,Mamadou Lamine Ndiaye,Djiby Sow,Dimitri Samuel Adjanohoun,Tatiana Mbengue,Cheikh Samba Wade,De Roulet Pablo,Jean-Claude Baraka Munyaka,Jerome Chenal
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 19 pages, 14 figures, 10th International Congress on Information and Communication Technology (ICICT 2025)

点击查看摘要

Abstract:Internet penetration rates in Africa are rising steadily, and mobile Internet is getting an even bigger boost with the availability of smartphones. Young people are increasingly using the Internet, especially social networks, and Senegal is no exception to this revolution. Social networks have become the main means of expression for young people. Despite this evolution in Internet access, there are few operators on the market, which limits the alternatives available in terms of value for money. In this paper, we will look at how young people feel about the price of mobile Internet in Senegal, in relation to the perceived quality of the service, through their comments on social networks. We scanned a set of Twitter and Facebook comments related to the subject and applied a sentiment analysis model to gather their general feelings.
zh

[NLP-43] Interpersonal Theory of Suicide as a Lens to Examine Suicidal Ideation in Online Spaces

【速读】: 该论文旨在解决在线社交平台中识别高风险自杀意念(Suicidal Ideation, SI)及其支持响应中存在的理论框架缺失问题。论文的关键在于采用“人际自杀理论”(Interpersonal Theory of Suicide, IPTS)作为分析框架,将Reddit社区r/SuicideWatch中的59,607篇帖子分类为自杀意念的多个维度(如孤独感、缺乏互惠之爱、自我厌恶和脆弱性)及风险因素(如挫败归属感、感知累赘感和获得的自杀能力),以深入理解高危自杀意念的核心特征。此外,研究通过语言学与内容分析探讨了对不同阶段自杀意念帖子的支持性回应模式,并评估了AI聊天机器人在提供有效支持方面的表现。尽管AI在结构化支持方面有所改进,但专家评价表明其在动态、个性化及深层次共情支持方面仍存在不足。这一研究强调了在开发和应用AI驱动的心理健康干预措施时需进行更深入的反思与理论指导的重要性。

链接: https://arxiv.org/abs/2504.13277
作者: Soorya Ram Shimgekar,Violeta J. Rodriguez,Paul A. Bloom,Dong Whi Yoo,Koustuv Saha
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Columbia University Irving Medical Center (哥伦比亚大学欧文医学中心), New York State Psychiatric Institute (纽约州精神卫生研究所); Kent State University (肯特州立大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Suicide is a critical global public health issue, with millions experiencing suicidal ideation (SI) each year. Online spaces enable individuals to express SI and seek peer support. While prior research has revealed the potential of detecting SI using machine learning and natural language analysis, a key limitation is the lack of a theoretical framework to understand the underlying factors affecting high-risk suicidal intent. To bridge this gap, we adopted the Interpersonal Theory of Suicide (IPTS) as an analytic lens to analyze 59,607 posts from Reddit’s r/SuicideWatch, categorizing them into SI dimensions (Loneliness, Lack of Reciprocal Love, Self Hate, and Liability) and risk factors (Thwarted Belongingness, Perceived Burdensomeness, and Acquired Capability of Suicide). We found that high-risk SI posts express planning and attempts, methods and tools, and weaknesses and pain. In addition, we also examined the language of supportive responses through psycholinguistic and content analyses to find that individuals respond differently to different stages of Suicidal Ideation (SI) posts. Finally, we explored the role of AI chatbots in providing effective supportive responses to suicidal ideation posts. We found that although AI improved structural coherence, expert evaluations highlight persistent shortcomings in providing dynamic, personalized, and deeply empathetic support. These findings underscore the need for careful reflection and deeper understanding in both the development and consideration of AI-driven interventions for effective mental health support.
zh

[NLP-44] CPG-EVAL: A Multi-Tiered Benchmark for Evaluating the Chinese Pedagogical Grammar Competence of Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在二语教育中教学语法知识评估不足的问题。解决方案的关键在于提出了首个专门设计的基准测试框架CPG-EVAL,该框架包含五个任务,旨在系统性评估LLMs在二语教学环境中的教学语法能力,涵盖语法识别、细粒度语法区分、类别辨别以及抗语言干扰能力。通过这一多层级的评估方法,论文揭示了不同规模模型的优势与局限,并强调了更佳的教学适配性和更严格的基准测试对于指导LLMs在教育领域应用的重要性。

链接: https://arxiv.org/abs/2504.13261
作者: Dong Wang
机构: Waseda University (早稻田大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: 12 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Purpose: The rapid emergence of large language models (LLMs) such as ChatGPT has significantly impacted foreign language education, yet their pedagogical grammar competence remains under-assessed. This paper introduces CPG-EVAL, the first dedicated benchmark specifically designed to evaluate LLMs’ knowledge of pedagogical grammar within the context of foreign language instruction. Methodology: The benchmark comprises five tasks designed to assess grammar recognition, fine-grained grammatical distinction, categorical discrimination, and resistance to linguistic interference. Findings: Smaller-scale models can succeed in single language instance tasks, but struggle with multiple instance tasks and interference from confusing instances. Larger-scale models show better resistance to interference but still have significant room for accuracy improvement. The evaluation indicates the need for better instructional alignment and more rigorous benchmarks, to effectively guide the deployment of LLMs in educational contexts. Value: This study offers the first specialized, theory-driven, multi-tiered benchmark framework for systematically evaluating LLMs’ pedagogical grammar competence in Chinese language teaching contexts. CPG-EVAL not only provides empirical insights for educators, policymakers, and model developers to better gauge AI’s current abilities in educational settings, but also lays the groundwork for future research on improving model alignment, enhancing educational suitability, and ensuring informed decision-making concerning LLM integration in foreign language instruction.
zh

[NLP-45] ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLM s

【速读】: 该论文旨在解决通过任务特定大语言模型(delta model)部署过程中面临的资源挑战,提出了一种名为ImPart的新型重要性感知(delta sparsification)稀疏化方法。现有方法要么随机移除参数,要么在奇异值分解(Singular Value Decomposition, SVD)后直接截断奇异向量,但这些方法要么完全忽略参数的重要性,要么以过于粗粒度的方式评估其重要性。ImPart的关键创新在于利用SVD动态调整不同奇异向量的稀疏化比率,基于其重要性保留关键的任务特定知识,即使在高稀疏化比率下也能有效工作。实验表明,ImPart在相同性能水平下实现了比基线方法高出2倍的压缩比,并在与现有方法结合时,在delta量化和模型合并方面达到了新的性能高度。

链接: https://arxiv.org/abs/2504.13237
作者: Yan Yang,Yixia Li,Hongru Wang,Xuetao Wei,Jianqiao Yu,Yun Chen,Guanhua Chen
机构: Shanghai University of Finance and Economics (上海财经大学); Southern University of Science and Technology (南方科技大学); Harbin Institute of Technology (哈尔滨工业大学 (深圳)); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these methods either disregard parameter importance entirely or evaluate it with too coarse a granularity. In this work, we introduce ImPart, a novel importance-aware delta sparsification approach. Leveraging SVD, it dynamically adjusts sparsity ratios of different singular vectors based on their importance, effectively retaining crucial task-specific knowledge even at high sparsity ratios. Experiments show that ImPart achieves state-of-the-art delta sparsification performance, demonstrating 2\times higher compression ratio than baselines at the same performance level. When integrated with existing methods, ImPart sets a new state-of-the-art on delta quantization and model merging.
zh

[NLP-46] DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

【速读】: 该论文旨在解决多领域数据集上的领域级采样策略优化问题,现有方法在保持领域内一致性及准确衡量领域影响方面存在不足。论文提出Domain Impact-aware Data Sampling (DIDS),其关键是通过梯度聚类算法确保领域内一致性,同时采用代理语言模型与降维技术降低计算开销;开发基于Fisher Information Matrix (FIM) 的度量方法以精准量化特定领域参数更新对下游任务输出分布的影响,并结合FIM指导的领域影响评估与损失学习轨迹确定最优采样比例,同时考虑边际收益递减效应。实验表明,DIDS在提升平均性能的同时保持了相当的训练效率。

链接: https://arxiv.org/abs/2504.13227
作者: Weijie Shi,Jipeng Zhang,Yaguang Wu,Jingzhi Fang,Ruiyuan Zhang,Jiajie Xu,Jia Zhu,Hao Chen,Yao Zhao,Sirui Han,Xiaofang Zhou
机构: The Hong Kong University of Science and Technology (香港科技大学); MetaX; Soochow University (苏州大学); Zhejiang Normal University (浙江师范大学); Tencent Inc. (腾讯); Alibaba Inc. (阿里巴巴)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model’s output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency.
zh

[NLP-47] Sustainability via LLM Right-sizing

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在组织工作流中部署时面临的性能、成本、能源消耗及数据主权等多维度权衡问题。具体而言,论文关注如何评估模型是否“足够好”以满足实际应用场景需求,而非单纯追求最先进的性能表现。论文的关键在于提出了一种基于双LLM的自动化评估框架,通过标准化十项涵盖输出质量、事实准确性及伦理责任的评价指标,系统性地比较了包括GPT-4o、Gemma-3和Phi-4在内的多种模型在十个日常职业任务中的表现。研究揭示了不同模型组别之间的权衡关系,并强调任务类型对模型效果的影响,最终倡导从单纯追求性能最大化的基准测试转向更贴合实际应用场景的任务适配性和情境意识评估方法,为负责任的LLMs部署提供可操作的指导。

链接: https://arxiv.org/abs/2504.13217
作者: Jennifer Haase,Finn Klessascheck,Jan Mendling,Sebastian Pokutta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 Figures, 6 Tables

点击查看摘要

Abstract:Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model “good enough”? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups – premium all-rounders, competent generalists, and limited but safe performers – highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.
zh

[NLP-48] KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding

【速读】: 该论文试图解决现有英语为中心的大语言模型(Large Language Models, LLMs)评估基准在韩语金融领域的局限性问题。解决方案的关键在于构建KFinEval-Pilot,这是一个包含超过1,000个精心策划的问题的数据集,覆盖金融知识、法律推理和金融毒性三个关键领域。该基准通过半自动化的工作流构建,结合GPT-4生成的提示与专家验证,确保领域相关性和事实准确性。通过评估多种代表性LLMs,研究揭示了不同模型家族在任务准确性和输出安全性之间的权衡,强调了在高风险金融应用中使用LLMs时持续存在的推理和安全性挑战。

链接: https://arxiv.org/abs/2504.13216
作者: Bokwang Hwang,Seonkyu Lim,Taewoong Kim,Yongjae Geun,Sunghyun Bang,Sohyun Park,Jihyun Park,Myeonggyu Lee,Jinwoo Lee,Yerin Kim,Jinsun Yoo,Jingyeong Hong,Jina Park,Yongchan Kim,Suhyun Kim,Younggyun Hahm,Yiseul Lee,Yejee Kang,Chanhyuk Yoon,Chansu Lee,Heeyewon Jeong,Jiyeon Lee,Seonhye Gu,Hyebin Kang,Yousang Cho,Hangyeol Yoo,KyungTae Lim
机构: Korea Financial Telecommunications and Clearings Institute (韩国金融电信和清算研究所); Teddysum Inc. (泰迪熊公司); SELECTSTAR Inc. (星选公司); Konyang University (康洋大学); Seoultech (首尔科技学院); KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce KFinEval-Pilot, a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain. Addressing the limitations of existing English-centric benchmarks, KFinEval-Pilot comprises over 1,000 curated questions across three critical areas: financial knowledge, legal reasoning, and financial toxicity. The benchmark is constructed through a semi-automated pipeline that combines GPT-4-generated prompts with expert validation to ensure domain relevance and factual accuracy. We evaluate a range of representative LLMs and observe notable performance differences across models, with trade-offs between task accuracy and output safety across different model families. These results highlight persistent challenges in applying LLMs to high-stakes financial applications, particularly in reasoning and safety. Grounded in real-world financial use cases and aligned with the Korean regulatory and linguistic context, KFinEval-Pilot serves as an early diagnostic tool for developing safer and more reliable financial AI systems.
zh

[NLP-49] X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

【速读】: 该论文试图解决多轮对话中语言模型(Language Models, LMs)的安全风险问题,特别是有害意图通过策略性分散在多轮交互中传播的风险。当前大多数研究集中于单轮安全,而多轮红队测试(red-teaming)仍面临适应性和多样性等关键挑战。论文提出的关键解决方案是X-Teaming框架,它系统性地探索看似无害的交互如何演变为有害结果,并生成相应的攻击场景。X-Teaming通过协作代理进行规划、攻击优化和验证,实现了最先进的多轮越狱(jailbreak)效果与多样性,在代表性开放权重和闭源模型上的成功率高达98.1%,尤其针对Claude 3.7 Sonnet模型达到了96.2%的成功率。基于此,论文进一步引入XGuard-Train数据集,用于增强语言模型的多轮安全性。

链接: https://arxiv.org/abs/2504.13203
作者: Salman Rahman,Liwei Jiang,James Shiffer,Genglin Liu,Sheriff Issaka,Md Rizwan Parvez,Hamid Palangi,Kai-Wei Chang,Yejin Choi,Saadia Gabriel
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.
zh

[NLP-50] he Quantum LLM : Modeling Semantic Spaces with Quantum Principles

【速读】: 该论文试图解决的问题是如何以一种新的视角理解大型语言模型(Large Language Models, LLMs)中的语义表示、交互及动态过程,并验证基于量子启发框架研究语义空间的有效性。论文的关键在于提出并详细阐述了六个核心原则,这些原则指导了LLMs中语义表示、交互及其动力学的构建。通过这一量子启发框架,论文不仅提供了对LLMs信息处理和响应生成的深刻洞察,还探讨了利用量子计算进一步开发更强大且高效的LLMs的可能性。

链接: https://arxiv.org/abs/2504.13202
作者: Timo Aukusti Laine
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantum Physics (quant-ph)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:In the previous article, we presented a quantum-inspired framework for modeling semantic representation and processing in Large Language Models (LLMs), drawing upon mathematical tools and conceptual analogies from quantum mechanics to offer a new perspective on these complex systems. In this paper, we clarify the core assumptions of this model, providing a detailed exposition of six key principles that govern semantic representation, interaction, and dynamics within LLMs. The goal is to justify that a quantum-inspired framework is a valid approach to studying semantic spaces. This framework offers valuable insights into their information processing and response generation, and we further discuss the potential of leveraging quantum computing to develop significantly more powerful and efficient LLMs based on these principles.
zh

[NLP-51] BASIR: Budget-Assisted Sectoral Impact Ranking – A Dataset for Sector Identification and Performance Prediction Using Language Models

【速读】: 本文旨在解决政府财政政策(尤其是年度联盟预算)对特定行业股票表现的实时影响分析所面临的挑战。现有方法在多标签分类以及基于预算公告后行业表现预测排名方面存在困难且研究较少。为应对这一挑战,论文提出了一种名为BASIR(基于预算的行业影响排名)的框架,通过构建标注数据集映射预算文本片段到行业影响,并结合微调嵌入用于行业识别及语言模型进行性能排序,从而系统性地识别和排名可能从印度联盟预算公告中受益的行业。关键在于采用从1947年至2025年的全面印度联盟预算文本语料库开发的BASIR数据集,实现了0.605的F1得分(行业分类)和0.997的NDCG得分(预测行业排名),为投资者和决策者提供了量化财政政策影响的数据驱动洞见,弥补了手动分析中的重要空白。

链接: https://arxiv.org/abs/2504.13189
作者: Sohom Ghosh,Sudip Kumar Naskar
机构: Jadavpur University (贾达普大学)
类目: Computation and Language (cs.CL); Statistical Finance (q-fin.ST)
备注: The codes and the datasets can be accessed from this https URL

点击查看摘要

Abstract:Government fiscal policies, particularly annual union budgets, exert significant influence on financial markets. However, real-time analysis of budgetary impacts on sector-specific equity performance remains methodologically challenging and largely unexplored. This study proposes a framework to systematically identify and rank sectors poised to benefit from India’s Union Budget announcements. The framework addresses two core tasks: (1) multi-label classification of excerpts from budget transcripts into 81 predefined economic sectors, and (2) performance ranking of these sectors. Leveraging a comprehensive corpus of Indian Union Budget transcripts from 1947 to 2025, we introduce BASIR (Budget-Assisted Sectoral Impact Ranking), an annotated dataset mapping excerpts from budgetary transcripts to sectoral impacts. Our architecture incorporates fine-tuned embeddings for sector identification, coupled with language models that rank sectors based on their predicted performances. Our results demonstrate 0.605 F1-score in sector classification, and 0.997 NDCG score in predicting ranks of sectors based on post-budget performances. The methodology enables investors and policymakers to quantify fiscal policy impacts through structured, data-driven insights, addressing critical gaps in manual analysis. The annotated dataset has been released under CC-BY-NC-SA-4.0 license to advance computational economics research.
zh

[NLP-52] Benchmarking Large Language Models for Calculus Problem-Solving: A Comparative Analysis

【速读】: 该论文旨在评估五种主流大型语言模型(Large Language Models, LLMs)——Chat GPT 4o、Copilot Pro、Gemini Advanced、Claude Pro 和 Meta AI 在解决微积分求导问题上的性能表现。研究通过涵盖 13 种基础问题类型的系统性交叉评估框架来实现这一目标,其中每个模型均需解答由所有模型生成的问题。研究的关键在于采用这种交叉评估方法,以揭示不同模型在解决问题时的能力差异及局限性。结果显示,这些模型在程序化求导任务上表现出色,但在概念理解与代数操作方面存在显著差异,尤其是在涉及函数单调区间判定与优化应用题的问题上表现欠佳。此外,通过交叉评估矩阵发现,Claude Pro 生成的问题难度最大,这表明问题生成能力与求解能力之间可能存在差异。论文的核心贡献在于明确了 LLMs 在微积分学习工具中的潜力与限制,强调了其在程序化能力上的优势,同时指出其在概念理解方面仍远不及人类数学推理水平,从而突显了人工指导在深化数学理解中的持续重要性。

链接: https://arxiv.org/abs/2504.13187
作者: In Hak Moon
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study presents a comprehensive evaluation of five leading large language models (LLMs) - Chat GPT 4o, Copilot Pro, Gemini Advanced, Claude Pro, and Meta AI - on their performance in solving calculus differentiation problems. The investigation assessed these models across 13 fundamental problem types, employing a systematic cross-evaluation framework where each model solved problems generated by all models. Results revealed significant performance disparities, with Chat GPT 4o achieving the highest success rate (94.71%), followed by Claude Pro (85.74%), Gemini Advanced (84.42%), Copilot Pro (76.30%), and Meta AI (56.75%). All models excelled at procedural differentiation tasks but showed varying limitations with conceptual understanding and algebraic manipulation. Notably, problems involving increasing/decreasing intervals and optimization word problems proved most challenging across all models. The cross-evaluation matrix revealed that Claude Pro generated the most difficult problems, suggesting distinct capabilities between problem generation and problem-solving. These findings have significant implications for educational applications, highlighting both the potential and limitations of LLMs as calculus learning tools. While they demonstrate impressive procedural capabilities, their conceptual understanding remains limited compared to human mathematical reasoning, emphasizing the continued importance of human instruction for developing deeper mathematical comprehension.
zh

计算机视觉

[CV-0] Outlier-Robust Multi-Model Fitting on Quantum Annealers CVPR2025 ATC

【速读】:该论文旨在解决多模型拟合(Multi-model Fitting, MMF)任务中的组合性挑战,特别是在存在离群点(outliers)的情况下。传统方法在处理多模型场景时通常局限于无噪声数据集或需要先验知识,而量子计算虽有潜力解决NP难问题,但现有基于量子的方法要么仅适用于单模型,要么无法有效应对离群点。
解决方案的关键在于提出了一种鲁棒的量子多模型拟合算法(Robust Quantum Multi-Model Fitting, R-QuMF)。该算法利用量子硬件的固有能力,将问题形式化为绝热量子计算机(Adiabatic Quantum Computers, AQC)的最大集覆盖任务,从而无需提前知晓模型的确切数量即可实现高效求解,显著提升了实际应用中的适用性与性能表现。实验结果表明,R-QuMF在合成及真实世界3D数据集上均优于现有量子技术,验证了量子计算在处理复杂MMF问题中的潜力,尤其是在包含噪声和离群点的真实场景中表现出色。

链接: https://arxiv.org/abs/2504.13836
作者: Saurabh Pandey,Luca Magri,Federica Arrigoni,Vladislav Golyanik
机构: MPI for Informatics (马克斯·普朗克计算机科学研究所); Saarland University (萨尔州大学); Politecnico di Milano (米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025 Workshop “Image Matching: Local Features Beyond”

点击查看摘要

Abstract:Multi-model fitting (MMF) presents a significant challenge in Computer Vision, particularly due to its combinatorial nature. While recent advancements in quantum computing offer promise for addressing NP-hard problems, existing quantum-based approaches for model fitting are either limited to a single model or consider multi-model scenarios within outlier-free datasets. This paper introduces a novel approach, the robust quantum multi-model fitting (R-QuMF) algorithm, designed to handle outliers effectively. Our method leverages the intrinsic capabilities of quantum hardware to tackle combinatorial challenges inherent in MMF tasks, and it does not require prior knowledge of the exact number of models, thereby enhancing its practical applicability. By formulating the problem as a maximum set coverage task for adiabatic quantum computers (AQC), R-QuMF outperforms existing quantum techniques, demonstrating superior performance across various synthetic and real-world 3D datasets. Our findings underscore the potential of quantum computing in addressing the complexities of MMF, especially in real-world scenarios with noisy and outlier-prone data.
zh

[CV-1] CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning CVPR2025

【速读】:本文旨在解决构建自监督世界模型(Self-Supervised World Model)以有效编码放射影像领域中医学知识的问题。现有研究虽已探索利用自监督学习方法建立通用机器学习模型,但针对放射影像的专门化世界模型尚未有系统性尝试。为填补这一空白,论文提出CheXWorld框架,其关键在于同时建模三个至关重要的医学知识维度:1)局部解剖结构,用于描述局部组织的细微特征;2)全局解剖布局,用于刻画人体的整体组织架构;3)领域变化,促使模型学习不同成像条件(如清晰度、对比度和曝光差异)下的表观转换。通过这些设计,CheXWorld不仅在定性和定量分析中验证了对上述医学知识的成功捕获,还通过跨八个医学图像分类与分割基准的任务展示了显著优于现有自监督学习方法及大规模医学基础模型的性能表现。

链接: https://arxiv.org/abs/2504.13820
作者: Yang Yue,Yulin Wang,Chenxin Tao,Pan Liu,Shiji Song,Gao Huang
机构: Tsinghua University (清华大学); PLA General Hospital (中国人民解放军总医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Humans can develop internal world models that encode common sense knowledge, telling them how the world works and predicting the consequences of their actions. This concept has emerged as a promising direction for establishing general-purpose machine-learning models in recent preliminary works, e.g., for visual representation learning. In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. Specifically, our work develops a unified framework that simultaneously models three aspects of medical knowledge essential for qualified radiologists, including 1) local anatomical structures describing the fine-grained characteristics of local tissues (e.g., architectures, shapes, and textures); 2) global anatomical layouts describing the global organization of the human body (e.g., layouts of organs and skeletons); and 3) domain variations that encourage CheXWorld to model the transitions across different appearance domains of radiographs (e.g., varying clarity, contrast, and exposure caused by collecting radiographs from different hospitals, devices, or patients). Empirically, we design tailored qualitative and quantitative analyses, revealing that CheXWorld successfully captures these three dimensions of medical knowledge. Furthermore, transfer learning experiments across eight medical image classification and segmentation benchmarks showcase that CheXWorld significantly outperforms existing SSL methods and large-scale medical foundation models. Code pre-trained models are available at this https URL.
zh

[CV-2] RefComp: A Reference-guided Unified Framework for Unpaired Point Cloud Completion

【速读】:本文旨在解决无配对点云补全任务中的挑战,即利用未经过真实标签(ground truth)训练的模型完成部分点云。现有方法通常针对特定类别设计(class-aware),需要为每种类别训练独立模型,这限制了它们在通用3D物体点云多样化场景下的泛化能力。为克服这些问题,本文提出了一种新的无配对点云补全框架——Reference-guided Completion (RefComp) 框架,其在类别感知与类别无关的训练设置下均表现出色。

解决方案的关键在于将无配对点云补全问题转化为形状翻译问题,并在部分点云的潜在特征空间中求解。为此,引入了通过以待补全的部分点云作为模板检索得到的部分-完整点云配对数据,这些配对数据作为参考信息引导补全过程。RefComp 框架采用共享参数的参考分支和目标分支,在Latent Shape Fusion Module (LSFM) 中实现形状融合与形状翻译,从而增强补全流程中的结构特征。实验表明,该框架不仅在类别感知训练设置中达到最先进的性能,还在类别无关训练设置下于虚拟扫描和真实世界数据集上取得具有竞争力的结果。

链接: https://arxiv.org/abs/2504.13788
作者: Yixuan Yang,Jinyu Yang,Zixiang Zhao,Victor Sanchez,Feng Zheng
机构: Southern University of Science and Technology (南方科技大学), Shenzhen, China; University of Warwick (华威大学), Coventry, U.K.; Tapall.ai; Photogrammetry and Remote Sensing, ETH Zürich (瑞士苏黎世联邦理工学院摄影测量与遥感研究所), 8093 Zürich, Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The unpaired point cloud completion task aims to complete a partial point cloud by using models trained with no ground truth. Existing unpaired point cloud completion methods are class-aware, i.e., a separate model is needed for each object class. Since they have limited generalization capabilities, these methods perform poorly in real-world scenarios when confronted with a wide range of point clouds of generic 3D objects. In this paper, we propose a novel unpaired point cloud completion framework, namely the Reference-guided Completion (RefComp) framework, which attains strong performance in both the class-aware and class-agnostic training settings. The RefComp framework transforms the unpaired completion problem into a shape translation problem, which is solved in the latent feature space of the partial point clouds. To this end, we introduce the use of partial-complete point cloud pairs, which are retrieved by using the partial point cloud to be completed as a template. These point cloud pairs are used as reference data to guide the completion process. Our RefComp framework uses a reference branch and a target branch with shared parameters for shape fusion and shape translation via a Latent Shape Fusion Module (LSFM) to enhance the structural features along the completion pipeline. Extensive experiments demonstrate that the RefComp framework achieves not only state-of-the-art performance in the class-aware training setting but also competitive results in the class-agnostic training setting on both virtual scans and real-world datasets.
zh

[CV-3] Learning Through Retrospection: Improving Trajectory Prediction for Automated Driving with Error Feedback

【速读】:该论文致力于解决现有自动驾驶轨迹预测模型在推理过程中无法修正错误并重复犯错的问题。传统方法将轨迹预测视为基于观测信息的单一任务,每次预测独立进行,缺乏对历史预测的回顾与校正能力。为了解决这一局限,论文提出了一种新颖的回顾技术(retrospection technique)。其关键是通过闭环滚动训练(closed-loop rollouts),使模型能够学习利用累积反馈,在接收到新观测数据时回顾先前的预测结果并分析误差,从而优化后续预测的质量。这种方法使得模型能够在推理阶段学会修正系统性误差,显著提升了预测精度。实验结果显示,在nuScenes和Argoverse数据集上,该方法相比无回顾机制的最先进基线模型,最小平均位移误差(Minimum Average Displacement Error)降低了多达31.9%,同时展示了更好的处理分布外场景的能力。

链接: https://arxiv.org/abs/2504.13785
作者: Steffen Hagedorn,Aron Distelzweig,Marcel Hallgarten,Alexandru P. Condurache
机构: Robert Bosch GmbH (罗伯特博世有限公司); Institute for Neuro- and Bioinformatics, University of Lübeck (神经与生物信息学研究所,吕贝克大学); Department of Computer Science, University of Freiburg (弗赖堡大学计算机科学系); Cognitive Systems Group, University of Tübingen (认知系统小组,图宾根大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In automated driving, predicting trajectories of surrounding vehicles supports reasoning about scene dynamics and enables safe planning for the ego vehicle. However, existing models handle predictions as an instantaneous task of forecasting future trajectories based on observed information. As time proceeds, the next prediction is made independently of the previous one, which means that the model cannot correct its errors during inference and will repeat them. To alleviate this problem and better leverage temporal data, we propose a novel retrospection technique. Through training on closed-loop rollouts the model learns to use aggregated feedback. Given new observations it reflects on previous predictions and analyzes its errors to improve the quality of subsequent predictions. Thus, the model can learn to correct systematic errors during inference. Comprehensive experiments on nuScenes and Argoverse demonstrate a considerable decrease in minimum Average Displacement Error of up to 31.9% compared to the state-of-the-art baseline without retrospection. We further showcase the robustness of our technique by demonstrating a better handling of out-of-distribution scenarios with undetected road-users.
zh

[CV-4] Fighting Fires from Space: Leverag ing Vision Transformers for Enhanced Wildfire Detection and Characterization

【速读】:该论文试图解决现代 wildfire 检测系统在应对持续性高强度野火季节时能力不足的问题。论文的关键解决方案在于探索 Vision Transformers (ViTs) 在卫星图像 wildfire 检测中的应用,与传统的 Convolutional Neural Networks (CNNs) 进行对比评估。研究发现,ViTs 能够有效利用局部和全局上下文信息,在特定数据集上优于传统 CNN 模型,提升了检测准确性(0.92% 的性能提升)。然而,基于 CNN 的 U-Net 实现在各项指标中表现最佳,展现了其在图像任务中的持续实用性。总体而言,ViTs 和 CNNs 在 wildfire 检测任务中具有相当的能力,但经过良好调优的 CNNs,特别是优化后的 U-Net,仍是最优选择,其 Intersection over Union (IoU) 达到 93.58%,较基准模型提升了 4.58%。

链接: https://arxiv.org/abs/2504.13776
作者: Aman Agarwal,James Gearon,Raksha Rank,Etienne Chenevert
机构: Indiana University (印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Wildfires are increasing in intensity, frequency, and duration across large parts of the world as a result of anthropogenic climate change. Modern hazard detection and response systems that deal with wildfires are under-equipped for sustained wildfire seasons. Recent work has proved automated wildfire detection using Convolutional Neural Networks (CNNs) trained on satellite imagery are capable of high-accuracy results. However, CNNs are computationally expensive to train and only incorporate local image context. Recently, Vision Transformers (ViTs) have gained popularity for their efficient training and their ability to include both local and global contextual information. In this work, we show that ViT can outperform well-trained and specialized CNNs to detect wildfires on a previously published dataset of LandSat-8 imagery. One of our ViTs outperforms the baseline CNN comparison by 0.92%. However, we find our own implementation of CNN-based UNet to perform best in every category, showing their sustained utility in image tasks. Overall, ViTs are comparably capable in detecting wildfires as CNNs, though well-tuned CNNs are still the best technique for detecting wildfire with our UNet providing an IoU of 93.58%, better than the baseline UNet by some 4.58%.
zh

[CV-5] Decoding Vision Transformers: the Diffusion Steering Lens CVPR2025

【速读】:该论文旨在解决在视觉变换器(Vision Transformer, ViT)中使用传统方法(如Logit Lens)无法充分捕捉视觉表示丰富性的问题。此外,虽然Diffusion Lens可以有效可视化图像编码器中的残差流表示,但其无法捕获单个子模块的直接贡献。论文的关键解决方案是提出了一种新的训练-free方法——Diffusion Steering Lens (DSL),通过引导子模块输出并调整后续的间接贡献,从而提供ViT内部处理过程的直观且可靠的解释。

链接: https://arxiv.org/abs/2504.13763
作者: Ryota Takatsuki,Sonia Joseph,Ippei Fujisawa,Ryota Kanai
机构: Araya Inc. (阿赖亚公司); AI Alignment Network (人工智能对齐网络); The University of Tokyo (东京大学); Mila - Quebec AI Institute (麦吉尔大学魁北克人工智能研究所); McGill University (麦吉尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 17 figures. Accepted to the CVPR 2025 Workshop on Mechanistic Interpretability for Vision (MIV)

点击查看摘要

Abstract:Logit Lens is a widely adopted method for mechanistic interpretability of transformer-based language models, enabling the analysis of how internal representations evolve across layers by projecting them into the output vocabulary space. Although applying Logit Lens to Vision Transformers (ViTs) is technically straightforward, its direct use faces limitations in capturing the richness of visual representations. Building on the work of Toker et al. (2024)~\citeToker2024-ve, who introduced Diffusion Lens to visualize intermediate representations in the text encoders of text-to-image diffusion models, we demonstrate that while Diffusion Lens can effectively visualize residual stream representations in image encoders, it fails to capture the direct contributions of individual submodules. To overcome this limitation, we propose \textbfDiffusion Steering Lens (DSL), a novel, training-free approach that steers submodule outputs and patches subsequent indirect contributions. We validate our method through interventional studies, showing that DSL provides an intuitive and reliable interpretation of the internal processing in ViTs.
zh

[CV-6] Frag ile Watermarking for Image Certification Using Deep Steganographic Embedding

【速读】:该论文旨在解决现代身份验证系统中基于国际民航组织(ICAO)标准的生物识别文档(如电子护照)面部图像在发行后可能遭受无意退化或恶意篡改的问题,这些变化可能导致面部识别系统的欺骗。论文的关键解决方案是提出一种基于深度隐写嵌入的脆弱水印技术,作为一种主动机制来认证符合ICAO标准的面部图像的真实性。通过在发行时将隐藏图像嵌入官方照片中,建立了对任何发行后修改敏感的完整性标记,并评估了一系列图像操作对恢复的隐藏图像的影响,显示退化伪影可作为稳健的法医学线索。此外,还提出了一种分类框架,通过分析揭示的内容来检测和分类所应用的操作类型。实验结果表明,该方法具有高检测精度,包括跨方法场景中的多个基于深度隐写的模型。这些发现支持了通过隐写嵌入实现的脆弱水印作为生物识别文档完整性验证的有效工具的可行性。

链接: https://arxiv.org/abs/2504.13759
作者: Davide Ghiani,Jefferson David Rodriguez Chivata,Stefano Lilliu,Simone Maurizio La Cava,Marco Micheletto,Giulia Orrù,Federico Lama,Gian Luca Marcialis
机构: University of Cagliari (萨萨里大学); Dedem S.p.A. (Dedem股份公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern identity verification systems increasingly rely on facial images embedded in biometric documents such as electronic passports. To ensure global interoperability and security, these images must comply with strict standards defined by the International Civil Aviation Organization (ICAO), which specify acquisition, quality, and format requirements. However, once issued, these images may undergo unintentional degradations (e.g., compression, resizing) or malicious manipulations (e.g., morphing) and deceive facial recognition systems. In this study, we explore fragile watermarking, based on deep steganographic embedding as a proactive mechanism to certify the authenticity of ICAO-compliant facial images. By embedding a hidden image within the official photo at the time of issuance, we establish an integrity marker that becomes sensitive to any post-issuance modification. We assess how a range of image manipulations affects the recovered hidden image and show that degradation artifacts can serve as robust forensic cues. Furthermore, we propose a classification framework that analyzes the revealed content to detect and categorize the type of manipulation applied. Our experiments demonstrate high detection accuracy, including cross-method scenarios with multiple deep steganography-based models. These findings support the viability of fragile watermarking via steganographic embedding as a valuable tool for biometric document integrity verification.
zh

[CV-7] owards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis

【速读】:该论文旨在解决神经母细胞瘤病理切片图像分类中因主观手动诊断导致的准确性不一致问题,以及现有自动化方法存在的可解释性差、特征提取能力有限和计算成本高的局限。论文的关键解决方案是提出CMSwinKAN模型,这是一种基于对比学习的多尺度特征融合方法,通过在Swin Transformer架构中引入Kernel Activation Network,显著提升了模型的可解释性和分类准确性。此外,模型通过融合多尺度特征和采用对比学习策略,模仿临床医生的综合分析方式,有效捕捉组织的全局和局部特性。同时,论文还设计了一种基于临床洞察的启发式软投票机制,将局部预测无缝整合到整体图像级别分类中。

链接: https://arxiv.org/abs/2504.13754
作者: Zhu Zhu,Shuo Jiang,Jingyuan Zheng,Yawen Li,Yifei Chen,Manli Zhao,Weizhong Gu,Feiwei Qin,Jinhu Wang,Gang Yu
机构: Children’s Hospital, Zhejiang University School of Medicine (浙江大学医学院附属儿童医院); Hangzhou Dianzi University (杭州电子科技大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14pages, 8 figures

点击查看摘要

Abstract:Neuroblastoma, adrenal-derived, is among the most common pediatric solid malignancies, characterized by significant clinical heterogeneity. Timely and accurate pathological diagnosis from hematoxylin and eosin-stained whole slide images is critical for patient prognosis. However, current diagnostic practices primarily rely on subjective manual examination by pathologists, leading to inconsistent accuracy. Existing automated whole slide image classification methods encounter challenges such as poor interpretability, limited feature extraction capabilities, and high computational costs, restricting their practical clinical deployment. To overcome these limitations, we propose CMSwinKAN, a contrastive-learning-based multi-scale feature fusion model tailored for pathological image classification, which enhances the Swin Transformer architecture by integrating a Kernel Activation Network within its multilayer perceptron and classification head modules, significantly improving both interpretability and accuracy. By fusing multi-scale features and leveraging contrastive learning strategies, CMSwinKAN mimics clinicians’ comprehensive approach, effectively capturing global and local tissue characteristics. Additionally, we introduce a heuristic soft voting mechanism guided by clinical insights to seamlessly bridge patch-level predictions to whole slide image-level classifications. We validate CMSwinKAN on the PpNTs dataset, which was collaboratively established with our partner hospital and the publicly accessible BreakHis dataset. Results demonstrate that CMSwinKAN performs better than existing state-of-the-art pathology-specific models pre-trained on large datasets. Our source code is available at this https URL.
zh

[CV-8] DAM-Net: Domain Adaptation Network with Micro-Labeled Fine-Tuning for Change Detection

【速读】:该论文旨在解决遥感影像变化检测(Change Detection, CD)领域中深度学习方法在跨数据集应用时面临的域适应性差的问题。当前方法在应用于新场景时,需要大量标注数据进行重新训练,这限制了其实际应用范围。为了解决这一问题,论文提出了一种名为DAM-Net的域适应网络,其关键在于结合对抗域适应策略和微标注精细调优方法。通过设计专门的分割判别器以及交替训练策略实现跨域的有效迁移,并通过选择性标注极少量样本(小于1%)来增强域适应能力。此外,网络还引入了多时相Transformer用于特征融合,并优化了主干结构。实验结果表明,DAM-Net不仅显著优于现有的域适应方法,而且仅使用0.3%的标注样本即可达到与需10%标注数据的半监督方法相当的性能。这一方案为遥感领域的高效域适应提供了新的思路。

链接: https://arxiv.org/abs/2504.13748
作者: Hongjia Chen,Xin Xu,Fangling Pu
机构: Collaborative Sensing Laboratory, Electronic Information School, Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Change detection (CD) in remote sensing imagery plays a crucial role in various applications such as urban planning, damage assessment, and resource management. While deep learning approaches have significantly advanced CD performance, current methods suffer from poor domain adaptability, requiring extensive labeled data for retraining when applied to new scenarios. This limitation severely restricts their practical applications across different datasets. In this work, we propose DAM-Net: a Domain Adaptation Network with Micro-Labeled Fine-Tuning for CD. Our network introduces adversarial domain adaptation to CD for, utilizing a specially designed segmentation-discriminator and alternating training strategy to enable effective transfer between domains. Additionally, we propose a novel Micro-Labeled Fine-Tuning approach that strategically selects and labels a minimal amount of samples (less than 1%) to enhance domain adaptation. The network incorporates a Multi-Temporal Transformer for feature fusion and optimized backbone structure based on previous research. Experiments conducted on the LEVIR-CD and WHU-CD datasets demonstrate that DAM-Net significantly outperforms existing domain adaptation methods, achieving comparable performance to semi-supervised approaches that require 10% labeled data while using only 0.3% labeled samples. Our approach significantly advances cross-dataset CD applications and provides a new paradigm for efficient domain adaptation in remote sensing. The source code of DAM-Net will be made publicly available upon publication.
zh

[CV-9] ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在渲染文本提示中描述的空间关系时存在的不足,即缺乏精确的空间信息。现有方法通常依赖外部网络条件和预定义布局,这不仅增加了计算成本,还降低了灵活性。为应对这一挑战,论文提出的关键解决方案包括:(1) 构建了一个包含明确空间提示的数据集,从LAION-400M数据集中精心提取和合成,以确保文本描述与空间布局之间的精确匹配;(2) 提出了一种基于低秩适应(Low-Rank Adaptation, LoRA)的灵活微调框架ESPLoRA,用于在不增加生成时间或牺牲输出质量的前提下增强生成模型的空间一致性;(3) 设计了改进的几何约束评估指标,捕捉三维空间关系,并通过TORE算法利用未完全消除的空间偏差进一步提升生成图像的空间一致性。实验结果表明,该方法在空间一致性基准测试中比当前最先进的框架CoMPaSS高出13.33%。

链接: https://arxiv.org/abs/2504.13745
作者: Andrea Rigo,Luca Stornaiuolo,Mauro Martino,Bruno Lepri,Nicu Sebe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images. However, they still struggle to properly render the spatial relationships described in text prompts. To address the lack of spatial information in T2I generations, existing methods typically use external network conditioning and predefined layouts, resulting in higher computational costs and reduced flexibility. Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M to ensure precise alignment between textual descriptions and spatial layouts. Alongside this dataset, we present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, specifically designed to enhance spatial consistency in generative models without increasing generation time or compromising the quality of the outputs. In addition to ESPLoRA, we propose refined evaluation metrics grounded in geometric constraints, capturing 3D spatial relations such as \textitin front of or \textitbehind. These metrics also expose spatial biases in T2I models which, even when not fully mitigated, can be strategically exploited by our TORE algorithm to further improve the spatial consistency of generated images. Our method outperforms the current state-of-the-art framework, CoMPaSS, by 13.33% on established spatial consistency benchmarks.
zh

[CV-10] LimitNet: Progressive Content-Aware Image Offloading for Extremely Weak Devices Networks

【速读】:该论文旨在解决物联网(IoT)设备受限于硬件能力和部署环境(如远程地区),无法直接运行先进视觉模型的问题。传统方法依赖于将任务卸载到云端,但低功耗广域网(LPWAN)的有限带宽、高丢包率及极低的工作周期,使得时间敏感型推理的快速卸载极具挑战性。特别是现有方案生成非渐进式比特流,在带宽受限或数据丢失情况下,云侧仅部分接收到数据时,解码质量显著下降。为此,论文提出LimitNet,这是一种面向极度受限设备与网络的渐进式、内容感知图像压缩模型。其关键在于轻量级渐进式编码器能够根据图像内容优先传输重要数据,使云侧即使在数据不完整的情况下也能进行推理。实验结果表明,相比当前最优方法(SOTA),LimitNet在ImageNet1000、CIFAR100和COCO数据集上的精度分别提升了14.01、18.01个百分点以及0.1 mAP@0.5,同时分别节省了61.24%、83.68%和42.25%的带宽,而编码时间仅增加4%。

链接: https://arxiv.org/abs/2504.13736
作者: Ali Hojjat,Janek Haberer,Tayyaba Zainab,Olaf Landsiedel
机构: Kiel University (基尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This is the author’s accepted manuscript. The Version of Record is available at: this https URL

点击查看摘要

Abstract:IoT devices have limited hardware capabilities and are often deployed in remote areas. Consequently, advanced vision models surpass such devices’ processing and storage capabilities, requiring offloading of such tasks to the cloud. However, remote areas often rely on LPWANs technology with limited bandwidth, high packet loss rates, and extremely low duty cycles, which makes fast offloading for time-sensitive inference challenging. Today’s approaches, which are deployable on weak devices, generate a non-progressive bit stream, and therefore, their decoding quality suffers strongly when data is only partially available on the cloud at a deadline due to limited bandwidth or packet losses. In this paper, we introduce LimitNet, a progressive, content-aware image compression model designed for extremely weak devices and networks. LimitNet’s lightweight progressive encoder prioritizes critical data during transmission based on the content of the image, which gives the cloud the opportunity to run inference even with partial data availability. Experimental results demonstrate that LimitNet, on average, compared to SOTA, achieves 14.01 p.p. (percentage point) higher accuracy on ImageNet1000, 18.01 pp on CIFAR100, and 0.1 higher mAP@0.5 on COCO. Also, on average, LimitNet saves 61.24% bandwidth on ImageNet1000, 83.68% on CIFAR100, and 42.25% on the COCO dataset compared to SOTA, while it only has 4% more encoding time compared to JPEG (with a fixed quality) on STM32F7 (Cortex-M7). Comments: This is the author’s accepted manuscript. The Version of Record is available at: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2504.13736 [cs.CV] (or arXiv:2504.13736v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.13736 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: In Proceedings of the 22nd ACM International Conference on Mobile Systems, Applications, and Services (MobiSys '24), June 3-7, 2024, Minato-ku, Tokyo, Japan. ACM, New York, NY, USA Related DOI: https://doi.org/10.1145/3643832.3661856 Focus to learn more DOI(s) linking to related resources
zh

[CV-11] MLEP: Multi-granularity Local Entropy Patterns for Universal AI-generated Image Detection

【速读】:该论文旨在解决AI生成图像(AIGI)检测中因现有方法缺乏源不变特征和有限泛化能力而导致的在不同生成模型和场景下性能不可靠的问题。论文的关键解决方案是探索图像熵作为AIGI检测线索的可能性,并提出了一种名为多粒度局部熵模式(MLEP)的方法,通过在多个图像尺度上计算打乱的小块图像的熵特征图,全面捕捉像素在维度和尺度上的关系,同时显著破坏图像语义以减少潜在的内容偏差。基于MLEP,可以训练出一个鲁棒的基于CNN的AIGI检测分类器。实验结果表明,该方法在开放世界场景下对32种不同生成模型合成的图像进行评估时,显著提升了准确性和泛化能力。

链接: https://arxiv.org/abs/2504.13726
作者: Lin Yuan,Xiaowan Li,Yan Zhang,Jiawei Zhang,Hongbo Li,Xinbo Gao
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Advancements in image generation technologies have raised significant concerns about their potential misuse, such as producing misinformation and deepfakes. Therefore, there is an urgent need for effective methods to detect AI-generated images (AIGI). Despite progress in AIGI detection, achieving reliable performance across diverse generation models and scenes remains challenging due to the lack of source-invariant features and limited generalization capabilities in existing methods. In this work, we explore the potential of using image entropy as a cue for AIGI detection and propose Multi-granularity Local Entropy Patterns (MLEP), a set of entropy feature maps computed across shuffled small patches over multiple image scaled. MLEP comprehensively captures pixel relationships across dimensions and scales while significantly disrupting image semantics, reducing potential content bias. Leveraging MLEP, a robust CNN-based classifier for AIGI detection can be trained. Extensive experiments conducted in an open-world scenario, evaluating images synthesized by 32 distinct generative models, demonstrate significant improvements over state-of-the-art methods in both accuracy and generalization.
zh

[CV-12] Human-aligned Deep Learning: Explainability Causality and Biological Inspiration

【速读】:该论文旨在解决深度学习模型在图像分类任务中缺乏高效性、可解释性和鲁棒性的问题,尤其关注医学影像领域的应用。论文从可解释性(Explainability)、因果性(Causality)和生物视觉(Biological Vision)三个视角提出了解决方案。关键在于通过设计具有可解释性的神经网络可视化方法(如针对乳腺肿块分类的原型部件学习),结合因果信号挖掘模块(如CROCODILE框架,集成因果概念、对比学习、特征解耦与先验知识以提升泛化能力),以及借鉴生物视觉机制的连接启发式网络(如CoCoReco,引入上下文感知注意力机制)。这些方法共同实现了更有效的预测、更强的泛化能力和更高的诊断可信度,为实现人类对齐的深度学习提供了路径,并促进了研究向临床应用的转化。

链接: https://arxiv.org/abs/2504.13717
作者: Gianluca Carloni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
备注: Personal adaptation and expansion of doctoral thesis (originally submitted in Oct 2024, revisioned in Jan 2025)

点击查看摘要

Abstract:This work aligns deep learning (DL) with human reasoning capabilities and needs to enable more efficient, interpretable, and robust image classification. We approach this from three perspectives: explainability, causality, and biological vision. Introduction and background open this work before diving into operative chapters. First, we assess neural networks’ visualization techniques for medical images and validate an explainable-by-design method for breast mass classification. A comprehensive review at the intersection of XAI and causality follows, where we introduce a general scaffold to organize past and future research, laying the groundwork for our second perspective. In the causality direction, we propose novel modules that exploit feature co-occurrence in medical images, leading to more effective and explainable predictions. We further introduce CROCODILE, a general framework that integrates causal concepts, contrastive learning, feature disentanglement, and prior knowledge to enhance generalization. Lastly, we explore biological vision, examining how humans recognize objects, and propose CoCoReco, a connectivity-inspired network with context-aware attention mechanisms. Overall, our key findings include: (i) simple activation maximization lacks insight for medical imaging DL models; (ii) prototypical-part learning is effective and radiologically aligned; (iii) XAI and causal ML are deeply connected; (iv) weak causal signals can be leveraged without a priori information to improve performance and interpretability; (v) our framework generalizes across medical domains and out-of-distribution data; (vi) incorporating biological circuit motifs improves human-aligned recognition. This work contributes toward human-aligned DL and highlights pathways to bridge the gap between research and clinical adoption, with implications for improved trust, diagnostic accuracy, and safe deployment.
zh

[CV-13] SLAMRender: A Benchmark for the Intersection Between Neural Rendering Gaussian Splatting and SLAM

【速读】:该论文旨在解决Simultaneous Localization and Mapping (SLAM)与新型视图合成(novel view synthesis)和场景渲染(scene rendering)领域之间的方法交叉应用所面临的挑战。当前数据集未能涵盖两个领域特有的复杂性,如SLAM中的多模态性和序列相关性,以及神经渲染(neural rendering)中的跨视角和光照条件泛化能力。为填补这一空白,论文提出了SLAMRender数据集,其关键在于提供一个综合性的基准数据集,包含40个同步的RGB、深度、IMU、机器人运动学数据及真实姿态流序列,并设计了覆盖多种场景和光照条件的实验设置,同时区分训练和测试轨迹以及对象重排。通过这些精心设计的数据特性,SLAMRender能够有效评估SLAM与新型视图渲染方法在跨领域融合中的性能,从而验证其作为新兴研究方向相关基准的有效性。

链接: https://arxiv.org/abs/2504.13713
作者: Samuel Cerezo,Gaetano Meli,Tomás Berriel Martins,Kirill Safronov,Javier Civera
机构: Departamento de Informática e Ingeniería de Sistemas, Universidad de Zaragoza (萨拉戈萨大学计算机与系统工程系); Technology & Innovation Center, KUKA Deutschland GmbH (库卡德国公司技术与创新中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Models and methods originally developed for novel view synthesis and scene rendering, such as Neural Radiance Fields (NeRF) and Gaussian Splatting, are increasingly being adopted as representations in Simultaneous Localization and Mapping (SLAM). However, existing datasets fail to include the specific challenges of both fields, such as multimodality and sequentiality in SLAM or generalization across viewpoints and illumination conditions in neural rendering. To bridge this gap, we introduce SLAMRender, a novel dataset designed to benchmark methods in the intersection between SLAM and novel view rendering. It consists of 40 sequences with synchronized RGB, depth, IMU, robot kinematic data, and ground-truth pose streams. By releasing robot kinematic data, the dataset also enables the assessment of novel SLAM strategies when applied to robot manipulators. The dataset sequences span five different setups featuring consumer and industrial objects under four different lighting conditions, with separate training and test trajectories per scene, as well as object rearrangements. Our experimental results, obtained with several baselines from the literature, validate SLAMRender as a relevant benchmark for this emerging research area.
zh

[CV-14] Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching

【速读】:该论文致力于解决视频对象分割(Referring Video Object Segmentation, RVOS)问题,即通过自然语言描述指导对视频中对象的分割。为实现这一目标,论文提出了一种基于Transformer的模型FS-RVOS,其关键在于两个核心组件:跨模态亲和模块(cross-modal affinity module)和实例序列匹配策略(instance sequence matching strategy)。此外,该方法进一步扩展至多对象分割任务(FS-RVMOS)。实验结果表明,FS-RVOS及其扩展版本在多个基准数据集上优于现有最先进的方法,展现出卓越的鲁棒性和准确性。

链接: https://arxiv.org/abs/2504.13710
作者: Heng Liu,Guanghui Li,Mingqi Gao,Xiantong Zhen,Feng Zheng,Yang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 10 figures

点击查看摘要

Abstract:Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.
zh

[CV-15] Green Robotic Mixed Reality with Gaussian Splatting

【速读】:本文旨在解决机器人混合现实(RoboMR)系统中绿色通信的挑战,特别是通过无线信道以高频率上传高分辨率图像的问题。论文提出了一种基于高斯点绘(Gaussian Splatting, GS)的RoboMR(GSRMR),其核心在于构建一个GS模型,使模拟器能够从机器人的姿态机会性地渲染出逼真的视图,从而减少对过度图像上传的需求。解决方案的关键在于通过GS模型实现高效渲染,并进一步提出GS跨层优化(GSCLO)框架,联合优化内容切换决策与不同帧间的功率分配。该优化问题通过加速惩罚优化(Accelerated Penalty Optimization, APO)算法求解。实验表明,GSRMR相比传统RoboMR可将通信能耗降低超过10倍,并在峰值信噪比(PSNR)和结构相似性指数(SSIM)方面优于多种基线方案。

链接: https://arxiv.org/abs/2504.13697
作者: Chenxuan Liu,He Li,Zongze Li,Shuai Wang,Wei Xu,Kejiang Ye,Derrick Wing Kwan Ng,Chengzhong Xu
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences (中国科学院大学); State Key Laboratory of IOTSC, Department of Computer and Information Science, University of Macau (澳门大学物联网科学与技术国家重点实验室计算机与信息科学系); Peng Cheng Laboratory (鹏城实验室); Manifold Tech Limited (万维数科有限公司); School of Electrical Engineering and Telecommunications, the University of New South Wales (新南威尔士大学电气工程与电信学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 6 pages, 5 figures, accepted by IEEE INFOCOM 2025 Workshop on Networked Robotics and Communication Systems

点击查看摘要

Abstract:Realizing green communication in robotic mixed reality (RoboMR) systems presents a challenge, due to the necessity of uploading high-resolution images at high frequencies through wireless channels. This paper proposes Gaussian splatting (GS) RoboMR (GSRMR), which achieves a lower energy consumption and makes a concrete step towards green RoboMR. The crux to GSRMR is to build a GS model which enables the simulator to opportunistically render a photo-realistic view from the robot’s pose, thereby reducing the need for excessive image uploads. Since the GS model may involve discrepancies compared to the actual environments, a GS cross-layer optimization (GSCLO) framework is further proposed, which jointly optimizes content switching (i.e., deciding whether to upload image or not) and power allocation across different frames. The GSCLO problem is solved by an accelerated penalty optimization (APO) algorithm. Experiments demonstrate that the proposed GSRMR reduces the communication energy by over 10x compared with RoboMR. Furthermore, the proposed GSRMR with APO outperforms extensive baseline schemes, in terms of peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM).
zh

[CV-16] Zebrafish Counting Using Event Stream Data

【速读】:该论文旨在解决医学实验室中因斑马鱼个体微小导致手动视觉计数困难的问题,以及现有计数方法在处理小型鱼类时适用性差或局限性过多的挑战。论文提出了一种基于事件流数据的斑马鱼计数算法作为解决方案。其关键在于利用事件相机采集数据,并通过相机校准与图像融合技术提高数据质量;进一步结合轨迹信息优化计数准确性;最终通过对多次计数结果取平均并向上取整获得最终结果。实验结果显示,在100次计数试验中,该算法的平均精度达到97.95%,相比传统算法实现了更简单的实现方式和更高的准确性。

链接: https://arxiv.org/abs/2504.13692
作者: Qianghua Chen,Huiyu Wang,Li Ming,Ying Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zebrafish share a high degree of homology with human genes and are commonly used as model organism in biomedical research. For medical laboratories, counting zebrafish is a daily task. Due to the tiny size of zebrafish, manual visual counting is challenging. Existing counting methods are either not applicable to small fishes or have too many limitations. The paper proposed a zebrafish counting algorithm based on the event stream data. Firstly, an event camera is applied for data acquisition. Secondly, camera calibration and image fusion were preformed successively. Then, the trajectory information was used to improve the counting accuracy. Finally, the counting results were averaged over an empirical of period and rounded up to get the final results. To evaluate the accuracy of the algorithm, 20 zebrafish were put in a four-liter breeding tank. Among 100 counting trials, the average accuracy reached 97.95%. As compared with traditional algorithms, the proposed one offers a simpler implementation and achieves higher accuracy.
zh

[CV-17] Analysing the Robustness of Vision-Language-Models to Common Corruptions

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在常见图像损坏(image corruptions)下的鲁棒性不足问题。论文的关键在于通过全面分析19种来自ImageNet-C基准的图像损坏类型(涵盖噪声、模糊、天气和数字失真四类),揭示不同任务下VLMs的脆弱性模式,并提出两个新的基准(TextVQA-C和GQA-C)以系统评估场景文本理解和基于对象推理在损坏条件下的性能变化。研究发现,基于Transformer的VLMs在处理低频信息时存在固有偏见,这解释了其在不同损坏类型下的鲁棒性差异。通过这些观察,论文为开发更适用于实际应用的抗损坏视觉-语言模型提供了重要见解。

链接: https://arxiv.org/abs/2504.13690
作者: Muhammad Usama,Syeda Aisha Asim,Syed Bilal Ali,Syed Talal Wasim,Umair Bin Mansoor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2304.10592 , arXiv:2301.12597 by other authors

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content. However, their robustness to common image corruptions remains under-explored. In this work, we present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark, spanning four categories: noise, blur, weather, and digital distortions. We introduce two new benchmarks, TextVQA-C and GQA-C, to systematically evaluate how corruptions affect scene text understanding and object-based reasoning, respectively. Our analysis reveals that transformer-based VLMs exhibit distinct vulnerability patterns across tasks: text recognition deteriorates most severely under blur and snow corruptions, while object reasoning shows higher sensitivity to corruptions such as frost and impulse noise. We connect these observations to the frequency-domain characteristics of different corruptions, revealing how transformers’ inherent bias toward low-frequency processing explains their differential robustness patterns. Our findings provide valuable insights for developing more corruption-robust vision-language models for real-world applications.
zh

[CV-18] AnyTSR: Any-Scale Thermal Super-Resolution for UAV

【速读】:该论文旨在解决热成像在智能无人飞行器(UAV)应用于复杂环境时,因热传感器固有低分辨率导致的细节不足和边界模糊问题。现有的超分辨率(Super-resolution, SR)方法多针对固定尺度设计,在实际应用中计算成本高且缺乏灵活性。为此,论文提出了一种新颖的任意尺度热超分辨率方法(AnyScale Thermal Super-Resolution, AnyTSR),适用于单个模型的UAV场景。解决方案的关键在于:首先,引入一种新的图像编码器,通过显式分配特定特征码实现更精确和灵活的表示;其次,通过有效嵌入坐标偏移信息到局部特征集成中,设计出创新的任意尺度上采样器,以更好地理解空间关系并减少伪影;此外,构建了一个包含陆地和水域场景的新数据集(UAV-TSR)用于热超分辨率任务。实验结果表明,所提方法在所有缩放因子下均优于现有最先进方法,并能生成更准确和详细的高分辨率图像。

链接: https://arxiv.org/abs/2504.13682
作者: Mengyuan Li,Changhong Fu,Ziyu Lu,Zijie Zhang,Haobo Zuo,Liangliang Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Thermal imaging can greatly enhance the application of intelligent unmanned aerial vehicles (UAV) in challenging environments. However, the inherent low resolution of thermal sensors leads to insufficient details and blurred boundaries. Super-resolution (SR) offers a promising solution to address this issue, while most existing SR methods are designed for fixed-scale SR. They are computationally expensive and inflexible in practical applications. To address above issues, this work proposes a novel any-scale thermal SR method (AnyTSR) for UAV within a single model. Specifically, a new image encoder is proposed to explicitly assign specific feature code to enable more accurate and flexible representation. Additionally, by effectively embedding coordinate offset information into the local feature ensemble, an innovative any-scale upsampler is proposed to better understand spatial relationships and reduce artifacts. Moreover, a novel dataset (UAV-TSR), covering both land and water scenes, is constructed for thermal SR tasks. Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art methods across all scaling factors as well as generates more accurate and detailed high-resolution images. The code is located at this https URL.
zh

[CV-19] EyecareGPT : Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset Benchmark and Model

【速读】:该论文旨在解决智能眼科诊断领域面临的三个关键挑战:(i) 数据稀缺性与质量不足,特别是缺乏高质量、多模态的眼科视觉指令数据;(ii) 缺乏全面系统的基准来评估诊断性能;(iii) 现有视觉架构难以适应细粒度、区域特异性的眼科病变识别。为应对这些挑战,论文提出Eyecare Kit,包含定制的数据集、基准和模型作为解决方案的核心。具体而言,首先构建一个多代理数据引擎以生成高质量的眼科视觉指令数据集Eyecare-100K;其次设计了综合评估智能眼科诊断任务性能的基准Eyecare-Bench;最后开发了优化细粒度眼科视觉理解的EyecareGPT,其关键在于引入自适应分辨率机制和分层密集连接器。实验结果表明,EyecareGPT在多种眼科任务中达到当前最优性能,凸显其推动智能眼科诊断开放研究的巨大潜力。

链接: https://arxiv.org/abs/2504.13650
作者: Sijing Li,Tianwei Lin,Lingshuai Lin,Wenqiao Zhang,Jiang Liu,Xiaoda Yang,Juncheng Li,Yucheng He,Xiaohui Song,Jun Xiao,Yueting Zhuang,Beng Chin Ooi
机构: unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare, but their reliance on general medical data and coarse-grained global visual understanding limits them in intelligent ophthalmic diagnosis. Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data. The lack of deeply annotated, high-quality, multi-modal ophthalmic visual instruction data; (ii) Benchmark. The absence of a comprehensive and systematic benchmark for evaluating diagnostic performance; (iii) Model. The difficulty of adapting holistic visual architectures to fine-grained, region-specific ophthalmic lesion identification. In this paper, we propose the Eyecare Kit, which systematically tackles the aforementioned three key challenges with the tailored dataset, benchmark and model: First, we construct a multi-agent data engine with real-life ophthalmology data to produce Eyecare-100K, a high-quality ophthalmic visual instruction dataset. Subsequently, we design Eyecare-Bench, a benchmark that comprehensively evaluates the overall performance of LVLMs on intelligent ophthalmic diagnosis tasks across multiple dimensions. Finally, we develop the EyecareGPT, optimized for fine-grained ophthalmic visual understanding thoroughly, which incorporates an adaptive resolution mechanism and a layer-wise dense connector. Extensive experimental results indicate that the EyecareGPT achieves state-of-the-art performance in a range of ophthalmic tasks, underscoring its significant potential for the advancement of open research in intelligent ophthalmic diagnosis. Our project is available at this https URL.
zh

[CV-20] Enhancing Pothole Detection and Characterization: Integrated Segmentation and Depth Estimation in Road Anomaly Systems

【速读】:本文旨在解决传统方法在道路坑洼(road potholes)表征上的不足,即现有基于机器学习的路障检测方法虽能部分自动化分析,但通常无法提供坑洼的完整表征。为应对这一挑战,论文提出利用迁移学习(transfer learning),采用预训练的YOLOv8-seg模型,通过车载摄像头采集的数字图像实现坑洼的自动表征。关键在于构建了一个包含图像及其对应深度图(depth maps)的新颖数据集,并结合图像分割与深度信息融合技术,以精确定位坑洼位置、计算其面积并提取详细的深度特征。这种方法相较于以往基于深度学习的路障检测系统,提供了更全面的坑洼表征,不仅有助于提升自动驾驶车辆对道路危险的识别能力,还支持道路管理部门更高效地应对路面损坏问题。

链接: https://arxiv.org/abs/2504.13648
作者: Uthman Baroudi,Alala BaHamid,Yasser Elalfy,Ziad Al Alami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Road anomaly detection plays a crucial role in road maintenance and in enhancing the safety of both drivers and vehicles. Recent machine learning approaches for road anomaly detection have overcome the tedious and time-consuming process of manual analysis and anomaly counting; however, they often fall short in providing a complete characterization of road potholes. In this paper, we leverage transfer learning by adopting a pre-trained YOLOv8-seg model for the automatic characterization of potholes using digital images captured from a dashboard-mounted camera. Our work includes the creation of a novel dataset, comprising both images and their corresponding depth maps, collected from diverse road environments in Al-Khobar city and the KFUPM campus in Saudi Arabia. Our approach performs pothole detection and segmentation to precisely localize potholes and calculate their area. Subsequently, the segmented image is merged with its depth map to extract detailed depth information about the potholes. This integration of segmentation and depth data offers a more comprehensive characterization compared to previous deep learning-based road anomaly detection systems. Overall, this method not only has the potential to significantly enhance autonomous vehicle navigation by improving the detection and characterization of road hazards but also assists road maintenance authorities in responding more effectively to road damage.
zh

[CV-21] Lightweight LiDAR-Camera 3D Dynamic Object Detection and Multi-Class Trajectory Prediction

【速读】:该论文旨在解决服务移动机器人在任务执行过程中需要实时感知动态物体(如行人、车辆和骑行者)并预测其轨迹的问题,同时受限于有限的计算资源。论文的关键在于提出了一种轻量级的多模态框架,用于三维物体检测和轨迹预测。该框架通过协同整合激光雷达(LiDAR)和摄像头输入实现三维空间中的实时感知,并引入两个创新模块:1)跨模态可变形Transformer(CMDT),用于高精度且计算开销可接受的物体检测;2)基于参考轨迹的多类别Transformer(RTMCT),用于高效且多样化的多类别对象轨迹预测,支持灵活的轨迹长度。实验表明,该系统在CODa基准测试中显著优于现有方法,在平均精度均值(mAP)检测和最小平均位移误差(minADE5)轨迹预测等指标上分别提升了2.03%和减少了0.408米的误差,同时展示了出色的部署能力,在配备入门级NVIDIA 3060 GPU的轮椅机器人上实现了13.2帧每秒的实时推理速度。

链接: https://arxiv.org/abs/2504.13647
作者: Yushen He,Lei Zhao,Tianchen Deng,Zipeng Fang,Weidong Chen
机构: Institute of Medical Robotics and Department of Automation, Shanghai Jiao Tong University (上海交通大学), Key Laboratory of System Control and Information Processing, Ministry of Education (教育部系统控制与信息处理重点实验室), Shanghai 200240, China (中国)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Service mobile robots are often required to avoid dynamic objects while performing their tasks, but they usually have only limited computational resources. So we present a lightweight multi-modal framework for 3D object detection and trajectory prediction. Our system synergistically integrates LiDAR and camera inputs to achieve real-time perception of pedestrians, vehicles, and riders in 3D space. The framework proposes two novel modules: 1) a Cross-Modal Deformable Transformer (CMDT) for object detection with high accuracy and acceptable amount of computation, and 2) a Reference Trajectory-based Multi-Class Transformer (RTMCT) for efficient and diverse trajectory prediction of mult-class objects with flexible trajectory lengths. Evaluations on the CODa benchmark demonstrate superior performance over existing methods across detection (+2.03% in mAP) and trajectory prediction (-0.408m in minADE5 of pedestrians) metrics. Remarkably, the system exhibits exceptional deployability - when implemented on a wheelchair robot with an entry-level NVIDIA 3060 GPU, it achieves real-time inference at 13.2 fps. To facilitate reproducibility and practical deployment, we release the related code of the method at this https URL and its ROS inference version at this https URL.
zh

[CV-22] Efficient Parameter Adaptation for Multi-Modal Medical Image Segmentation and Prognosis

【速读】:该论文旨在解决医学影像领域中肿瘤分割和预后分析依赖于CT-PET联合数据的问题,由于PET扫描数据的可用性有限,这种依赖性成为瓶颈。论文提出了一种参数高效的多模态适应(Parameter-Efficient Multi-Modal Adaptation, PEMMA)框架,用于轻量级升级仅基于CT训练的Transformer模型,使其能够高效适配PET扫描数据。此外,该框架还扩展到预后任务,保持跨模态微调的高效性。关键在于利用Transformer架构的模块化特性,通过低秩适应(Low-Rank Adaptation, LoRA)和分解低秩适应(Decomposed Low-Rank Adaptation, DoRA)实现参数高效的模态间迁移,并通过最小化模态间的纠缠确保单一模态更新时不会导致灾难性遗忘。

链接: https://arxiv.org/abs/2504.13645
作者: Numan Saeed,Shahad Hardan,Muhammad Ridzuan,Nada Saadi,Karthik Nandakumar,Mohammad Yaqub
机构: MBZUAI (Mohammad Bin Zayed University of Artificial Intelligence)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cancer detection and prognosis relies heavily on medical imaging, particularly CT and PET scans. Deep Neural Networks (DNNs) have shown promise in tumor segmentation by fusing information from these modalities. However, a critical bottleneck exists: the dependency on CT-PET data concurrently for training and inference, posing a challenge due to the limited availability of PET scans. Hence, there is a clear need for a flexible and efficient framework that can be trained with the widely available CT scans and can be still adapted for PET scans when they become available. In this work, we propose a parameter-efficient multi-modal adaptation (PEMMA) framework for lightweight upgrading of a transformer-based segmentation model trained only on CT scans such that it can be efficiently adapted for use with PET scans when they become available. This framework is further extended to perform prognosis task maintaining the same efficient cross-modal fine-tuning approach. The proposed approach is tested with two well-known segementation backbones, namely UNETR and Swin UNETR. Our approach offers two main advantages. Firstly, we leverage the inherent modularity of the transformer architecture and perform low-rank adaptation (LoRA) as well as decomposed low-rank adaptation (DoRA) of the attention weights to achieve parameter-efficient adaptation. Secondly, by minimizing cross-modal entanglement, PEMMA allows updates using only one modality without causing catastrophic forgetting in the other. Our method achieves comparable performance to early fusion, but with only 8% of the trainable parameters, and demonstrates a significant +28% Dice score improvement on PET scans when trained with a single modality. Furthermore, in prognosis, our method improves the concordance index by +10% when adapting a CT-pretrained model to include PET scans, and by +23% when adapting for both PET and EHR data.
zh

[CV-23] DenSe-AdViT: A novel Vision Transformer for Dense SAR Object Detection

【速读】:该论文旨在解决密集合成孔径雷达(SAR)目标检测中,传统视觉Transformer(ViT)在提取多尺度局部特征方面的不足,特别是面对小目标且目标密集分布时性能受限的问题。论文的关键创新在于提出了Density-Sensitive Vision Transformer with Adaptive Tokens (DenSe-AdViT),其核心解决方案包括设计一个Density-Aware Module (DAM),通过目标分布生成密度张量,并基于精心构建的目标度量指标,实现对物体空间分布与密度的精确捕捉;同时引入Density-Enhanced Fusion Module (DEFM),将卷积神经网络(CNN)增强的多尺度信息与Transformer提取的全局特征有效融合,借助密度掩码和多源特征优化注意力机制,从而提升对密集目标区域的检测性能。实验结果表明,DenSe-AdViT在RSDD数据集上达到79.8%的mAP,在SIVED数据集上达到92.5%,验证了其在密集车辆目标检测中的有效性。

链接: https://arxiv.org/abs/2504.13638
作者: Yang Zhang,Jingyi Cao,Yanan You,Yuanyuan Qiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformer (ViT) has achieved remarkable results in object detection for synthetic aperture radar (SAR) images, owing to its exceptional ability to extract global features. However, it struggles with the extraction of multi-scale local features, leading to limited performance in detecting small targets, especially when they are densely arranged. Therefore, we propose Density-Sensitive Vision Transformer with Adaptive Tokens (DenSe-AdViT) for dense SAR target detection. We design a Density-Aware Module (DAM) as a preliminary component that generates a density tensor based on target distribution. It is guided by a meticulously crafted objective metric, enabling precise and effective capture of the spatial distribution and density of objects. To integrate the multi-scale information enhanced by convolutional neural networks (CNNs) with the global features derived from the Transformer, Density-Enhanced Fusion Module (DEFM) is proposed. It effectively refines attention toward target-survival regions with the assist of density mask and the multiple sources features. Notably, our DenSe-AdViT achieves 79.8% mAP on the RSDD dataset and 92.5% on the SIVED dataset, both of which feature a large number of densely distributed vehicle targets.
zh

[CV-24] Visual Intention Grounding for Egocentric Assistants

【速读】:该论文旨在解决从第三人称视角到第一人称视角下视觉接地(Visual Grounding)任务的挑战,特别是在输入为第一人称视角且对象可能通过需求和意图隐式描述的应用场景中。论文的关键在于引入了首个针对第一人称视觉意图接地的数据集EgoIntention,并提出了Reason-to-Ground (RoG) 指令微调方法。RoG 方法通过链式意图推理与对象接地机制,实现了正常描述和第一人称意图的混合训练,在处理背景对象误识别及非通用对象功能理解方面显著优于直接微调和混合训练,同时保持或略微提升常规描述接地性能,从而实现第一人称和第三人称视觉输入的统一接地能力。

链接: https://arxiv.org/abs/2504.13621
作者: Pengzhan Sun,Junbin Xiao,Tze Ho Elden Tse,Yicong Li,Arjun Akula,Angela Yao
机构: National University of Singapore (新加坡国立大学); Google DeepMind (谷歌深思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual grounding associates textual descriptions with objects in an image. Conventional methods target third-person image inputs and named object queries. In applications such as AI assistants, the perspective shifts – inputs are egocentric, and objects may be referred to implicitly through needs and intentions. To bridge this gap, we introduce EgoIntention, the first dataset for egocentric visual intention grounding. EgoIntention challenges multimodal LLMs to 1) understand and ignore unintended contextual objects and 2) reason about uncommon object functionalities. Benchmark results show that current models misidentify context objects and lack affordance understanding in egocentric views. We also propose Reason-to-Ground (RoG) instruction tuning; it enables hybrid training with normal descriptions and egocentric intentions with a chained intention reasoning and object grounding mechanism. RoG significantly outperforms naive finetuning and hybrid training on EgoIntention, while maintaining or slightly improving naive description grounding. This advancement enables unified visual grounding for egocentric and exocentric visual inputs while handling explicit object queries and implicit human intentions.
zh

[CV-25] Compile Scene Graphs with Reinforcement Learning

【速读】:该论文旨在解决利用大型语言模型(Large Language Models, LLMs)实现端到端结构化视觉表示(如场景图 Scene Graphs, SGG)提取的问题。目前,基于LLMs的多模态模型(Multimodal LLMs, M-LLMs)在文本生成方面表现出色,但在处理需要生成对象及其关系三元组的任务时仍显不足。论文的关键解决方案在于结合监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL):首先通过SFT在场景图数据集上训练模型以完成基本任务,随后利用强化学习设计了一种以图为中心的奖励函数(包括节点级奖励、边级奖励及格式一致性奖励),显著提升了模型生成场景图的能力,并实现了零失败率的性能表现,而这是仅依赖SFT难以达到的效果。

链接: https://arxiv.org/abs/2504.13617
作者: Zuyao Chen,Jinlin Wu,Zhen Lei,Marc Pollefeys,Chang Wen Chen
机构: The Hong Kong Polytechnic University (香港理工大学); ETH Zürich (苏黎世联邦理工学院); CAIR, HKISI-CAS (中国科学院自动化研究所香港中文大学联合实验室); Institute of Automation, CAS (中国科学院自动化研究所); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Next token prediction is the fundamental principle for training large language models (LLMs), and reinforcement learning (RL) further enhances their reasoning performance. As an effective way to model language, image, video, and other modalities, the use of LLMs for end-to-end extraction of structured visual representations, such as scene graphs, remains underexplored. It requires the model to accurately produce a set of objects and relationship triplets, rather than generating text token by token. To achieve this, we introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset and subsequently refined using reinforcement learning to enhance its ability to generate scene graphs in an end-to-end manner. The SFT follows a conventional prompt-response paradigm, while RL requires the design of effective reward signals. Given the structured nature of scene graphs, we design a graph-centric reward function that integrates node-level rewards, edge-level rewards, and a format consistency reward. Our experiments demonstrate that rule-based RL substantially enhances model performance in the SGG task, achieving a zero failure rate–unlike supervised fine-tuning (SFT), which struggles to generalize effectively. Our code is available at this https URL.
zh

[CV-26] Cross-Hierarchical Bidirectional Consistency Learning for Fine-Grained Visual Classification

【速读】:该论文旨在解决细粒度视觉分类(Fine-Grained Visual Classification, FGVC)任务中因类间差异小、类内变化大而导致的分类准确性与一致性不足的问题。现有方法通常依赖额外标注进行图像分类,而忽视了嵌套在层次标签结构(Tree Hierarchies)中的有价值信息。论文的关键在于提出了一种新颖的跨层次双向一致性学习(Cross-Hierarchical Bidirectional Consistency Learning, CHBC)框架,通过设计专门的模块来分解和增强注意力掩码及特征,并利用双向一致性损失在不同层次间调节分类结果,确保标签预测的一致性并减少误分类。实验结果验证了CHBC框架的有效性,消融研究进一步揭示了特征增强与一致性约束的应用策略及其重要贡献。

链接: https://arxiv.org/abs/2504.13608
作者: Pengxiang Gao,Yihao Liang,Yanzhi Song,Zhouwang Yang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-Grained Visual Classification (FGVC) aims to categorize closely related subclasses, a task complicated by minimal inter-class differences and significant intra-class variance. Existing methods often rely on additional annotations for image classification, overlooking the valuable information embedded in Tree Hierarchies that depict hierarchical label relationships. To leverage this knowledge to improve classification accuracy and consistency, we propose a novel Cross-Hierarchical Bidirectional Consistency Learning (CHBC) framework. The CHBC framework extracts discriminative features across various hierarchies using a specially designed module to decompose and enhance attention masks and features. We employ bidirectional consistency loss to regulate the classification outcomes across different hierarchies, ensuring label prediction consistency and reducing misclassification. Experiments on three widely used FGVC datasets validate the effectiveness of the CHBC framework. Ablation studies further investigate the application strategies of feature enhancement and consistency constraints, underscoring the significant contributions of the proposed modules.
zh

[CV-27] FocusTrack: A Self-Adaptive Local Sampling Algorithm for Efficient Anti-UAV Tracking

【速读】:本文旨在解决反无人机(Anti-UAV)跟踪中的挑战,包括小目标尺寸、突然的相机运动以及复杂的红外背景。现有跟踪方法可分为全局方法和局部方法两大类。全局方法如SiamDT虽然精度高,但计算开销过大,限制了其实际应用;而局部方法如OSTrack和ROMTrack虽高效,但在相机运动导致目标显著位移时表现不佳。论文通过初步实验发现,结合自适应搜索区域调整的局部跟踪器可显著提高跟踪精度,缩小局部与全局跟踪器之间的差距。为应对这一挑战,本文提出FocusTrack框架,通过动态优化搜索区域和增强特征表示,在计算效率与跟踪准确性之间实现最佳平衡。关键在于Search Region Adjustment (SRA)策略和Attention-to-Mask (ATM)模块:SRA策略估计目标存在概率并自适应调整视野,确保目标始终处于焦点内;ATM模块则整合层次信息,丰富目标细节表达以对抗搜索区域变化引起的特征退化。实验表明,FocusTrack在AntiUAV和AntiUAV410数据集上的AUC分别达到67.7%和62.8%,相比基线提升8.5%和9.1%,同时在效率上超越全局方法,实现了实时跟踪。

链接: https://arxiv.org/abs/2504.13604
作者: Ying Wang,Tingfa Xu,Jianan Li
机构: Beijing Institute of Technology (北京理工大学); Key Laboratory of Photoelectronic Imaging Technology and System, Ministry of Education of China (教育部光电成像技术与系统重点实验室); Chongqing Innovation Center, Beijing Institute of Technology (北京理工大学重庆创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13pages, 13 figures

点击查看摘要

Abstract:Anti-UAV tracking poses significant challenges, including small target sizes, abrupt camera motion, and cluttered infrared backgrounds. Existing tracking paradigms can be broadly categorized into global- and local-based methods. Global-based trackers, such as SiamDT, achieve high accuracy by scanning the entire field of view but suffer from excessive computational overhead, limiting real-world deployment. In contrast, local-based methods, including OSTrack and ROMTrack, efficiently restrict the search region but struggle when targets undergo significant displacements due to abrupt camera motion. Through preliminary experiments, it is evident that a local tracker, when paired with adaptive search region adjustment, can significantly enhance tracking accuracy, narrowing the gap between local and global trackers. To address this challenge, we propose FocusTrack, a novel framework that dynamically refines the search region and strengthens feature representations, achieving an optimal balance between computational efficiency and tracking accuracy. Specifically, our Search Region Adjustment (SRA) strategy estimates the target presence probability and adaptively adjusts the field of view, ensuring the target remains within focus. Furthermore, to counteract feature degradation caused by varying search regions, the Attention-to-Mask (ATM) module is proposed. This module integrates hierarchical information, enriching the target representations with fine-grained details. Experimental results demonstrate that FocusTrack achieves state-of-the-art performance, obtaining 67.7% AUC on AntiUAV and 62.8% AUC on AntiUAV410, outperforming the baseline tracker by 8.5% and 9.1% AUC, respectively. In terms of efficiency, FocusTrack surpasses global-based trackers, requiring only 30G MACs and achieving 143 fps with FocusTrack (SRA) and 44 fps with the full version, both enabling real-time tracking.
zh

[CV-28] LMPOcc: 3D Semantic Occupancy Prediction Utilizing Long-Term Memory Prior from Historical Traversals

【速读】:本文旨在解决基于视觉的3D语义占用预测在自动驾驶场景中的局限性,特别是现有方法未能有效利用从历史遍历中获取的感知信息的问题。这些信息通常被忽略,尽管它们可能在环境条件变化(如天气和光照)下对重复访问的相同地理区域具有重要价值。为了解决这一问题,论文提出了一种名为Longterm Memory Prior Occupancy (LMPOcc) 的方法,这是首个利用从历史遍历中提取的长期记忆先验进行3D占用预测的技术。其关键在于引入了一个即插即用的架构,通过开发高效的轻量级当前-先验融合模块,将长期记忆先验与当前感知信息自适应聚合,同时构建全局占用表示。此外,为了确保与多种占用预测基线模型的兼容性,提出了一个模型无关的先验格式。实验结果表明,LMPOcc 在Occ3D-nuScenes基准测试中达到了最先进的性能,特别是在静态语义类别上,并展示了通过多车辆众包构建全局占用的能力。

链接: https://arxiv.org/abs/2504.13596
作者: Shanshuai Yuan,Julong Wei,Muer Tie,Xiangyun Ren,Zhongxue Gan,Wenchao Ding
机构: Academy for Engineering and Technology, Fudan University (复旦大学工程与技术学院); Chongqing Changan Automobile CO., Ltd. (重庆长安汽车股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-based 3D semantic occupancy prediction is critical for autonomous driving, enabling unified modeling of static infrastructure and dynamic agents. In practice, autonomous vehicles may repeatedly traverse identical geographic locations under varying environmental conditions, such as weather fluctuations and illumination changes. Existing methods in 3D occupancy prediction predominantly integrate adjacent temporal contexts. However, these works neglect to leverage perceptual information, which is acquired from historical traversals of identical geographic locations. In this paper, we propose Longterm Memory Prior Occupancy (LMPOcc), the first 3D occupancy prediction methodology that exploits long-term memory priors derived from historical traversal perceptual outputs. We introduce a plug-and-play architecture that integrates long-term memory priors to enhance local perception while simultaneously constructing global occupancy representations. To adaptively aggregate prior features and current features, we develop an efficient lightweight Current-Prior Fusion module. Moreover, we propose a model-agnostic prior format to ensure compatibility across diverse occupancy prediction baselines. LMPOcc achieves state-of-the-art performance validated on the Occ3D-nuScenes benchmark, especially on static semantic categories. Additionally, experimental results demonstrate LMPOcc’s ability to construct global occupancy through multi-vehicle crowdsourcing.
zh

[CV-29] KAN or MLP? Point Cloud Shows the Way Forward

【速读】:该论文旨在解决多层感知器(MLPs)在点云复杂几何结构分析中的局限性,具体表现为固定激活函数难以高效捕捉局部几何特征,同时存在参数效率低和模型冗余高的问题。为了解决这些问题,论文提出了PointKAN,通过引入Kolmogorov-Arnold网络(KANs)来探索其在分层特征表示中的有效性。关键解决方案包括:(1) 设计了几何仿射模块(GAM),以增强模型对几何变化的鲁棒性;(2) 在局部特征处理(LFP)中采用并行结构提取组级特征与全局上下文,实现细节与整体结构的丰富表达;(3) 在全局特征处理(GFP)中整合和进一步加工这些特征,逐步扩展感受野以捕获完整的点云几何信息。此外,为了应对标准KANs的高参数量和计算低效问题,论文还开发了PointKAN-elite的高效版本(Efficient-KANs),显著减少了参数量同时保持精度。实验结果表明,PointKAN在ModelNet40、ScanObjectNN和ShapeNetPart等基准数据集上的表现优于PointMLP,尤其在少量学习任务中表现出色,并大幅降低了参数量和计算复杂度(FLOPs)。

链接: https://arxiv.org/abs/2504.13593
作者: Yan Shi,Qingdong He,Yijun Liu,Xiaoyu Liu,Jingyong Su
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)); Tencent(腾讯); University of Pennsylvania(宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-Layer Perceptrons (MLPs) have become one of the fundamental architectural component in point cloud analysis due to its effective feature learning mechanism. However, when processing complex geometric structures in point clouds, MLPs’ fixed activation functions struggle to efficiently capture local geometric features, while suffering from poor parameter efficiency and high model redundancy. In this paper, we propose PointKAN, which applies Kolmogorov-Arnold Networks (KANs) to point cloud analysis tasks to investigate their efficacy in hierarchical feature representation. First, we introduce a Geometric Affine Module (GAM) to transform local features, improving the model’s robustness to geometric variations. Next, in the Local Feature Processing (LFP), a parallel structure extracts both group-level features and global context, providing a rich representation of both fine details and overall structure. Finally, these features are combined and processed in the Global Feature Processing (GFP). By repeating these operations, the receptive field gradually expands, enabling the model to capture complete geometric information of the point cloud. To overcome the high parameter counts and computational inefficiency of standard KANs, we develop Efficient-KANs in the PointKAN-elite variant, which significantly reduces parameters while maintaining accuracy. Experimental results demonstrate that PointKAN outperforms PointMLP on benchmark datasets such as ModelNet40, ScanObjectNN, and ShapeNetPart, with particularly strong performance in Few-shot Learning task. Additionally, PointKAN achieves substantial reductions in parameter counts and computational complexity (FLOPs). This work highlights the potential of KANs-based architectures in 3D vision and opens new avenues for research in point cloud understanding.
zh

[CV-30] HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering CVPR

【速读】:该论文试图解决传统3D场景理解技术在处理大规模城市级数据集时效率低下以及依赖手工标注的问题。论文提出了一种名为Hierarchical vocab-Agnostic Expert Clustering (HAEC) 的方法,其关键在于基于超级点图聚类,并采用了一种新颖的专家图Transformer作为主干网络。此外,论文还展示了一个完全从原始点云中自动生成标签的合成标注流程,无需人工标注。这种方法能够助力复杂的城市密集3D场景操作,并为数字孪生的数据处理开辟新路径。

链接: https://arxiv.org/abs/2504.13590
作者: Alexander Rusnak,Frédéric Kaplan
机构: École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication through the upcoming CVPR Workshop on open scene understanding with foundation models (OPENSUN3D)

点击查看摘要

Abstract:Traditional 3D scene understanding techniques are generally predicated on hand-annotated label sets, but in recent years a new class of open-vocabulary 3D scene understanding techniques has emerged. Despite the success of this paradigm on small scenes, existing approaches cannot scale efficiently to city-scale 3D datasets. In this paper, we present Hierarchical vocab-Agnostic Expert Clustering (HAEC), after the latin word for ‘these’, a superpoint graph clustering based approach which utilizes a novel mixture of experts graph transformer for its backbone. We administer this highly scalable approach to the first application of open-vocabulary scene understanding on the SensatUrban city-scale dataset. We also demonstrate a synthetic labeling pipeline which is derived entirely from the raw point clouds with no hand-annotation. Our technique can help unlock complex operations on dense urban 3D scenes and open a new path forward in the processing of digital twins.
zh

[CV-31] Leverag ing Automatic CAD Annotations for Supervised Learning in 3D Scene Understanding

【速读】:该论文旨在解决高精度3D场景理解中生成高质量3D标注困难的问题。论文的关键解决方案是利用近期在自动检索合成CAD模型方面的技术进步,证明了由这些方法生成的数据可以作为高质量的真实标签用于监督深度学习模型的训练。具体而言,作者采用了一种类似于先前用于ScanNet场景中物体9D姿态和CAD模型自动标注的流水线,并将其应用于缺乏此类标注的ScanNet++ v1数据集。研究发现,基于自动获取的标注训练的深度学习模型不仅可行,而且在点云补全和单视图CAD模型检索与配准两项任务中的表现优于使用人工标注数据训练的模型。这表明自动3D标注不仅能提升模型性能,还能大幅降低标注成本。为了支持未来3D场景理解的研究,作者将发布其生成的标注(SCANnotate++)以及训练好的模型。

链接: https://arxiv.org/abs/2504.13580
作者: Yuchen Rao,Stefan Ainetter,Sinisa Stekovic,Vincent Lepetit,Friedrich Fraundorfer
机构: Inst. of Visual Computing, Graz Univ. of Technology (格拉茨技术大学视觉计算研究所), Austria; LIGM, École des Ponts et Chaussees, IP Paris, CNRS (法国巴黎理工学院国立高等桥梁学校, 法国国家科学研究中心), France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Github Page: this https URL

点击查看摘要

Abstract:High-level 3D scene understanding is essential in many applications. However, the challenges of generating accurate 3D annotations make development of deep learning models difficult. We turn to recent advancements in automatic retrieval of synthetic CAD models, and show that data generated by such methods can be used as high-quality ground truth for training supervised deep learning models. More exactly, we employ a pipeline akin to the one previously used to automatically annotate objects in ScanNet scenes with their 9D poses and CAD models. This time, we apply it to the recent ScanNet++ v1 dataset, which previously lacked such annotations. Our findings demonstrate that it is not only possible to train deep learning models on these automatically-obtained annotations but that the resulting models outperform those trained on manually annotated data. We validate this on two distinct tasks: point cloud completion and single-view CAD model retrieval and alignment. Our results underscore the potential of automatic 3D annotations to enhance model performance while significantly reducing annotation costs. To support future research in 3D scene understanding, we will release our annotations, which we call SCANnotate++, along with our trained models.
zh

[CV-32] HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework

【速读】:该论文旨在解决RGB-D语义分割在室内场景中有效融合RGB图像的丰富色彩信息与深度图像的空间距离信息的关键挑战。现有方法大多忽视了RGB和深度图像在表达信息上的固有差异,而未能充分挖掘这两种模态的独特优势。论文提出的关键解决方案是设计了一种新颖的异构双分支框架HDBFormer,通过专门处理两种模态的差异来充分利用其特性。对于包含丰富细节的RGB图像,采用基础和细节编码器提取局部和全局特征;而对于更简单的深度图像,则提出了轻量级分层编码器LDFormer以高效提取深度特征。此外,引入了模态信息交互模块MIIM,结合Transformer和大核卷积,有效实现跨模态的全局与局部信息交互。实验表明,HDBFormer在NYUDepthv2和SUN-RGBD数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2504.13579
作者: Shuobin Wei,Zhuang Zhou,Zhengan Lu,Zizhao Yuan,Binghua Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, published to IEEE Signal Processing Letter

点击查看摘要

Abstract:In RGB-D semantic segmentation for indoor scenes, a key challenge is effectively integrating the rich color information from RGB images with the spatial distance information from depth images. However, most existing methods overlook the inherent differences in how RGB and depth images express information. Properly distinguishing the processing of RGB and depth images is essential to fully exploiting their unique and significant characteristics. To address this, we propose a novel heterogeneous dual-branch framework called HDBFormer, specifically designed to handle these modality differences. For RGB images, which contain rich detail, we employ both a basic and detail encoder to extract local and global features. For the simpler depth images, we propose LDFormer, a lightweight hierarchical encoder that efficiently extracts depth features with fewer parameters. Additionally, we introduce the Modality Information Interaction Module (MIIM), which combines transformers with large kernel convolutions to interact global and local information across modalities efficiently. Extensive experiments show that HDBFormer achieves state-of-the-art performance on the NYUDepthv2 and SUN-RGBD datasets. The code is available at: this https URL.
zh

[CV-33] MAAM: A Lightweight Multi-Agent Aggregation Module for Efficient Image Classification Based on the MindSpore Framework

【速读】:该论文旨在解决在资源受限环境中(如边缘设备或实时系统)图像分类任务中轻量级模型的需求,即在计算效率与鲁棒特征表示之间找到平衡的问题。传统注意力机制虽然具备强大的特征建模能力,但通常面临高计算复杂度和结构刚性的挑战,限制了其在有限计算资源场景中的应用。为了解决这一问题,论文提出了一种名为多智能体聚合模块(Multi-Agent Aggregation Module, MAAM)的轻量级注意力架构,并将其集成到MindSpore框架中。MAAM的关键在于通过三个并行智能体分支独立提取异构特征,并利用可学习的标量权重自适应融合这些特征,随后通过卷积压缩层进一步优化。此外,借助MindSpore的动态计算图和算子融合技术,MAAM实现了显著的性能提升,在CIFAR-10数据集上的准确率达到87.0%,同时大幅优于传统CNN(58.3%)和MLP(49.6%)模型,并提升了30%的训练效率。消融实验验证了智能体注意力(移除后准确率降至32.0%)和压缩模块(省略后降至25.5%)的重要性,证明了它们对于保持判别性特征学习的必要性。硬件加速能力和极低的内存占用进一步展示了其实际应用价值,为资源受限环境下的图像分类提供了一个不妥协精度的部署方案。

链接: https://arxiv.org/abs/2504.13574
作者: Zhenkai Qin,Feng Zhu,Huan Zeng,Xunyi Nong
机构: Guangxi Police College (广西警察学院); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The demand for lightweight models in image classification tasks under resource-constrained environments necessitates a balance between computational efficiency and robust feature representation. Traditional attention mechanisms, despite their strong feature modeling capability, often struggle with high computational complexity and structural rigidity, limiting their applicability in scenarios with limited computational resources (e.g., edge devices or real-time systems). To address this, we propose the Multi-Agent Aggregation Module (MAAM), a lightweight attention architecture integrated with the MindSpore framework. MAAM employs three parallel agent branches with independently parameterized operations to extract heterogeneous features, adaptively fused via learnable scalar weights, and refined through a convolutional compression layer. Leveraging MindSpore’s dynamic computational graph and operator fusion, MAAM achieves 87.0% accuracy on the CIFAR-10 dataset, significantly outperforming conventional CNN (58.3%) and MLP (49.6%) models, while improving training efficiency by 30%. Ablation studies confirm the critical role of agent attention (accuracy drops to 32.0% if removed) and compression modules (25.5% if omitted), validating their necessity for maintaining discriminative feature learning. The framework’s hardware acceleration capabilities and minimal memory footprint further demonstrate its practicality, offering a deployable solution for image classification in resource-constrained scenarios without compromising accuracy.
zh

[CV-34] WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion

【速读】:该论文旨在解决大规模恶劣天气下激光雷达(LiDAR)数据生成质量受限的问题。现有激光雷达模拟器只能针对单一恶劣天气使用单一物理模型进行仿真,生成数据的保真度较低。为应对这一挑战,论文提出WeatherGen,这是一种统一的多样化天气激光雷达数据扩散生成框架,显著提升了生成数据的保真度。

解决方案的关键在于:首先设计了一种基于地图的数据生产器,提供大量高质量的多样化天气数据用于训练;其次利用去噪扩散范式构建扩散模型,并提出一种“蜘蛛毒蛇生成器”(spider mamba generator),逐步恢复被扰动的多样化天气数据,通过扫描激光雷达光束圆或中心射线来建模特征交互,有效保持激光雷达数据的物理结构;接着通过潜在特征对齐器将现实世界知识转移到生成器中;最后设计了一种基于对比学习的控制器,通过语言监督赋予天气控制信号紧凑的语义知识,指导扩散模型生成更具判别性的数据。这些创新点共同实现了高质量的多样化天气激光雷达数据生成。

链接: https://arxiv.org/abs/2504.13561
作者: Yang Wu,Yun Zhu,Kaihua Zhang,Jianjun Qian,Jin Xie,Jian Yang
机构: PCA Lab, Nanjing University of Science and Technology (南京理工大学 PCA 实验室); School of Automation, Southeast University (东南大学自动化学院); State Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室); School of Intelligence Science and Technology, Nanjing University (南京大学智能科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D scene perception demands a large amount of adverse-weather LiDAR data, yet the cost of LiDAR data collection presents a significant scaling-up challenge. To this end, a series of LiDAR simulators have been proposed. Yet, they can only simulate a single adverse weather with a single physical model, and the fidelity of the generated data is quite limited. This paper presents WeatherGen, the first unified diverse-weather LiDAR data diffusion generation framework, significantly improving fidelity. Specifically, we first design a map-based data producer, which can provide a vast amount of high-quality diverse-weather data for training purposes. Then, we utilize the diffusion-denoising paradigm to construct a diffusion model. Among them, we propose a spider mamba generator to restore the disturbed diverse weather data gradually. The spider mamba models the feature interactions by scanning the LiDAR beam circle or central ray, excellently maintaining the physical structure of the LiDAR data. Subsequently, following the generator to transfer real-world knowledge, we design a latent feature aligner. Afterward, we devise a contrastive learning-based controller, which equips weather control signals with compact semantic knowledge through language supervision, guiding the diffusion model to generate more discriminative data. Extensive evaluations demonstrate the high generation quality of WeatherGen. Through WeatherGen, we construct the mini-weather dataset, promoting the performance of the downstream task under adverse weather conditions. Code is available: this https URL
zh

[CV-35] Zero-Shot Industrial Anomaly Segmentation with Image-Aware Prompt Generation PAKDD2025

【速读】:该论文旨在解决现有文本引导的零样本异常分割模型在工业场景中因固定提示而适应性不足的问题。论文的关键解决方案是提出Image-Aware Prompt Anomaly Segmentation (IAP-AS),通过结合图像标注模型和大型语言模型(LLM)生成动态且上下文感知的提示,从而增强异常分割的适应性和泛化能力。这一方法利用从图像中提取的对象属性生成上下文相关的提示,显著提升了在动态和非结构化工业环境中的性能,在实验中将F1-max指标提高了多达10%,展示了其卓越的适应性和泛化能力。

链接: https://arxiv.org/abs/2504.13560
作者: SoYoung Park,Hyewon Lee,Mingyu Choi,Seunghoon Han,Jong-Ryul Lee,Sungsu Lim,Tae-Ho Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to PAKDD 2025, 12 pages

点击查看摘要

Abstract:Anomaly segmentation is essential for industrial quality, maintenance, and stability. Existing text-guided zero-shot anomaly segmentation models are effective but rely on fixed prompts, limiting adaptability in diverse industrial scenarios. This highlights the need for flexible, context-aware prompting strategies. We propose Image-Aware Prompt Anomaly Segmentation (IAP-AS), which enhances anomaly segmentation by generating dynamic, context-aware prompts using an image tagging model and a large language model (LLM). IAP-AS extracts object attributes from images to generate context-aware prompts, improving adaptability and generalization in dynamic and unstructured industrial environments. In our experiments, IAP-AS improves the F1-max metric by up to 10%, demonstrating superior adaptability and generalization. It provides a scalable solution for anomaly segmentation across industries
zh

[CV-36] Beyond One-Hot Labels: Semantic Mixing for Model Calibration

【速读】:该论文旨在解决现有模型校准方法因依赖单一热标签(one-hot labels)数据集而导致的对不确定性建模不足的问题。传统方法隐式假设所有标注具有完全确定性,这在分类任务中有效,但在需要反映预测置信度真实可能性的场景下显得知识不足。为克服这一局限,论文提出通过创建包含丰富数值化真实置信值的合成数据集来改进模型校准。解决方案的关键在于引入了一种名为校准感知语义混合(Calibration-aware Semantic Mixing, CSM)的新框架,该框架利用扩散模型生成具有混合类别特性的训练样本,并为其分配不同的置信分数。此外,针对扩散逆过程中的注释置信度与混合比例之间的不匹配问题,提出了校准重标注策略,并探索了更适合新数据表示范式的损失函数。实验结果表明,CSM 在模型校准性能上优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.13548
作者: Haoyang Luo,Linwei Tao,Minjing Dong,Chang Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model calibration seeks to ensure that models produce confidence scores that accurately reflect the true likelihood of their predictions being correct. However, existing calibration approaches are fundamentally tied to datasets of one-hot labels implicitly assuming full certainty in all the annotations. Such datasets are effective for classification but provides insufficient knowledge of uncertainty for model calibration, necessitating the curation of datasets with numerically rich ground-truth confidence values. However, due to the scarcity of uncertain visual examples, such samples are not easily available as real datasets. In this paper, we introduce calibration-aware data augmentation to create synthetic datasets of diverse samples and their ground-truth uncertainty. Specifically, we present Calibration-aware Semantic Mixing (CSM), a novel framework that generates training samples with mixed class characteristics and annotates them with distinct confidence scores via diffusion models. Based on this framework, we propose calibrated reannotation to tackle the misalignment between the annotated confidence score and the mixing ratio during the diffusion reverse process. Besides, we explore the loss functions that better fit the new data representation paradigm. Experimental results demonstrate that CSM achieves superior calibration compared to the state-of-the-art calibration approaches. Code is available at this http URL.
zh

[CV-37] EG-Gaussian: Epipolar Geometry and Graph Network Enhanced 3D Gaussian Splatting

【速读】:本文探讨了一个关于从图像重建3D场景的开放性研究问题。现有方法采用3D高斯点撒布(3D Gaussian Splatting, 3DGS)来生成3D场景,因其高效的训练过程而受到青睐。然而,这些方法可能产生不完整的3D场景或模糊的多视图,原因在于:(1) 3DGS点初始化不准确;(2) 在稀疏视角输入情况下,3DGS倾向于使3D高斯分布变平。为了解决这些问题,本文提出了一种新颖的框架EG-Gaussian,它利用了对极几何(epipolar geometry)和图网络(graph networks)进行3D场景重建。关键在于:首先,将对极几何集成到3DGS初始化阶段以增强初始3DGS点的构建;其次,专门设计了一个图学习模块来优化3DGS的空间特征,其中结合了邻近点之间的空间坐标和角度关系。实验结果表明,与基于3DGS的方法相比,该方法显著提高了重建精度。

链接: https://arxiv.org/abs/2504.13540
作者: Beizhen Zhao,Yifan Zhou,Zijian Wang,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州));
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we explore an open research problem concerning the reconstruction of 3D scenes from images. Recent methods have adopt 3D Gaussian Splatting (3DGS) to produce 3D scenes due to its efficient training process. However, these methodologies may generate incomplete 3D scenes or blurred multiviews. This is because of (1) inaccurate 3DGS point initialization and (2) the tendency of 3DGS to flatten 3D Gaussians with the sparse-view input. To address these issues, we propose a novel framework EG-Gaussian, which utilizes epipolar geometry and graph networks for 3D scene reconstruction. Initially, we integrate epipolar geometry into the 3DGS initialization phase to enhance initial 3DGS point construction. Then, we specifically design a graph learning module to refine 3DGS spatial features, in which we incorporate both spatial coordinates and angular relationships among neighboring points. Experiments on indoor and outdoor benchmark datasets demonstrate that our approach significantly improves reconstruction accuracy compared to 3DGS-based methods.
zh

[CV-38] OBIFormer: A Fast Attentive Denoising Framework for Oracle Bone Inscriptions

【速读】:该论文旨在解决因自然风化、腐蚀及人为破坏导致大量甲骨文(Oracle Bone Inscriptions, OBIs)碎片严重退化的问题,这使得甲骨文自动识别极具挑战性。传统方法要么关注像素级信息,要么利用普通变换器进行基于字形的甲骨文去噪,但这些方法计算开销巨大。为此,论文提出了一种快速注意力去噪框架OBIFormer,其关键在于利用通道自注意力、字形提取以及选择性核特征融合,能够在保证计算效率的同时精确重建去噪图像。实验表明,OBIFormer在合成与原始甲骨文数据集上的PSNR和SSIM指标达到了最先进的去噪性能,并在真实甲骨数据集的实验中展示了其在辅助甲骨文自动识别方面的巨大潜力。

链接: https://arxiv.org/abs/2504.13524
作者: Jinhao Li,Zijian Chen,Tingzhu Chen,Zhiji Liu,Changbo Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Oracle bone inscriptions (OBIs) are the earliest known form of Chinese characters and serve as a valuable resource for research in anthropology and archaeology. However, most excavated fragments are severely degraded due to thousands of years of natural weathering, corrosion, and man-made destruction, making automatic OBI recognition extremely challenging. Previous methods either focus on pixel-level information or utilize vanilla transformers for glyph-based OBI denoising, which leads to tremendous computational overhead. Therefore, this paper proposes a fast attentive denoising framework for oracle bone inscriptions, i.e., OBIFormer. It leverages channel-wise self-attention, glyph extraction, and selective kernel feature fusion to reconstruct denoised images precisely while being computationally efficient. Our OBIFormer achieves state-of-the-art denoising performance for PSNR and SSIM metrics on synthetic and original OBI datasets. Furthermore, comprehensive experiments on a real oracle dataset demonstrate the great potential of our OBIFormer in assisting automatic OBI recognition. The code will be made available at this https URL.
zh

[CV-39] U-Shape Mamba: State Space Model for faster diffusion CVPR2025

【速读】:该论文旨在解决扩散模型在高质量图像生成中的高计算成本问题。为应对这一挑战,论文提出了一种名为U-Shape Mamba (USM)的新扩散模型,其关键在于利用基于Mamba的层构建了一个类似于U-Net的分层结构。通过在编码器中逐步减少序列长度并在解码器中通过Mamba块恢复序列长度,USM显著降低了计算开销,同时保持了强大的生成能力。实验结果表明,USM相比当前最高效的基于Mamba的扩散模型Zigma,实现了GFlops降低至三分之一、更低的内存需求及更快的运行速度,并且在图像质量上超越了Zigma,分别在AFHQ、CelebAHQ和COCO数据集上的Frechet Inception Distance (FID)指标提升了15.3、0.84和2.7点。这些结果凸显了USM作为高效且可扩展的扩散模型解决方案的优势,使高质量图像合成更加便捷,同时降低了计算成本。

链接: https://arxiv.org/abs/2504.13499
作者: Alex Ergasti,Filippo Botti,Tomaso Fontanini,Claudio Ferrari,Massimo Bertozzi,Andrea Prati
机构: University of Parma (帕尔马大学), Department of Engineering and Architecture (工程与建筑系); University of Siena (锡耶纳大学), Department of Information engineering and mathematics (信息工程与数学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepeted at CVPR 2025 eLVM workshop

点击查看摘要

Abstract:Diffusion models have become the most popular approach for high-quality image generation, but their high computational cost still remains a significant challenge. To address this problem, we propose U-Shape Mamba (USM), a novel diffusion model that leverages Mamba-based layers within a U-Net-like hierarchical structure. By progressively reducing sequence length in the encoder and restoring it in the decoder through Mamba blocks, USM significantly lowers computational overhead while maintaining strong generative capabilities. Experimental results against Zigma, which is currently the most efficient Mamba-based diffusion model, demonstrate that USM achieves one-third the GFlops, requires less memory and is faster, while outperforming Zigma in image quality. Frechet Inception Distance (FID) is improved by 15.3, 0.84 and 2.7 points on AFHQ, CelebAHQ and COCO datasets, respectively. These findings highlight USM as a highly efficient and scalable solution for diffusion-based generative models, making high-quality image synthesis more accessible to the research community while reducing computational costs.
zh

[CV-40] Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing

【速读】:该论文旨在解决基于扩散模型的指令引导图像编辑中因随机噪声引起的背景失真等问题,以及现有方法依赖试错调整种子(seed)或提示(prompt)导致效率低下的挑战。此外,虽然已有针对Text-to-Image (T2I)生成的种子选择方法,但这些方法通常依赖外部验证器且评估多组种子会增加计算复杂度,限制了其适用性。

论文的关键解决方案包括两个方面:首先提出了一种基于多个种子的图像编辑基线方法,利用背景一致性分数实现了无监督的最佳结果(Best-of-N);其次引入了ELECT(Early-timestep Latent Evaluation for Candidate Selection),这是一种零样本框架,通过在扩散过程的早期时间步估计背景不匹配来选择可靠的种子,从而识别出能够保持背景不变而仅修改前景的种子。ELECT通过背景不一致分数对候选种子进行排序,在早期过滤掉不适合的样本,同时保留可编辑性。除了独立的种子选择外,ELECT还可集成到指令引导的编辑管道中,并扩展到联合种子和提示的选择任务,进一步提升性能。实验表明,ELECT平均减少了41%的计算成本,最高可达61%,同时显著提高了背景一致性和指令遵循度,在之前失败的情况下达到了约40%的成功率,且无需任何外部监督或额外训练。

链接: https://arxiv.org/abs/2504.13490
作者: Joowon Kim,Ziseok Lee,Donghyeon Cho,Sanghyun Jo,Yeonsung Jung,Kyungsu Kim,Eunho Yang
机构: KAIST; Seoul National University (首尔国立大学); OGQ
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advances in diffusion models, achieving reliable image generation and editing remains challenging due to the inherent diversity induced by stochastic noise in the sampling process. Instruction-guided image editing with diffusion models offers user-friendly capabilities, yet editing failures, such as background distortion, frequently occur. Users often resort to trial and error, adjusting seeds or prompts to achieve satisfactory results, which is inefficient. While seed selection methods exist for Text-to-Image (T2I) generation, they depend on external verifiers, limiting applicability, and evaluating multiple seeds increases computational complexity. To address this, we first establish a multiple-seed-based image editing baseline using background consistency scores, achieving Best-of-N performance without supervision. Building on this, we introduce ELECT (Early-timestep Latent Evaluation for Candidate Selection), a zero-shot framework that selects reliable seeds by estimating background mismatches at early diffusion timesteps, identifying the seed that retains the background while modifying only the foreground. ELECT ranks seed candidates by a background inconsistency score, filtering unsuitable samples early based on background consistency while preserving editability. Beyond standalone seed selection, ELECT integrates into instruction-guided editing pipelines and extends to Multimodal Large-Language Models (MLLMs) for joint seed and prompt selection, further improving results when seed selection alone is insufficient. Experiments show that ELECT reduces computational costs (by 41 percent on average and up to 61 percent) while improving background consistency and instruction adherence, achieving around 40 percent success rates in previously failed cases - without any external supervision or training.
zh

[CV-41] Variational Autoencoder Framework for Hyperspectral Retrievals (Hyper-VAE) of Phytoplankton Absorption and Chlorophyll a in Coastal Waters for NASAs EMIT and PACE Missions

【速读】:该论文旨在解决利用高光谱遥感数据精确反演浮游植物吸收系数(aphy)和叶绿素a浓度(Chl-a)的问题,特别是在光学特性复杂的河口和近海区域。这些区域由于浮游植物群落组成的多样性及光学性质的复杂性,传统海洋色彩遥感方法面临显著挑战。论文的关键解决方案在于创新性地将变分自编码器(Variational Autoencoder, VAE)模型应用于高光谱遥感反射率(Rrs)的反演任务,并通过特定设计优化其在多分布预测问题中的表现。与混合密度网络(Mixture Density Network, MDN)相比,VAE展现出更高的精度和更低的偏差,尤其在处理高维数据集(如PACE任务)时具有明显优势。这一研究为结合人工智能技术提升对水生生态系统中浮游植物群落动态的理解提供了有力支持。

链接: https://arxiv.org/abs/2504.13476
作者: Jiadong Lou,Bingqing Liu,Yuanheng Xiong,Xiaodong Zhang,Xu Yuan
机构: Department of Computer & Information Sciences, University of Delaware (特拉华大学计算机与信息系统系); School of Geosciences, University of Louisiana at Lafayette (路易斯安那大学拉法叶分校地球科学学院); School of Ocean Science and Engineering, The University of Southern Mississippi (南密西西比大学海洋科学与工程学院); Department of Computer & Information Sciences, University of Delaware (特拉华大学计算机与信息系统系)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Phytoplankton absorb and scatter light in unique ways, subtly altering the color of water, changes that are often minor for human eyes to detect but can be captured by sensitive ocean color instruments onboard satellites from space. Hyperspectral sensors, paired with advanced algorithms, are expected to significantly enhance the characterization of phytoplankton community composition, especially in coastal waters where ocean color remote sensing applications have historically encountered significant challenges. This study presents novel machine learning-based solutions for NASA’s hyperspectral missions, including EMIT and PACE, tackling high-fidelity retrievals of phytoplankton absorption coefficient and chlorophyll a from their hyperspectral remote sensing reflectance. Given that a single Rrs spectrum may correspond to varied combinations of inherent optical properties and associated concentrations, the Variational Autoencoder (VAE) is used as a backbone in this study to handle such multi-distribution prediction problems. We first time tailor the VAE model with innovative designs to achieve hyperspectral retrievals of aphy and of Chl-a from hyperspectral Rrs in optically complex estuarine-coastal waters. Validation with extensive experimental observation demonstrates superior performance of the VAE models with high precision and low bias. The in-depth analysis of VAE’s advanced model structures and learning designs highlights the improvement and advantages of VAE-based solutions over the mixture density network (MDN) approach, particularly on high-dimensional data, such as PACE. Our study provides strong evidence that current EMIT and PACE hyperspectral data as well as the upcoming Surface Biology Geology mission will open new pathways toward a better understanding of phytoplankton community dynamics in aquatic ecosystems when integrated with AI technologies.
zh

[CV-42] HMPE:HeatMap Embedding for Efficient Transformer-Based Small Object Detection

【速读】:该论文旨在解决小目标检测中基于Transformer方法存在的显著不足,通过引入HeatMap Position Embedding (HMPE),提出了一种新颖的Transformer优化技术,其关键是动态融合位置编码与语义检测信息,并通过热图引导实现自适应优化。此外,论文设计了Multi-Scale ObjectBox-Heatmap Fusion Encoder (MOHFE) 和 HeatMap Induced High-Quality Queries for Decoder (HIDQ) 模块,分别用于增强编码器和解码器性能,同时结合热图嵌入与Linear-Snake Conv (LSConv) 特征工程,提升小目标类别多样性的嵌入表达能力并减少解码器多头注意力层的数量,从而加速推理和训练过程。实验结果表明,该方法在小目标数据集(NWPU VHR-10) 上提升了1.9% 的mAP,在通用数据集(PASCAL VOC) 上提升了1.2%,并通过HMPE优化将解码器层数从8层降至最少3层,显著降低了计算成本。

链接: https://arxiv.org/abs/2504.13469
作者: YangChen Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Current Transformer-based methods for small object detection continue emerging, yet they have still exhibited significant shortcomings. This paper introduces HeatMap Position Embedding (HMPE), a novel Transformer Optimization technique that enhances object detection performance by dynamically integrating positional encoding with semantic detection information through heatmap-guided adaptive this http URL also innovatively visualize the HMPE method, offering clear visualization of embedded information for parameter this http URL then create Multi-Scale ObjectBox-Heatmap Fusion Encoder (MOHFE) and HeatMap Induced High-Quality Queries for Decoder (HIDQ) modules. These are designed for the encoder and decoder, respectively, to generate high-quality queries and reduce background noise this http URL both heatmap embedding and Linear-Snake Conv(LSConv) feature engineering, we enhance the embedding of massively diverse small object categories and reduced the decoder multihead layers, thereby accelerating both inference and this http URL the generalization experiments, our approach outperforme the baseline mAP by 1.9% on the small object dataset (NWPU VHR-10) and by 1.2% on the general dataset (PASCAL VOC). By employing HMPE-enhanced embedding, we are able to reduce the number of decoder layers from eight to a minimum of three, significantly decreasing both inference and training costs.
zh

[CV-43] Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization

【速读】:该论文试图解决传统少样本时序动作定位(Few-shot Temporal Action Localization, Few-shot TAL)方法仅依赖视频级信息而忽略文本信息的问题。文本信息可以提供有价值的语义支持以增强定位性能。为了解决这一问题,论文的关键在于提出了一种基于Chain-of-Thought文本推理的新少样本时序动作定位方法。该方法设计了一个新颖的少样本学习框架,利用文本语义信息提升模型捕捉动作共同点和变化的能力,包括一个语义感知的文本-视觉对齐模块,用于在不同层次上对查询视频和支持视频进行对齐。同时,为了更好地表达文本层面的动作之间的时间依赖性和因果关系以辅助动作定位,设计了一种类似于Chain of Thought (CoT) 的推理方法,逐步引导视觉语言模型(Vision-Language Model, VLM)和大型语言模型(Large Language Model, LLM)生成类似于CoT的文本描述,这些生成的文本能够捕获比视觉特征更多的动作变化。

链接: https://arxiv.org/abs/2504.13460
作者: Hongwei Ji,Wulian Yun,Mengshi Qi,Huadong Ma
机构: State Key Laboratory of Networking and Switching Technology (网络与交换技术国家重点实验室),
Beijing University of Posts and Telecommunications (北京邮电大学), China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the localization task. Therefore, we propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model’s ability to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level to assist action localization, we design a Chain of Thought (CoT)-like reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoT-like text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets. We introduce the first dataset named Human-related Anomaly Localization and explore the application of the TAL task in human anomaly detection. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. We will release our code, data and benchmark.
zh

[CV-44] Learning from Noisy Pseudo-labels for All-Weather Land Cover Mapping

【速读】:该论文旨在解决合成孔径雷达(SAR)图像语义分割中的伪标签噪声问题,由于SAR传感器对云层和光照条件不敏感,SAR图像在遥感领域受到关注,但其详细信息不足且存在显著斑点噪声,导致人工标注或自动分割困难。现有方法通过配对光学-SAR图像利用光学图像分割网络生成伪标签,但这些伪标签包含大量噪声,影响了SAR图像分割性能。为解决此问题,论文提出了一种更精确的伪标签生成方法,结合半监督学习与新型图像分辨率对齐增强技术。关键在于引入对称交叉熵损失函数以减轻噪声伪标签的影响,并采用一系列训练和测试技巧优化分割效果。实验结果表明,该方法在GRSS数据融合竞赛中表现优异,排名第一。

链接: https://arxiv.org/abs/2504.13458
作者: Wang Liu,Zhiyu Wang,Xin Guo,Puhong Duan,Xudong Kang,Shutao Li
机构: Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation of SAR images has garnered significant attention in remote sensing due to the immunity of SAR sensors to cloudy weather and light conditions. Nevertheless, SAR imagery lacks detailed information and is plagued by significant speckle noise, rendering the annotation or segmentation of SAR images a formidable task. Recent efforts have resorted to annotating paired optical-SAR images to generate pseudo-labels through the utilization of an optical image segmentation network. However, these pseudo-labels are laden with noise, leading to suboptimal performance in SAR image segmentation. In this study, we introduce a more precise method for generating pseudo-labels by incorporating semi-supervised learning alongside a novel image resolution alignment augmentation. Furthermore, we introduce a symmetric cross-entropy loss to mitigate the impact of noisy pseudo-labels. Additionally, a bag of training and testing tricks is utilized to generate better land-cover mapping results. Our experiments on the GRSS data fusion contest indicate the effectiveness of the proposed method, which achieves first place. The code is available at this https URL.
zh

[CV-45] Neural Ganglion Sensors: Learning Task-specific Event Cameras Inspired by the Neural Circuit of the Human Retina

【速读】:本文旨在解决传统事件相机在处理视觉信息时未能充分利用局部空间上下文以及缺乏多任务适应性的问题。论文的关键创新在于引入了神经节感测器(Neural Ganglion Sensors),这是一种对传统事件相机的扩展,通过学习任务特定的空间-时间视网膜核(即模仿视网膜神经节细胞“事件”的行为)来提升性能。这种方法不仅提高了视觉信息处理的效果,还在视频插值和光流估计等任务中减少了整体事件带宽,从而展示了受神经节细胞启发的事件传感器在边缘设备及其它低功耗实时应用中的潜力。

链接: https://arxiv.org/abs/2504.13457
作者: Haley M. So,Gordon Wetzstein
机构: Department of Electrical Engineering, Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Inspired by the data-efficient spiking mechanism of neurons in the human eye, event cameras were created to achieve high temporal resolution with minimal power and bandwidth requirements by emitting asynchronous, per-pixel intensity changes rather than conventional fixed-frame rate images. Unlike retinal ganglion cells (RGCs) in the human eye, however, which integrate signals from multiple photoreceptors within a receptive field to extract spatio-temporal features, conventional event cameras do not leverage local spatial context when deciding which events to fire. Moreover, the eye contains around 20 different kinds of RGCs operating in parallel, each attuned to different features or conditions. Inspired by this biological design, we introduce Neural Ganglion Sensors, an extension of traditional event cameras that learns task-specific spatio-temporal retinal kernels (i.e., RGC “events”). We evaluate our design on two challenging tasks: video interpolation and optical flow. Our results demonstrate that our biologically inspired sensing improves performance relative to conventional event cameras while reducing overall event bandwidth. These findings highlight the promise of RGC-inspired event sensors for edge devices and other low-power, real-time applications requiring efficient, high-resolution visual streams.
zh

[CV-46] MicroFlow: Domain-Specific Optical Flow for Ground Deformation Estimation in Seismic Events

【速读】:该论文旨在解决地质研究中密集地面位移测量的需求与直接采集的不现实性之间的矛盾,传统方法依赖光学卫星图像的时间序列进行块匹配以估计位移场,但这种方法存在缺乏真实的地面实况数据、需要亚像素精度以及受地质或人为变化导致的时间变异性等挑战。特别是,基于深度学习的模型在估计真实场景中小位移时表现不佳,因为它们依赖显式的相关层。论文的关键解决方案在于提出一种采用迭代细化机制、包含显式扭曲层和独立于相关性的主干网络的模型,从而实现亚像素精度。此外,引入非凸形式的总变分正则化,在保持边缘锐利的同时确保其他区域平滑。该模型在半合成基准测试中显著优于广泛使用的地球物理方法,并且在由中高分辨率传感器捕获的具有挑战性的实际场景中表现出良好的泛化能力。

链接: https://arxiv.org/abs/2504.13452
作者: Juliette Bertrand,Sophie Giffard-Roisin,James Hollingsworth,Julien Mairal
机构: Univ. Grenoble Alpes (格勒诺布尔-阿尔卑斯大学); Inria (法国国家信息与自动化研究所); CNRS (法国国家科学研究中心); Grenoble INP (格勒诺布尔国立理工学院); LJK (格勒诺布尔数学实验室); Univ. Savoie Mont Blanc (萨瓦大学); IRD (法国发展研究院); Univ. Gustave Eiffel (埃菲尔大学); ISTerre (地球科学实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dense ground displacement measurements are crucial for geological studies but are impractical to collect directly. Traditionally, displacement fields are estimated using patch matching on optical satellite images from different acquisition times. While deep learning-based optical flow models are promising, their adoption in ground deformation analysis is hindered by challenges such as the absence of real ground truth, the need for sub-pixel precision, and temporal variations due to geological or anthropogenic changes. In particular, we identify that deep learning models relying on explicit correlation layers struggle at estimating small displacements in real-world conditions. Instead, we propose a model that employs iterative refinements with explicit warping layers and a correlation-independent backbone, enabling sub-pixel precision. Additionally, a non-convex variant of Total Variation regularization preserves fault-line sharpness while maintaining smoothness elsewhere. Our model significantly outperforms widely used geophysics methods on semi-synthetic benchmarks and generalizes well to challenging real-world scenarios captured by both medium- and high-resolution sensors. Project page: this https URL.
zh

[CV-47] SatelliteCalculator: A Multi-Task Vision Foundation Model for Quantitative Remote Sensing Inversion

【速读】:该论文试图解决定量遥感反演中物理可解释回归任务应用不足的问题,以及多光谱特性和地理空间异质性对模型泛化性的挑战。论文的关键解决方案在于提出SatelliteCalculator,这是一种专为定量遥感反演设计的视觉基础模型。通过利用物理定义的指数公式,自动构建了一个包含超过一百万个样本对的大规模数据集,覆盖八个核心生态指标。模型采用冻结的Swin Transformer骨干网络,并结合提示引导架构,引入了交叉注意力适配器和轻量级任务特定MLP解码器。这种设计不仅实现了在所有任务上的竞争性精度,还显著降低了推理成本。

链接: https://arxiv.org/abs/2504.13442
作者: Zhenyu Yu,Mohd. Yamani Idna Idris,Pei Wang
机构: Faculty of Computer Science and Information Technology, Universiti Malaya (马来亚大学), Kuala Lumpur 50603, Malaysia; Faculty of Information Engineering and Automation, Kunming University of Science and Technology (昆明理工大学), Kunming 210098, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantitative remote sensing inversion plays a critical role in environmental monitoring, enabling the estimation of key ecological variables such as vegetation indices, canopy structure, and carbon stock. Although vision foundation models have achieved remarkable progress in classification and segmentation tasks, their application to physically interpretable regression remains largely unexplored. Furthermore, the multi-spectral nature and geospatial heterogeneity of remote sensing data pose significant challenges for generalization and transferability. To address these issues, we introduce SatelliteCalculator, the first vision foundation model tailored for quantitative remote sensing inversion. By leveraging physically defined index formulas, we automatically construct a large-scale dataset of over one million paired samples across eight core ecological indicators. The model integrates a frozen Swin Transformer backbone with a prompt-guided architecture, featuring cross-attentive adapters and lightweight task-specific MLP decoders. Experiments on the Open-Canopy benchmark demonstrate that SatelliteCalculator achieves competitive accuracy across all tasks while significantly reducing inference cost. Our results validate the feasibility of applying foundation models to quantitative inversion, and provide a scalable framework for task-adaptive remote sensing estimation.
zh

[CV-48] mporal Propagation of Asymmetric Feature Pyramid for Surgical Scene Segmentation

【速读】:该论文旨在解决机器人辅助腹腔镜手术场景分割中的两个关键挑战:(1) 静态图像的局限性,包括局部特征相似性的模糊性和精细结构细节的复杂性;(2) 动态视频的复杂性,源于器械快速运动和持续视觉遮挡。现有方法主要关注空间特征提取,却忽略了手术视频流中的时间依赖关系。为了解决这些问题,论文提出了一种双向注意架构——时间非对称特征传播网络,其核心在于通过跨帧特征传播增强时间相关性建模。解决方案的关键在于引入时间查询传播器(temporal query propagator),以整合多方向一致性约束来提升特定帧的特征表示能力,并设计了一个聚合的非对称特征金字塔模块,用于保留解剖结构和手术器械的判别性特征。这种方法独特之处在于同时实现了时间引导和上下文推理,从而显著提升了手术场景理解的能力。在两个公开数据集上的综合评估表明,该方法在EndoVis2018上提高了+16.4%的mIoU,在Endoscapes2023上提高了+3.3%的mAP,大幅超越当前最先进的方法。

链接: https://arxiv.org/abs/2504.13440
作者: Cheng Yuan,Yutong Ban
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical scene segmentation is crucial for robot-assisted laparoscopic surgery understanding. Current approaches face two challenges: (i) static image limitations including ambiguous local feature similarities and fine-grained structural details, and (ii) dynamic video complexities arising from rapid instrument motion and persistent visual occlusions. While existing methods mainly focus on spatial feature extraction, they fundamentally overlook temporal dependencies in surgical video streams. To address this, we present temporal asymmetric feature propagation network, a bidirectional attention architecture enabling cross-frame feature propagation. The proposed method contains a temporal query propagator that integrates multi-directional consistency constraints to enhance frame-specific feature representation, and an aggregated asymmetric feature pyramid module that preserves discriminative features for anatomical structures and surgical instruments. Our framework uniquely enables both temporal guidance and contextual reasoning for surgical scene understanding. Comprehensive evaluations on two public benchmarks show the proposed method outperforms the current SOTA methods by a large margin, with +16.4% mIoU on EndoVis2018 and +3.3% mAP on Endoscapes2023. The code will be publicly available after paper acceptance.
zh

[CV-49] Circular Image Deturbulence using Quasi-conformal Geometry

【速读】:该论文旨在解决光学传感器与物体之间存在非均匀介质导致成像输出失真的问题,这显著增加了下游图像处理任务的复杂性。论文的关键挑战在于缺乏用于训练监督模型的高质量配对标记图像。为了解决这一问题,论文引入了圆形单希共形去湍流(Circular Quasi-Conformal Deturbulence, CQCD)框架,这是一种无监督方法,通过圆形架构去除图像失真。该方案的关键在于设计了一个确保恢复图像在几何上精确且视觉上忠实的圆形恢复过程,并结合正向和逆向映射。为了保证估计的非刚性变形具有双射性,利用计算准共形几何理论来正则化映射,强制其保持同胚性质,从而确保定义明确的变换,保留结构完整性并防止产生不必要的伪影。此外,还集成了紧框架块以编码对失真敏感的特征,实现精确恢复。实验结果表明,CQCD 在图像恢复质量和变形场估计准确性方面均优于现有最先进的去湍流方法。

链接: https://arxiv.org/abs/2504.13432
作者: Chu Chen,Han Zhang,Lok Ming Lui
机构: City University of Hong Kong (香港城市大学); Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The presence of inhomogeneous media between optical sensors and objects leads to distorted imaging outputs, significantly complicating downstream image-processing tasks. A key challenge in image restoration is the lack of high-quality, paired-label images required for training supervised models. In this paper, we introduce the Circular Quasi-Conformal Deturbulence (CQCD) framework, an unsupervised approach for removing image distortions through a circular architecture. This design ensures that the restored image remains both geometrically accurate and visually faithful while preventing the accumulation of incorrect this http URL circular restoration process involves both forward and inverse mapping. To ensure the bijectivity of the estimated non-rigid deformations, computational quasi-conformal geometry theories are leveraged to regularize the mapping, enforcing its homeomorphic properties. This guarantees a well-defined transformation that preserves structural integrity and prevents unwanted artifacts. Furthermore, tight-frame blocks are integrated to encode distortion-sensitive features for precise recovery. To validate the performance of our approach, we conduct evaluations on various synthetic and real-world captured images. Experimental results demonstrate that CQCD not only outperforms existing state-of-the-art deturbulence methods in terms of image restoration quality but also provides highly accurate deformation field estimations.
zh

[CV-50] HSACNet: Hierarchical Scale-Aware Consistency Regularized Semi-Supervised Change Detection ICME2025

【速读】:该论文旨在解决半监督变化检测(Semi-supervised Change Detection, SSCD)在复杂场景下性能不佳的问题,特别是面对噪声数据时表现较差,并且现有方法通常忽视了层内多尺度特征的重要性,仅关注层间融合,导致不同尺度变化目标的完整性受损。为了解决这些问题,论文提出了HSACNet(Hierarchical Scale-Aware Consistency regularized Network),其关键是通过引入Segment Anything Model 2 (SAM2) 的Hiera骨干作为编码器以提取层间多尺度特征,并使用适配器实现高效的参数微调;同时设计了Scale-Aware Differential Attention Module (SADAM),用于精确捕获层内多尺度变化特征并抑制噪声;此外,采用了双增强一致性正则化策略来有效利用未标注数据。实验结果表明,HSACNet在多个变化检测基准测试中达到了最先进的性能,同时减少了参数量和计算成本。

链接: https://arxiv.org/abs/2504.13428
作者: Qi’ao Xu,Pengfei Wang,Yanjun Li,Tianwen Qian,Xiaoling Wang
机构: School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院), Shanghai, 200062, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 8 figures, accepted by ICME 2025

点击查看摘要

Abstract:Semi-supervised change detection (SSCD) aims to detect changes between bi-temporal remote sensing images by utilizing limited labeled data and abundant unlabeled data. Existing methods struggle in complex scenarios, exhibiting poor performance when confronted with noisy data. They typically neglect intra-layer multi-scale features while emphasizing inter-layer fusion, harming the integrity of change objects with different scales. In this paper, we propose HSACNet, a Hierarchical Scale-Aware Consistency regularized Network for SSCD. Specifically, we integrate Segment Anything Model 2 (SAM2), using its Hiera backbone as the encoder to extract inter-layer multi-scale features and applying adapters for parameter-efficient fine-tuning. Moreover, we design a Scale-Aware Differential Attention Module (SADAM) that can precisely capture intra-layer multi-scale change features and suppress noise. Additionally, a dual-augmentation consistency regularization strategy is adopted to effectively utilize the unlabeled data. Extensive experiments across four CD benchmarks demonstrate that our HSACNet achieves state-of-the-art performance, with reduced parameters and computational cost.
zh

[CV-51] Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction

【速读】:该论文致力于解决现有基于匹配的多视图3D重建模型在挑战性区域(如纹理不足区域和低光照条件)中重建质量显著下降的问题。解决方案的关键在于引入了一个单目引导的 refinement 模块,通过将单目几何先验整合到多视图重建框架中,利用单目几何估计的固有鲁棒性来弥补匹配方法的不足,从而大幅提升多视图重建系统的稳健性,并实现高质量的前馈重建。

链接: https://arxiv.org/abs/2504.13419
作者: Wenyu Li,Sidun Liu,Peng Qiao,Yong Dou
机构: National University of Defence Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in data-driven geometric multi-view 3D reconstruction foundation models (e.g., DUSt3R) have shown remarkable performance across various 3D vision tasks, facilitated by the release of large-scale, high-quality 3D datasets. However, as we observed, constrained by their matching-based principles, the reconstruction quality of existing models suffers significant degradation in challenging regions with limited matching cues, particularly in weakly textured areas and low-light conditions. To mitigate these limitations, we propose to harness the inherent robustness of monocular geometry estimation to compensate for the inherent shortcomings of matching-based methods. Specifically, we introduce a monocular-guided refinement module that integrates monocular geometric priors into multi-view reconstruction frameworks. This integration substantially enhances the robustness of multi-view reconstruction systems, leading to high-quality feed-forward reconstructions. Comprehensive experiments across multiple benchmarks demonstrate that our method achieves substantial improvements in both mutli-view camera pose estimation and point cloud accuracy.
zh

[CV-52] How Learnable Grids Recover Fine Detail in Low Dimensions: A Neural Tangent Kernel Analysis of Multigrid Parametric Encodings

【速读】:该论文旨在解决神经网络在低维空间映射中普遍存在的一种谱偏置(spectral bias)问题,即在朴素实现中无法有效学习高频信息。为缓解这一问题,论文对比分析了两种常见技术:Fourier特征编码(Fourier Feature Encodings, FFE)和多网格参数编码(Multigrid Parametric Encodings, MPE)。关键在于,尽管FFE被广泛认为是低维映射的标准方法,但MPE通过其自适应网格结构能够以更高的分辨率和更精细的细节表现超越FFE,并且避免了FFE因基于傅里叶变换而可能产生的混叠效应(aliasing)。通过使用神经切线核(Neural Tangent Kernel, NTK)进行分析,论文证明MPE通过其网格结构而非可学习嵌入提升了性能,而FFE则完全依赖于其嵌入空间来改善表现。实验结果表明,MPE相比基线提升了最小特征值8个数量级,相比FFE提升了2个数量级,从而显著增强了高频信息的学习能力。

链接: https://arxiv.org/abs/2504.13412
作者: Samuel Audia,Soheil Feizi,Matthias Zwicker,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学帕克分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural networks that map between low dimensional spaces are ubiquitous in computer graphics and scientific computing; however, in their naive implementation, they are unable to learn high frequency information. We present a comprehensive analysis comparing the two most common techniques for mitigating this spectral bias: Fourier feature encodings (FFE) and multigrid parametric encodings (MPE). FFEs are seen as the standard for low dimensional mappings, but MPEs often outperform them and learn representations with higher resolution and finer detail. FFE’s roots in the Fourier transform, make it susceptible to aliasing if pushed too far, while MPEs, which use a learned grid structure, have no such limitation. To understand the difference in performance, we use the neural tangent kernel (NTK) to evaluate these encodings through the lens of an analogous kernel regression. By finding a lower bound on the smallest eigenvalue of the NTK, we prove that MPEs improve a network’s performance through the structure of their grid and not their learnable embedding. This mechanism is fundamentally different from FFEs, which rely solely on their embedding space to improve performance. Results are empirically validated on a 2D image regression task using images taken from 100 synonym sets of ImageNet and 3D implicit surface regression on objects from the Stanford graphics dataset. Using peak signal-to-noise ratio (PSNR) and multiscale structural similarity (MS-SSIM) to evaluate how well fine details are learned, we show that the MPE increases the minimum eigenvalue by 8 orders of magnitude over the baseline and 2 orders of magnitude over the FFE. The increase in spectrum corresponds to a 15 dB (PSNR) / 0.65 (MS-SSIM) increase over baseline and a 12 dB (PSNR) / 0.33 (MS-SSIM) increase over the FFE.
zh

[CV-53] LoRA-Based Continual Learning with Constraints on Critical Parameter Changes

【速读】:该论文旨在解决在基于正交LoRA (orthogonal LoRA) 微调的连续学习任务中,尽管能够有效缓解灾难性遗忘 (catastrophic forgetting),但预任务的关键参数在学习后任务后仍然会发生显著变化的问题。为了解决这一问题,论文提出在学习后任务之前,预先冻结视觉Transformer (Vision Transformer, ViT) 中用于预任务的最为核心参数矩阵。此外,论文还基于正交LoRA微调,提出了正交LoRA组合 (LoRAC),通过QR分解进一步提升方法的可塑性 (plasticity)。实验表明,所提出的方法在多个著名的连续学习基准数据集上实现了最先进的性能,例如,在Split CIFAR-100数据集上,相比现有方法,其准确率提升了6.35%,遗忘率降低了3.24%。

链接: https://arxiv.org/abs/2504.13407
作者: Shimou Ling,Liang Zhang,Jiangwei Zhao,Lili Pan,Hongliang Li
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LoRA-based continual learning represents a promising avenue for leveraging pre-trained models in downstream continual learning tasks. Recent studies have shown that orthogonal LoRA tuning effectively mitigates forgetting. However, this work unveils that under orthogonal LoRA tuning, the critical parameters for pre-tasks still change notably after learning post-tasks. To address this problem, we directly propose freezing the most critical parameter matrices in the Vision Transformer (ViT) for pre-tasks before learning post-tasks. In addition, building on orthogonal LoRA tuning, we propose orthogonal LoRA composition (LoRAC) based on QR decomposition, which may further enhance the plasticity of our method. Elaborate ablation studies and extensive comparisons demonstrate the effectiveness of our proposed method. Our results indicate that our method achieves state-of-the-art (SOTA) performance on several well-known continual learning benchmarks. For instance, on the Split CIFAR-100 dataset, our method shows a 6.35% improvement in accuracy and a 3.24% reduction in forgetting compared to previous methods. Our code is available at this https URL.
zh

[CV-54] ProgRoCC: A Progressive Approach to Rough Crowd Counting

【速读】:该论文旨在解决大规模人群计数中基于枚举的技术因个体数量增加而导致的可行性下降及估计可靠性降低的问题。论文提出了一种基于估算的方法,即粗糙人群计数(Rough Crowd Counting),其关键在于利用更容易获取的粗略标注数据(而非传统昂贵的逐目标精确标注)来实现更高的准确性。解决方案的核心是引入了一种基于CLIP的渐进式估计学习策略(ProgRoCC),通过由粗到精的方式确定目标数量,并设计了一个视觉-语言匹配适配器以优化模态间的有效匹配,从而改进视觉特征并提升最终性能。实验结果表明,该方法在半监督和弱监督人群计数任务中超越了现有最先进方法。

链接: https://arxiv.org/abs/2504.13405
作者: Shengqin Jiang,Linfei Li,Haokui Zhang,Qingshan Liu,Amin Beheshti,Jian Yang,Anton van den Hengel,Quan Z. Sheng,Yuankai Qi
机构: School of Computer Science, Nanjing University of Information Science and Technology (南京信息工程大学计算机科学学院); School of Cybersecurity, Northwestern Polytechnical University (西北工业大学网络空间安全学院); School of Computer Science, Nanjing University of Posts and Telecommunications (南京邮电大学计算机科学学院); School of Computing, Macquarie University (麦考瑞大学计算学院); Australian Institute for Machine Learning, The University of Adelaide (阿德莱德大学机器学习研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:As the number of individuals in a crowd grows, enumeration-based techniques become increasingly infeasible and their estimates increasingly unreliable. We propose instead an estimation-based version of the problem: we label Rough Crowd Counting that delivers better accuracy on the basis of training data that is easier to acquire. Rough crowd counting requires only rough annotations of the number of targets in an image, instead of the more traditional, and far more expensive, per-target annotations. We propose an approach to the rough crowd counting problem based on CLIP, termed ProgRoCC. Specifically, we introduce a progressive estimation learning strategy that determines the object count through a coarse-to-fine approach. This approach delivers answers quickly, outperforms the state-of-the-art in semi- and weakly-supervised crowd counting. In addition, we design a vision-language matching adapter that optimizes key-value pairs by mining effective matches of two modalities to refine the visual features, thereby improving the final performance. Extensive experimental results on three widely adopted crowd counting datasets demonstrate the effectiveness of our method.
zh

[CV-55] CytoFM: The first cytology foundation model

【速读】:该论文旨在解决数字细胞学领域中开发鲁棒深度学习模型的挑战,这些问题包括样本染色与制备方法的异质性、不同器官之间的差异,以及大规模多样化标注数据集的稀缺性。为应对这些挑战,论文提出的关键解决方案是引入CytoFM——首个细胞学自监督基础模型。通过采用结合掩码图像建模和自蒸馏的Vision Transformer (ViT) 训练框架iBOT,在多样化的细胞学数据集上进行预训练,CytoFM能够学习到稳健且可迁移的表征。这种任务不可知的预训练方法使得CytoFM在乳腺癌分类和细胞类型识别等多个下游细胞学任务中表现出色,其性能优于基于组织病理学或自然图像预训练的基础模型。

链接: https://arxiv.org/abs/2504.13402
作者: Vedrana Ivezić,Ashwath Radhachandran,Ekaterina Redekop,Shreeram Athreya,Dongwoo Lee,Vivek Sant,Corey Arnold,William Speier
机构: University of California, Los Angeles (加州大学洛杉矶分校); UT Southwestern Medical Center (德克萨斯大学西南医学中心, 达拉斯); University of California, Los Angeles (加州大学洛杉矶分校); University of California, Los Angeles (加州大学洛杉矶分校); University of California, Los Angeles (加州大学洛杉矶分校); UT Southwestern Medical Center (德克萨斯大学西南医学中心, 达拉斯); University of California, Los Angeles (加州大学洛杉矶分校); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cytology is essential for cancer diagnostics and screening due to its minimally invasive nature. However, the development of robust deep learning models for digital cytology is challenging due to the heterogeneity in staining and preparation methods of samples, differences across organs, and the limited availability of large, diverse, annotated datasets. Developing a task-specific model for every cytology application is impractical and non-cytology-specific foundation models struggle to generalize to tasks in this domain where the emphasis is on cell morphology. To address these challenges, we introduce CytoFM, the first cytology self-supervised foundation model. Using iBOT, a self-supervised Vision Transformer (ViT) training framework incorporating masked image modeling and self-distillation, we pretrain CytoFM on a diverse collection of cytology datasets to learn robust, transferable representations. We evaluate CytoFM on multiple downstream cytology tasks, including breast cancer classification and cell type identification, using an attention-based multiple instance learning framework. Our results demonstrate that CytoFM performs better on two out of three downstream tasks than existing foundation models pretrained on histopathology (UNI) or natural images (iBOT-Imagenet). Visualizations of learned representations demonstrate our model is able to attend to cytologically relevant features. Despite a small pre-training dataset, CytoFM’s promising results highlight the ability of task-agnostic pre-training approaches to learn robust and generalizable features from cytology data.
zh

[CV-56] owards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety

【速读】:该论文旨在解决自动驾驶中视觉数据(尤其是视频流)中异常危险检测的挑战,特别是现有模型难以应对未预定义类别或不可预测的危险。解决方案的关键在于提出了一种多模态方法,结合视觉-语言推理与零样本物体检测技术,通过引入OpenAI的CLIP模型提升物体检测的定位精度,并利用语义相似性评估检测结果。此外,论文创建了一个扩展的基准数据集并开发相关工具,以支持大规模危险检测任务的数据管理与评价。

链接: https://arxiv.org/abs/2504.13399
作者: Shashank Shriram,Srinivasa Perisetla,Aryan Keskar,Harsha Krishnaswamy,Tonko Emil Westerhof Bossen,Andreas Møgelmose,Ross Greer
机构: University of California, Merced (加州大学默塞德分校); Aalborg Universitet (奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting anomalous hazards in visual data, particularly in video streams, is a critical challenge in autonomous driving. Existing models often struggle with unpredictable, out-of-label hazards due to their reliance on predefined object categories. In this paper, we propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection to improve hazard identification and explanation. Our pipeline consists of a Vision-Language Model (VLM), a Large Language Model (LLM), in order to detect hazardous objects within a traffic scene. We refine object detection by incorporating OpenAI’s CLIP model to match predicted hazards with bounding box annotations, improving localization accuracy. To assess model performance, we create a ground truth dataset by denoising and extending the foundational COOOL (Challenge-of-Out-of-Label) anomaly detection benchmark dataset with complete natural language descriptions for hazard annotations. We define a means of hazard detection and labeling evaluation on the extended dataset using cosine similarity. This evaluation considers the semantic similarity between the predicted hazard description and the annotated ground truth for each video. Additionally, we release a set of tools for structuring and managing large-scale hazard detection datasets. Our findings highlight the strengths and limitations of current vision-language-based approaches, offering insights into future improvements in autonomous hazard detection systems. Our models, scripts, and data can be found at this https URL
zh

[CV-57] BeetleVerse: A study on taxonomic classification of ground beetles

【速读】:该论文旨在解决基于形态学细微差异进行甲虫分类时因人工专家操作繁琐而导致的实际应用受限的问题。论文通过评估12种视觉模型在跨四个多样化数据集上的分类性能,探索了样本高效学习(sample efficiency)和跨域适应(domain adaptation)在真实世界中的应用潜力。解决方案的关键在于提出了一种结合Vision and Language Transformer与MLP头的模型架构,该模型在属水平达到97%的准确率,在种水平达到94%,同时通过样本高效学习策略将训练数据需求减少了高达50%。此外,研究揭示了实验室到实地图像迁移过程中显著的领域差距(domain gap),强调了跨域适应的重要性。这一研究为大规模自动化的甲虫分类以及其他长尾生态数据集的自动化分类奠定了基础,并推动了样本高效学习与跨域适应技术的发展。

链接: https://arxiv.org/abs/2504.13393
作者: S M Rayeed,Alyson East,Samuel Stevens,Sydne Record,Charles V Stewart
机构: Rensselaer Polytechnic Institute; The University of Maine; The Ohio State University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ground beetles are a highly sensitive and speciose biological indicator, making them vital for monitoring biodiversity. However, they are currently an underutilized resource due to the manual effort required by taxonomic experts to perform challenging species differentiations based on subtle morphological differences, precluding widespread applications. In this paper, we evaluate 12 vision models on taxonomic classification across four diverse, long-tailed datasets spanning over 230 genera and 1769 species, with images ranging from controlled laboratory settings to challenging field-collected (in-situ) photographs. We further explore taxonomic classification in two important real-world contexts: sample efficiency and domain adaptation. Our results show that the Vision and Language Transformer combined with an MLP head is the best performing model, with 97% accuracy at genus and 94% at species level. Sample efficiency analysis shows that we can reduce train data requirements by up to 50% with minimal compromise in performance. The domain adaptation experiments reveal significant challenges when transferring models from lab to in-situ images, highlighting a critical domain gap. Overall, our study lays a foundation for large-scale automated taxonomic classification of beetles, and beyond that, advances sample-efficient learning and cross-domain adaptation for diverse long-tailed ecological datasets.
zh

[CV-58] POET: Supporting Prompting Creativity and Personalization with Automated Expansion of Text-to-Image Generation

【速读】:该论文试图解决视觉生成式 AI 工具在创意任务早期阶段应用时存在的局限性,具体表现为输出结果趋于常规化且缺乏多样性,以及交互方式可能对初学者不够友好。这些局限性限制了工具在支持用户探索创造性想法方面的潜力,尤其是在需要高度个性化和多样化输出的场景中。论文指出,由于创意用户的行为通常具有多样性和不可预测性,生成工具需要提供更多样化和个性化的功能。

解决方案的关键在于提出 POET(Personalized Open-Ended Tool),这是一个实时交互式工具,具备以下三个核心能力:(1) 自动发现文本到图像生成模型中的同质化维度;(2) 扩展这些维度以增加生成图像的输出空间多样性;(3) 根据用户反馈学习并个性化扩展结果。通过这种方法,POET 不仅提高了生成结果的感知多样性,还减少了用户达到满意结果所需的提示次数,从而鼓励用户在协同创作过程中更深入地反思和探索更多可能性。

链接: https://arxiv.org/abs/2504.13392
作者: Evans Xu Han,Alice Qian Zhang,Hong Shen,Haiyi Zhu,Paul Pu Liang,Jane Hsieh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:State-of-the-art visual generative AI tools hold immense potential to assist users in the early ideation stages of creative tasks – offering the ability to generate (rather than search for) novel and unprecedented (instead of existing) images of considerable quality that also adhere to boundless combinations of user specifications. However, many large-scale text-to-image systems are designed for broad applicability, yielding conventional output that may limit creative exploration. They also employ interaction methods that may be difficult for beginners. Given that creative end users often operate in diverse, context-specific ways that are often unpredictable, more variation and personalization are necessary. We introduce POET, a real-time interactive tool that (1) automatically discovers dimensions of homogeneity in text-to-image generative models, (2) expands these dimensions to diversify the output space of generated images, and (3) learns from user feedback to personalize expansions. An evaluation with 28 users spanning four creative task domains demonstrated POET’s ability to generate results with higher perceived diversity and help users reach satisfaction in fewer prompts during creative tasks, thereby prompting them to deliberate and reflect more on a wider range of possible produced results during the co-creative process. Focusing on visual creativity, POET offers a first glimpse of how interaction techniques of future text-to-image generation tools may support and align with more pluralistic values and the needs of end users during the ideation stages of their work.
zh

[CV-59] Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis

【速读】:该论文旨在解决语音驱动的3D头像在实现高质量唇形同步(lip-sync)的同时生成丰富表情的问题。确定性模型可以生成高质量的唇形同步但缺乏丰富的表情,而随机性模型虽然能够生成多样化的表情,但唇形同步质量较低。为了解决这一矛盾,论文的关键在于开发一种具有准确唇形同步的随机模型。为此,作者提出通过观察得出的方法:如果一种方法能够生成逼真的3D唇部动作,则可以从这些唇部动作推断出对应的语音。推断出的语音应与原始输入音频匹配,而错误的预测则会形成一种新的监督信号,用于训练具有准确唇形同步的3D说话头像。论文提出的解决方案是THUNDER框架,它通过可微声音生成引入了一种新颖的监督机制。首先训练一个从面部动画回归音频的网格到语音模型,然后将其整合到基于扩散的说话头像框架中,在训练过程中通过分析-综合的音频监督循环提高唇形同步的质量,同时保持多样化且高质量的表情生成能力。

链接: https://arxiv.org/abs/2504.13386
作者: Radek Daněček,Carolin Schmitt,Senya Polikovsky,Michael J. Black
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.
zh

[CV-60] SMPL-GPT exture: Dual-View 3D Human Texture Estimation using Text-to-Image Generation Models

【速读】:该论文旨在解决生成高质量、照片级真实感的人体纹理的问题,这一任务在计算机视觉和多媒体领域具有基础性和挑战性。然而,由于隐私、伦理及获取成本等原因,真实配对的人体正反面图像数据稀缺,限制了数据的可扩展性。此外,利用深度生成模型(如GAN或扩散模型)从图像输入中学习先验知识以推断未见区域(如人体背面)时,常会导致伪影、结构不一致或细节丢失等问题。为应对这些挑战,论文提出了一种名为SMPL-GPTexture的新方法,其关键是通过先进的文本到图像生成模型,以自然语言提示作为输入生成配对的高分辨率正反面人体图像,以此作为纹理估计的起点。该方法首先通过人体网格恢复模型实现图像像素与3D模型UV坐标的鲁棒2D到3D对齐,接着采用显式的逆光栅化技术将输入图像的颜色投影到UV空间生成精确完整的纹理贴图,并最终利用基于扩散的修补模块填充缺失区域,结合融合机制生成统一的完整纹理贴图。

链接: https://arxiv.org/abs/2504.13378
作者: Mingxiao Tu,Shuchang Ye,Hoijoon Jung,Jinman Kim
机构: University of Sydney (悉尼大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating high-quality, photorealistic textures for 3D human avatars remains a fundamental yet challenging task in computer vision and multimedia field. However, real paired front and back images of human subjects are rarely available with privacy, ethical and cost of acquisition, which restricts scalability of the data. Additionally, learning priors from image inputs using deep generative models, such as GANs or diffusion models, to infer unseen regions such as the human back often leads to artifacts, structural inconsistencies, or loss of fine-grained detail. To address these issues, we present SMPL-GPTexture (skinned multi-person linear model - general purpose Texture), a novel pipeline that takes natural language prompts as input and leverages a state-of-the-art text-to-image generation model to produce paired high-resolution front and back images of a human subject as the starting point for texture estimation. Using the generated paired dual-view images, we first employ a human mesh recovery model to obtain a robust 2D-to-3D SMPL alignment between image pixels and the 3D model’s UV coordinates for each views. Second, we use an inverted rasterization technique that explicitly projects the observed colour from the input images into the UV space, thereby producing accurate, complete texture maps. Finally, we apply a diffusion-based inpainting module to fill in the missing regions, and the fusion mechanism then combines these results into a unified full texture map. Extensive experiments shows that our SMPL-GPTexture can generate high resolution texture aligned with user’s prompts.
zh

[CV-61] VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture

【速读】:该论文旨在解决现代智慧农业中目标检测模型训练面临的两大挑战:一是需要大规模数据收集以确保模型性能,二是分布式敏感农业数据带来的隐私泄露风险。为应对这些挑战,论文提出了一种基于视觉-语言模型(Vision-Language Model, VLM)的轻量级联邦学习框架(VLLFL)。其关键在于结合VLM的泛化能力和上下文感知检测能力,同时利用联邦学习的隐私保护特性,在不同农场间部署紧凑型提示生成器以提升VLM性能,从而在保障隐私的同时显著降低通信开销。实验结果表明,VLLFL不仅提升了VLM性能达14.53%,还减少了99.3%的通信开销,提供了一种高效、可扩展且隐私友好的农业应用解决方案。

链接: https://arxiv.org/abs/2504.13365
作者: Long Li,Jiajia Li,Dong Chen,Lina Pu,Haibo Yao,Yanbo Huang
机构: Department of Electrical and Computer Engineering, The University of Alabama (电气与计算机工程系,阿拉巴马大学); Electrical and Computer Engineering, Michigan State University (电气与计算机工程系,密歇根州立大学); Agricultural and Biological Engineering, Mississippi State University (农业与生物工程系,密西西比州立大学); Department of Computer Science, University of Alabama (计算机科学系,阿拉巴马大学); USDA-ARS Genetics and Sustainbale Agriculture (美国农业部农业研究服务局遗传与可持续农业实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In modern smart agriculture, object detection plays a crucial role by enabling automation, precision farming, and monitoring of resources. From identifying crop health and pest infestations to optimizing harvesting processes, accurate object detection enhances both productivity and sustainability. However, training object detection models often requires large-scale data collection and raises privacy concerns, particularly when sensitive agricultural data is distributed across farms. To address these challenges, we propose VLLFL, a vision-language model-based lightweight federated learning framework (VLLFL). It harnesses the generalization and context-aware detection capabilities of the vision-language model (VLM) and leverages the privacy-preserving nature of federated learning. By training a compact prompt generator to boost the performance of the VLM deployed across different farms, VLLFL preserves privacy while reducing communication overhead. Experimental results demonstrate that VLLFL achieves 14.53% improvement in the performance of VLM while reducing 99.3% communication overhead. Spanning tasks from identifying a wide variety of fruits to detecting harmful animals in agriculture, the proposed framework offers an efficient, scalable, and privacy-preserving solution specifically tailored to agricultural applications.
zh

[CV-62] Wearable-Derived Behavioral and Physiological Biomarkers for Classifying Unipolar and Bipolar Depression Severity

【速读】:本文旨在解决抑郁症亚型(单相抑郁与双相抑郁)分类的难题,现有研究多局限于健康个体与整体抑郁症群体的二元区分,未能充分捕捉抑郁症的异质性。为此,论文提出利用可穿戴设备预测抑郁症亚型,并识别特异性生物标志物以提升诊断精确度并支持个性化治疗策略。解决方案的关键在于构建CALYPSO数据集,通过无创方式收集生理与行为信号(如血容量脉搏、皮肤电活动、体温及三维加速度),并采用标准机器学习方法建立基准模型。初步结果显示,从加速度数据提取的物理活动相关特征在区分两种亚型方面最为有效,准确率达到96.77%,而基于温度的特征也表现出高判别能力,准确率为93.55%。这些发现凸显了生理与行为监测在改进抑郁症亚型分类中的潜力,为更精准的临床干预铺平道路。

链接: https://arxiv.org/abs/2504.13331
作者: Yassine Ouzar,Clémence Nineuil,Fouad Boutaleb,Emery Pierson,Ali Amad,Mohamed Daoudi
机构: Univ. Lille (里尔大学); CNRS (法国国家科学研究中心); Centrale Lille (中央理工大学); Institut Mines-Télécom (矿业与电信学院); Inserm (法国国家健康与医学研究院); CHU Lille (里尔大学医院); LIX, École Polytechnique (Polytechnique 工科大学教学与研究实验室); IPP Paris (巴黎综合理工学院); IMT Nord Europe (北方矿业与电信学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2025

点击查看摘要

Abstract:Depression is a complex mental disorder characterized by a diverse range of observable and measurable indicators that go beyond traditional subjective assessments. Recent research has increasingly focused on objective, passive, and continuous monitoring using wearable devices to gain more precise insights into the physiological and behavioral aspects of depression. However, most existing studies primarily distinguish between healthy and depressed individuals, adopting a binary classification that fails to capture the heterogeneity of depressive disorders. In this study, we leverage wearable devices to predict depression subtypes-specifically unipolar and bipolar depression-aiming to identify distinctive biomarkers that could enhance diagnostic precision and support personalized treatment strategies. To this end, we introduce the CALYPSO dataset, designed for non-invasive detection of depression subtypes and symptomatology through physiological and behavioral signals, including blood volume pulse, electrodermal activity, body temperature, and three-axis acceleration. Additionally, we establish a benchmark on the dataset using well-known features and standard machine learning methods. Preliminary results indicate that features related to physical activity, extracted from accelerometer data, are the most effective in distinguishing between unipolar and bipolar depression, achieving an accuracy of 96.77% . Temperature-based features also showed high discriminative power, reaching an accuracy of 93.55% . These findings highlight the potential of physiological and behavioral monitoring for improving the classification of depressive subtypes, paving the way for more tailored clinical interventions.
zh

[CV-63] SAR Object Detection with Self-Supervised Pretraining and Curriculum-Aware Sampling ICLR2025

【速读】:该论文旨在解决卫星载荷合成孔径雷达(SAR)图像中目标检测面临的挑战,特别是小目标检测困难以及标注数据稀缺的问题。传统方法受限于SAR数据的空间分辨率低、噪声高以及缺乏大规模标注数据集,难以有效发展监督学习驱动的目标检测模型。论文的关键创新在于提出了一种名为TRANSAR的新颖自监督端到端视觉Transformer(Vision Transformer, ViT)模型,通过在超过25,700平方公里未标注SAR图像数据集上的掩码图像预训练实现特征学习。此外,为了解决类别不平衡问题,论文引入了一种自适应采样调度器,利用基于课程学习和模型反馈机制动态调整训练过程中的目标类分布。这些方案共同实现了对传统监督架构及现有自监督学习模型的有效超越。

链接: https://arxiv.org/abs/2504.13310
作者: Yasin Almalioglu,Andrzej Kucik,Geoffrey French,Dafni Antotsiou,Alexander Adam,Cedric Archambeau
机构: Helsing (海伦斯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025 ML4RS this https URL

点击查看摘要

Abstract:Object detection in satellite-borne Synthetic Aperture Radar (SAR) imagery holds immense potential in tasks such as urban monitoring and disaster response. However, the inherent complexities of SAR data and the scarcity of annotations present significant challenges in the advancement of object detection in this domain. Notably, the detection of small objects in satellite-borne SAR images poses a particularly intricate problem, because of the technology’s relatively low spatial resolution and inherent noise. Furthermore, the lack of large labelled SAR datasets hinders the development of supervised deep learning-based object detection models. In this paper, we introduce TRANSAR, a novel self-supervised end-to-end vision transformer-based SAR object detection model that incorporates masked image pre-training on an unlabeled SAR image dataset that spans more than 25,700 km\textsuperscript2 ground area. Unlike traditional object detection formulation, our approach capitalises on auxiliary binary semantic segmentation, designed to segregate objects of interest during the post-tuning, especially the smaller ones, from the background. In addition, to address the innate class imbalance due to the disproportion of the object to the image size, we introduce an adaptive sampling scheduler that dynamically adjusts the target class distribution during training based on curriculum learning and model feedback. This approach allows us to outperform conventional supervised architecture such as DeepLabv3 or UNet, and state-of-the-art self-supervised learning-based arhitectures such as DPT, SegFormer or UperNet, as shown by extensive evaluations on benchmark SAR datasets.
zh

[CV-64] Weak Cube R-CNN: Weakly Supervised 3D Detection using only 2D Bounding Boxes

【速读】:该论文旨在解决单目3D目标检测中对昂贵的3D标注数据的高度依赖问题,通过弱监督方法减少对标注数据的需求。论文的关键在于提出了一种名为Weak Cube R-CNN的通用模型,该模型仅需利用2D边界框标注进行训练,通过挖掘三维立方体在二维投影之间的关系来预测3D物体。解决方案的关键在于利用预训练的冻结基础2D模型估计深度和朝向信息,并将其作为伪真值用于训练;同时设计损失函数将外部模型的信息融入其中,从而隐式地从大规模基础2D模型中迁移知识,而无需显式的3D边界框标注。这种基于单目相机系统的方案显著降低了对昂贵LiDAR传感器或多摄像头设置的依赖。

链接: https://arxiv.org/abs/2504.13297
作者: Andreas Lau Hansen,Lukas Wanzeck,Dim P. Papadopoulos
机构: Technical University of Denmark (丹麦技术大学); Pioneer Center for AI (先驱人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures. Accepted for 23rd Scandinavian Conference, SCIA 2025, Reykjavik, Iceland

点击查看摘要

Abstract:Monocular 3D object detection is an essential task in computer vision, and it has several applications in robotics and virtual reality. However, 3D object detectors are typically trained in a fully supervised way, relying extensively on 3D labeled data, which is labor-intensive and costly to annotate. This work focuses on weakly-supervised 3D detection to reduce data needs using a monocular method that leverages a singlecamera system over expensive LiDAR sensors or multi-camera setups. We propose a general model Weak Cube R-CNN, which can predict objects in 3D at inference time, requiring only 2D box annotations for training by exploiting the relationship between 2D projections of 3D cubes. Our proposed method utilizes pre-trained frozen foundation 2D models to estimate depth and orientation information on a training set. We use these estimated values as pseudo-ground truths during training. We design loss functions that avoid 3D labels by incorporating information from the external models into the loss. In this way, we aim to implicitly transfer knowledge from these large foundation 2D models without having access to 3D bounding box annotations. Experimental results on the SUN RGB-D dataset show increased performance in accuracy compared to an annotation time equalized Cube R-CNN baseline. While not precise for centimetre-level measurements, this method provides a strong foundation for further research.
zh

[CV-65] LIFT: Lightweight Fine-Tuning for Long-Tail Learning

【速读】:该论文旨在解决现有微调范式在长尾学习任务中的效率与准确性不足的问题。论文揭示了当前方法对微调策略的不当使用,表明重度微调(fine-tuning)可能导致尾部类别性能显著下降,而轻量级微调则表现出更好的有效性。这一现象的关键在于重度微调会引入不一致的类别条件分布。基于此洞察,论文提出LIFT+框架,通过优化一致的类别条件,并结合语义感知初始化、极简的数据增强以及测试时集成等技术,提升基础模型的适应性和泛化能力。LIFT+不仅大幅减少了训练轮次(从~100降至≤15)和学习参数规模(小于1%),同时在多项实验中超越了现有最先进方法,提供了一个高效且准确的微调管道。

链接: https://arxiv.org/abs/2504.13282
作者: Jiang-Xin Shi,Tong Wei,Yu-Feng Li
机构: National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China (国家重点实验室,南京大学); School of Artificial Intelligence, Nanjing University, Nanjing 210023, China (人工智能学院,南京大学); School of Computer Science and Engineering, Southeast University, Nanjing 210096, China (计算机科学与工程学院,东南大学); Key Laboratory of Computer Network and Information Integration, Southeast University, Ministry of Education, China (计算机网络与信息集成教育部重点实验室,东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The fine-tuning paradigm has emerged as a prominent approach for addressing long-tail learning tasks in the era of foundation models. However, the impact of fine-tuning strategies on long-tail learning performance remains unexplored. In this work, we disclose that existing paradigms exhibit a profound misuse of fine-tuning methods, leaving significant room for improvement in both efficiency and accuracy. Specifically, we reveal that heavy fine-tuning (fine-tuning a large proportion of model parameters) can lead to non-negligible performance deterioration on tail classes, whereas lightweight fine-tuning demonstrates superior effectiveness. Through comprehensive theoretical and empirical validation, we identify this phenomenon as stemming from inconsistent class conditional distributions induced by heavy fine-tuning. Building on this insight, we propose LIFT+, an innovative lightweight fine-tuning framework to optimize consistent class conditions. Furthermore, LIFT+ incorporates semantic-aware initialization, minimalist data augmentation, and test-time ensembling to enhance adaptation and generalization of foundation models. Our framework provides an efficient and accurate pipeline that facilitates fast convergence and model compactness. Extensive experiments demonstrate that LIFT+ significantly reduces both training epochs (from \sim 100 to \leq 15) and learned parameters (less than 1%), while surpassing state-of-the-art approaches by a considerable margin. The source code is available at this https URL.
zh

[CV-66] A Stochastic Nonlinear Dynamical System for Smoothing Noisy Eye Gaze Data

【速读】:该论文旨在解决屏幕注视位置(gaze location)因眼动仪局限性、校准漂移、环境光照变化及眨眼等因素引起的噪声问题,以提高注视点追踪的准确性。论文的关键解决方案是采用扩展卡尔曼滤波器(Extended Kalman Filter, EKF),通过平滑眼动实验采集的数据,并系统性探索不同系统参数的影响。研究结果表明,EKF显著降低了噪声,从而大幅提升跟踪精度,同时提出的随机非线性动力学模型与真实实验数据高度吻合,展现出在相关领域的应用潜力。

链接: https://arxiv.org/abs/2504.13278
作者: Thoa Thieu,Roderick Melnik
机构: 未知
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:In this study, we address the challenges associated with accurately determining gaze location on a screen, which is often compromised by noise from factors such as eye tracker limitations, calibration drift, ambient lighting changes, and eye blinks. We propose the use of an extended Kalman filter (EKF) to smooth the gaze data collected during eye-tracking experiments, and systematically explore the interaction of different system parameters. Our results demonstrate that the EKF significantly reduces noise, leading to a marked improvement in tracking accuracy. Furthermore, we show that our proposed stochastic nonlinear dynamical model aligns well with real experimental data and holds promise for applications in related fields.
zh

[CV-67] ChartQA-X: Generating Explanations for Charts

【速读】:该论文试图解决在数据驱动决策过程中,如何有效解释和解读图表中复杂信息的问题。解决方案的关键在于提出ChartQA-X数据集,并结合多模型生成与筛选机制。通过利用六种不同模型生成解释性内容,并依据忠实性(faithfulness)、信息量(informativeness)、连贯性(coherence)和困惑度(perplexity)等指标选择最佳响应,该方法显著提升了模型在问答任务中的性能,并增强了智能代理传达复杂信息的能力,从而提高用户理解并建立对生成结果的信任。

链接: https://arxiv.org/abs/2504.13275
作者: Shamanthak Hegde,Pooyan Fazli,Hasti Seifi
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ability to interpret and explain complex information from visual data in charts is crucial for data-driven decision-making. In this work, we address the challenge of providing explanations alongside answering questions about chart images. We present ChartQA-X, a comprehensive dataset comprising various chart types with 28,299 contextually relevant questions, answers, and detailed explanations. These explanations are generated by prompting six different models and selecting the best responses based on metrics such as faithfulness, informativeness, coherence, and perplexity. Our experiments show that models fine-tuned on our dataset for explanation generation achieve superior performance across various metrics and demonstrate improved accuracy in question-answering tasks on new datasets. By integrating answers with explanatory narratives, our approach enhances the ability of intelligent agents to convey complex information effectively, improve user understanding, and foster trust in the generated responses.
zh

[CV-68] Dynamic Memory-enhanced Transformer for Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像(HSI)分类任务中由于复杂的空间-光谱相关性导致的挑战。现有Transformer模型虽擅长捕捉长距离依赖关系,但通常存在信息冗余和注意力效率低下的问题,限制了其建模高分辨率空间-光谱关系的能力,这对HSI分类至关重要。论文的关键解决方案是提出了MemFormer,一种轻量级且增强内存的Transformer模型。MemFormer通过引入增强记忆的多头注意力机制,迭代优化动态内存模块,从而在减少跨层冗余的同时提升特征提取能力;同时采用动态内存富集策略逐步捕获复杂的空谱依赖关系,形成更具表达力的特征表示。此外,为了进一步提高结构一致性,论文设计了一种专用于HSI数据的空间-光谱位置编码(SSPE),确保连续性而不增加基于卷积方法的计算负担。实验结果表明,MemFormer在基准数据集上的分类精度优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.13242
作者: Muhammad Ahmad,Manuel Mazzara,Salvatore Distefano,Adil Mehmood Khan
机构: SDAIA-KFUPM, Joint Research Center for Artificial Intelligence (JRCAI), King Fahd University of Petroleum and Minerals (KFUPM); Dipartimento di Matematica e Informatica—MIFT, University of Messina (意大利墨西拿大学); Institute of Software Development and Engineering, Innopolis University (俄罗斯因诺波利斯大学); School of Computer Science, University of Hull (英国赫尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) classification remains a challenging task due to the intricate spatial-spectral correlations. Existing transformer models excel in capturing long-range dependencies but often suffer from information redundancy and attention inefficiencies, limiting their ability to model fine-grained relationships crucial for HSI classification. To overcome these limitations, this work proposes MemFormer, a lightweight and memory-enhanced transformer. MemFormer introduces a memory-enhanced multi-head attention mechanism that iteratively refines a dynamic memory module, enhancing feature extraction while reducing redundancy across layers. Additionally, a dynamic memory enrichment strategy progressively captures complex spatial and spectral dependencies, leading to more expressive feature representations. To further improve structural consistency, we incorporate a spatial-spectral positional encoding (SSPE) tailored for HSI data, ensuring continuity without the computational burden of convolution-based approaches. Extensive experiments on benchmark datasets demonstrate that MemFormer achieves superior classification accuracy, outperforming state-of-the-art methods.
zh

[CV-69] WildFireCan-MMD: A Multimodal dataset for Classification of User-generated Content During Wildfires in Canada

【速读】:该论文旨在解决在野火期间快速获取相关信息的挑战,传统数据源因速度慢且成本高而存在局限性,而社交媒虽提供实时更新但提取相关洞见仍具挑战。论文提出WildFireCan-MMD这一包含X条帖子的新多模态数据集,涵盖13个关键主题,并评估视觉语言模型与定制训练分类器。结果显示,尽管零样本提示部署迅速,但当有标注数据可用时,即使是简单的训练模型也能比其表现更好,最高可达23%。研究的关键在于强调定制化数据集和任务特定训练的重要性,尤其是这些数据集应本地化以适应不同地区和情境下的灾害响应需求。

链接: https://arxiv.org/abs/2504.13231
作者: Braeden Sherritt,Isar Nejadgholi,Marzieh Amini
机构: Carleton University; National Research Council Canada (加拿大国家研究委员会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rapid information access is vital during wildfires, yet traditional data sources are slow and costly. Social media offers real-time updates, but extracting relevant insights remains a challenge. We present WildFireCan-MMD, a new multimodal dataset of X posts from recent Canadian wildfires, annotated across 13 key themes. Evaluating both Vision Language Models and custom-trained classifiers, we show that while zero-shot prompting offers quick deployment, even simple trained models outperform them when labelled data is available, by up to 23%. Our findings highlight the enduring importance of tailored datasets and task-specific training. Importantly, such datasets should be localized, as disaster response requirements vary across regions and contexts.
zh

[CV-70] ICAS: IP Adapter and ControlNet-based Attention Structure for Multi-Subject Style Transfer Optimization

【速读】:该论文旨在解决多主体风格化图像生成中的两个主要挑战:一是风格属性(如颜色、纹理、氛围和结构)定义的模糊性以及在多个主体间一致应用这些属性的难度;二是现有基于扩散模型的方法通常依赖于计算成本高昂的反转过程或大规模风格化数据集,且在保持多主体语义保真度的同时面临高推理成本的问题。论文提出了一种名为ICAS(基于IP-Adapter和ControlNet的注意力结构)的新框架,其关键是通过仅微调预训练扩散模型的内容注入分支来实现高效且可控的多主体风格迁移,从而在保留身份特定语义的同时增强风格控制能力。此外,结合IP-Adapter进行自适应风格注入与ControlNet进行结构条件约束,确保全局布局的忠实保存及局部风格合成的准确性。同时,引入循环多主体内容嵌入机制,在有限数据条件下实现有效的风格迁移,无需依赖大规模风格化数据集。实验结果表明,ICAS在结构保真度、风格一致性及推理效率方面表现出色,为实际应用中的多主体风格迁移建立了新范式。

链接: https://arxiv.org/abs/2504.13224
作者: Fuwei Liu
机构: Northeastern University at Qinhuangdao (东北大学秦皇岛分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Generating multi-subject stylized images remains a significant challenge due to the ambiguity in defining style attributes (e.g., color, texture, atmosphere, and structure) and the difficulty in consistently applying them across multiple subjects. Although recent diffusion-based text-to-image models have achieved remarkable progress, existing methods typically rely on computationally expensive inversion procedures or large-scale stylized datasets. Moreover, these methods often struggle with maintaining multi-subject semantic fidelity and are limited by high inference costs. To address these limitations, we propose ICAS (IP-Adapter and ControlNet-based Attention Structure), a novel framework for efficient and controllable multi-subject style transfer. Instead of full-model tuning, ICAS adaptively fine-tunes only the content injection branch of a pre-trained diffusion model, thereby preserving identity-specific semantics while enhancing style controllability. By combining IP-Adapter for adaptive style injection with ControlNet for structural conditioning, our framework ensures faithful global layout preservation alongside accurate local style synthesis. Furthermore, ICAS introduces a cyclic multi-subject content embedding mechanism, which enables effective style transfer under limited-data settings without the need for extensive stylized corpora. Extensive experiments show that ICAS achieves superior performance in structure preservation, style consistency, and inference efficiency, establishing a new paradigm for multi-subject style transfer in real-world applications.
zh

[CV-71] SSTAF: Spatial-Spectral-Temporal Attention Fusion Transformer for Motor Imagery Classification

【速读】:该论文旨在解决基于脑电图(EEG)的运动想象分类中跨受试者模型鲁棒性不足的问题,主要源于EEG信号的非平稳特性以及受试者间显著的个体差异。为应对这一挑战,论文提出了一种新颖的空间-频谱-时间注意融合(SSTAF)Transformer架构,专为上肢运动想象分类设计。其关键在于通过集成频谱Transformer、空间Transformer及注意力机制,动态关注多域中最具判别性的模式,包括频谱频率、空间电极位置和时间动态,并结合短时傅里叶变换提取时频域特征以增强特征区分能力。

链接: https://arxiv.org/abs/2504.13220
作者: Ummay Maria Muna,Md. Mehedi Hasan Shawon,Md Jobayer,Sumaiya Akter,Saifur Rahman Sabuj
机构: Department of Electrical and Electronic Engineering, BRAC University, Dhaka-1212, Bangladesh (电气与电子工程系, BRAC 大学, 孟加拉国达卡-1212); Department of Electrical and Computer Engineering, University of Maryland College Park, College Park, Maryland 20742, USA (电气与计算机工程系, 马里兰大学帕克分校, 美国马里兰州学院公园, 20742)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages

点击查看摘要

Abstract:Brain-computer interfaces (BCI) in electroencephalography (EEG)-based motor imagery classification offer promising solutions in neurorehabilitation and assistive technologies by enabling communication between the brain and external devices. However, the non-stationary nature of EEG signals and significant inter-subject variability cause substantial challenges for developing robust cross-subject classification models. This paper introduces a novel Spatial-Spectral-Temporal Attention Fusion (SSTAF) Transformer specifically designed for upper-limb motor imagery classification. Our architecture consists of a spectral transformer and a spatial transformer, followed by a transformer block and a classifier network. Each module is integrated with attention mechanisms that dynamically attend to the most discriminative patterns across multiple domains, such as spectral frequencies, spatial electrode locations, and temporal dynamics. The short-time Fourier transform is incorporated to extract features in the time-frequency domain to make it easier for the model to obtain a better feature distinction. We evaluated our SSTAF Transformer model on two publicly available datasets, the EEGMMIDB dataset, and BCI Competition IV-2a. SSTAF Transformer achieves an accuracy of 76.83% and 68.30% in the data sets, respectively, outperforms traditional CNN-based architectures and a few existing transformer-based approaches.
zh

[CV-72] Wavelet-based Variational Autoencoders for High-Resolution Image Generation

【速读】:该论文旨在解决传统变分自编码器(Variational Autoencoders, VAEs)生成图像模糊的问题,主要由于其假设各向同性的高斯潜空间以及在捕捉高频细节方面的局限性。论文提出了一种基于小波的新方法(Wavelet-VAE),通过多尺度Haar小波系数构建潜空间,并将图像特征编码为多尺度细节和逼近系数,同时引入可学习噪声参数以保持随机性。关键在于重新设计重参数化技巧,处理KL散度项,并将小波稀疏性原理融入训练目标。实验结果表明,Wavelet-VAE在CIFAR-10等数据集上提升了视觉保真度并恢复了更高分辨率的细节。

链接: https://arxiv.org/abs/2504.13214
作者: Andrew Kiruluta
机构: UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Variational Autoencoders (VAEs) are powerful generative models capable of learning compact latent representations. However, conventional VAEs often generate relatively blurry images due to their assumption of an isotropic Gaussian latent space and constraints in capturing high-frequency details. In this paper, we explore a novel wavelet-based approach (Wavelet-VAE) in which the latent space is constructed using multi-scale Haar wavelet coefficients. We propose a comprehensive method to encode the image features into multi-scale detail and approximation coefficients and introduce a learnable noise parameter to maintain stochasticity. We thoroughly discuss how to reformulate the reparameterization trick, address the KL divergence term, and integrate wavelet sparsity principles into the training objective. Our experimental evaluation on CIFAR-10 and other high-resolution datasets demonstrates that the Wavelet-VAE improves visual fidelity and recovers higher-resolution details compared to conventional VAEs. We conclude with a discussion of advantages, potential limitations, and future research directions for wavelet-based generative modeling.
zh

[CV-73] Mirror: Multimodal Cognitive Reframing Therapy for Rolling with Resistance

【速读】:该论文旨在解决文本驱动的认知行为疗法(CBT)模型在面对客户抗拒时表现不佳的问题,这种抗拒会削弱治疗联盟。为了解决这一挑战,论文提出了一种多模态方法,通过整合非语言线索,使AI治疗师能够更好地调整其回应以匹配客户的负面情绪状态。该方案的关键在于引入了一个新的合成数据集——Multimodal Interactive Rolling with Resistance (Mirror),该数据集将客户的陈述与相应的面部图像配对。利用此数据集,研究人员训练了基础视觉-语言模型(VLMs),这些模型可以分析面部线索、推断情绪,并生成共情回应来有效管理抗拒。评估结果显示,Mirror显著提升了AI治疗师处理抗拒的能力,优于现有的基于文本的CBT方法。

链接: https://arxiv.org/abs/2504.13211
作者: Subin Kim,Hoonrae Kim,Jihyun Lee,Yejin Jeon,Gary Geunbae Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies have explored the use of large language models (LLMs) in psychotherapy; however, text-based cognitive behavioral therapy (CBT) models often struggle with client resistance, which can weaken therapeutic alliance. To address this, we propose a multimodal approach that incorporates nonverbal cues, allowing the AI therapist to better align its responses with the client’s negative emotional state. Specifically, we introduce a new synthetic dataset, Multimodal Interactive Rolling with Resistance (Mirror), which is a novel synthetic dataset that pairs client statements with corresponding facial images. Using this dataset, we train baseline Vision-Language Models (VLMs) that can analyze facial cues, infer emotions, and generate empathetic responses to effectively manage resistance. They are then evaluated in terms of both the therapist’s counseling skills and the strength of the therapeutic alliance in the presence of client resistance. Our results demonstrate that Mirror significantly enhances the AI therapist’s ability to handle resistance, which outperforms existing text-based CBT approaches.
zh

[CV-74] Intelligent road crack detection and analysis based on improved YOLOv8

【速读】:该论文旨在解决城市化加速和交通流量增加背景下,路面裂缝(Pothole)检测效率低、成本高且依赖人工的问题,这对道路安全和服务寿命构成了严重威胁。论文的关键解决方案是提出了一种基于增强版YOLOv8深度学习框架的智能道路裂缝检测与分析系统。该系统通过训练包含4029张图像的数据集开发了一个目标分割模型,能够高效且精确地识别和分割道路裂缝区域,并进一步分析这些区域以准确计算裂缝的最大宽度、最小宽度及其精确位置。此外,引入ECA(Efficient Channel Attention)和CBAM(Convolutional Block Attention Module)注意力机制显著提升了模型的检测精度和效率,为道路维护和安全监测提供了创新性解决方案。

链接: https://arxiv.org/abs/2504.13208
作者: Haomin Zuo,Zhengyang Li,Jiangchuan Gong,Zhen Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by IEEE - ICAACE 2025

点击查看摘要

Abstract:As urbanization speeds up and traffic flow increases, the issue of pavement distress is becoming increasingly pronounced, posing a severe threat to road safety and service life. Traditional methods of pothole detection rely on manual inspection, which is not only inefficient but also costly. This paper proposes an intelligent road crack detection and analysis system, based on the enhanced YOLOv8 deep learning framework. A target segmentation model has been developed through the training of 4029 images, capable of efficiently and accurately recognizing and segmenting crack regions in roads. The model also analyzes the segmented regions to precisely calculate the maximum and minimum widths of cracks and their exact locations. Experimental results indicate that the incorporation of ECA and CBAM attention mechanisms substantially enhances the model’s detection accuracy and efficiency, offering a novel solution for road maintenance and safety monitoring.
zh

[CV-75] Universal Representations for Classification-enhanced Lossy Compression

【速读】:该论文试图解决在多目标优化场景下(如同时考虑压缩率、分类准确性和感知质量)如何设计通用编码器以避免针对每个特定权衡点重新训练的问题。解决方案的关键在于提出一种能够实现多个解码目标的通用表示方法,通过单一编码器在不同失真与分类(或感知)约束条件下保持性能。实验验证表明,这种通用编码器在感知图像压缩任务中的性能降级极小,与专门优化的编码器相当,但在分类-失真权衡的场景下,直接复用为某一特定分类-失真权衡优化的编码器会导致显著的失真惩罚。

链接: https://arxiv.org/abs/2504.13191
作者: Nam Nguyen
机构: Oregon State University (俄勒冈州立大学), Oregon, United States
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:In lossy compression, the classical tradeoff between compression rate and reconstruction distortion has traditionally guided algorithm design. However, Blau and Michaeli [5] introduced a generalized framework, known as the rate-distortion-perception (RDP) function, incorporating perceptual quality as an additional dimension of evaluation. More recently, the rate-distortion-classification (RDC) function was investigated in [19], evaluating compression performance by considering classification accuracy alongside distortion. In this paper, we explore universal representations, where a single encoder is developed to achieve multiple decoding objectives across various distortion and classification (or perception) constraints. This universality avoids retraining encoders for each specific operating point within these tradeoffs. Our experimental validation on the MNIST dataset indicates that a universal encoder incurs only minimal performance degradation compared to individually optimized encoders for perceptual image compression tasks, aligning with prior results from [23]. Nonetheless, we also identify that in the RDC setting, reusing an encoder optimized for one specific classification-distortion tradeoff leads to a significant distortion penalty when applied to alternative points.
zh

[CV-76] SupResDiffGAN a new approach for the Super-Resolution task

【速读】:该论文旨在解决超分辨率任务中扩散模型效率较低的问题,并尝试在生成质量与推理速度之间取得平衡。论文提出了一种名为SupResDiffGAN的新型混合架构,结合了生成式对抗网络(Generative Adversarial Networks, GANs)和扩散模型的优势。其关键解决方案包括利用潜在空间表示以减少扩散步骤,从而显著加快推理时间;同时通过自适应噪声腐蚀(adaptive noise corruption)机制防止判别器过拟合,确保生成器与判别器在训练过程中的稳定交互。实验结果表明,该方法在效率和图像质量上均优于传统扩散模型(如SR3和I²SB),为扩散模型在实时高分辨率图像生成中的应用奠定了基础。

链接: https://arxiv.org/abs/2504.13622
作者: Dawid Kopeć,Wojciech Kozłowski,Maciej Wizerkaniuk,Dawid Krutul,Jan Kocoń,Maciej Zięba
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 25th International Conference on Computational Science

点击查看摘要

Abstract:In this work, we present SupResDiffGAN, a novel hybrid architecture that combines the strengths of Generative Adversarial Networks (GANs) and diffusion models for super-resolution tasks. By leveraging latent space representations and reducing the number of diffusion steps, SupResDiffGAN achieves significantly faster inference times than other diffusion-based super-resolution models while maintaining competitive perceptual quality. To prevent discriminator overfitting, we propose adaptive noise corruption, ensuring a stable balance between the generator and the discriminator during training. Extensive experiments on benchmark datasets show that our approach outperforms traditional diffusion models such as SR3 and I ^2 SB in efficiency and image quality. This work bridges the performance gap between diffusion- and GAN-based methods, laying the foundation for real-time applications of diffusion models in high-resolution image generation.
zh

[CV-77] ViG3D-UNet: Volumetric Vascular Connectivity-Aware Segmentation via 3D Vision Graph Representation

【速读】:该论文旨在解决冠状动脉血管分割中因不连续性和端点缺失导致的挑战,这些问题严重影响冠状动脉可视化与冠心病诊断的准确性。为应对这些挑战,论文提出了一种名为ViG3D-UNet的3D视觉图神经网络框架。其关键在于结合3D图表示与聚合技术于U形架构中,通过ViG3D模块捕捉血管的体素连通性与拓扑结构,同时利用卷积模块提取精细的血管细节,并借助通道注意力机制将这两者融合形成编码特征。此外,采用类似回形针形状的偏移解码器减少稀疏特征空间中的冗余计算,恢复特征图尺寸以匹配原始输入,从而实现连续且精确的血管分割。

链接: https://arxiv.org/abs/2504.13599
作者: Bowen Liu,Chunlei Meng,Wei Lin,Hongda Zhang,Ziqing Zhou,Zhongxue Gan,Chun Ouyang
机构: Academy for Engineering and Technology, Fudan University (复旦大学工程与技术学院); Shanghai Engineering Research Center of AI & Robotics, Fudan University (复旦大学人工智能与机器人工程研究中心); Engineering Research Center of AI & Robotics, Ministry of Education, China (中国教育部人工智能与机器人工程研究中心); CFFF platform of Fudan University (复旦大学CFFF平台)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate vascular segmentation is essential for coronary visualization and the diagnosis of coronary heart disease. This task involves the extraction of sparse tree-like vascular branches from the volumetric space. However, existing methods have faced significant challenges due to discontinuous vascular segmentation and missing endpoints. To address this issue, a 3D vision graph neural network framework, named ViG3D-UNet, was introduced. This method integrates 3D graph representation and aggregation within a U-shaped architecture to facilitate continuous vascular segmentation. The ViG3D module captures volumetric vascular connectivity and topology, while the convolutional module extracts fine vascular details. These two branches are combined through channel attention to form the encoder feature. Subsequently, a paperclip-shaped offset decoder minimizes redundant computations in the sparse feature space and restores the feature map size to match the original input dimensions. To evaluate the effectiveness of the proposed approach for continuous vascular segmentation, evaluations were performed on two public datasets, ASOCA and ImageCAS. The segmentation results show that the ViG3D-UNet surpassed competing methods in maintaining vascular segmentation connectivity while achieving high segmentation accuracy. Our code will be available soon.
zh

[CV-78] FocusNet: Transformer-enhanced Polyp Segmentation with Local and Pooling Attention

【速读】:该论文旨在解决现有结肠镜息肉分割模型因依赖单模态和单中心数据而导致在真实临床环境中效果不佳的问题。为克服这一局限,论文提出了一种名为FocusNet的Transformer增强型聚焦注意力网络。FocusNet的关键在于其包含三个核心模块:Cross-semantic Interaction Decoder Module (CIDM),用于生成粗略分割图;Detail Enhancement Module (DEM),用于优化浅层特征;以及Focus Attention Module (FAM),通过局部和池化注意力机制平衡局部细节与全局上下文,从而提升分割性能。这些模块共同确保了模型在多模态和多中心数据上的可靠性和准确性。

链接: https://arxiv.org/abs/2504.13597
作者: Jun Zeng,KC Santosh,Deepak Rajan Nayak,Thomas de Lange,Jonas Varkey,Tyler Berzin,Debesh Jha
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); University of South Dakota (南达科他大学); Malaviya National Institute of Technology Jaipur (马拉维国家技术学院斋浦尔); Harvard Medical School (哈佛医学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Colonoscopy is vital in the early diagnosis of colorectal polyps. Regular screenings can effectively prevent benign polyps from progressing to CRC. While deep learning has made impressive strides in polyp segmentation, most existing models are trained on single-modality and single-center data, making them less effective in real-world clinical environments. To overcome these limitations, we propose FocusNet, a Transformer-enhanced focus attention network designed to improve polyp segmentation. FocusNet incorporates three essential modules: the Cross-semantic Interaction Decoder Module (CIDM) for generating coarse segmentation maps, the Detail Enhancement Module (DEM) for refining shallow features, and the Focus Attention Module (FAM), to balance local detail and global context through local and pooling attention mechanisms. We evaluate our model on PolypDB, a newly introduced dataset with multi-modality and multi-center data for building more reliable segmentation methods. Extensive experiments showed that FocusNet consistently outperforms existing state-of-the-art approaches with a high dice coefficients of 82.47% on the BLI modality, 88.46% on FICE, 92.04% on LCI, 82.09% on the NBI and 93.42% on WLI modality, demonstrating its accuracy and robustness across five different modalities. The source code for FocusNet is available at this https URL.
zh

[CV-79] A Novel Hybrid Approach for Retinal Vessel Segmentation with Dynamic Long-Range Dependency and Multi-Scale Retinal Edge Fusion Enhancement

【速读】:该论文致力于解决视网膜血管分割中因多尺度血管变化、复杂曲率及边界模糊所导致的挑战,特别是现有方法在血管不连续性和边缘特征模糊上的局限性。论文提出的关键解决方案是一种新颖的混合框架,通过协同整合卷积神经网络(CNNs)与Mamba模型实现高精度视网膜血管分割。其核心创新点包括:1)高分辨率边缘融合网络(High-Resolution Edge Fuse Network),结合多尺度主干网络与多尺度视网膜边缘融合模块(MREF),提升边缘特征以确保分割准确性;2)动态蛇视觉状态空间块,通过动态蛇卷积与Mamba结合,自适应捕捉血管曲率细节及长程依赖,并通过改进的八方向二维蛇选择扫描机制和动态加权策略增强复杂血管拓扑结构的感知能力;3)MREF模块通过多尺度边缘特征聚合提高边界精度,在抑制噪声的同时突出关键血管结构。实验结果表明,该方法在三个公开数据集上达到最先进的性能,特别是在保持血管连续性和分割低对比度区域中的血管方面表现优异。

链接: https://arxiv.org/abs/2504.13553
作者: Yihao Ouyang,Xunheng Kuang,Mengjia Xiong,Zhida Wang,Yuanquan Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate retinal vessel segmentation provides essential structural information for ophthalmic image analysis. However, existing methods struggle with challenges such as multi-scale vessel variability, complex curvatures, and ambiguous boundaries. While Convolutional Neural Networks (CNNs), Transformer-based models and Mamba-based architectures have advanced the field, they often suffer from vascular discontinuities or edge feature ambiguity. To address these limitations, we propose a novel hybrid framework that synergistically integrates CNNs and Mamba for high-precision retinal vessel segmentation. Our approach introduces three key innovations: 1) The proposed High-Resolution Edge Fuse Network is a high-resolution preserving hybrid segmentation framework that combines a multi-scale backbone with the Multi-scale Retina Edge Fusion (MREF) module to enhance edge features, ensuring accurate and robust vessel segmentation. 2) The Dynamic Snake Visual State Space block combines Dynamic Snake Convolution with Mamba to adaptively capture vessel curvature details and long-range dependencies. An improved eight-directional 2D Snake-Selective Scan mechanism and a dynamic weighting strategy enhance the perception of complex vascular topologies. 3) The MREF module enhances boundary precision through multi-scale edge feature aggregation, suppressing noise while emphasizing critical vessel structures across scales. Experiments on three public datasets demonstrate that our method achieves state-of-the-art performance, particularly in maintaining vascular continuity and effectively segmenting vessels in low-contrast regions. This work provides a robust method for clinical applications requiring accurate retinal vessel analysis. The code is available at this https URL.
zh

[CV-80] Quantum Walks-Based Adaptive Distribution Generation with Efficient CUDA-Q Acceleration

【速读】:本文旨在解决高效生成高精度目标概率分布的问题。解决方案的关键在于提出了一种基于量子行走(Quantum Walks)的自适应分布生成器,通过整合变分量子电路与离散时间量子行走(特别是split-step量子行走及其纠缠扩展),实现对硬币参数的动态调整以及量子态向期望分布演化的引导。这种方法不仅支持一维概率建模应用(如金融模拟),还能生成二维结构化模式(如数字0到9的表示)。借助CUDA-Q框架利用GPU加速,显著降低了计算开销并提升了可扩展性。实验结果表明,该方法在模拟保真度方面表现出色,并弥合了理论量子算法与实际高性能计算之间的差距。

链接: https://arxiv.org/abs/2504.13532
作者: Yen-Jui Chang,Wei-Ting Wang,Chen-Yu Liu,Yun-Yuan Wang,Ching-Ray Chang
机构: Cyber University of China (中华科技大学生技学系); National Taiwan University (台湾大学); Nvidia(英伟达); National Taiwan University (台湾大学)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Pricing of Securities (q-fin.PR)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:We present a novel Adaptive Distribution Generator that leverages a quantum walks-based approach to generate high precision and efficiency of target probability distributions. Our method integrates variational quantum circuits with discrete-time quantum walks, specifically, split-step quantum walks and their entangled extensions, to dynamically tune coin parameters and drive the evolution of quantum states towards desired distributions. This enables accurate one-dimensional probability modeling for applications such as financial simulation and structured two-dimensional pattern generation exemplified by digit representations(0~9). Implemented within the CUDA-Q framework, our approach exploits GPU acceleration to significantly reduce computational overhead and improve scalability relative to conventional methods. Extensive benchmarks demonstrate that our Quantum Walks-Based Adaptive Distribution Generator achieves high simulation fidelity and bridges the gap between theoretical quantum algorithms and practical high-performance computation.
zh

[CV-81] Filter2Noise: Interpretable Self-Supervised Single-Image Denoising for Low-Dose CT with Attention-Guided Bilateral Filtering

【速读】:该论文旨在解决低剂量CT成像中有效降噪的问题,以增强细微结构和低对比度病灶的可视化,同时避免诊断错误。传统监督方法受限于有限的配对数据集,而现有的自监督方法通常需要多张噪声图像,并依赖深度网络(如U-Net),但缺乏对降噪机制的可解释性。为应对这些挑战,论文提出了一种可解释的自监督单图像降噪框架——Filter2Noise (F2N)。其关键在于引入了一个注意力引导的双边滤波器(Attention-Guided Bilateral Filter),通过轻量级模块预测空间变化的滤波参数,这些参数可以在训练后进行可视化和调整,从而实现用户在感兴趣区域的可控降噪。此外,为了实现单图像训练,提出了新颖的下采样洗牌策略和新的自监督损失函数,将Noise2Noise的概念扩展到单张图像,并处理空间相关噪声。实验结果显示,F2N在Mayo Clinic 2016低剂量CT数据集上的性能优于现有最先进的自监督单图像方法(ZS-N2N),同时提升了透明度、用户控制能力和参数效率,这些特性对于需要精确且可解释性降噪的医学应用具有重要价值。

链接: https://arxiv.org/abs/2504.13519
作者: Yipeng Sun,Linda-Sophie Schneider,Mingxuan Gu,Siyuan Mei,Chengze Ye,Fabian Wagner,Siming Bayer,Andreas Maier
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Effective denoising is crucial in low-dose CT to enhance subtle structures and low-contrast lesions while preventing diagnostic errors. Supervised methods struggle with limited paired datasets, and self-supervised approaches often require multiple noisy images and rely on deep networks like U-Net, offering little insight into the denoising mechanism. To address these challenges, we propose an interpretable self-supervised single-image denoising framework – Filter2Noise (F2N). Our approach introduces an Attention-Guided Bilateral Filter that adapted to each noisy input through a lightweight module that predicts spatially varying filter parameters, which can be visualized and adjusted post-training for user-controlled denoising in specific regions of interest. To enable single-image training, we introduce a novel downsampling shuffle strategy with a new self-supervised loss function that extends the concept of Noise2Noise to a single image and addresses spatially correlated noise. On the Mayo Clinic 2016 low-dose CT dataset, F2N outperforms the leading self-supervised single-image method (ZS-N2N) by 4.59 dB PSNR while improving transparency, user control, and parametric efficiency. These features provide key advantages for medical applications that require precise and interpretable noise reduction. Our code is demonstrated at this https URL .
zh

[CV-82] DADU: Dual Attention-based Deep Supervised UNet for Automated Semantic Segmentation of Cardiac Images

【速读】:该论文旨在解决心脏磁共振(Cardiac Magnetic Resonance, CMR)图像中左心室、右心室及心肌瘢痕组织分割精度不足的问题。为实现这一目标,论文提出了一种增强型深度学习模型,其关键在于结合UNet架构、基于通道和空间注意力的双注意力机制、基于边缘检测的跳跃连接以及深度监督学习。其中,双注意力机制通过同时捕获特征通道间的重要性和空间位置信息,提升了特征表示能力;基于边缘检测的跳跃连接有效改善了从特征图重建图像的质量;而深度监督策略则缓解了深度神经网络分类任务中梯度消失的问题。实验结果表明,该方法在Dice相似性分数(DSC)和Hausdorff距离(HD)指标上均优于其他领先技术。

链接: https://arxiv.org/abs/2504.13415
作者: Racheal Mukisa,Arvind K. Bansal
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:We propose an enhanced deep learning-based model for image segmentation of the left and right ventricles and myocardium scar tissue from cardiac magnetic resonance (CMR) images. The proposed technique integrates UNet, channel and spatial attention, edge-detection based skip-connection and deep supervised learning to improve the accuracy of the CMR image-segmentation. Images are processed using multiple channels to generate multiple feature-maps. We built a dual attention-based model to integrate channel and spatial attention. The use of extracted edges in skip connection improves the reconstructed images from feature-maps. The use of deep supervision reduces vanishing gradient problems inherent in classification based on deep neural networks. The algorithms for dual attention-based model, corresponding implementation and performance results are described. The performance results show that this approach has attained high accuracy: 98% Dice Similarity Score (DSC) and significantly lower Hausdorff Distance (HD). The performance results outperform other leading techniques both in DSC and HD.
zh

[CV-83] Cardiac MRI Semantic Segmentation for Ventricles and Myocardium using Deep Learning

【速读】:该论文致力于解决自动化心脏磁共振成像(Cardiac Magnetic Resonance, CMR)图像语义分割精度不足的问题,以更精确地定位心脏的主要结构(左心室腔、右心室腔及左心室心肌),从而提升心血管疾病诊断的准确性。解决方案的关键在于改进U-Net模型,在下采样过程中提取边缘属性和上下文信息,并在上采样阶段将这些信息注入,以增强对目标结构的空间定位能力。通过这种方法,论文展示了Dice相似性系数(DSC)提升了2%-11%,Hausdorff距离(HD)降低了1.6至5.7毫米的性能提升。

链接: https://arxiv.org/abs/2504.13391
作者: Racheal Mukisa,Arvind K. Bansal
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:Automated noninvasive cardiac diagnosis plays a critical role in the early detection of cardiac disorders and cost-effective clinical management. Automated diagnosis involves the automated segmentation and analysis of cardiac images. Precise delineation of cardiac substructures and extraction of their morphological attributes are essential for evaluating the cardiac function, and diagnosing cardiovascular disease such as cardiomyopathy, valvular diseases, abnormalities related to septum perforations, and blood-flow rate. Semantic segmentation labels the CMR image at the pixel level, and localizes its subcomponents to facilitate the detection of abnormalities, including abnormalities in cardiac wall motion in an aging heart with muscle abnormalities, vascular abnormalities, and valvular abnormalities. In this paper, we describe a model to improve semantic segmentation of CMR images. The model extracts edge-attributes and context information during down-sampling of the U-Net and infuses this information during up-sampling to localize three major cardiac structures: left ventricle cavity (LV); right ventricle cavity (RV); and LV myocardium (LMyo). We present an algorithm and performance results. A comparison of our model with previous leading models, using similarity metrics between actual image and segmented image, shows that our approach improves Dice similarity coefficient (DSC) by 2%-11% and lowers Hausdorff distance (HD) by 1.6 to 5.7 mm.
zh

[CV-84] Accelerated Optimization of Implicit Neural Representations for CT Reconstruction

【速读】:该论文试图解决低剂量/稀疏视图X射线计算机断层扫描(CT)重建中隐式神经表示(INRs)优化速度慢的问题。解决方案的关键在于提出两种加速策略:(1) 使用改进条件的修改损失函数,(2) 基于乘子交替方向法的算法。研究显示这两种方法显著加速了合成乳腺CT体模在稀疏视图设置下的INR基重建。

链接: https://arxiv.org/abs/2504.13390
作者: Mahrokh Najaf,Gregory Ongie
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE ISBI 2025

点击查看摘要

Abstract:Inspired by their success in solving challenging inverse problems in computer vision, implicit neural representations (INRs) have been recently proposed for reconstruction in low-dose/sparse-view X-ray computed tomography (CT). An INR represents a CT image as a small-scale neural network that takes spatial coordinates as inputs and outputs attenuation values. Fitting an INR to sinogram data is similar to classical model-based iterative reconstruction methods. However, training INRs with losses and gradient-based algorithms can be prohibitively slow, taking many thousands of iterations to converge. This paper investigates strategies to accelerate the optimization of INRs for CT reconstruction. In particular, we propose two approaches: (1) using a modified loss function with improved conditioning, and (2) an algorithm based on the alternating direction method of multipliers. We illustrate that both of these approaches significantly accelerate INR-based reconstruction of a synthetic breast CT phantom in a sparse-view setting.
zh

[CV-85] Putting the Segment Anything Model to the Test with 3D Knee MRI – A Comparison with State-of-the-Art Performance BMVC2024

【速读】:该论文旨在解决膝关节半月板(menisci)在三维磁共振成像(3D knee MRI)中的自动化分割问题,以实现早期检测和治疗半月板异常,并深入研究半月板在膝关节骨性关节炎(knee osteoarthritis, OA)发病机制中的作用。此前的研究主要基于卷积网络的变体,但尚未充分利用近期的大规模视觉Transformer分割模型。论文的关键解决方案是评估Segment Anything Model (SAM),一种基础分割模型,在经过端到端微调后,其Dice系数达到0.87±0.03,与3D U-Net的性能相当,接近IWOAI膝关节MRI分割挑战赛2019的最佳成绩(0.88±0.03)。然而,SAM在Hausdorff距离指标上的表现劣于3D U-Net,表明其在匹配精细解剖结构形态方面存在局限性。因此,尽管SAM具有通用性,但在涉及低对比度和边界不清晰的精细解剖结构分割任务中,其性能可能不如基本的3D U-Net。

链接: https://arxiv.org/abs/2504.13340
作者: Oliver Mills,Philip Conaghan,Nishant Ravikumar,Samuel Relton
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Work accepted at BMVC 2024. Minor changes to the camera-ready version since acceptance include a corrected running header and the addition of an Acknowledgments section (including code availability)

点击查看摘要

Abstract:Menisci are cartilaginous tissue found within the knee that contribute to joint lubrication and weight dispersal. Damage to menisci can lead to onset and progression of knee osteoarthritis (OA), a condition that is a leading cause of disability, and for which there are few effective therapies. Accurate automated segmentation of menisci would allow for earlier detection and treatment of meniscal abnormalities, as well as shedding more light on the role the menisci play in OA pathogenesis. Focus in this area has mainly used variants of convolutional networks, but there has been no attempt to utilise recent large vision transformer segmentation models. The Segment Anything Model (SAM) is a so-called foundation segmentation model, which has been found useful across a range of different tasks due to the large volume of data used for training the model. In this study, SAM was adapted to perform fully-automated segmentation of menisci from 3D knee magnetic resonance images. A 3D U-Net was also trained as a baseline. It was found that, when fine-tuning only the decoder, SAM was unable to compete with 3D U-Net, achieving a Dice score of 0.81\pm0.03 , compared to 0.87\pm0.03 , on a held-out test set. When fine-tuning SAM end-to-end, a Dice score of 0.87\pm0.03 was achieved. The performance of both the end-to-end trained SAM configuration and the 3D U-Net were comparable to the winning Dice score ( 0.88\pm0.03 ) in the IWOAI Knee MRI Segmentation Challenge 2019. Performance in terms of the Hausdorff Distance showed that both configurations of SAM were inferior to 3D U-Net in matching the meniscus morphology. Results demonstrated that, despite its generalisability, SAM was unable to outperform a basic 3D U-Net in meniscus segmentation, and may not be suitable for similar 3D medical image segmentation tasks also involving fine anatomical structures with low contrast and poorly-defined boundaries.
zh

[CV-86] Focus3D: A Practical Method to Adaptively Focus ISAR Data and Provide 3-D Information for Automatic Target Recognition

【速读】:该论文旨在解决海上舰船自动目标识别(ATR)中成像聚焦与姿态估计的问题。传统方法仅提供聚焦图像,难以确定舰船的姿态(如侧视图或顶视图),从而限制了识别精度。论文的关键在于结合聚焦算法与一种能够建模舰船相对于雷达角度变化的方法,通过引入两自由度姿态描述(方位角 Aspect Angle 和俯仰角 Tilt Angle),实现对舰船在水平面旋转及有效入射角变化的精确建模,从而提升基于已知舰船特征匹配的识别性能。这一方案突破了单一角度假设的局限性,为多角度数据处理提供了更全面的解决方案。

链接: https://arxiv.org/abs/2504.13321
作者: John R. Bennett
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To improve ATR identification of ships at sea requires an advanced ISAR processor - one that not only provides focused images but can also determine the pose of the ship. This tells us whether the image shows a profile (vertical plane) view, a plan (horizontal plane) view or some view in between. If the processor can provide this information, then the ATR processor can try to match the images with known vertical or horizontal features of ships and, in conjunction with estimated ship length, narrow the set of possible identifications. This paper extends the work of Melendez and Bennett [M-B, Ref. 1] by combining a focus algorithm with a method that models the angles of the ship relative to the radar. In M-B the algorithm was limited to a single angle and the plane of rotation was not determined. This assumption may be fine for a short time image where there is limited data available to determine the pose. However, the present paper models the ship rotation with two angles - aspect angle, representing rotation in the horizontal plane, and tilt angle, representing variations in the effective grazing angle to the ship.
zh

[CV-87] Efficient Brain Tumor Segmentation Using a Dual-Decoder 3D U-Net with Attention Gates (DDUNet)

【速读】:该论文旨在解决脑肿瘤分割在资源受限环境下的效率与准确性平衡问题。传统先进的分割方法通常需要大量计算资源和长时间训练,限制了其实际应用。为应对这一挑战,论文提出了一种新颖的双解码器U-Net架构,并通过引入注意力门控跳跃连接来增强模型性能。关键在于该设计能够在保持较高分割精度的同时显著降低训练需求,从而实现高效且准确的脑肿瘤分割,尤其适用于MRI图像分析。在BraTS 2020数据集上的测试表明,该模型在仅50个训练周期内达到了Whole Tumor (WT) 85.06%、Tumor Core (TC) 80.61%以及Enhancing Tumor (ET) 71.26%的Dice评分,优于多种常用的U-Net变体。此资源高效模型为改善脑肿瘤的早期检测与诊断提供了可行方案,有助于提升患者治疗效果。

链接: https://arxiv.org/abs/2504.13200
作者: Mohammad Mahdi Danesh Pajouh
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cancer remains one of the leading causes of mortality worldwide, and among its many forms, brain tumors are particularly notorious due to their aggressive nature and the critical challenges involved in early diagnosis. Recent advances in artificial intelligence have shown great promise in assisting medical professionals with precise tumor segmentation, a key step in timely diagnosis and treatment planning. However, many state-of-the-art segmentation methods require extensive computational resources and prolonged training times, limiting their practical application in resource-constrained settings. In this work, we present a novel dual-decoder U-Net architecture enhanced with attention-gated skip connections, designed specifically for brain tumor segmentation from MRI scans. Our approach balances efficiency and accuracy by achieving competitive segmentation performance while significantly reducing training demands. Evaluated on the BraTS 2020 dataset, the proposed model achieved Dice scores of 85.06% for Whole Tumor (WT), 80.61% for Tumor Core (TC), and 71.26% for Enhancing Tumor (ET) in only 50 epochs, surpassing several commonly used U-Net variants. Our model demonstrates that high-quality brain tumor segmentation is attainable even under limited computational resources, thereby offering a viable solution for researchers and clinicians operating with modest hardware. This resource-efficient model has the potential to improve early detection and diagnosis of brain tumors, ultimately contributing to better patient outcomes
zh

[CV-88] Advanced Deep Learning and Large Language Models : Comprehensive Insights for Cancer Detection

【速读】:该论文旨在解决现有深度学习(Deep Learning, DL)在癌症检测领域研究中对技术角色的全面分析不足的问题。尽管已有许多关于DL在医疗领域的综述,但对其在癌症检测中的整体影响仍缺乏系统性探讨,且现有研究多聚焦于特定方面,存在理解上的空白。为填补这些空白,论文提出了一种综合分析方法,重点评估包括迁移学习(Transfer Learning, TL)、强化学习(Reinforcement Learning, RL)、联邦学习(Federated Learning, FL)、Transformer模型以及大型语言模型(Large Language Models, LLMs)在内的先进DL技术的应用。这些技术的关键在于通过提升诊断准确性、缓解数据稀缺问题以及实现去中心化学习的同时保护数据隐私来增强癌症检测能力。其中,TL通过适配预训练模型以处理新数据集,在标注数据有限的情况下提高性能;RL优化诊断流程与治疗策略;FL促进协作模型开发而无需共享敏感信息;而Transformer和LLM则借助自然语言处理的技术优势改善医学数据的可解释性。此外,论文还讨论了这些技术在癌症诊断中的实际效率,并针对数据不平衡等挑战提出了应对方案,为研究人员和从业者提供了当前趋势的洞见,并指导未来在癌症检测领域的高级DL研究方向。

链接: https://arxiv.org/abs/2504.13186
作者: Yassine Habchi,Hamza Kheddar,Yassine Himeur,Adel Belouchrani,Erchin Serpedin,Fouad Khelifi,Muhammad E.H. Chowdhury
机构: University Center Salhi Ahmed, Naama, Algeria (大学中心Salhi Ahmed, 拿阿马, 阿尔及利亚); University of Medea, 26000, Algeria (梅代亚大学, 阿尔及利亚); University of Dubai, Dubai, UAE (迪拜大学, 阿联酋); Ecole Nationale Polytechnique/ LDCCP lab., El Harrach, Algiers, Algeria (国家高等Polytechnique学院/ LDCCP实验室, 埃尔哈拉什, 阿尔及尔, 阿尔及利亚); Texas A&M University, College Station TX 77843-3128 USA (德克萨斯农工大学, 美国); Northumbria University at Newcastle (纽卡斯尔诺森比亚大学); Qatar University, Doha 2713, Qatar (卡塔尔大学, 多哈, 卡塔尔)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancement of deep learning (DL) has transformed healthcare, particularly in cancer detection and diagnosis. DL surpasses traditional machine learning and human accuracy, making it a critical tool for identifying diseases. Despite numerous reviews on DL in healthcare, a comprehensive analysis of its role in cancer detection remains limited. Existing studies focus on specific aspects, leaving gaps in understanding its broader impact. This paper addresses these gaps by reviewing advanced DL techniques, including transfer learning (TL), reinforcement learning (RL), federated learning (FL), Transformers, and large language models (LLMs). These approaches enhance accuracy, tackle data scarcity, and enable decentralized learning while maintaining data privacy. TL adapts pre-trained models to new datasets, improving performance with limited labeled data. RL optimizes diagnostic pathways and treatment strategies, while FL fosters collaborative model development without sharing sensitive data. Transformers and LLMs, traditionally used in natural language processing, are now applied to medical data for improved interpretability. Additionally, this review examines these techniques’ efficiency in cancer diagnosis, addresses challenges like data imbalance, and proposes solutions. It serves as a resource for researchers and practitioners, providing insights into current trends and guiding future research in advanced DL for cancer detection.
zh

人工智能

[AI-0] Parameter-Efficient Continual Fine-Tuning: A Survey

【速读】:该论文旨在解决传统机器学习模型在适应动态学习场景时面临的强依赖于独立同分布(i.i.d.)假设的根本限制,特别是如何使大规模预训练模型高效适应不断演化的环境(如真实世界),其中新数据和任务以序列方式到达。这一挑战定义了连续学习(Continual Learning, CL)领域,其目标是开发能够在整个生命周期内持续学习的神经网络模型。论文的关键在于探讨参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)及其变体参数高效连续微调(PECFT)的方法,这些方法通过少量且高效的修改实现与完整微调相当的性能,同时克服灾难性遗忘(Catastrophic Forgetting)问题,从而提高模型对多任务连续适应的能力。

链接: https://arxiv.org/abs/2504.13822
作者: Eric Nuertey Coleman,Luigi Quarantiello,Ziyue Liu,Qinwen Yang,Samrat Mukherjee,Julio Hurtado,Vincenzo Lomonaco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of large pre-trained networks has revolutionized the AI field, unlocking new possibilities and achieving unprecedented performance. However, these models inherit a fundamental limitation from traditional Machine Learning approaches: their strong dependence on the \textiti.i.d. assumption hinders their adaptability to dynamic learning scenarios. We believe the next breakthrough in AI lies in enabling efficient adaptation to evolving environments – such as the real world – where new data and tasks arrive sequentially. This challenge defines the field of Continual Learning (CL), a Machine Learning paradigm focused on developing lifelong learning neural models. One alternative to efficiently adapt these large-scale models is known Parameter-Efficient Fine-Tuning (PEFT). These methods tackle the issue of adapting the model to a particular data or scenario by performing small and efficient modifications, achieving similar performance to full fine-tuning. However, these techniques still lack the ability to adjust the model to multiple tasks continually, as they suffer from the issue of Catastrophic Forgetting. In this survey, we first provide an overview of CL algorithms and PEFT methods before reviewing the state-of-the-art on Parameter-Efficient Continual Fine-Tuning (PECFT). We examine various approaches, discuss evaluation metrics, and explore potential future research directions. Our goal is to highlight the synergy between CL and Parameter-Efficient Fine-Tuning, guide researchers in this field, and pave the way for novel future research directions.
zh

[AI-1] Imitation Learning with Precisely Labeled Human Demonstrations

【速读】:该论文旨在解决在模仿学习范式下,利用人类演示训练通用机器人时面临的挑战,包括精确动作推断、弥合具身差距以及与前沿通用机器人训练管道的融合问题。论文的关键创新在于通过赋予手持夹爪独特且易于分割的颜色,结合RANSAC和ICP配准方法实现了末端执行器位姿的精确估计。这一方案的核心突破在于利用用户可控的夹爪外观设计,简化并提升了人类演示数据处理的可靠性和精度,从而显著改善了政策性能,即使存在固有的具身差距,结合机器人演示后仍可提升性能至平均88.1%的机器人演示水平。

链接: https://arxiv.org/abs/2504.13803
作者: Yilong Song
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Within the imitation learning paradigm, training generalist robots requires large-scale datasets obtainable only through diverse curation. Due to the relative ease to collect, human demonstrations constitute a valuable addition when incorporated appropriately. However, existing methods utilizing human demonstrations face challenges in inferring precise actions, ameliorating embodiment gaps, and fusing with frontier generalist robot training pipelines. In this work, building on prior studies that demonstrate the viability of using hand-held grippers for efficient data collection, we leverage the user’s control over the gripper’s appearance–specifically by assigning it a unique, easily segmentable color–to enable simple and reliable application of the RANSAC and ICP registration method for precise end-effector pose estimation. We show in simulation that precisely labeled human demonstrations on their own allow policies to reach on average 88.1% of the performance of using robot demonstrations, and boost policy performance when combined with robot demonstrations, despite the inherent embodiment gap.
zh

[AI-2] Meta-Learning and Knowledge Discovery based Physics-Informed Neural Network for Remaining Useful Life Prediction

【速读】:该论文旨在解决旋转机械剩余有用寿命(RUL, Remaining Useful Life)预测中因目标域数据稀缺和退化动力学不明确而导致的挑战。论文提出了一种基于元学习和知识发现的物理信息神经网络(MKDPINN, Meta-Learning and Knowledge Discovery-based Physics-Informed Neural Network)。其关键在于通过隐藏状态映射器(HSM, Hidden State Mapper)将噪声传感器数据映射到低维隐状态空间,并利用物理引导调节器(PGR, Physics-Guided Regulator)学习未知的非线性偏微分方程以描述退化演化过程,从而将物理约束嵌入PINN框架中。此外,通过元学习优化跨源领域元任务,实现对新目标任务的少量样本适应,最终有效提升了在数据稀缺条件下的泛化能力和预测准确性。

链接: https://arxiv.org/abs/2504.13797
作者: Yu Wang,Shujie Liu,Shuai Lv,Gengshuo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages,20 figs

点击查看摘要

Abstract:Predicting the remaining useful life (RUL) of rotating machinery is critical for industrial safety and maintenance, but existing methods struggle with scarce target-domain data and unclear degradation dynamics. We propose a Meta-Learning and Knowledge Discovery-based Physics-Informed Neural Network (MKDPINN) to address these challenges. The method first maps noisy sensor data to a low-dimensional hidden state space via a Hidden State Mapper (HSM). A Physics-Guided Regulator (PGR) then learns unknown nonlinear PDEs governing degradation evolution, embedding these physical constraints into the PINN framework. This integrates data-driven and physics-based approaches. The framework uses meta-learning, optimizing across source-domain meta-tasks to enable few-shot adaptation to new target tasks. Experiments on industrial data and the C-MAPSS benchmark show MKDPINN outperforms baselines in generalization and accuracy, proving its effectiveness for RUL prediction under data scarcity
zh

[AI-3] Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion

【速读】:该论文旨在解决基于生成对抗网络(GAN)的语音转换(Voice Conversion, VC)模型在自然度方面存在的真实语音与生成语音样本之间的显著差异问题。同时,针对现有许多GAN模型采用单一生成器-判别器学习方法的局限性,提出通过单个生成器多判别器学习方案更有效地优化目标数据分布。论文的关键解决方案在于引入了一种名为集体学习机制优化传输GAN(Collective Learning Mechanism-based Optimal Transport GAN, CLOT-GAN)的新模型,该模型集成了多种判别器,包括深度卷积神经网络(Deep Convolutional Neural Network, DCNN)、视觉Transformer(Vision Transformer, ViT)以及conformer,以利用集体学习机制理解梅尔频谱图的共振峰分布。此外,通过引入最优传输(Optimal Transport, OT)损失函数,精确弥合源数据与目标数据分布之间的差距。实验验证表明,CLOT-GAN-VC模型在VCC 2018、VCTK和CMU-Arctic数据集上的客观与主观评估均优于现有VC模型。

链接: https://arxiv.org/abs/2504.13791
作者: Sandipan Dhar,Md. Tousin Akhter,Nanda Dulal Jana,Swagatam Das
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 7 pages, 2 figures, 3 tables

点击查看摘要

Abstract:After demonstrating significant success in image synthesis, Generative Adversarial Network (GAN) models have likewise made significant progress in the field of speech synthesis, leveraging their capacity to adapt the precise distribution of target data through adversarial learning processes. Notably, in the realm of State-Of-The-Art (SOTA) GAN-based Voice Conversion (VC) models, there exists a substantial disparity in naturalness between real and GAN-generated speech samples. Furthermore, while many GAN models currently operate on a single generator discriminator learning approach, optimizing target data distribution is more effectively achievable through a single generator multi-discriminator learning scheme. Hence, this study introduces a novel GAN model named Collective Learning Mechanism-based Optimal Transport GAN (CLOT-GAN) model, incorporating multiple discriminators, including the Deep Convolutional Neural Network (DCNN) model, Vision Transformer (ViT), and conformer. The objective of integrating various discriminators lies in their ability to comprehend the formant distribution of mel-spectrograms, facilitated by a collective learning mechanism. Simultaneously, the inclusion of Optimal Transport (OT) loss aims to precisely bridge the gap between the source and target data distribution, employing the principles of OT theory. The experimental validation on VCC 2018, VCTK, and CMU-Arctic datasets confirms that the CLOT-GAN-VC model outperforms existing VC models in objective and subjective assessments.
zh

[AI-4] Probabilistic Stability Guarantees for Feature Attributions

【速读】:该论文致力于解决现有稳定性保证方法在评估特征归因时过于保守的问题。论文的关键在于引入“软稳定性”(soft stability)概念,并提出一种简单、与模型无关且样本高效的稳定性认证算法(Stability Certification Algorithm, SCA),该算法能够为任何归因方法提供非平凡且可解释的保证。此外,通过轻度平滑处理,SCA实现了准确性和稳定性之间的优雅权衡,而非像先前的方法那样需要更剧烈的折衷。论文还利用布尔函数分析提供了平滑条件下稳定性的一种新刻画方式。通过在视觉和语言任务上的实验验证,证明了软稳定性在衡量解释方法鲁棒性方面的有效性。

链接: https://arxiv.org/abs/2504.13787
作者: Helen Jin,Anton Xue,Weiqiu You,Surbhi Goel,Eric Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stability guarantees are an emerging tool for evaluating feature attributions, but existing certification methods rely on smoothed classifiers and often yield conservative guarantees. To address these limitations, we introduce soft stability and propose a simple, model-agnostic, and sample-efficient stability certification algorithm (SCA) that provides non-trivial and interpretable guarantees for any attribution. Moreover, we show that mild smoothing enables a graceful tradeoff between accuracy and stability, in contrast to prior certification methods that require a more aggressive compromise. Using Boolean function analysis, we give a novel characterization of stability under smoothing. We evaluate SCA on vision and language tasks, and demonstrate the effectiveness of soft stability in measuring the robustness of explanation methods.
zh

[AI-5] DP2Unlearning: An Efficient and Guaranteed Unlearning Framework for LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中可能无意中记住敏感或受版权保护信息的问题,并在后续推理阶段将这些信息泄露给用户所带来的伦理与法律隐患。论文提出的解决方案的关键在于引入了一种名为DP2Unlearning的新框架,它通过在训练阶段采用\epsilon-差分隐私(\epsilon-Differential Privacy, DP)保护文本数据,从而在不重新从头训练模型的情况下,以显著低于完全重训的成本提供正式的遗忘保证(formal forgetting guarantees)。这种机制不仅确保了目标数据被有效遗忘,同时保留了模型的实用性能,优于传统的近似遗忘方法。

链接: https://arxiv.org/abs/2504.13774
作者: Tamim Al Mahmud,Najeeb Jebreel,Josep Domingo-Ferrer,David Sanchez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 49 pages

点击查看摘要

Abstract:Large language models (LLMs) have recently revolutionized language processing tasks but have also brought ethical and legal issues. LLMs have a tendency to memorize potentially private or copyrighted information present in the training data, which might then be delivered to end users at inference time. When this happens, a naive solution is to retrain the model from scratch after excluding the undesired data. Although this guarantees that the target data have been forgotten, it is also prohibitively expensive for LLMs. Approximate unlearning offers a more efficient alternative, as it consists of ex post modifications of the trained model itself to prevent undesirable results, but it lacks forgetting guarantees because it relies solely on empirical evidence. In this work, we present DP2Unlearning, a novel LLM unlearning framework that offers formal forgetting guarantees at a significantly lower cost than retraining from scratch on the data to be retained. DP2Unlearning involves training LLMs on textual data protected using \epsilon-differential privacy (DP), which later enables efficient unlearning with the guarantees against disclosure associated with the chosen \epsilon. Our experiments demonstrate that DP2Unlearning achieves similar model performance post-unlearning, compared to an LLM retraining from scratch on retained data – the gold standard exact unlearning – but at approximately half the unlearning cost. In addition, with a reasonable computational cost, it outperforms approximate unlearning methods at both preserving the utility of the model post-unlearning and effectively forgetting the targeted information.
zh

[AI-6] A Survey for What Developers Require in AI-powered Tools that Aid in Component Selection in CBSD

【速读】:该论文试图解决工业界缺乏广泛接受的标准方法或工具用于组件选择(Component Selection)的问题,这一问题源于学术界与工业界的差距。论文通过混合方法调查近100名从事基于组件的软件工程实践或研究的人员,旨在深入理解工业界面临的挑战、需求以及当前的最佳实践,并从工业视角识别和优先级化组件选择的质量标准。此外,针对工业界对将近期技术进展融入基于CBSD组件选择工具的呼吁,论文还探讨了专业人士对AI驱动工具(包括现有和未来设想)的看法。解决方案的关键在于通过综合分析工业界的需求与学术研究,提出能够弥合差距并满足实际应用需求的组件选择方法及相应的质量评估框架,同时探索AI技术在提升组件选择效率和效果方面的潜力。

链接: https://arxiv.org/abs/2504.13751
作者: Mahdi Jaberzadeh Ansari,Ann Barcomb
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 4 figures, The 29th International Conference on Evaluation and Assessment in Software Engineering, 17 to 20 June, 2025, Istanbul, Turkey

点击查看摘要

Abstract:Although it has been more than four decades that the first components-based software development (CBSD) studies were conducted, there is still no standard method or tool for component selection which is widely accepted by the industry. The gulf between industry and academia contributes to the lack of an accepted tool. We conducted a mixed methods survey of nearly 100 people engaged in component-based software engineering practice or research to better understand the problems facing industry, how these needs could be addressed, and current best practices employed in component selection. We also sought to identify and prioritize quality criteria for component selection from an industry perspective. In response to the call for CBSD component selection tools to incorporate recent technical advances, we also explored the perceptions of professionals about AI-driven tools, present and envisioned.
zh

[AI-7] Exploring Multimodal Prompt for Visualization Authoring with Large Language Models

【速读】:该论文旨在解决通过自然语言指令引导大型语言模型(Large Language Models, LLMs)进行可视化创作时,因自然语言表达的精确性和表现力不足导致的意图误读及迭代耗时问题。论文的关键解决方案是引入视觉提示(Visual Prompts)作为文本提示(Text Prompts)的补充输入模态,以更清晰地传达用户意图并提升LLMs对意图的理解能力。基于此,作者设计了VisPilot系统,支持用户利用多模态提示(包括文本、草图以及对现有可视化直接操作)实现直观且高效的可视化创作。通过实证研究与用户实验,论文验证了多模态提示在提升LLMs在可视化任务中的可用性方面的重要性,并探讨了其在增强人机协作方面的潜力。

链接: https://arxiv.org/abs/2504.13700
作者: Zhen Wen,Luoxuan Weng,Yinghao Tang,Runjin Zhang,Yuxin Liu,Bo Pan,Minfeng Zhu,Wei Chen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown great potential in automating the process of visualization authoring through simple natural language utterances. However, instructing LLMs using natural language is limited in precision and expressiveness for conveying visualization intent, leading to misinterpretation and time-consuming iterations. To address these limitations, we conduct an empirical study to understand how LLMs interpret ambiguous or incomplete text prompts in the context of visualization authoring, and the conditions making LLMs misinterpret user intent. Informed by the findings, we introduce visual prompts as a complementary input modality to text prompts, which help clarify user intent and improve LLMs’ interpretation abilities. To explore the potential of multimodal prompting in visualization authoring, we design VisPilot, which enables users to easily create visualizations using multimodal prompts, including text, sketches, and direct manipulations on existing visualizations. Through two case studies and a controlled user study, we demonstrate that VisPilot provides a more intuitive way to create visualizations without affecting the overall task efficiency compared to text-only prompting approaches. Furthermore, we analyze the impact of text and visual prompts in different visualization tasks. Our findings highlight the importance of multimodal prompting in improving the usability of LLMs for visualization authoring. We discuss design implications for future visualization systems and provide insights into how multimodal prompts can enhance human-AI collaboration in creative visualization tasks. All materials are available at this https URL.
zh

[AI-8] race Gadgets: Minimizing Code Context for Machine Learning-Based Vulnerability Prediction

【速读】:该论文旨在解决手动识别网络应用和API漏洞的繁琐性以及传统静态安全扫描器易产生大量误报的问题。同时,尽管基于机器学习的方法在漏洞检测中展现出潜力,但它们通常仅在训练数据与测试数据高度相关的情况下表现良好。论文的关键挑战在于如何为机器学习模型提供合适且简洁的代码上下文,避免因上下文过长而影响模型(尤其是较小模型)对代码的理解能力。

为此,论文提出了一种名为Trace Gadgets的新颖代码表示方法,通过移除无关代码来最小化代码上下文,精确捕获通往漏洞路径的相关语句。作为输入,Trace Gadgets为机器学习模型提供了最小但完整的上下文,从而提升了检测性能。此外,研究团队还构建了一个大规模的真实世界应用程序数据集,并进行了人工标注,以进一步优化基于机器学习的漏洞检测器。实验结果显示,当使用Trace Gadgets作为输入时,最先进的机器学习模型在完全未见过的数据集上的检测能力优于行业标准静态扫描器GitHub CodeQL至少4%。通过将该框架应用于实际应用,论文发现了广泛部署软件中的先前未知漏洞。

链接: https://arxiv.org/abs/2504.13676
作者: Felix Mächtle,Nils Loose,Tim Schulz,Florian Sieck,Jan-Niclas Serr,Ralf Möller,Thomas Eisenbarth
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the number of web applications and API endpoints exposed to the Internet continues to grow, so does the number of exploitable vulnerabilities. Manually identifying such vulnerabilities is tedious. Meanwhile, static security scanners tend to produce many false positives. While machine learning-based approaches are promising, they typically perform well only in scenarios where training and test data are closely related. A key challenge for ML-based vulnerability detection is providing suitable and concise code context, as excessively long contexts negatively affect the code comprehension capabilities of machine learning models, particularly smaller ones. This work introduces Trace Gadgets, a novel code representation that minimizes code context by removing non-related code. Trace Gadgets precisely capture the statements that cover the path to the vulnerability. As input for ML models, Trace Gadgets provide a minimal but complete context, thereby improving the detection performance. Moreover, we collect a large-scale dataset generated from real-world applications with manually curated labels to further improve the performance of ML-based vulnerability detectors. Our results show that state-of-the-art machine learning models perform best when using Trace Gadgets compared to previous code representations, surpassing the detection capabilities of industry-standard static scanners such as GitHub’s CodeQL by at least 4% on a fully unseen dataset. By applying our framework to real-world applications, we identify and report previously unknown vulnerabilities in widely deployed software. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.13676 [cs.CR] (or arXiv:2504.13676v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.13676 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-9] Do Prompt Patterns Affect Code Quality? A First Empirical Assessment of ChatGPT -Generated Code

【速读】:该论文试图解决大型语言模型(LLMs)在代码生成中的不一致性能、幻觉倾向及质量问题,这些问题影响了程序的可理解性和可维护性。论文关注通过提示工程(prompt engineering)来应对这些挑战,并特别研究提示模式(prompt patterns)对代码质量的影响,尤其是可维护性、安全性及可靠性。论文的关键在于通过实证分析揭示不同提示方式(如Zero-Shot、Zero-Shot with Chain-of-Thought和Few-Shot)对代码质量的实际影响,最终发现提示结构对这些质量指标的实质性影响有限。

链接: https://arxiv.org/abs/2504.13656
作者: Antonio Della Porta,Stefano Lambiase,Fabio Palomba
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have rapidly transformed software development, especially in code generation. However, their inconsistent performance, prone to hallucinations and quality issues, complicates program comprehension and hinders maintainability. Research indicates that prompt engineering-the practice of designing inputs to direct LLMs toward generating relevant outputs-may help address these challenges. In this regard, researchers have introduced prompt patterns, structured templates intended to guide users in formulating their requests. However, the influence of prompt patterns on code quality has yet to be thoroughly investigated. An improved understanding of this relationship would be essential to advancing our collective knowledge on how to effectively use LLMs for code generation, thereby enhancing their understandability in contemporary software development. This paper empirically investigates the impact of prompt patterns on code quality, specifically maintainability, security, and reliability, using the Dev-GPT dataset. Results show that Zero-Shot prompting is most common, followed by Zero-Shot with Chain-of-Thought and Few-Shot. Analysis of 7583 code files across quality metrics revealed minimal issues, with Kruskal-Wallis tests indicating no significant differences among patterns, suggesting that prompt structure may not substantially impact these quality metrics in ChatGPT-assisted code generation.
zh

[AI-10] Multi-modal Knowledge Graph Generation with Semantics-enriched Prompts IJCNN2025

【速读】:该论文旨在解决多模态知识图谱(MMKGs)数量不足及其构建过程中面临的挑战,特别是如何从传统知识图谱(KGs)中高效生成高质量且与上下文高度相关的图像以丰富知识图谱。论文的关键解决方案在于提出了一种名为可视结构邻域选择(Visualizable Structural Neighbor Selection, VSNS)的方法。VSNS 方法包含两个模块:可视化邻域选择(Visualizable Neighbor Selection, VNS)和结构邻域选择(Structural Neighbor Selection, SNS)。其中,VNS 模块用于过滤难以可视化的关系,而 SNS 模块则专注于选择能够最佳反映实体结构特征的邻居节点。通过在两个数据集(MKG-Y 和 DB15K)上的定性和定量评估表明,采用 VSNS 方法可显著提升生成图像的质量及其与知识图谱的相关性。

链接: https://arxiv.org/abs/2504.13631
作者: Yajing Xu,Zhiqiang Liu,Jiaoyan Chen,Mingchen Tu,Zhuo Chen,Jeff Z. Pan,Yichi Zhang,Yushan Zhu,Wen Zhang,Huajun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by IJCNN 2025

点击查看摘要

Abstract:Multi-modal Knowledge Graphs (MMKGs) have been widely applied across various domains for knowledge representation. However, the existing MMKGs are significantly fewer than required, and their construction faces numerous challenges, particularly in ensuring the selection of high-quality, contextually relevant images for knowledge graph enrichment. To address these challenges, we present a framework for constructing MMKGs from conventional KGs. Furthermore, to generate higher-quality images that are more relevant to the context in the given knowledge graph, we designed a neighbor selection method called Visualizable Structural Neighbor Selection (VSNS). This method consists of two modules: Visualizable Neighbor Selection (VNS) and Structural Neighbor Selection (SNS). The VNS module filters relations that are difficult to visualize, while the SNS module selects neighbors that most effectively capture the structural characteristics of the entity. To evaluate the quality of the generated images, we performed qualitative and quantitative evaluations on two datasets, MKG-Y and DB15K. The experimental results indicate that using the VSNS method to select neighbors results in higher-quality images that are more relevant to the knowledge graph.
zh

[AI-11] Adaptive Long-term Embedding with Denoising and Augmentation for Recommendation

【速读】:该论文致力于解决个性化推荐系统中基于图的方法面临的噪声干扰和静态表征限制的问题。解决方案的关键在于提出了一种名为ALDA4Rec的新方法,该方法通过构建物品-物品图并利用社区检测过滤噪声,同时增强用户-物品交互;采用图卷积网络(Graph Convolutional Networks, GCNs)学习短期表征,结合平均池化、门控循环单元(GRUs)和注意力机制建模长期嵌入,并引入基于多层感知机(MLP)的自适应加权策略动态优化用户的长期偏好。实验结果表明,ALDA4Rec在多个真实数据集上显著提升了推荐系统的准确性和鲁棒性。

链接: https://arxiv.org/abs/2504.13614
作者: Zahra Akhlaghi,Mostafa Haghir Chehreghani
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The rapid growth of the internet has made personalized recommendation systems indispensable. Graph-based sequential recommendation systems, powered by Graph Neural Networks (GNNs), effectively capture complex user-item interactions but often face challenges such as noise and static representations. In this paper, we introduce the Adaptive Long-term Embedding with Denoising and Augmentation for Recommendation (ALDA4Rec) method, a novel model that constructs an item-item graph, filters noise through community detection, and enriches user-item interactions. Graph Convolutional Networks (GCNs) are then employed to learn short-term representations, while averaging, GRUs, and attention mechanisms are utilized to model long-term embeddings. An MLP-based adaptive weighting strategy is further incorporated to dynamically optimize long-term user preferences. Experiments conducted on four real-world datasets demonstrate that ALDA4Rec outperforms state-of-the-art baselines, delivering notable improvements in both accuracy and robustness. The source code is available at this https URL.
zh

[AI-12] Entropic Time Schedulers for Generative Diffusion Models

【速读】:该论文试图解决生成式扩散模型(Generative Diffusion Models)在实际应用中因噪声调度函数选择不当而导致性能受限的问题。论文的关键解决方案在于提出一种基于熵的时间调度器(entropic time scheduler),该调度器通过非均匀采样点的选择替代传统的均匀时间间隔方法,确保每个采样点对最终生成贡献相同的信息量。论文证明此时间重参数化(time reparameterization)不依赖于初始时间选择,并进一步提供了一种可计算的精确公式,用于估计经过训练模型的熵时间,且无需显著增加计算开销。此外,受最优性结果启发,引入了归一化的熵时间(rescaled entropic time)。实验表明,在高斯混合分布和ImageNet数据集上的使用显著提升了预训练模型的推理性能,特别是在少量函数评估次数(NFEs)的情况下,预训练EDM2模型的质量通过归一化的熵时间重参数化得到了显著提升,而无需增加函数评估次数。

链接: https://arxiv.org/abs/2504.13612
作者: Dejan Stancevic,Luca Ambrogioni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:The practical performance of generative diffusion models depends on the appropriate choice of the noise scheduling function, which can also be equivalently expressed as a time reparameterization. In this paper, we present a time scheduler that selects sampling points based on entropy rather than uniform time spacing, ensuring that each point contributes an equal amount of information to the final generation. We prove that this time reparameterization does not depend on the initial choice of time. Furthermore, we provide a tractable exact formula to estimate this \emphentropic time for a trained model using the training loss without substantial overhead. Alongside the entropic time, inspired by the optimality results, we introduce a rescaled entropic time. In our experiments with mixtures of Gaussian distributions and ImageNet, we show that using the (rescaled) entropic times greatly improves the inference performance of trained models. In particular, we found that the image quality in pretrained EDM2 models, as evaluated by FID and FD-DINO scores, can be substantially increased by the rescaled entropic time reparameterization without increasing the number of function evaluations, with greater improvements in the few NFEs regime.
zh

[AI-13] RAG Without the Lag: Interactive Debugging for Retrieval-Augmented Generation Pipelines

【速读】:本文旨在解决基于检索增强生成(Retrieval-Augmented Generation, RAG)管道在构建具有外部领域特定知识的AI助手时所面临的两个主要挑战:一是RAG管道中检索与生成组件紧密耦合,难以定位导致最终输出错误的具体组件;二是参数调整对输出质量的影响评估需要长时间的后处理,导致反馈周期过慢。为了解决这些问题,论文提出了RAGGY工具,其关键在于结合了一个可组合RAG基础组件的Python库以及一个用于实时调试的交互界面,从而实现更高效的参数调优和错误排查流程。

链接: https://arxiv.org/abs/2504.13587
作者: Quentin Romero Lauro,Shreya Shankar,Sepanta Zeighami,Aditya Parameswaran
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) pipelines have become the de-facto approach for building AI assistants with access to external, domain-specific knowledge. Given a user query, RAG pipelines typically first retrieve ® relevant information from external sources, before invoking a Large Language Model (LLM), augmented (A) with this information, to generate (G) responses. Modern RAG pipelines frequently chain multiple retrieval and generation components, in any order. However, developing effective RAG pipelines is challenging because retrieval and generation components are intertwined, making it hard to identify which component(s) cause errors in the eventual output. The parameters with the greatest impact on output quality often require hours of pre-processing after each change, creating prohibitively slow feedback cycles. To address these challenges, we present RAGGY, a developer tool that combines a Python library of composable RAG primitives with an interactive interface for real-time debugging. We contribute the design and implementation of RAGGY, insights into expert debugging patterns through a qualitative study with 12 engineers, and design implications for future RAG tools that better align with developers’ natural workflows.
zh

[AI-14] MetaDSE: A Few-shot Meta-learning Framework for Cross-workload CPU Design Space Exploration

【速读】:该论文致力于解决跨工作负载(cross-workload)CPU架构设计空间探索(Design Space Exploration, DSE)中的过拟合、数据歧义及工作负载差异性等挑战。为应对这些难题,论文将跨工作负载CPU DSE任务重新定义为少量学习(few-shot learning)的元学习问题,并提出了一种名为MetaDSE的新方法。MetaDSE的关键在于利用模型不可知的元学习(Model Agnostic Meta-Learning, MAML),快速适应新的目标工作负载,显著提升跨工作负载DSE的效率。此外,MetaDSE引入了一种新颖的知识迁移方法——工作负载自适应架构掩码算法(workload-adaptive architectural mask algorithm),该算法揭示了架构的内在特性。实验结果表明,与最先进的方法相比,MetaDSE将预测误差减少了44.3%。

链接: https://arxiv.org/abs/2504.13568
作者: Runzhen Xue,Hao Wu,Mingyu Yan,Ziheng Xiao,Xiaochun Ye,Dongrui Fan
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures. Accepted by DAC 2025

点击查看摘要

Abstract:Cross-workload design space exploration (DSE) is crucial in CPU architecture design. Existing DSE methods typically employ the transfer learning technique to leverage knowledge from source workloads, aiming to minimize the requirement of target workload simulation. However, these methods struggle with overfitting, data ambiguity, and workload dissimilarity. To address these challenges, we reframe the cross-workload CPU DSE task as a few-shot meta-learning problem and further introduce MetaDSE. By leveraging model agnostic meta-learning, MetaDSE swiftly adapts to new target workloads, greatly enhancing the efficiency of cross-workload CPU DSE. Additionally, MetaDSE introduces a novel knowledge transfer method called the workload-adaptive architectural mask algorithm, which uncovers the inherent properties of the architecture. Experiments on SPEC CPU 2017 demonstrate that MetaDSE significantly reduces prediction error by 44.3% compared to the state-of-the-art. MetaDSE is open-sourced and available at this \hrefthis https URLanonymous GitHub. Comments: 7 pages, 6 figures. Accepted by DAC 2025 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.13568 [cs.AR] (or arXiv:2504.13568v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2504.13568 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-15] ransformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective

【速读】:该论文旨在解决如何通过Transformer模型逼近Hölder连续函数类的问题,并探索克服维度灾难的方法。关键在于构造了一种仅包含单头自注意力层及softmax激活函数的Transformer模型,结合特定宽度的前馈层。通过利用ReLU与floor等激活函数,所需前馈层的数量仅为对数级别,而其宽度依赖于逼近精度和函数光滑度;若允许使用其他激活函数,则前馈层宽度可进一步降至常数。这一方案基于Kolmogorov-Arnold表示定理,无需引入上下文映射概念,证明过程更直观清晰。此外,文中提出的翻译技巧将前馈神经网络的已有逼近结果推广至Transformer研究中。

链接: https://arxiv.org/abs/2504.13558
作者: Yuling Jiao,Yanming Lai,Yang Wang,Bokai Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Transformer model is widely used in various application areas of machine learning, such as natural language processing. This paper investigates the approximation of the Hölder continuous function class \mathcalH_Q^\beta\left([0,1]^d\times n,\mathbbR^d\times n\right) by Transformers and constructs several Transformers that can overcome the curse of dimensionality. These Transformers consist of one self-attention layer with one head and the softmax function as the activation function, along with several feedforward layers. For example, to achieve an approximation accuracy of \epsilon , if the activation functions of the feedforward layers in the Transformer are ReLU and floor, only \mathcalO\left(\log\frac1\epsilon\right) layers of feedforward layers are needed, with widths of these layers not exceeding \mathcalO\left(\frac1\epsilon^2/\beta\log\frac1\epsilon\right) . If other activation functions are allowed in the feedforward layers, the width of the feedforward layers can be further reduced to a constant. These results demonstrate that Transformers have a strong expressive capability. The construction in this paper is based on the Kolmogorov-Arnold Representation Theorem and does not require the concept of contextual mapping, hence our proof is more intuitively clear compared to previous Transformer approximation works. Additionally, the translation technique proposed in this paper helps to apply the previous approximation results of feedforward neural networks to Transformer research.
zh

[AI-16] ask Assignment and Exploration Optimization for Low Altitude UAV Rescue via Generative AI Enhanced Multi-agent Reinforcement Learning

【速读】:该论文旨在解决低空无人飞行器(UAV)在未知环境中执行救援、检查和监控任务时,因高计算需求超出单个UAV处理能力而导致系统不稳定的问题,同时考虑地面计算节点(GCN)资源有限且动态变化的限制。为应对这些挑战,论文提出了一种创新的合作框架,包括UAV与地面嵌入式机器人(GER)以及高空平台(HAP)之间的协作,通过UAV到GER(U2G)和UAV到HAP(U2H)通信实现资源共享,从而为卸载任务提供计算服务。解决方案的关键在于将UAV的任务分配和探索优化问题建模为一个动态长期优化问题,并以最小化任务完成时间和能量消耗为目标,同时确保系统稳定性。为此,论文首先利用Lyapunov优化技术将带稳定性约束的原始问题转化为每时隙确定性问题,随后提出了一种名为HG-MADDPG的算法,结合匈牙利算法用于探索区域选择以提高UAV与环境交互效率,以及基于生成扩散模型(GDM)的多智能体深度确定性策略梯度(MADDPG)方法优化任务分配决策,如任务卸载和资源分配。仿真结果验证了所提方法在任务卸载效率、延迟降低和系统稳定性方面的显著改进。

链接: https://arxiv.org/abs/2504.13554
作者: Xin Tang,Qian Chen,Wenjie Weng,Chao Jin,Zhang Liu,Jiacheng Wang,Geng Sun,Xiaohuan Li,Dusit Niyato
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI)-driven convolutional neural networks enhance rescue, inspection, and surveillance tasks performed by low-altitude uncrewed aerial vehicles (UAVs) and ground computing nodes (GCNs) in unknown environments. However, their high computational demands often exceed a single UAV’s capacity, leading to system instability, further exacerbated by the limited and dynamic resources of GCNs. To address these challenges, this paper proposes a novel cooperation framework involving UAVs, ground-embedded robots (GERs), and high-altitude platforms (HAPs), which enable resource pooling through UAV-to-GER (U2G) and UAV-to-HAP (U2H) communications to provide computing services for UAV offloaded tasks. Specifically, we formulate the multi-objective optimization problem of task assignment and exploration optimization in UAVs as a dynamic long-term optimization problem. Our objective is to minimize task completion time and energy consumption while ensuring system stability over time. To achieve this, we first employ the Lyapunov optimization technique to transform the original problem, with stability constraints, into a per-slot deterministic problem. We then propose an algorithm named HG-MADDPG, which combines the Hungarian algorithm with a generative diffusion model (GDM)-based multi-agent deep deterministic policy gradient (MADDPG) approach. We first introduce the Hungarian algorithm as a method for exploration area selection, enhancing UAV efficiency in interacting with the environment. We then innovatively integrate the GDM and multi-agent deep deterministic policy gradient (MADDPG) to optimize task assignment decisions, such as task offloading and resource allocation. Simulation results demonstrate the effectiveness of the proposed approach, with significant improvements in task offloading efficiency, latency reduction, and system stability compared to baseline methods.
zh

[AI-17] SwitchMT: An Adaptive Context Switching Methodology for Scalable Multi-Task Learning in Intelligent Autonomous Agents

【速读】:该论文旨在解决智能自主代理(如移动机器人)在动态真实环境中的多任务适应性学习问题。当前最先进的强化学习方法仅在单任务设置中表现出色,但在多任务泛化方面仍受任务干扰的限制,并且难以处理数据流。此外,现实世界的需求还要求代理具备数据流处理能力。针对这些问题,本文提出了一种名为SwitchMT的新型自适应任务切换方法,用于基于强化学习的多任务学习。其关键在于:(1) 使用具有活跃树突和双重结构的深度尖峰Q网络,通过任务特定上下文信号生成专门的子网络;(2) 利用奖励和网络参数内部动力学的自适应任务切换策略。实验结果表明,SwitchMT在多任务学习中优于现有方法,在多个Atari游戏中取得了与最先进方法竞争的分数,展示了其更好的泛化学习能力。这些结果验证了SwitchMT在解决任务干扰的同时,通过自适应任务切换实现多任务学习自动化的能力。

链接: https://arxiv.org/abs/2504.13541
作者: Avaneesh Devkota,Rachmad Vidya Wicaksana Putra,Muhammad Shafique
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 7 pages, 7 figures, 3 tables

点击查看摘要

Abstract:The ability to train intelligent autonomous agents (such as mobile robots) on multiple tasks is crucial for adapting to dynamic real-world environments. However, state-of-the-art reinforcement learning (RL) methods only excel in single-task settings, and still struggle to generalize across multiple tasks due to task interference. Moreover, real-world environments also demand the agents to have data stream processing capabilities. Toward this, a state-of-the-art work employs Spiking Neural Networks (SNNs) to improve multi-task learning by exploiting temporal information in data stream, while enabling lowpower/energy event-based operations. However, it relies on fixed context/task-switching intervals during its training, hence limiting the scalability and effectiveness of multi-task learning. To address these limitations, we propose SwitchMT, a novel adaptive task-switching methodology for RL-based multi-task learning in autonomous agents. Specifically, SwitchMT employs the following key ideas: (1) a Deep Spiking Q-Network with active dendrites and dueling structure, that utilizes task-specific context signals to create specialized sub-networks; and (2) an adaptive task-switching policy that leverages both rewards and internal dynamics of the network parameters. Experimental results demonstrate that SwitchMT achieves superior performance in multi-task learning compared to state-of-the-art methods. It achieves competitive scores in multiple Atari games (i.e., Pong: -8.8, Breakout: 5.6, and Enduro: 355.2) compared to the state-of-the-art, showing its better generalized learning capability. These results highlight the effectiveness of our SwitchMT methodology in addressing task interference while enabling multi-task learning automation through adaptive task switching, thereby paving the way for more efficient generalist agents with scalable multi-task learning capabilities.
zh

[AI-18] Deep Learning Models Meet Financial Data Modalities

【速读】:该论文旨在解决深度学习在处理结构化金融数据(Structured Financial Data)方面应用不足的问题,特别是在算法交易(Algorithmic Trading)中的信号提取与预测性能提升。论文的核心目标是通过将深度学习模型与多种金融数据模态(Data Modalities)相结合,优化交易策略和投资组合管理的预测能力。
解决方案的关键在于提出了一种新颖的方法,即将限价订单簿(Limit Order Book, LOB)分析融入算法交易中。具体而言,论文开发了嵌入技术(Embedding Techniques),并将限价订单簿的时间序列快照(Sequential Snapshots)视为基于图像表示(Image-Based Representation)的不同输入通道。这种方法不仅实现了对限价订单簿数据的有效处理,还显著提升了高频交易算法(High-Frequency Trading Algorithms)的性能,验证了深度学习在金融领域应用的潜力与有效性。

链接: https://arxiv.org/abs/2504.13521
作者: Kasymkhan Khubiev,Michail Semenov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Statistical Finance (q-fin.ST)
备注: 15 pages, 14 images, 7 tables

点击查看摘要

Abstract:Algorithmic trading relies on extracting meaningful signals from diverse financial data sources, including candlestick charts, order statistics on put and canceled orders, traded volume data, limit order books, and news flow. While deep learning has demonstrated remarkable success in processing unstructured data and has significantly advanced natural language processing, its application to structured financial data remains an ongoing challenge. This study investigates the integration of deep learning models with financial data modalities, aiming to enhance predictive performance in trading strategies and portfolio optimization. We present a novel approach to incorporating limit order book analysis into algorithmic trading by developing embedding techniques and treating sequential limit order book snapshots as distinct input channels in an image-based representation. Our methodology for processing limit order book data achieves state-of-the-art performance in high-frequency trading algorithms, underscoring the effectiveness of deep learning in financial applications.
zh

[AI-19] Optimizing Electric Vehicle Charging Station Locations: A Data-driven System with Multi-source Fusion

【速读】:该论文试图解决城市规划者在优化电动汽车(EV)充电站选址过程中面临的挑战,特别是长距离旅行中的续航焦虑以及住宅区充电站分布不足的问题。论文的关键解决方案在于开发了一个基于数据驱动的系统,该系统整合了新南威尔士州(NSW)现有的电动汽车出行数据,并结合多种增强地理可行性的因素,如路线数据、地方政府区域(LGA)边界、火灾与洪水风险以及兴趣点(POI)。通过多源数据融合与可视化展示,系统能够合理估计并部署充电需求,并通过案例研究验证其结果,为未来电动汽车充电站的选址提供指导依据。

链接: https://arxiv.org/abs/2504.13517
作者: Lihuan Li,Du Yin,Hao Xue,David Lillo-Trynes,Flora Salim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4-page short paper

点击查看摘要

Abstract:With the growing electric vehicles (EVs) charging demand, urban planners face the challenges of providing charging infrastructure at optimal locations. For example, range anxiety during long-distance travel and the inadequate distribution of residential charging stations are the major issues many cities face. To achieve reasonable estimation and deployment of the charging demand, we develop a data-driven system based on existing EV trips in New South Wales (NSW) state, Australia, incorporating multiple factors that enhance the geographical feasibility of recommended charging stations. Our system integrates data sources including EV trip data, geographical data such as route data and Local Government Area (LGA) boundaries, as well as features like fire and flood risks, and Points of Interest (POIs). We visualize our results to intuitively demonstrate the findings from our data-driven, multi-source fusion system, and evaluate them through case studies. The outcome of this work can provide a platform for discussion to develop new insights that could be used to give guidance on where to position future EV charging stations.
zh

[AI-20] Large Language Models for Validating Network Protocol Parsers

【速读】:该论文旨在解决网络协议解析器实现与官方协议标准之间一致性验证的挑战。传统方法要么需要大量人工参与,要么忽略协议标准,难以有效检测语义违规问题。论文的关键创新在于提出PARVAL,这是一个基于大型语言模型(Large Language Models, LLMs)的多智能体框架。PARVAL通过利用LLMs理解自然语言和代码的能力,将协议标准和其对应的实现转换为统一的中间表示——格式规范(format specifications),并通过差分比较揭示不一致之处。这种自动化方法显著提升了验证效率,并在实验中成功识别出多个未被发现的潜在漏洞。

链接: https://arxiv.org/abs/2504.13515
作者: Mingwei Zheng,Danning Xie,Xiangyu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Network protocol parsers are essential for enabling correct and secure communication between devices. Bugs in these parsers can introduce critical vulnerabilities, including memory corruption, information leakage, and denial-of-service attacks. An intuitive way to assess parser correctness is to compare the implementation with its official protocol standard. However, this comparison is challenging because protocol standards are typically written in natural language, whereas implementations are in source code. Existing methods like model checking, fuzzing, and differential testing have been used to find parsing bugs, but they either require significant manual effort or ignore the protocol standards, limiting their ability to detect semantic violations. To enable more automated validation of parser implementations against protocol standards, we propose PARVAL, a multi-agent framework built on large language models (LLMs). PARVAL leverages the capabilities of LLMs to understand both natural language and code. It transforms both protocol standards and their implementations into a unified intermediate representation, referred to as format specifications, and performs a differential comparison to uncover inconsistencies. We evaluate PARVAL on the Bidirectional Forwarding Detection (BFD) protocol. Our experiments demonstrate that PARVAL successfully identifies inconsistencies between the implementation and its RFC standard, achieving a low false positive rate of 5.6%. PARVAL uncovers seven unique bugs, including five previously unknown issues.
zh

[AI-21] Statistical Validation in Cultural Adaptations of Cognitive Tests: A Multi- Regional Systematic Review

【速读】:该论文试图解决跨文化背景下认知评估工具方法学适应性的问题,重点探讨如何通过系统化的方法确保这些工具在不同文化和语言环境中的有效性和可靠性。解决方案的关键在于采用整体模型(holistic models)来应对人口统计学变化,并强调社区反馈的重要性,同时结合标准化翻译协议和严格的统计验证方法,以实现文化适配过程的科学性和严谨性。例如,文中提到教育水平可以解释MoCA-H得分变异的26.76%,而文化-语言因素在欧洲MoCA-H适应中解释了6.89%的变异。此外,使用曼彻斯特翻译评估清单(Manchester Translation Evaluation Checklist)评价文化适应时达到了78.5%的评分者间一致性,进一步证明了综合方法的必要性。

链接: https://arxiv.org/abs/2504.13495
作者: Miit Daga,Priyasha Mohanty,Ram Krishna,Swarna Priya RM
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: This paper is accepted and presented in the International Conference Challenges Opportunities in Artificial Intelligence: Engineering Management Applications (COAIEMA 2025) and to be published in Taylor Francis Proceedings

点击查看摘要

Abstract:This systematic review discusses the methodological approaches and statistical confirmations of cross-cultural adaptations of cognitive evaluation tools used with different populations. The review considers six seminal studies on the methodology of cultural adaptation in Europe, Asia, Africa, and South America. The results indicate that proper adaptations need holistic models with demographic changes, and education explained as much as 26.76% of the variance in MoCA-H scores. Cultural-linguistic factors explained 6.89% of the variance in European adaptations of MoCA-H; however, another study on adapted MMSE and BCSB among Brazilian Indigenous populations reported excellent diagnostic performance, with a sensitivity of 94.4% and specificity of 99.2%. There was 78.5% inter-rater agreement on the evaluation of cultural adaptation using the Manchester Translation Evaluation Checklist. A paramount message of the paper is that community feedback is necessary for culturally appropriate preparation, standardized translation protocols also must be included, along with robust statistical validation methodologies for developing cognitive assessment instruments. This review supplies evidence-based frameworks for the further adaptation of cognitive assessments in increasingly diverse global health settings.
zh

[AI-22] Creating Full-Stack Hybrid Reasoning Systems that Prioritize and Enhance Human Intelligence

【速读】:该论文试图解决在混合智能(Hybrid Intelligence)框架下,如何通过结合人类与人工智能的能力,弥补人类推理中的缺陷与局限性,以应对未来社会面临的挑战。论文指出,尽管当前研究主要集中在优化人工智能方面,但更紧迫的需求在于提升人类的批判性思维、创造力及智慧。解决方案的关键在于开发基于生成式 AI (Generative AI) 的工具,这些工具不仅能增强人类对问题的反思能力,还能促进对技术细节的探索。此外,论文提出了一种高层次模型,旨在以集中化的人类参与和控制为核心,整合人工智能与人类能力。

链接: https://arxiv.org/abs/2504.13477
作者: Sean Koon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 10 pages; 3 figures; 1 table

点击查看摘要

Abstract:The idea of augmented or hybrid intelligence offers a compelling vision for combining human and AI capabilities, especially in tasks where human wisdom, expertise, or common sense are essential. Unfortunately, human reasoning can be flawed and shortsighted, resulting in adverse individual impacts or even long-term societal consequences. While strong efforts are being made to develop and optimize the AI aspect of hybrid reasoning, the real urgency lies in fostering wiser and more intelligent human participation. Tools that enhance critical thinking, ingenuity, expertise, and even wisdom could be essential in addressing the challenges of our emerging future. This paper proposes the development of generative AI-based tools that enhance both the human ability to reflect upon a problem as well as the ability to explore the technical aspects of it. A high-level model is also described for integrating AI and human capabilities in a way that centralizes human participation and control.
zh

[AI-23] Ascribe New Dimensions to Scientific Data Visualization with VR

【速读】:该论文旨在解决传统计算机鼠标及2D可视化方法在探索复杂多尺度科学图像时的局限性问题,特别是对于固有三维结构的直观分析挑战。论文提出的关键解决方案是ASCRIBE-VR平台,这是一个集成了AI驱动算法与科学图像的虚拟现实(Virtual Reality, VR)工具。其关键是通过将基于AI的分割结果与迭代反馈过程无缝集成到VR环境中,支持对大规模三维图像的沉浸式浏览和交互式分析,从而增强材料研究中计算分析与人类直觉之间的桥梁,并实现人机协同与数字孪生的连接。

链接: https://arxiv.org/abs/2504.13448
作者: Daniela Ushizima,Guilherme Melo dos Santos,Zineb Sordo,Ronald Pandolfi,Jeffrey Donatelli
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:For over half a century, the computer mouse has been the primary tool for interacting with digital data, yet it remains a limiting factor in exploring complex, multi-scale scientific images. Traditional 2D visualization methods hinder intuitive analysis of inherently 3D structures. Virtual Reality (VR) offers a transformative alternative, providing immersive, interactive environments that enhance data comprehension. This article introduces ASCRIBE-VR, a VR platform of Autonomous Solutions for Computational Research with Immersive Browsing \ Exploration, which integrates AI-driven algorithms with scientific images. ASCRIBE-VR enables multimodal analysis, structural assessments, and immersive visualization, supporting scientific visualization of advanced datasets such as X-ray CT, Magnetic Resonance, and synthetic 3D imaging. Our VR tools, compatible with Meta Quest, can consume the output of our AI-based segmentation and iterative feedback processes to enable seamless exploration of large-scale 3D images. By merging AI-generated results with VR visualization, ASCRIBE-VR enhances scientific discovery, bridging the gap between computational analysis and human intuition in materials research, connecting human-in-the-loop with digital twins.
zh

[AI-24] rust but verify

【速读】:该论文旨在解决在去中心化人工智能代理网络中,如何确保节点运行授权且正确的大型语言模型(LLMs)以维持服务质量的问题。论文的关键解决方案是通过其同行的社会共识机制来检测运行未经授权或错误 LLM 的节点。此外,论文还提出了一种主观验证系统,作为 EigenLayer 自动验证服务(AVS)的一部分,通过引入金融激励与惩罚措施,鼓励节点诚实运行 LLM。

链接: https://arxiv.org/abs/2504.13443
作者: Michael J. Yuan,Carlos Campoy,Sydney Lai,James Snewin,Ju Long
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Decentralized AI agent networks, such as Gaia, allows individuals to run customized LLMs on their own computers and then provide services to the public. However, in order to maintain service quality, the network must verify that individual nodes are running their designated LLMs. In this paper, we demonstrate that in a cluster of mostly honest nodes, we can detect nodes that run unauthorized or incorrect LLM through social consensus of its peers. We will discuss the algorithm and experimental data from the Gaia network. We will also discuss the intersubjective validation system, implemented as an EigenLayer AVS to introduce financial incentives and penalties to encourage honest behavior from LLM nodes.
zh

[AI-25] Bounded and Uniform Energy-based Out-of-distribution Detection for Graphs

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在检测节点级分布外(Out-of-Distribution, OOD)数据时的性能局限性问题。尽管近期提出的GNNSAFE框架通过负能量分数的聚合显著提升了GNNs的OOD检测能力,但研究表明,由于负能量分数和logit偏移的无界性,节点间分数聚合容易受到极端值的影响,从而严重限制了检测精度。为应对这一挑战,论文提出NODESAFE方法,其关键在于通过引入两个优化项,使负能量分数有界并缓解logit偏移,从而减少节点极端分数的生成。实验结果表明,该方法在节点级OOD数据检测任务中表现优异,例如,在结构操纵诱导的OOD检测场景下,FPR95指标在有/无OOD数据暴露的情况下分别较当前最优方法降低了28.4%和22.7%。

链接: https://arxiv.org/abs/2504.13429
作者: Shenzhi Yang,Bin Liang,An Liu,Lin Gui,Xingkai Yao,Xiaofang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2302.02914 by other authors

点击查看摘要

Abstract:Given the critical role of graphs in real-world applications and their high-security requirements, improving the ability of graph neural networks (GNNs) to detect out-of-distribution (OOD) data is an urgent research problem. The recent work GNNSAFE proposes a framework based on the aggregation of negative energy scores that significantly improves the performance of GNNs to detect node-level OOD data. However, our study finds that score aggregation among nodes is susceptible to extreme values due to the unboundedness of the negative energy scores and logit shifts, which severely limits the accuracy of GNNs in detecting node-level OOD data. In this paper, we propose NODESAFE: reducing the generation of extreme scores of nodes by adding two optimization terms that make the negative energy scores bounded and mitigate the logit shift. Experimental results show that our approach dramatically improves the ability of GNNs to detect OOD data at the node level, e.g., in detecting OOD data induced by Structure Manipulation, the metric of FPR95 (lower is better) in scenarios without (with) OOD data exposure are reduced from the current SOTA by 28.4% (22.7%).
zh

[AI-26] he Impact of AI on the Cyber Offense-Defense Balance and the Character of Cyber Conflict

【速读】:该论文试图探讨人工智能(AI)在网络安全领域中对攻击与防御平衡的影响,并分析随着AI技术的进步,网络安全冲突与竞争的本质将如何变化。论文的关键在于综合已有的学术观点,归纳出九个支持网络攻击优势的论点和九个支持防御优势的论点,并进一步整合Healey、Jervis和Nandrajog分别收集的其他四十八个相关论点,评估这些论点在不同AI发展水平下的潜在变化。最终,论文得出结论:网络安全领域过于复杂,无法简单判断AI总体上会增强进攻还是防御能力;AI将在某些方面提升效率,在另一些方面造成阻碍,同时对部分领域影响有限,并总结出四十四种预期的AI影响方式。

链接: https://arxiv.org/abs/2504.13371
作者: Andrew J. Lohn
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Unlike other domains of conflict, and unlike other fields with high anticipated risk from AI, the cyber domain is intrinsically digital with a tight feedback loop between AI training and cyber application. Cyber may have some of the largest and earliest impacts from AI, so it is important to understand how the cyber domain may change as AI continues to advance. Our approach reviewed the literature, collecting nine arguments that have been proposed for offensive advantage in cyber conflict and nine proposed arguments for defensive advantage. We include an additional forty-eight arguments that have been proposed to give cyber conflict and competition its character as collected separately by Healey, Jervis, and Nandrajog. We then consider how each of those arguments and propositions might change with varying degrees of AI advancement. We find that the cyber domain is too multifaceted for a single answer to whether AI will enhance offense or defense broadly. AI will improve some aspects, hinder others, and leave some aspects unchanged. We collect and present forty-four ways that we expect AI to impact the cyber offense-defense balance and the character of cyber conflict and competition.
zh

[AI-27] An Optimal Discriminator Weighted Imitation Perspective for Reinforcement Learning ICLR2025

【速读】:该论文试图解决离线强化学习(Offline RL)中因数据分布偏移导致的性能下降问题。现有方法在估计状态访问分布比率时存在不足,无法充分利用离线数据集。为了解决这一问题,论文提出了迭代式双重强化学习(Iterative Dual Reinforcement Learning, IDRL),其关键是通过一种迭代校正机制,逐步逼近离线数据集中的最优状态访问分布比率。具体而言,IDRL 在每次迭代中利用前一次迭代学到的比例去除零权重的次优转移,并在剩余子数据集上运行改进的双重重估方法,从而实现访问分布的优化。这种迭代过程相当于引入了一个逐步改进的课程学习策略,使得访问分布更接近于理想判别器权重,从而显著提升了算法在多种离线数据集上的性能与稳定性。

链接: https://arxiv.org/abs/2504.13368
作者: Haoran Xu,Shuozhe Li,Harshit Sikchi,Scott Niekum,Amy Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2025

点击查看摘要

Abstract:We introduce Iterative Dual Reinforcement Learning (IDRL), a new method that takes an optimal discriminator-weighted imitation view of solving RL. Our method is motivated by a simple experiment in which we find training a discriminator using the offline dataset plus an additional expert dataset and then performing discriminator-weighted behavior cloning gives strong results on various types of datasets. That optimal discriminator weight is quite similar to the learned visitation distribution ratio in Dual-RL, however, we find that current Dual-RL methods do not correctly estimate that ratio. In IDRL, we propose a correction method to iteratively approach the optimal visitation distribution ratio in the offline dataset given no addtional expert dataset. During each iteration, IDRL removes zero-weight suboptimal transitions using the learned ratio from the previous iteration and runs Dual-RL on the remaining subdataset. This can be seen as replacing the behavior visitation distribution with the optimized visitation distribution from the previous iteration, which theoretically gives a curriculum of improved visitation distribution ratios that are closer to the optimal discriminator weight. We verify the effectiveness of IDRL on various kinds of offline datasets, including D4RL datasets and more realistic corrupted demonstrations. IDRL beats strong Primal-RL and Dual-RL baselines in terms of both performance and stability, on all datasets.
zh

[AI-28] In between myth and reality: AI for math – a case study in category theory

【速读】:该论文试图探索人工智能系统在数学研究中的性能表现,并通过实验评估两个当代领先的AI系统在数学问题求解方面的辅助能力。论文的核心目标一是理解AI系统如何助力数学研究,二是为AI系统的开发者提供改进建议以明确优化方向。解决方案的关键在于设计针对性的实验,以客观评估AI系统的性能,并基于实验结果提炼具体的改进意见。

链接: https://arxiv.org/abs/2504.13360
作者: Răzvan Diaconescu
机构: 未知
类目: Artificial Intelligence (cs.AI); History and Overview (math.HO); Logic (math.LO)
备注:

点击查看摘要

Abstract:Recently, there is an increasing interest in understanding the performance of AI systems in solving math problems. A multitude of tests have been performed, with mixed conclusions. In this paper we discuss an experiment we have made in the direction of mathematical research, with two of the most prominent contemporary AI systems. One of the objective of this experiment is to get an understanding of how AI systems can assist mathematical research. Another objective is to support the AI systems developers by formulating suggestions for directions of improvement.
zh

[AI-29] Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models ICRA2025

【速读】:该论文旨在解决从人类视频中学习操作任务时,因视觉数据无法捕捉任务执行过程中需动态调整的控制参数(如力)而导致的问题。解决方案的关键在于引入Chain-of-Modality (CoM) 提示策略,通过结合感知设备(如测量肌肉活动的臂环和记录声音的麦克风)获取多模态人类演示数据(视频与肌肉或音频信号耦合),使视觉语言模型能够逐步整合各模态信息,从而细化任务计划并生成详细的控制参数,使机器人能够基于单一多模态人类视频提示完成操作任务。实验结果表明,CoM 在提取任务计划和控制参数方面比基线方法提升了三倍准确性,并在真实机器人实验中展现出对新任务设置和物体的强大泛化能力。

链接: https://arxiv.org/abs/2504.13351
作者: Chen Wang,Fei Xia,Wenhao Yu,Tingnan Zhang,Ruohan Zhang,C. Karen Liu,Li Fei-Fei,Jie Tan,Jacky Liang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: ICRA 2025

点击查看摘要

Abstract:Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data – videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real-world robot experiments. Videos and code are available at this https URL
zh

[AI-30] On the Definition of Robustness and Resilience of AI Agents for Real-time Congestion Management

【速读】:该论文旨在解决欧盟《人工智能法案》中针对高风险领域提出的鲁棒性、韧性和安全性要求缺乏详细评估方法的问题。论文提出了一种新颖的框架,用于定量评估强化学习代理在拥堵管理中的鲁棒性和韧性。解决方案的关键在于利用AI友好的数字环境Grid2Op,通过扰动代理模拟自然与对抗性干扰,同时保持环境实际状态不变,从而在不同场景下评估人工智能系统的性能。鲁棒性通过稳定性与奖励影响指标衡量,而韧性则量化性能退化后的恢复能力。研究结果验证了该框架在识别漏洞以及提升关键应用中人工智能鲁棒性和韧性的有效性。

链接: https://arxiv.org/abs/2504.13314
作者: Timothy Tjhay,Ricardo J. Bessa,Jose Paulos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: IEEE PowerTech 2025 Conference

点击查看摘要

Abstract:The European Union’s Artificial Intelligence (AI) Act defines robustness, resilience, and security requirements for high-risk sectors but lacks detailed methodologies for assessment. This paper introduces a novel framework for quantitatively evaluating the robustness and resilience of reinforcement learning agents in congestion management. Using the AI-friendly digital environment Grid2Op, perturbation agents simulate natural and adversarial disruptions by perturbing the input of AI systems without altering the actual state of the environment, enabling the assessment of AI performance under various scenarios. Robustness is measured through stability and reward impact metrics, while resilience quantifies recovery from performance degradation. The results demonstrate the framework’s effectiveness in identifying vulnerabilities and improving AI robustness and resilience for critical applications.
zh

[AI-31] Enhanced Pruning Strategy for Multi-Component Neural Architectures Using Component-Aware Graph Analysis

【速读】:该论文致力于解决深度神经网络(Deep Neural Networks, DNNs)在资源受限环境中部署困难的问题,尤其是在多组件神经架构(Multi-Component Neural Architectures, MCNAs)中,传统基于参数依赖性分析的全面结构化剪枝框架可能导致网络功能完整性受损。论文的关键解决方案在于提出了一种组件感知的剪枝策略(component-aware pruning strategy),通过扩展依赖图(dependency graphs)将单个组件及其组件间流隔离,从而形成更小且针对性更强的剪枝组群,确保功能完整性的保留。实验表明,该方法在控制任务中实现了更高的稀疏性和更低的性能退化,为高效优化复杂的多组件DNNs开辟了新途径。

链接: https://arxiv.org/abs/2504.13296
作者: Ganesh Sundaram,Jonas Ulmen,Daniel Görges
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, IFAC J3C

点击查看摘要

Abstract:Deep neural networks (DNNs) deliver outstanding performance, but their complexity often prohibits deployment in resource-constrained settings. Comprehensive structured pruning frameworks based on parameter dependency analysis reduce model size with specific regard to computational performance. When applying them to Multi-Component Neural Architectures (MCNAs), they risk network integrity by removing large parameter groups. We introduce a component-aware pruning strategy, extending dependency graphs to isolate individual components and inter-component flows. This creates smaller, targeted pruning groups that conserve functional integrity. Demonstrated effectively on a control task, our approach achieves greater sparsity and reduced performance degradation, opening a path for optimizing complex, multi-component DNNs efficiently.
zh

[AI-32] Causal-Copilot: An Autonomous Causal Analysis Agent

【速读】:该论文旨在解决因果方法在实际应用中的可及性问题,即由于概念和算法的复杂性,领域专家难以利用因果学习的最新进展,同时因果研究者缺乏广泛的现实世界部署来测试和改进其方法。为了解决这一问题,论文提出的关键解决方案是开发Causal-Copilot,这是一个基于大型语言模型框架的自主代理,能够将专家级因果分析操作化。Causal-Copilot自动化处理表格数据和时间序列数据的完整因果分析流程,包括因果发现、因果推理、算法选择、超参数优化、结果解释以及生成可操作见解,并通过自然语言实现交互式细化,从而降低非专业人士的使用门槛,同时保持方法论严谨性。通过整合超过20种最先进的因果分析技术,该系统促进了良性循环,不仅扩展了领域专家访问高级因果方法的机会,还产生了丰富的实际应用场景,以推动因果理论的发展。实证评估表明,Causal-Copilot相比现有基线表现出色,提供了一个可靠、可扩展且可扩展的解决方案,弥合了因果分析中理论复杂性和实际适用性之间的差距。

链接: https://arxiv.org/abs/2504.13263
作者: Xinyue Wang,Kun Zhou,Wenyi Wu,Har Simrat Singh,Fang Nan,Songyao Jin,Aryan Philip,Saloni Patnaik,Hou Zhu,Shivam Singh,Parjanya Prashant,Qian Shen,Biwei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal analysis plays a foundational role in scientific discovery and reliable decision-making, yet it remains largely inaccessible to domain experts due to its conceptual and algorithmic complexity. This disconnect between causal methodology and practical usability presents a dual challenge: domain experts are unable to leverage recent advances in causal learning, while causal researchers lack broad, real-world deployment to test and refine their methods. To address this, we introduce Causal-Copilot, an autonomous agent that operationalizes expert-level causal analysis within a large language model framework. Causal-Copilot automates the full pipeline of causal analysis for both tabular and time-series data – including causal discovery, causal inference, algorithm selection, hyperparameter optimization, result interpretation, and generation of actionable insights. It supports interactive refinement through natural language, lowering the barrier for non-specialists while preserving methodological rigor. By integrating over 20 state-of-the-art causal analysis techniques, our system fosters a virtuous cycle – expanding access to advanced causal methods for domain experts while generating rich, real-world applications that inform and advance causal theory. Empirical evaluations demonstrate that Causal-Copilot achieves superior performance compared to existing baselines, offering a reliable, scalable, and extensible solution that bridges the gap between theoretical sophistication and real-world applicability in causal analysis.
zh

[AI-33] Recursive Deep Inverse Reinforcement Learning

【速读】:该论文旨在解决在非合作多智能体系统(如网络安全、军事及策略游戏领域)中实时推断对手目标的问题。现有基于最大熵原理的深度逆强化学习(Deep Inverse Reinforcement Learning, IRL)方法虽能有效恢复对手的目标,但通常依赖离线训练、大批次数据以及一阶更新,限制了其在实时场景中的应用。论文的关键在于提出了一种在线递归深度逆强化学习(Recursive Deep Inverse Reinforcement Learning, RDIRL)方法,通过采用序列二阶牛顿更新来最小化标准引导成本学习(Guided Cost Learning, GCL)目标函数的上界,类似于扩展卡尔曼滤波器(Extended Kalman Filter, EKF),从而实现快速收敛的学习算法。实验结果表明,RDIRL不仅能够有效恢复专家代理的成本和奖励函数,还在标准与对抗性基准任务中优于多种领先IRL算法。

链接: https://arxiv.org/abs/2504.13241
作者: Paul Ghanem,Michael Potter,Owen Howell,Pau Closas,Alireza Ramezani,Deniz Erdogmus,Robert Platt,Tales Imbiriba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inferring an adversary’s goals from exhibited behavior is crucial for counterplanning and non-cooperative multi-agent systems in domains like cybersecurity, military, and strategy games. Deep Inverse Reinforcement Learning (IRL) methods based on maximum entropy principles show promise in recovering adversaries’ goals but are typically offline, require large batch sizes with gradient descent, and rely on first-order updates, limiting their applicability in real-time scenarios. We propose an online Recursive Deep Inverse Reinforcement Learning (RDIRL) approach to recover the cost function governing the adversary actions and goals. Specifically, we minimize an upper bound on the standard Guided Cost Learning (GCL) objective using sequential second-order Newton updates, akin to the Extended Kalman Filter (EKF), leading to a fast (in terms of convergence) learning algorithm. We demonstrate that RDIRL is able to recover cost and reward functions of expert agents in standard and adversarial benchmark tasks. Experiments on benchmark tasks show that our proposed approach outperforms several leading IRL algorithms.
zh

[AI-34] Non-Uniform Class-Wise Coreset Selection: Characterizing Category Difficulty for Data-Efficient Transfer Learning

【速读】:该论文旨在解决大规模迁移学习模型和数据集在适应性调整和存储优化方面的需求,当前方法主要依赖于实例级难度评估,忽视了类别级特征,导致少数类别的代表性不足。为了解决这一局限性,论文提出了一种名为非均匀类别级核心集选择(Non-Uniform Class-Wise Coreset Selection, NUCS)的新框架,其关键在于结合了类别级和实例级标准,通过自动分配每个类别的数据选择预算,并基于类别内在难度自适应地选择最优难度范围内的样本,从而构建一个更加平衡且具有代表性的核心集。这种方法不仅弥补了先前方法的关键缺陷,还通过理论分析验证了自适应预算分配和样本选择的合理性,并通过广泛的实验展示了NUCS在14个多样化数据集和模型架构上的持续改进,实现了优于现有技术的准确性和计算效率。例如,在CIFAR100和Food101数据集上,NUCS仅保留30%的样本即达到了与全数据训练相当的精度,同时将计算时间减少了60%。这项工作强调了在核心集选择中表征类别难度的重要性,为迁移学习提供了稳健且高效的数据解决方案。

链接: https://arxiv.org/abs/2504.13234
作者: Hanyu Zhang,Zhen Xing,Wenxuan Yang,Chenxi Ma,Weimin Tan,Bo Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11pages

点击查看摘要

Abstract:As transfer learning models and datasets grow larger, efficient adaptation and storage optimization have become critical needs. Coreset selection addresses these challenges by identifying and retaining the most informative samples, constructing a compact subset for target domain training. However, current methods primarily rely on instance-level difficulty assessments, overlooking crucial category-level characteristics and consequently under-representing minority classes. To overcome this limitation, we propose Non-Uniform Class-Wise Coreset Selection (NUCS), a novel framework that integrates both class-level and instance-level criteria. NUCS automatically allocates data selection budgets for each class based on intrinsic category difficulty and adaptively selects samples within optimal difficulty ranges. By explicitly incorporating category-specific insights, our approach achieves a more balanced and representative coreset, addressing key shortcomings of prior methods. Comprehensive theoretical analysis validates the rationale behind adaptive budget allocation and sample selection, while extensive experiments across 14 diverse datasets and model architectures demonstrate NUCS’s consistent improvements over state-of-the-art methods, achieving superior accuracy and computational efficiency. Notably, on CIFAR100 and Food101, NUCS matches full-data training accuracy while retaining just 30% of samples and reducing computation time by 60%. Our work highlights the importance of characterizing category difficulty in coreset selection, offering a robust and data-efficient solution for transfer learning.
zh

[AI-35] Scaling Laws for Data-Efficient Visual Transfer Learning

【速读】:该论文致力于解决视觉人工智能模型在数据受限的下游任务中性能如何随数据规模扩展的问题,并探索知识蒸馏在有限数据条件下的有效性。论文的关键在于提出了首个针对视觉迁移学习的数据高效缩放律框架,并通过系统分析揭示了“蒸馏边界理论”。这一理论阐明了蒸馏效率的一个关键转折点:在数据稀缺条件下,蒸馏模型显著优于未蒸馏模型,能够有效利用继承的知识弥补训练样本的不足;而当预训练数据超过某一临界阈值时,未蒸馏模型逐渐超越蒸馏版本,表明从知识继承中获得的收益会随着任务特定数据的增加而递减。通过跨多种模型规模(2.5M 至 38M 参数)和数据量的实证验证,论文验证了这些性能转折点,从而重新定义了数据受限场景下的缩放规律,填补了大规模预训练与实际下游应用之间的重要知识空白。

链接: https://arxiv.org/abs/2504.13219
作者: Wenxuan Yang,Qingqu Wei,Chenxi Ma,Weimin Tan,Bo Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current scaling laws for visual AI models focus predominantly on large-scale pretraining, leaving a critical gap in understanding how performance scales for data-constrained downstream tasks. To address this limitation, this paper establishes the first practical framework for data-efficient scaling laws in visual transfer learning, addressing two fundamental questions: 1) How do scaling behaviors shift when downstream tasks operate with limited data? 2) What governs the efficacy of knowledge distillation under such constraints? Through systematic analysis of vision tasks across data regimes (1K-1M samples), we propose the distillation boundary theory, revealing a critical turning point in distillation efficiency: 1) Distillation superiority: In data-scarce conditions, distilled models significantly outperform their non-distillation counterparts, efficiently leveraging inherited knowledge to compensate for limited training samples. 2) Pre-training dominance: As pre-training data increases beyond a critical threshold, non-distilled models gradually surpass distilled versions, suggesting diminishing returns from knowledge inheritance when sufficient task-specific data becomes available. Empirical validation across various model scales (2.5M to 38M parameters) and data volumes demonstrate these performance inflection points, with error difference curves transitioning from positive to negative values at critical data thresholds, confirming our theoretical predictions. This work redefines scaling laws for data-limited regimes, bridging the knowledge gap between large-scale pretraining and practical downstream adaptation, addressing a critical barrier to understanding vision model scaling behaviors and optimizing computational resource allocation.
zh

[AI-36] Harmony: A Unified Framework for Modality Incremental Learning

【速读】:该论文致力于解决跨连续演化的模态序列进行增量学习的可行性问题,特别是当数据来自全新模态时所面临的挑战。传统方法多集中于一致模态的单模态或跨模态增量学习,而本文提出了一种新的范式——模态增量学习(Modality Incremental Learning, MIL),其每个学习阶段涉及的数据来自不同的模态。
解决方案的关键在于提出的名为Harmony的框架,该框架通过自适应兼容特征调制(adaptive compatible feature modulation)和累积模态桥接(cumulative modal bridging)实现模态对齐与知识保留。这些组件通过构建历史模态特征、执行模态知识积累与对齐,协作弥合模态差异并保持知识传递,即使在每个学习阶段仅能获得单一模态数据的情况下也能有效工作。实验结果表明,所提方法显著优于现有增量学习方法,在MIL场景中验证了其有效性。

链接: https://arxiv.org/abs/2504.13218
作者: Yaguang Song,Xiaoshan Yang,Dongmei Jiang,Yaowei Wang,Changsheng Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Incremental learning aims to enable models to continuously acquire knowledge from evolving data streams while preserving previously learned capabilities. While current research predominantly focuses on unimodal incremental learning and multimodal incremental learning where the modalities are consistent, real-world scenarios often present data from entirely new modalities, posing additional challenges. This paper investigates the feasibility of developing a unified model capable of incremental learning across continuously evolving modal sequences. To this end, we introduce a novel paradigm called Modality Incremental Learning (MIL), where each learning stage involves data from distinct modalities. To address this task, we propose a novel framework named Harmony, designed to achieve modal alignment and knowledge retention, enabling the model to reduce the modal discrepancy and learn from a sequence of distinct modalities, ultimately completing tasks across multiple modalities within a unified framework. Our approach introduces the adaptive compatible feature modulation and cumulative modal bridging. Through constructing historical modal features and performing modal knowledge accumulation and alignment, the proposed components collaboratively bridge modal differences and maintain knowledge retention, even with solely unimodal data available at each learning this http URL components work in concert to establish effective modality connections and maintain knowledge retention, even when only unimodal data is available at each learning stage. Extensive experiments on the MIL task demonstrate that our proposed method significantly outperforms existing incremental learning methods, validating its effectiveness in MIL scenarios.
zh

[AI-37] Graphical Models for Decision-Making: Integrating Causality and Game Theory

【速读】:该论文试图解决如何将因果关系(Causality)与博弈论(Game Theory)框架有效结合,并在概率图模型(Probabilistic Graphical Models)的背景下实现其实际应用的问题。论文的关键在于明确因果关系与博弈论交汇处的核心概念,通过严谨分析和直观示例阐明这些模型的输入需求及应用场景,帮助实践者理解如何根据不同场景选择和应用相关模型,同时引用现有研究支持其实现。这一工作旨在推动这些模型在现实世界中的更广泛应用。

链接: https://arxiv.org/abs/2504.13210
作者: Maarten C. Vonk,Mauricio Gonzalez Soto,Anna V. Kononova
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Probability (math.PR)
备注:

点击查看摘要

Abstract:Causality and game theory are two influential fields that contribute significantly to decision-making in various domains. Causality defines and models causal relationships in complex policy problems, while game theory provides insights into strategic interactions among stakeholders with competing interests. Integrating these frameworks has led to significant theoretical advancements with the potential to improve decision-making processes. However, practical applications of these developments remain underexplored. To support efforts toward implementation, this paper clarifies key concepts in game theory and causality that are essential to their intersection, particularly within the context of probabilistic graphical models. By rigorously examining these concepts and illustrating them with intuitive, consistent examples, we clarify the required inputs for implementing these models, provide practitioners with insights into their application and selection across different scenarios, and reference existing research that supports their implementation. We hope this work encourages broader adoption of these models in real-world scenarios.
zh

[AI-38] On the Feasibility of Using MultiModal LLM s to Execute AR Social Engineering Attacks

【速读】:该论文旨在研究利用多模态大语言模型(Multimodal Large Language Models, LLMs)驱动增强现实(Augmented Reality, AR)进行社会工程攻击的可行性,并首次提出了一种名为SEAR的框架来系统性地实现这一目标。论文的关键在于通过三个核心阶段构建和执行此类攻击:(1) 基于AR的社会上下文合成,融合视觉、听觉及环境线索等多模态输入;(2) 基于角色的多模态检索增强生成(Retrieval-Augmented Generation, RAG),动态检索并整合上下文数据以保持角色差异性;(3) 交互式社会工程代理(ReInteract),通过推理交互循环实施自适应多阶段攻击策略。这些关键步骤共同构成了一个完整的框架,用于探索AR与LLMs结合带来的新型社会工程威胁及其潜在影响。

链接: https://arxiv.org/abs/2504.13209
作者: Ting Bi,Chenghang Ye,Zheyu Yang,Ziyi Zhou,Cui Tang,Jun Zhang,Zui Tao,Kailong Wang,Liting Zhou,Yang Yang,Tianlong Yu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Augmented Reality (AR) and Multimodal Large Language Models (LLMs) are rapidly evolving, providing unprecedented capabilities for human-computer interaction. However, their integration introduces a new attack surface for social engineering. In this paper, we systematically investigate the feasibility of orchestrating AR-driven Social Engineering attacks using Multimodal LLM for the first time, via our proposed SEAR framework, which operates through three key phases: (1) AR-based social context synthesis, which fuses Multimodal inputs (visual, auditory and environmental cues); (2) role-based Multimodal RAG (Retrieval-Augmented Generation), which dynamically retrieves and integrates contextual data while preserving character differentiation; and (3) ReInteract social engineering agents, which execute adaptive multiphase attack strategies through inference interaction loops. To verify SEAR, we conducted an IRB-approved study with 60 participants in three experimental configurations (unassisted, AR+LLM, and full SEAR pipeline) compiling a new dataset of 180 annotated conversations in simulated social scenarios. Our results show that SEAR is highly effective at eliciting high-risk behaviors (e.g., 93.3% of participants susceptible to email phishing). The framework was particularly effective in building trust, with 85% of targets willing to accept an attacker’s call after an interaction. Also, we identified notable limitations such as ``occasionally artificial’’ due to perceived authenticity gaps. This work provides proof-of-concept for AR-LLM driven social engineering attacks and insights for developing defensive countermeasures against next-generation augmented reality threats.
zh

[AI-39] On-Device Watermarking: A Socio-Technical Imperative For Authenticity In The Age of Generative AI ICLR2025

【速读】:该论文试图解决生成式 AI 输出检测与水印技术面临的局限性问题,主张当前研究方向(即专注于AI生成内容的水印)存在偏差,应转向基于可信内容的加密签名认证。论文指出,对于音频-视觉内容,真实世界的数据源自物理环境并通过硬件传感器捕获,这为在硬件层进行水印提供了独特机会。关键解决方案在于提出一种社会-技术框架,并借鉴HTTPS认证和Blu-Ray验证协议的经验,强调硬件层面的身份验证具有更强的可操作性和政策可行性。同时,论文警告过度依赖AI水印可能忽视其技术局限性,建议将研究资源更多集中在文本和大型语言模型(LLM)领域,因其不直接关联物理传感器。

链接: https://arxiv.org/abs/2504.13205
作者: Houssam Kherraz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, ICLR 2025, this https URL

点击查看摘要

Abstract:As generative AI models produce increasingly realistic output, both academia and industry are focusing on the ability to detect whether an output was generated by an AI model or not. Many of the research efforts and policy discourse are centered around robust watermarking of AI outputs. While plenty of progress has been made, all watermarking and AI detection techniques face severe limitations. In this position paper, we argue that we are adopting the wrong approach, and should instead focus on watermarking via cryptographic signatures trustworthy content rather than AI generated ones. For audio-visual content, in particular, all real content is grounded in the physical world and captured via hardware sensors. This presents a unique opportunity to watermark at the hardware layer, and we lay out a socio-technical framework and draw parallels with HTTPS certification and Blu-Ray verification protocols. While acknowledging implementation challenges, we contend that hardware-based authentication offers a more tractable path forward, particularly from a policy perspective. As generative models approach perceptual indistinguishability, the research community should be wary of being overly optimistic with AI watermarking, and we argue that AI watermarking research efforts are better spent in the text and LLM space, which are ultimately not traceable to a physical sensor.
zh

[AI-40] Building Trustworthy Multimodal AI: A Review of Fairness Transparency and Ethics in Vision-Language Tasks

【速读】:该论文旨在解决多模态人工智能(Multimodal Artificial Intelligence, M-AI)系统,特别是视觉-语言任务的信任worthiness问题,重点关注公平性(Fairness)、透明性(Transparency)和伦理影响(Ethical Implications)。论文通过对比分析视觉问答(Visual Question Answering, VQA)、图像描述生成(Image Captioning)和视觉对话(Visual Dialogue)等核心任务,揭示了这些系统在实际应用中的关键挑战,并总结了近年来的研究趋势、难点及前沿解决方案。

解决方案的关键在于从透明性、公平性和伦理角度出发,提出具体的技术手段和框架设计。透明性方面,通过注意力图(Attention Maps)和基于梯度的方法(Gradient-based Methods)提升模型的可解释性;公平性方面,强调在VQA和视觉对话系统中缓解数据或模型偏差,确保跨不同人口统计群体的结果无偏;伦理影响方面,重点解决多语言模型中的偏见问题,并保障数据处理的合规性。最终,论文呼吁将公平性、透明性和伦理考量统一整合到视觉-语言模型的开发框架中,以实现更可靠和负责任的人工智能系统。

链接: https://arxiv.org/abs/2504.13199
作者: Mohammad Saleha,Azadeh Tabatabaeib
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: This review explores the trustworthiness of multimodal artificial intelligence (AI) systems, specifically focusing on vision-language tasks. It addresses critical challenges related to fairness, transparency, and ethical implications in these systems, providing a comparative analysis of key tasks such as Visual Question Answering (VQA), image captioning, and visual dialogue. Background: Multimodal models, particularly vision-language models, enhance artificial intelligence (AI) capabilities by integrating visual and textual data, mimicking human learning processes. Despite significant advancements, the trustworthiness of these models remains a crucial concern, particularly as AI systems increasingly confront issues regarding fairness, transparency, and ethics. Methods: This review examines research conducted from 2017 to 2024 focusing on forenamed core vision-language tasks. It employs a comparative approach to analyze these tasks through the lens of trustworthiness, underlining fairness, explainability, and ethics. This study synthesizes findings from recent literature to identify trends, challenges, and state-of-the-art solutions. Results: Several key findings were highlighted. Transparency: Explainability of vision language tasks is important for user trust. Techniques, such as attention maps and gradient-based methods, have successfully addressed this issue. Fairness: Bias mitigation in VQA and visual dialogue systems is essential for ensuring unbiased outcomes across diverse demographic groups. Ethical Implications: Addressing biases in multilingual models and ensuring ethical data handling is critical for the responsible deployment of vision-language systems. Conclusion: This study underscores the importance of integrating fairness, transparency, and ethical considerations in developing vision-language models within a unified framework.
zh

[AI-41] Investigating cybersecurity incidents using large language models in latest-generation wireless networks

【速读】:本文旨在解决基于现代生成模型(Generative Models)检测网络安全事件、支持决策以及评估应对信息安全威胁措施有效性的问题。研究的关键在于通过模拟MIMO系统中的信号传播数据、合成对抗样本、执行针对机器学习模型的对抗攻击,以及对大规模语言模型(Large Language Models, LLMs)进行微调以检测这些对抗攻击,从而实现对网络安全事件的检测与分析。此外,还利用提示技术(Prompts Technique)解释检测决策,并首次通过大规模语言模型对数据投毒攻击进行了二分类,同时探索了LLMs在最新一代无线网络中调查网络安全事件的可能性。研究的关键解决方案是通过对模拟无线网络段的数据进行微调,比较了六种大规模语言模型在检测对抗攻击方面的性能,其中Gemma-7b模型在精确度(Precision=0.89)、召回率(Recall=0.89)和F1分数(F1-Score=0.89)方面表现最优,并展示了其在解释决策、分析特征重要性及提供缓解对抗攻击后果建议方面的显著潜力。

链接: https://arxiv.org/abs/2504.13196
作者: Leonid Legashev,Arthur Zhigalov
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:The purpose of research: Detection of cybersecurity incidents and analysis of decision support and assessment of the effectiveness of measures to counter information security threats based on modern generative models. The methods of research: Emulation of signal propagation data in MIMO systems, synthesis of adversarial examples, execution of adversarial attacks on machine learning models, fine tuning of large language models for detecting adversarial attacks, explainability of decisions on detecting cybersecurity incidents based on the prompts technique. Scientific novelty: A binary classification of data poisoning attacks was performed using large language models, and the possibility of using large language models for investigating cybersecurity incidents in the latest generation wireless networks was investigated. The result of research: Fine-tuning of large language models was performed on the prepared data of the emulated wireless network segment. Six large language models were compared for detecting adversarial attacks, and the capabilities of explaining decisions made by a large language model were investigated. The Gemma-7b model showed the best results according to the metrics Precision = 0.89, Recall = 0.89 and F1-Score = 0.89. Based on various explainability prompts, the Gemma-7b model notes inconsistencies in the compromised data under study, performs feature importance analysis and provides various recommendations for mitigating the consequences of adversarial attacks. Large language models integrated with binary classifiers of network threats have significant potential for practical application in the field of cybersecurity incident investigation, decision support and assessing the effectiveness of measures to counter information security threats.
zh

[AI-42] Optimizing Multi-Gateway LoRaWAN via Cloud-Edge Collaboration and Knowledge Distillation

【速读】:该论文针对大规模多网关LoRaWAN网络中终端节点资源分配与决策效率低的问题,提出了一种基于边缘智能的云边协同资源分配与决策方法HEAT-LDL (HEAT-Local Distill Lyapunov)。其关键在于结合Actor-Critic架构与Lyapunov优化方法实现下行链路控制与网关负载均衡,并通过云边知识蒸馏在终端节点侧优化自主决策能力。当下行决策指令丢失时,终端节点利用学生模型与基于先验知识及本地历史的边缘决策器进行协作自主决策,从而显著提升数据包成功传输率与能量效率。

链接: https://arxiv.org/abs/2504.13194
作者: Hong Yang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:For large-scale multi-gateway LoRaWAN networks, this study proposes a cloud-edge collaborative resource allocation and decision-making method based on edge intelligence, HEAT-LDL (HEAT-Local Distill Lyapunov), which realizes collaborative decision-making between gateways and terminal nodes. HEAT-LDL combines the Actor-Critic architecture and the Lyapunov optimization method to achieve intelligent downlink control and gateway load balancing. When the signal quality is good, the network server uses the HEAT algorithm to schedule the terminal nodes. To improve the efficiency of autonomous decision-making of terminal nodes, HEAT-LDL performs cloud-edge knowledge distillation on the HEAT teacher model on the terminal node side. When the downlink decision instruction is lost, the terminal node uses the student model and the edge decider based on prior knowledge and local history to make collaborative autonomous decisions. Simulation experiments show that compared with the optimal results of all compared algorithms, HEAT-LDL improves the packet success rate and energy efficiency by 20.5% and 88.1%, respectively.
zh

[AI-43] HEAT:History-Enhanced Dual-phase Actor-Critic Algorithm with A Shared Transformer

【速读】:该论文旨在解决单网关 LoRaWAN 网络性能优化的问题。为实现这一目标,论文提出了一种基于历史增强的两阶段 Actor-Critic 算法与共享 Transformer 算法(History-Enhanced Two-phase Actor-Critic with Shared Transformer, HEAT)。HEAT 的关键创新在于同时考虑了上行参数和通常被忽略的下行参数,并有效结合了离线强化学习(Offline Reinforcement Learning)和在线强化学习(Online Reinforcement Learning),通过利用历史数据与实时交互提升模型性能。此外,论文还开发了一个开源的 LoRaWAN 网络仿真器 LoRaWANSim,用于支持实验验证。仿真结果表明,与现有算法的最佳表现相比,HEAT 将数据包成功率提升了 15%,并将能量效率提高了 95%。

链接: https://arxiv.org/abs/2504.13193
作者: Hong Yang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:For a single-gateway LoRaWAN network, this study proposed a history-enhanced two-phase actor-critic algorithm with a shared transformer algorithm (HEAT) to improve network performance. HEAT considers uplink parameters and often neglected downlink parameters, and effectively integrates offline and online reinforcement learning, using historical data and real-time interaction to improve model performance. In addition, this study developed an open source LoRaWAN network simulator LoRaWANSim. The simulator considers the demodulator lock effect and supports multi-channel, multi-demodulator and bidirectional communication. Simulation experiments show that compared with the best results of all compared algorithms, HEAT improves the packet success rate and energy efficiency by 15% and 95%, respectively.
zh

[AI-44] CheatAgent : Attacking LLM LLM -Empowered Recommender Systems via LLM Agent

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)赋能的推荐系统(Recommender System, RecSys)在安全性和隐私性方面的脆弱性问题,特别是针对黑盒推荐系统的攻击挑战。传统基于强化学习(Reinforcement Learning, RL)代理的攻击方法因处理复杂文本输入、规划及推理能力的限制,在应对LLM-Empowered RecSys时效果不佳。为克服这一难题,论文提出了一种名为CheatAgent的新颖攻击框架,利用LLMs的人类模拟决策能力作为攻击代理。

解决方案的关键在于开发了一个基于LLM的攻击代理,首先通过最小化输入修改来确定插入位置以最大化影响;随后设计LLM代理生成对抗扰动并在目标位置插入;最后借助提示调优(Prompt Tuning)技术,通过从受害推荐系统获得的反馈迭代优化攻击策略,从而显著提升生成对抗扰动的质量。实验结果表明,该方法在三个真实数据集上的有效性得到了验证。

链接: https://arxiv.org/abs/2504.13192
作者: Liang-bo Ning,Shijie Wang,Wenqi Fan,Qing Li,Xin Xu,Hao Chen,Feiran Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience and have attracted considerable attention. Despite the impressive progress, the research question regarding the safety vulnerability of LLM-empowered RecSys still remains largely under-investigated. Given the security and privacy concerns, it is more practical to focus on attacking the black-box RecSys, where attackers can only observe the system’s inputs and outputs. However, traditional attack approaches employing reinforcement learning (RL) agents are not effective for attacking LLM-empowered RecSys due to the limited capabilities in processing complex textual inputs, planning, and reasoning. On the other hand, LLMs provide unprecedented opportunities to serve as attack agents to attack RecSys because of their impressive capability in simulating human-like decision-making processes. Therefore, in this paper, we propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs, where an LLM-based agent is developed to attack LLM-Empowered RecSys. Specifically, our method first identifies the insertion position for maximum impact with minimal input modification. After that, the LLM agent is designed to generate adversarial perturbations to insert at target positions. To further improve the quality of generated perturbations, we utilize the prompt tuning technique to improve attacking strategies via feedback from the victim RecSys iteratively. Extensive experiments across three real-world datasets demonstrate the effectiveness of our proposed attacking method.
zh

[AI-45] Factors That Influence the Adoption of AI-enabled Conversational Agents (AICAs) as an Augmenting Therapeutic Tool by Frontline Healthcare Workers: From Technology Acceptance Model 3 (TAM3) Lens – A Systematic Mapping Review

【速读】:该论文旨在探讨人工智能(AI)对话代理在心理健康领域的可行性,并试图解决如何通过系统性分析心理健康专业人士的观点,理解其对AI对话代理的态度以及影响其采纳和推荐的关键因素。论文的关键在于采用TAM3框架,从心理健康专业人员的角度出发,综合评估AI对话代理带来的机遇、关注点及其潜在影响,从而为该技术的开发与部署提供指导框架,确保其在增强心理健康服务中的有效应用。

链接: https://arxiv.org/abs/2504.13183
作者: Rawan AlMakinah
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligent (AI) conversational agents hold a promising future in the field of mental health, especially in helping marginalized communities that lack access to mental health support services. It is tempting to have a 24/7 mental health companion that can be accessed anywhere using mobile phones to provide therapist-like advice. Yet, caution should be taken, and studies around their feasibility need to be surveyed. Before adopting such a rapidly changing technology, studies on its feasibility should be explored, summarized, and synthesized to gain a solid understanding of the status quo and to enable us to build a framework that can guide us throughout the development and deployment processes. Different perspectives must be considered when investigating the feasibility of AI conversational agents, including the mental healthcare professional perspective. The literature can provide insights into their perspectives in terms of opportunities, concerns, and implications. Mental health professionals, the subject-matter experts in this field, have their points of view that should be understood and considered. This systematic literature review will explore mental health practitioners’ attitudes toward AI conversational agents and the factors that affect their adoption and recommendation of the technology to augment their services and treatments. The TAM3 Framework will be the lens through which this systematic literature review will be conducted.
zh

[AI-46] Near-optimal algorithms for private estimation and sequential testing of collision probability

【速读】:该论文致力于解决离散分布扩散的核心度量——碰撞概率(collision probability)的估计与假设检验问题。在满足 (\alpha, \beta)-局部差分隐私(Local Differential Privacy, LDP)约束的同时,论文提出了一种新算法,能够以误差 (\epsilon) 估计碰撞概率,所需样本复杂度为 (\tilde{O}\left(\frac{\log(1/\beta)}{\alpha^2 \epsilon^2}\right)),当 (\alpha \leq 1) 时,这一结果较先前工作提升了 (\frac{1}{\alpha^2}) 倍。此外,论文还设计了一个顺序测试算法,能够在未知 (\epsilon) 的情况下,仅需 (\tilde{O}\left(\frac{1}{\epsilon^2}\right)) 样本区分分离度为 (\epsilon) 的碰撞概率值。关键在于通过优化样本复杂度与隐私保护之间的权衡,显著减少了所需的样本数量,同时保持了接近最优的理论性能。

链接: https://arxiv.org/abs/2504.13804
作者: Robert Busa-Fekete,Umar Syed
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present new algorithms for estimating and testing \emphcollision probability, a fundamental measure of the spread of a discrete distribution that is widely used in many scientific fields. We describe an algorithm that satisfies (\alpha, \beta) -local differential privacy and estimates collision probability with error at most \epsilon using \tildeO\left(\frac\log(1/\beta)\alpha^2 \epsilon^2\right) samples for \alpha \le 1 , which improves over previous work by a factor of \frac1\alpha^2 . We also present a sequential testing algorithm for collision probability, which can distinguish between collision probability values that are separated by \epsilon using \tildeO(\frac1\epsilon^2) samples, even when \epsilon is unknown. Our algorithms have nearly the optimal sample complexity, and in experiments we show that they require significantly fewer samples than previous methods.
zh

[AI-47] Adaptive Non-local Observable on Quantum Neural Networks

【速读】:该论文试图解决传统变分量子电路(Variational Quantum Circuits, VQCs)在量子机器学习中的局限性,特别是固定Hermitian可观测量导致的模型复杂度不足问题。解决方案的关键在于提出了一种基于海森堡绘景的自适应非局域测量框架,通过引入具有动态参数的可变Hermitian可观测量,将变分旋转优化视为可观测量空间中的轨迹追踪过程。这一视角揭示了标准VQC只是海森堡表示的一种特殊情况,并进一步证明了结合变分旋转与非局域可观测量能够增强量子比特间的相互作用和信息混合,从而实现更灵活的电路设计。论文提出了两种非局域测量方案,并通过分类任务的数值模拟验证了该方法相较于传统VQC的优越性能,提供了一种更强大且资源高效的量子神经网络实现方式。

链接: https://arxiv.org/abs/2504.13414
作者: Hsin-Yi Lin,Huan-Hsin Tseng,Samuel Yen-Chi Chen,Shinjae Yoo
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Conventional Variational Quantum Circuits (VQCs) for Quantum Machine Learning typically rely on a fixed Hermitian observable, often built from Pauli operators. Inspired by the Heisenberg picture, we propose an adaptive non-local measurement framework that substantially increases the model complexity of the quantum circuits. Our introduction of dynamical Hermitian observables with evolving parameters shows that optimizing VQC rotations corresponds to tracing a trajectory in the observable space. This viewpoint reveals that standard VQCs are merely a special case of the Heisenberg representation. Furthermore, we show that properly incorporating variational rotations with non-local observables enhances qubit interaction and information mixture, admitting flexible circuit designs. Two non-local measurement schemes are introduced, and numerical simulations on classification tasks confirm that our approach outperforms conventional VQCs, yielding a more powerful and resource-efficient approach as a Quantum Neural Network. Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.13414 [quant-ph] (or arXiv:2504.13414v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2504.13414 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-48] Addressing the Minor-Embedding Problem in Quantum Annealing and Evaluating State-of-the-Art Algorithm Performance

【速读】:本文旨在解决量子退火处理器中的变量嵌入问题,即如何将Ising模型的变量映射到量子退火硬件上。论文的核心动机源于量子退火器在处理适合其架构的问题与非硬件原生拓扑问题时表现出的性能差异。研究的两个主要目标是:一是分析嵌入质量对D-Wave Systems量子退火器性能的影响;二是评估D-Wave提供的标准嵌入算法Minorminer生成嵌入的质量。针对第一目标,实验表明嵌入平均链长与采样解的相对误差之间存在明确关联,这凸显了嵌入质量对量子退火性能的关键影响。对于第二目标,论文聚焦于Minorminer技术,评估其嵌入能力、生成嵌入的质量以及结果的鲁棒性,并将其与另一种确定性的Clique Embedding算法进行比较,后者作为最坏情况下的基准。结果表明,Minorminer仍有显著改进空间,尚未始终优于最坏情况下的基准。因此,论文的关键在于揭示嵌入质量对量子退火性能的重要性,并通过对比不同嵌入算法,强调了优化嵌入方法的必要性。

链接: https://arxiv.org/abs/2504.13376
作者: Aitor Gómez-Tejedor,Eneko Osaba,Esther Villar-Rodriguez
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Paper submitted for review in the Future Generation Computer Systems journal

点击查看摘要

Abstract:This study addresses the minor-embedding problem, which involves mapping the variables of an Ising model onto a quantum annealing processor. The primary motivation stems from the observed performance disparity of quantum annealers when solving problems suited to the processor’s architecture versus those with non-hardware-native topologies. Our research has two main objectives: i) to analyze the impact of embedding quality on the performance of D-Wave Systems quantum annealers, and ii) to evaluate the quality of the embeddings generated by Minorminer, an algorithm provided by D-Wave and widely recognized as the standard minor-embedding technique in the literature. Regarding the first objective, our experiments reveal a clear correlation between the average chain length of embeddings and the relative errors of the solutions sampled. This underscores the critical influence of embedding quality on quantum annealing performance. For the second objective, we focus on the Minorminer technique, assessing its capacity to embed problems, the quality of the embeddings produced, and the robustness of the results. We also compare its performance with Clique Embedding, another algorithm developed by D-Wave, which is deterministic and designed to embed fully connected Ising models into quantum annealing processors, serving as a worst-case scenario. The results demonstrate that there is significant room for improvement for Minorminer, as it has not consistently outperformed the worst-case scenario.
zh

[AI-49] Pricing AI Model Accuracy

【速读】:该论文试图解决在竞争性市场中,AI模型提供商如何优化其提升模型准确性(Model Accuracy)的投资决策问题。论文的核心关注点是消费者对模型准确性具有异质化偏好(Heterogeneous Preferences),以及市场竞争如何影响企业改善模型错误率(Model Error)的激励机制。研究发现,企业在总体准确性上的改进并不一定带来利润的增加,而是在其竞争优势维度(Superior Dimension)上的进一步投资才是最优选择。解决方案的关键在于将模型错误分解为误报率(False Positive Rate)和漏报率(False Negative Rate),并通过针对性投资减少各维度的错误。论文指出,企业在自身优势维度的投资能够显著提升收益,而在劣势维度的投资则会导致亏损,尽管这种盈利性的投资可能损害消费者利益但会提高整体社会福利(Overall Welfare)。

链接: https://arxiv.org/abs/2504.13375
作者: Nikhil Kumar
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper examines the market for AI models in which firms compete to provide accurate model predictions and consumers exhibit heterogeneous preferences for model accuracy. We develop a consumer-firm duopoly model to analyze how competition affects firms’ incentives to improve model accuracy. Each firm aims to minimize its model’s error, but this choice can often be suboptimal. Counterintuitively, we find that in a competitive market, firms that improve overall accuracy do not necessarily improve their profits. Rather, each firm’s optimal decision is to invest further on the error dimension where it has a competitive advantage. By decomposing model errors into false positive and false negative rates, firms can reduce errors in each dimension through investments. Firms are strictly better off investing on their superior dimension and strictly worse off with investments on their inferior dimension. Profitable investments adversely affect consumers but increase overall welfare.
zh

[AI-50] Adaptive AI decision interface for autonomous electronic material discovery

【速读】:该论文旨在解决AI-powered自主实验(AI/AE)在电子材料发现中的有效性受限于数据稀缺性的问题,特别是由于设计-制造-测试-分析循环漫长且复杂导致的数据不足。与有经验的人类科学家不同,即使是最先进的AI算法在有限数据集下也缺乏实时做出信息丰富决策的适应能力。为了解决这一挑战,论文开发并实施了一个AI决策接口来增强AI/AE系统。该接口的核心是一个AI顾问,它能够进行实时进度监控、数据分析以及交互式人机协作,从而主动适应不同阶段和类型的实验。关键解决方案在于通过此平台实现的自适应AI/AE架构,其通过有机电化学晶体管(OECT)作为测试设备优化混合导电性能参数μC*,最终实现了比传统旋涂方法高出150%的性能提升,并揭示了实现更高体积电容的两个关键结构因素:更大的结晶层间距和更高的比表面积,同时发现了该材料的一种新聚合物多晶型。

链接: https://arxiv.org/abs/2504.13344
作者: Yahao Dai,Henry Chan,Aikaterini Vriza,Fredrick Kim,Yunfei Wang,Wei Liu,Naisong Shan,Jing Xu,Max Weires,Yukun Wu,Zhiqiang Cao,C. Suzanne Miller,Ralu Divan,Xiaodan Gu,Chenhui Zhu,Sihong Wang,Jie Xu
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-powered autonomous experimentation (AI/AE) can accelerate materials discovery but its effectiveness for electronic materials is hindered by data scarcity from lengthy and complex design-fabricate-test-analyze cycles. Unlike experienced human scientists, even advanced AI algorithms in AI/AE lack the adaptability to make informative real-time decisions with limited datasets. Here, we address this challenge by developing and implementing an AI decision interface on our AI/AE system. The central element of the interface is an AI advisor that performs real-time progress monitoring, data analysis, and interactive human-AI collaboration for actively adapting to experiments in different stages and types. We applied this platform to an emerging type of electronic materials-mixed ion-electron conducting polymers (MIECPs) – to engineer and study the relationships between multiscale morphology and properties. Using organic electrochemical transistors (OECT) as the testing-bed device for evaluating the mixed-conducting figure-of-merit – the product of charge-carrier mobility and the volumetric capacitance (\muC*), our adaptive AI/AE platform achieved a 150% increase in \muC* compared to the commonly used spin-coating method, reaching 1,275 F cm-1 V-1 s-1 in just 64 autonomous experimental trials. A study of 10 statistically selected samples identifies two key structural factors for achieving higher volumetric capacitance: larger crystalline lamellar spacing and higher specific surface area, while also uncovering a new polymer polymorph in this material.
zh

机器学习

[LG-0] Can LLM s handle WebShell detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework

链接: https://arxiv.org/abs/2504.13811
作者: Feijiang Han,Jiaming Zhang,Chuyi Deng,Jianheng Tang,Yunhuai Liu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:WebShell attacks, in which malicious scripts are injected into web servers, are a major cybersecurity threat. Traditional machine learning and deep learning methods are hampered by issues such as the need for extensive training data, catastrophic forgetting, and poor generalization. Recently, Large Language Models (LLMs) have gained attention for code-related tasks, but their potential in WebShell detection remains underexplored. In this paper, we make two major contributions: (1) a comprehensive evaluation of seven LLMs, including GPT-4, LLaMA 3.1 70B, and Qwen 2.5 variants, benchmarked against traditional sequence- and graph-based methods using a dataset of 26.59K PHP scripts, and (2) the Behavioral Function-Aware Detection (BFAD) framework, designed to address the specific challenges of applying LLMs to this domain. Our framework integrates three components: a Critical Function Filter that isolates malicious PHP function calls, a Context-Aware Code Extraction strategy that captures the most behaviorally indicative code segments, and Weighted Behavioral Function Profiling (WBFP) that enhances in-context learning by prioritizing the most relevant demonstrations based on discriminative function-level profiles. Our results show that larger LLMs achieve near-perfect precision but lower recall, while smaller models exhibit the opposite trade-off. However, all models lag behind previous State-Of-The-Art (SOTA) methods. With BFAD, the performance of all LLMs improved, with an average F1 score increase of 13.82%. Larger models such as GPT-4, LLaMA 3.1 70B, and Qwen 2.5 14B outperform SOTA methods, while smaller models such as Qwen 2.5 3B achieve performance competitive with traditional methods. This work is the first to explore the feasibility and limitations of LLMs for WebShell detection, and provides solutions to address the challenges in this task.

[LG-1] ransformer Encoder and Multi-features Time2Vec for Financial Prediction

链接: https://arxiv.org/abs/2504.13801
作者: Nguyen Kim Hai Bui,Nguyen Duy Chien,Péter Kovács,Gergő Bognár
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 5 pages, currently under review at Eusipco 2025

点击查看摘要

Abstract:Financial prediction is a complex and challenging task of time series analysis and signal processing, expected to model both short-term fluctuations and long-term temporal dependencies. Transformers have remarkable success mostly in natural language processing using attention mechanism, which also influenced the time series community. The ability to capture both short and long-range dependencies helps to understand the financial market and to recognize price patterns, leading to successful applications of Transformers in stock prediction. Although, the previous research predominantly focuses on individual features and singular predictions, that limits the model’s ability to understand broader market trends. In reality, within sectors such as finance and technology, companies belonging to the same industry often exhibit correlated stock price movements. In this paper, we develop a novel neural network architecture by integrating Time2Vec with the Encoder of the Transformer model. Based on the study of different markets, we propose a novel correlation feature selection method. Through a comprehensive fine-tuning of multiple hyperparameters, we conduct a comparative analysis of our results against benchmark models. We conclude that our method outperforms other state-of-the-art encoding methods such as positional encoding, and we also conclude that selecting correlation features enhance the accuracy of predicting multiple stock prices. Comments: 5 pages, currently under review at Eusipco 2025 Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE) Cite as: arXiv:2504.13801 [cs.LG] (or arXiv:2504.13801v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.13801 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] he Binary and Ternary Quantization Can Improve Feature Discrimination

链接: https://arxiv.org/abs/2504.13792
作者: Weizhi Lu,Mingrui Chen,Weiyu Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In machine learning, quantization is widely used to simplify data representation and facilitate algorithm deployment on hardware. Given the fundamental role of classification in machine learning, it is crucial to investigate the impact of quantization on classification. Current research primarily focuses on quantization errors, operating under the premise that higher quantization errors generally result in lower classification performance. However, this premise lacks a solid theoretical foundation and often contradicts empirical findings. For instance, certain extremely low bit-width quantization methods, such as \0,1\ -binary quantization and \0, \pm1\ -ternary quantization, can achieve comparable or even superior classification accuracy compared to the original non-quantized data, despite exhibiting high quantization errors. To more accurately evaluate classification performance, we propose to directly investigate the feature discrimination of quantized data, instead of analyzing its quantization error. Interestingly, it is found that both binary and ternary quantization methods can improve, rather than degrade, the feature discrimination of the original data. This remarkable performance is validated through classification experiments across various data types, including images, speech, and texts.

[LG-3] On the Relationship Between Robustness and Expressivity of Graph Neural Networks

链接: https://arxiv.org/abs/2504.13786
作者: Lorenz Kummer,Wilfried N. Gansterer,Nils M. Kriege
类目: Machine Learning (cs.LG)
*备注: Accepted at AISTAST 2025, will add DOI when available

点击查看摘要

Abstract:We investigate the vulnerability of Graph Neural Networks (GNNs) to bit-flip attacks (BFAs) by introducing an analytical framework to study the influence of architectural features, graph properties, and their interaction. The expressivity of GNNs refers to their ability to distinguish non-isomorphic graphs and depends on the encoding of node neighborhoods. We examine the vulnerability of neural multiset functions commonly used for this purpose and establish formal criteria to characterize a GNN’s susceptibility to losing expressivity due to BFAs. This enables an analysis of the impact of homophily, graph structural variety, feature encoding, and activation functions on GNN robustness. We derive theoretical bounds for the number of bit flips required to degrade GNN expressivity on a dataset, identifying ReLU-activated GNNs operating on highly homophilous graphs with low-dimensional or one-hot encoded features as particularly susceptible. Empirical results using ten real-world datasets confirm the statistical significance of our key theoretical insights and offer actionable results to mitigate BFA risks in expressivity-critical applications. Comments: Accepted at AISTAST 2025, will add DOI when available Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.13786 [cs.LG] (or arXiv:2504.13786v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.13786 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] Equi-Euler GraphNet: An Equivariant Temporal-Dynamics Informed Graph Neural Network for Dual Force and Trajectory Prediction in Multi-Body Systems

链接: https://arxiv.org/abs/2504.13768
作者: Vinay Sharma,Rémi Tanguy Oddon,Pietro Tesini,Jens Ravesloot,Cees Taal,Olga Fink
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Accurate real-time modeling of multi-body dynamical systems is essential for enabling digital twin applications across industries. While many data-driven approaches aim to learn system dynamics, jointly predicting internal loads and system trajectories remains a key challenge. This dual prediction is especially important for fault detection and predictive maintenance, where internal loads-such as contact forces-act as early indicators of faults, reflecting wear or misalignment before affecting motion. These forces also serve as inputs to degradation models (e.g., crack growth), enabling damage prediction and remaining useful life estimation. We propose Equi-Euler GraphNet, a physics-informed graph neural network (GNN) that simultaneously predicts internal forces and global trajectories in multi-body systems. In this mesh-free framework, nodes represent system components and edges encode interactions. Equi-Euler GraphNet introduces two inductive biases: (1) an equivariant message-passing scheme, interpreting edge messages as interaction forces consistent under Euclidean transformations; and (2) a temporal-aware iterative node update mechanism, based on Euler integration, to capture influence of distant interactions over time. Tailored for cylindrical roller bearings, it decouples ring dynamics from constrained motion of rolling elements. Trained on high-fidelity multiphysics simulations, Equi-Euler GraphNet generalizes beyond the training distribution, accurately predicting loads and trajectories under unseen speeds, loads, and configurations. It outperforms state-of-the-art GNNs focused on trajectory prediction, delivering stable rollouts over thousands of time steps with minimal error accumulation. Achieving up to a 200x speedup over conventional solvers while maintaining comparable accuracy, it serves as an efficient reduced-order model for digital twins, design, and maintenance.

[LG-5] Predictors of Childhood Vaccination Uptake in England: An Explainable Machine Learning Analysis of Longitudinal Regional Data (2021-2024)

链接: https://arxiv.org/abs/2504.13755
作者: Amin Noroozi,Sidratul Muntaha Esha,Mansoureh Ghari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Childhood vaccination is a cornerstone of public health, yet disparities in vaccination coverage persist across England. These disparities are shaped by complex interactions among various factors, including geographic, demographic, socioeconomic, and cultural (GDSC) factors. Previous studies mostly rely on cross-sectional data and traditional statistical approaches that assess individual or limited sets of variables in isolation. Such methods may fall short in capturing the dynamic and multivariate nature of vaccine uptake. In this paper, we conducted a longitudinal machine learning analysis of childhood vaccination coverage across 150 districts in England from 2021 to 2024. Using vaccination data from NHS records, we applied hierarchical clustering to group districts by vaccination coverage into low- and high-coverage clusters. A CatBoost classifier was then trained to predict districts’ vaccination clusters using their GDSC data. Finally, the SHapley Additive exPlanations (SHAP) method was used to interpret the predictors’ importance. The classifier achieved high accuracies of 92.1, 90.6, and 86.3 in predicting districts’ vaccination clusters for the years 2021-2022, 2022-2023, and 2023-2024, respectively. SHAP revealed that geographic, cultural, and demographic variables, particularly rurality, English language proficiency, the percentage of foreign-born residents, and ethnic composition, were the most influential predictors of vaccination coverage, whereas socioeconomic variables, such as deprivation and employment, consistently showed lower importance, especially in 2023-2024. Surprisingly, rural districts were significantly more likely to have higher vaccination rates. Additionally, districts with lower vaccination coverage had higher populations whose first language was not English, who were born outside the UK, or who were from ethnic minority groups.

[LG-6] Dynamic Regularized CBDT: Variance-Calibrated Causal Boosting for Interpretable Heterogeneous Treatment Effects

链接: https://arxiv.org/abs/2504.13733
作者: Yichen Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint version. 13 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Heterogeneous treatment effect estimation in high-stakes applications demands models that simultaneously optimize precision, interpretability, and calibration. Many existing tree-based causal inference techniques, however, exhibit high estimation errors when applied to observational data because they struggle to capture complex interactions among factors and rely on static regularization schemes. In this work, we propose Dynamic Regularized Causal Boosted Decision Trees (CBDT), a novel framework that integrates variance regularization and average treatment effect calibration into the loss function of gradient boosted decision trees. Our approach dynamically updates the regularization parameters using gradient statistics to better balance the bias-variance tradeoff. Extensive experiments on standard benchmark datasets and real-world clinical data demonstrate that the proposed method significantly improves estimation accuracy while maintaining reliable coverage of true treatment effects. In an intensive care unit patient triage study, the method successfully identified clinically actionable rules and achieved high accuracy in treatment effect estimation. The results validate that dynamic regularization can effectively tighten error bounds and enhance both predictive performance and model interpretability.

[LG-7] MEGA: Second-Order Gradient Alignment for Catastrophic Forgetting Mitigation in GFSCIL

链接: https://arxiv.org/abs/2504.13691
作者: Jinhui Pang,Changqing Lin,Hao Lin,Jinglin He,Zhengjun Li,Zhihui Zhang,Xiaoshuai Hao
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Graph Few-Shot Class-Incremental Learning (GFSCIL) enables models to continually learn from limited samples of novel tasks after initial training on a large base dataset. Existing GFSCIL approaches typically utilize Prototypical Networks (PNs) for metric-based class representations and fine-tune the model during the incremental learning stage. However, these PN-based methods oversimplify learning via novel query set fine-tuning and fail to integrate Graph Continual Learning (GCL) techniques due to architectural constraints. To address these challenges, we propose a more rigorous and practical setting for GFSCIL that excludes query sets during the incremental training phase. Building on this foundation, we introduce Model-Agnostic Meta Graph Continual Learning (MEGA), aimed at effectively alleviating catastrophic forgetting for GFSCIL. Specifically, by calculating the incremental second-order gradient during the meta-training stage, we endow the model to learn high-quality priors that enhance incremental learning by aligning its behaviors across both the meta-training and incremental learning stages. Extensive experiments on four mainstream graph datasets demonstrate that MEGA achieves state-of-the-art results and enhances the effectiveness of various GCL methods in GFSCIL. We believe that our proposed MEGA serves as a model-agnostic GFSCIL paradigm, paving the way for future research.

[LG-8] Efficient algorithms for the Hadamard decomposition

链接: https://arxiv.org/abs/2504.13633
作者: Samuel Wertz,Arnaud Vandaele,Nicolas Gillis
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 7 pages, code available from this https URL

点击查看摘要

Abstract:The Hadamard decomposition is a powerful technique for data analysis and matrix compression, which decomposes a given matrix into the element-wise product of two or more low-rank matrices. In this paper, we develop an efficient algorithm to solve this problem, leveraging an alternating optimization approach that decomposes the global non-convex problem into a series of convex sub-problems. To improve performance, we explore advanced initialization strategies inspired by the singular value decomposition (SVD) and incorporate acceleration techniques by introducing momentum-based updates. Beyond optimizing the two-matrix case, we also extend the Hadamard decomposition framework to support more than two low-rank matrices, enabling approximations with higher effective ranks while preserving computational efficiency. Finally, we conduct extensive experiments to compare our method with the existing gradient descent-based approaches for the Hadamard decomposition and with traditional low-rank approximation techniques. The results highlight the effectiveness of our proposed method across diverse datasets.

[LG-9] Fairness and Robustness in Machine Unlearning

链接: https://arxiv.org/abs/2504.13610
作者: Khoa Tran,Simon S. Woo
类目: Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:Machine unlearning poses the challenge of ``how to eliminate the influence of specific data from a pretrained model’’ in regard to privacy concerns. While prior research on approximated unlearning has demonstrated accuracy and efficiency in time complexity, we claim that it falls short of achieving exact unlearning, and we are the first to focus on fairness and robustness in machine unlearning algorithms. Our study presents fairness Conjectures for a well-trained model, based on the variance-bias trade-off characteristic, and considers their relevance to robustness. Our Conjectures are supported by experiments conducted on the two most widely used model architectures, ResNet and ViT, demonstrating the correlation between fairness and robustness: \textitthe higher fairness-gap is, the more the model is sensitive and vulnerable. In addition, our experiments demonstrate the vulnerability of current state-of-the-art approximated unlearning algorithms to adversarial attacks, where their unlearned models suffer a significant drop in accuracy compared to the exact-unlearned models. We claim that our fairness-gap measurement and robustness metric should be used to evaluate the unlearning algorithm. Furthermore, we demonstrate that unlearning in the intermediate and last layers is sufficient and cost-effective for time and memory complexity.

[LG-10] Bitcoins Edge: Embedded Sentiment in Blockchain Transactional Data

链接: https://arxiv.org/abs/2504.13598
作者: Charalampos Kleitsikas,Nikolaos Korfiatis,Stefanos Leonardos,Carmine Ventre
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注: Published in IEEE International Conference on Blockchain and Cryptocurrency 2025

点击查看摘要

Abstract:Cryptocurrency blockchains, beyond their primary role as distributed payment systems, are increasingly used to store and share arbitrary content, such as text messages and files. Although often non-financial, this hidden content can impact price movements by conveying private information, shaping sentiment, and influencing public opinion. However, current analyses of such data are limited in scope and scalability, primarily relying on manual classification or hand-crafted heuristics. In this work, we address these limitations by employing Natural Language Processing techniques to analyze, detect patterns, and extract public sentiment encoded within blockchain transactional data. Using a variety of Machine Learning techniques, we showcase for the first time the predictive power of blockchain-embedded sentiment in forecasting cryptocurrency price movements on the Bitcoin and Ethereum blockchains. Our findings shed light on a previously underexplored source of freely available, transparent, and immutable data and introduce blockchain sentiment analysis as a novel and robust framework for enhancing financial predictions in cryptocurrency markets. Incidentally, we discover an asymmetry between cryptocurrencies; Bitcoin has an informational advantage over Ethereum in that the sentiment embedded into transactional data is sufficient to predict its price movement.

[LG-11] owards End-to-End Network Intent Management with Large Language Models

链接: https://arxiv.org/abs/2504.13589
作者: Lam Dinh,Sihem Cherrared,Xiaofeng Huang,Fabrice Guillemin
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Full paper is accepted at IFIP Networking 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are likely to play a key role in Intent-Based Networking (IBN) as they show remarkable performance in interpreting human language as well as code generation, enabling the translation of high-level intents expressed by humans into low-level network configurations. In this paper, we leverage closed-source language models (i.e., Google Gemini 1.5 pro, ChatGPT-4) and open-source models (i.e., LLama, Mistral) to investigate their capacity to generate E2E network configurations for radio access networks (RANs) and core networks in 5G/6G mobile networks. We introduce a novel performance metrics, known as FEACI, to quantitatively assess the format (F), explainability (E), accuracy (A), cost ©, and inference time (I) of the generated answer; existing general metrics are unable to capture these features. The results of our study demonstrate that open-source models can achieve comparable or even superior translation performance compared with the closed-source models requiring costly hardware setup and not accessible to all users.

[LG-12] How to Achieve Higher Accuracy with Less Training Points?

链接: https://arxiv.org/abs/2504.13586
作者: Jinghan Yang,Anupam Pani,Yunchao Zhang
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:In the era of large-scale model training, the extensive use of available datasets has resulted in significant computational inefficiencies. To tackle this issue, we explore methods for identifying informative subsets of training data that can achieve comparable or even superior model performance. We propose a technique based on influence functions to determine which training samples should be included in the training set. We conducted empirical evaluations of our method on binary classification tasks utilizing logistic regression models. Our approach demonstrates performance comparable to that of training on the entire dataset while using only 10% of the data. Furthermore, we found that our method achieved even higher accuracy when trained with just 60% of the data.

[LG-13] Hysteresis-Aware Neural Network Modeling and Whole-Body Reinforcement Learning Control of Soft Robots

链接: https://arxiv.org/abs/2504.13582
作者: Zongyuan Chen,Yan Xia,Jiayuan Liu,Jijia Liu,Wenhao Tang,Jiayu Chen,Feng Gao,Longfei Ma,Hongen Liao,Yu Wang,Chao Yu,Boyu Zhang,Fei Xing
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Soft robots exhibit inherent compliance and safety, which makes them particularly suitable for applications requiring direct physical interaction with humans, such as surgical procedures. However, their nonlinear and hysteretic behavior, resulting from the properties of soft materials, presents substantial challenges for accurate modeling and control. In this study, we present a soft robotic system designed for surgical applications and propose a hysteresis-aware whole-body neural network model that accurately captures and predicts the soft robot’s whole-body motion, including its hysteretic behavior. Building upon the high-precision dynamic model, we construct a highly parallel simulation environment for soft robot control and apply an on-policy reinforcement learning algorithm to efficiently train whole-body motion control strategies. Based on the trained control policy, we developed a soft robotic system for surgical applications and validated it through phantom-based laser ablation experiments in a physical environment. The results demonstrate that the hysteresis-aware modeling reduces the Mean Squared Error (MSE) by 84.95 percent compared to traditional modeling methods. The deployed control algorithm achieved a trajectory tracking error ranging from 0.126 to 0.250 mm on the real soft robot, highlighting its precision in real-world conditions. The proposed method showed strong performance in phantom-based surgical experiments and demonstrates its potential for complex scenarios, including future real-world clinical applications.

[LG-14] MSTIM: A MindSpore-Based Model for Traffic Flow Prediction

链接: https://arxiv.org/abs/2504.13576
作者: Weiqi Qin,Yuxin Liu,Dongze Wu,Zhenkai Qin,Qining Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aiming at the problems of low accuracy and large error fluctuation of traditional traffic flow predictionmodels when dealing with multi-scale temporal features and dynamic change patterns. this paperproposes a multi-scale time series information modelling model MSTIM based on the Mindspore framework, which integrates long and short-term memory networks (LSTMs), convolutional neural networks (CNN), and the attention mechanism to improve the modelling accuracy and stability. The Metropolitan Interstate Traffic Volume (MITV) dataset was used for the experiments and compared and analysed with typical LSTM-attention models, CNN-attention models and LSTM-CNN models. The experimental results show that the MSTIM model achieves better results in the metrics of Mean Absolute Error (MAE), Mean Square Error (MSE), and Root Mean Square Error (RMSE), which significantly improves the accuracy and stability of the traffic volume prediction.

[LG-15] Bayesian continual learning and forgetting in neural networks

链接: https://arxiv.org/abs/2504.13569
作者: Djohan Bonnet,Kellian Cottart,Tifenn Hirtzlin,Tarcisius Januel,Thomas Dalgaty,Elisa Vianello,Damien Querlioz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biological synapses effortlessly balance memory retention and flexibility, yet artificial neural networks still struggle with the extremes of catastrophic forgetting and catastrophic remembering. Here, we introduce Metaplasticity from Synaptic Uncertainty (MESU), a Bayesian framework that updates network parameters according their uncertainty. This approach allows a principled combination of learning and forgetting that ensures that critical knowledge is preserved while unused or outdated information is gradually released. Unlike standard Bayesian approaches – which risk becoming overly constrained, and popular continual-learning methods that rely on explicit task boundaries, MESU seamlessly adapts to streaming data. It further provides reliable epistemic uncertainty estimates, allowing out-of-distribution detection, the only computational cost being to sample the weights multiple times to provide proper output statistics. Experiments on image-classification benchmarks demonstrate that MESU mitigates catastrophic forgetting, while maintaining plasticity for new tasks. When training 200 sequential permuted MNIST tasks, MESU outperforms established continual learning techniques in terms of accuracy, capability to learn additional tasks, and out-of-distribution data detection. Additionally, due to its non-reliance on task boundaries, MESU outperforms conventional learning techniques on the incremental training of CIFAR-100 tasks consistently in a wide range of scenarios. Our results unify ideas from metaplasticity, Bayesian inference, and Hessian-based regularization, offering a biologically-inspired pathway to robust, perpetual learning.

[LG-16] Irregular Sampling of High-Dimensional Functions in Reproducing Kernel Hilbert Spaces

链接: https://arxiv.org/abs/2504.13543
作者: Armin Iske,Lennart Ohlsen
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:We develop sampling formulas for high-dimensional functions in reproducing kernel Hilbert spaces, where we rely on irregular samples that are taken at determining sequences of data points. We place particular emphasis on sampling formulas for tensor product kernels, where we show that determining irregular samples in lower dimensions can be composed to obtain a tensor of determining irregular samples in higher dimensions. This in turn reduces the computational complexity of sampling formulas for high-dimensional functions quite significantly.

[LG-17] Can Local Representation Alignment RNNs Solve Temporal Tasks?

链接: https://arxiv.org/abs/2504.13531
作者: Nikolay Manchev,Luis C. Garcia-Peraza-Herrera
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recurrent Neural Networks (RNNs) are commonly used for real-time processing, streaming data, and cases where the amount of training samples is limited. Backpropagation Through Time (BPTT) is the predominant algorithm for training RNNs; however, it is frequently criticized for being prone to exploding and vanishing gradients and being biologically implausible. In this paper, we present and evaluate a target propagation-based method for RNNs, which uses local updates and seeks to reduce the said instabilities. Having stable RNN models increases their practical use in a wide range of fields such as natural language processing, time-series forecasting, anomaly detection, control systems, and robotics. The proposed solution uses local representation alignment (LRA). We thoroughly analyze the performance of this method, experiment with normalization and different local error functions, and invalidate certain assumptions about the behavior of this type of learning. Namely, we demonstrate that despite the decomposition of the network into sub-graphs, the model still suffers from vanishing gradients. We also show that gradient clipping as proposed in LRA has little to no effect on network performance. This results in an LRA RNN model that is very difficult to train due to vanishing gradients. We address this by introducing gradient regularization in the direction of the update and demonstrate that this modification promotes gradient flow and meaningfully impacts convergence. We compare and discuss the performance of the algorithm, and we show that the regularized LRA RNN considerably outperforms the unregularized version on three landmark tasks: temporal order, 3-bit temporal order, and random permutation. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.13531 [cs.LG] (or arXiv:2504.13531v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.13531 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] Risk-aware black-box portfolio construction using Bayesian optimization with adaptive weighted Lagrangian estimator

链接: https://arxiv.org/abs/2504.13529
作者: Zinuo You,John Cartlidge,Karen Elliott,Menghan Ge,Daniel Gold
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Computational Finance (q-fin.CP); Portfolio Management (q-fin.PM)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Existing portfolio management approaches are often black-box models due to safety and commercial issues in the industry. However, their performance can vary considerably whenever market conditions or internal trading strategies change. Furthermore, evaluating these non-transparent systems is expensive, where certain budgets limit observations of the systems. Therefore, optimizing performance while controlling the potential risk of these financial systems has become a critical challenge. This work presents a novel Bayesian optimization framework to optimize black-box portfolio management models under limited observations. In conventional Bayesian optimization settings, the objective function is to maximize the expectation of performance metrics. However, simply maximizing performance expectations leads to erratic optimization trajectories, which exacerbate risk accumulation in portfolio management. Meanwhile, this can lead to misalignment between the target distribution and the actual distribution of the black-box model. To mitigate this problem, we propose an adaptive weight Lagrangian estimator considering dual objective, which incorporates maximizing model performance and minimizing variance of model observations. Extensive experiments demonstrate the superiority of our approach over five backtest settings with three black-box stock portfolio management models. Ablation studies further verify the effectiveness of the proposed estimator.

[LG-19] Designing a reliable lateral movement detector using a graph foundation model

链接: https://arxiv.org/abs/2504.13527
作者: Corentin Larroche
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models have recently emerged as a new paradigm in machine learning (ML). These models are pre-trained on large and diverse datasets and can subsequently be applied to various downstream tasks with little or no retraining. This allows people without advanced ML expertise to build ML applications, accelerating innovation across many fields. However, the adoption of foundation models in cybersecurity is hindered by their inability to efficiently process data such as network traffic captures or binary executables. The recent introduction of graph foundation models (GFMs) could make a significant difference, as graphs are well-suited to representing these types of data. We study the usability of GFMs in cybersecurity through the lens of one specific use case, namely lateral movement detection. Using a pre-trained GFM, we build a detector that reaches state-of-the-art performance without requiring any training on domain-specific data. This case study thus provides compelling evidence of the potential of GFMs for cybersecurity.

[LG-20] Cross-Modal Temporal Fusion for Financial Market Forecasting

链接: https://arxiv.org/abs/2504.13522
作者: Yunhua Pei,John Cartlidge,Anandadeep Mandal,Daniel Gold,Enrique Marcilio,Riccardo Mazzon
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computational Finance (q-fin.CP)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Accurate financial market forecasting requires diverse data sources, including historical price trends, macroeconomic indicators, and financial news, each contributing unique predictive signals. However, existing methods often process these modalities independently or fail to effectively model their interactions. In this paper, we introduce Cross-Modal Temporal Fusion (CMTF), a novel transformer-based framework that integrates heterogeneous financial data to improve predictive accuracy. Our approach employs attention mechanisms to dynamically weight the contribution of different modalities, along with a specialized tensor interpretation module for feature extraction. To facilitate rapid model iteration in industry applications, we incorporate a mature auto-training scheme that streamlines optimization. When applied to real-world financial datasets, CMTF demonstrates improvements over baseline models in forecasting stock price movements and provides a scalable and effective solution for cross-modal integration in financial market prediction.

[LG-21] Monitor and Recover: A Paradigm for Future Research on Distribution Shift in Learning-Enabled Cyber-Physical Systems

链接: https://arxiv.org/abs/2504.13484
作者: Vivian Lin,Insup Lee
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted to ICCPS 2025

点击查看摘要

Abstract:With the known vulnerability of neural networks to distribution shift, maintaining reliability in learning-enabled cyber-physical systems poses a salient challenge. In response, many existing methods adopt a detect and abstain methodology, aiming to detect distribution shift at inference time so that the learning-enabled component can abstain from decision-making. This approach, however, has limited use in real-world applications. We instead propose a monitor and recover paradigm as a promising direction for future research. This philosophy emphasizes 1) robust safety monitoring instead of distribution shift detection and 2) distribution shift recovery instead of abstention. We discuss two examples from our recent work.

[LG-22] Latent Tensor Factorization with Nonlinear PID Control for Missing Data Recovery in Non-Intrusive Load Monitoring

链接: https://arxiv.org/abs/2504.13483
作者: Yiran Wang,Tangtang Xie,Hao Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-Intrusive Load Monitoring (NILM) has emerged as a key smart grid technology, identifying electrical device and providing detailed energy consumption data for precise demand response management. Nevertheless, NILM data suffers from missing values due to inescapable factors like sensor failure, leading to inaccuracies in non-intrusive load monitoring. A stochastic gradient descent (SGD)-based latent factorization of tensors model has proven to be effective in estimating missing data, however, it updates a latent factor solely based on the current stochastic gradient, without considering past information, which leads to slow convergence of anLFT model. To address this issue, this paper proposes a Nonlinear Proportional-integral-derivative (PID)-Incorporated Latent factorization of tensors (NPIL) model with two-fold ideas: a) rebuilding the instant learning error according to the principle of a nonlinear PID controller, thus, the past update information is efficiently incorporated into the learning scheme, and b) implementing gain parameter adaptation by utilizing particle swarm optimization (PSO) algorithm, hence, the model computational efficiency is effectively improved. Experimental results on real-world NILM datasets demonstrate that the proposed NPIL model surpasses state-of-the-art models in convergence rate and accuracy when predicting the missing NILM data.

[LG-23] SFL-LEO: Asynchronous Split-Federated Learning Design for LEO Satellite-Ground Network Framework

链接: https://arxiv.org/abs/2504.13479
作者: Jiasheng Wu,Jingjing Zhang,Zheng Lin,Zhe Chen,Xiong Wang,Wenjun Zhu,Yue Gao
类目: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 13 pages, 14 figures

点击查看摘要

Abstract:Recently, the rapid development of LEO satellite networks spurs another widespread concern-data processing at satellites. However, achieving efficient computation at LEO satellites in highly dynamic satellite networks is challenging and remains an open problem when considering the constrained computation capability of LEO satellites. For the first time, we propose a novel distributed learning framework named SFL-LEO by combining Federated Learning (FL) with Split Learning (SL) to accommodate the high dynamics of LEO satellite networks and the constrained computation capability of LEO satellites by leveraging the periodical orbit traveling feature. The proposed scheme allows training locally by introducing an asynchronous training strategy, i.e., achieving local update when LEO satellites disconnect with the ground station, to provide much more training space and thus increase the training performance. Meanwhile, it aggregates client-side sub-models at the ground station and then distributes them to LEO satellites by borrowing the idea from the federated learning scheme. Experiment results driven by satellite-ground bandwidth measured in Starlink demonstrate that SFL-LEO provides a similar accuracy performance with the conventional SL scheme because it can perform local training even within the disconnection duration.

[LG-24] Safety Monitoring for Learning-Enabled Cyber-Physical Systems in Out-of-Distribution Scenarios

链接: https://arxiv.org/abs/2504.13478
作者: Vivian Lin,Ramneet Kaur,Yahan Yang,Souradeep Dutta,Yiannis Kantaros,Anirban Roy,Susmit Jha,Oleg Sokolsky,Insup Lee
类目: Machine Learning (cs.LG)
*备注: Accepted to ICCPS 2025

点击查看摘要

Abstract:The safety of learning-enabled cyber-physical systems is compromised by the well-known vulnerabilities of deep neural networks to out-of-distribution (OOD) inputs. Existing literature has sought to monitor the safety of such systems by detecting OOD data. However, such approaches have limited utility, as the presence of an OOD input does not necessarily imply the violation of a desired safety property. We instead propose to directly monitor safety in a manner that is itself robust to OOD data. To this end, we predict violations of signal temporal logic safety specifications based on predicted future trajectories. Our safety monitor additionally uses a novel combination of adaptive conformal prediction and incremental learning. The former obtains probabilistic prediction guarantees even on OOD data, and the latter prevents overly conservative predictions. We evaluate the efficacy of the proposed approach in two case studies on safety monitoring: 1) predicting collisions of an F1Tenth car with static obstacles, and 2) predicting collisions of a race car with multiple dynamic obstacles. We find that adaptive conformal prediction obtains theoretical guarantees where other uncertainty quantification methods fail to do so. Additionally, combining adaptive conformal prediction and incremental learning for safety monitoring achieves high recall and timeliness while reducing loss in precision. We achieve these results even in OOD settings and outperform alternative methods.

[LG-25] Are you SURE? Enhancing Multimodal Pretraining with Missing Modalities through Uncertainty Estimation

链接: https://arxiv.org/abs/2504.13465
作者: Duy A. Nguyen,Quan Huu Do,Khoa D. Doan,Minh N. Do
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal learning has demonstrated incredible successes by integrating diverse data sources, yet it often relies on the availability of all modalities - an assumption that rarely holds in real-world applications. Pretrained multimodal models, while effective, struggle when confronted with small-scale and incomplete datasets (i.e., missing modalities), limiting their practical applicability. Previous studies on reconstructing missing modalities have overlooked the reconstruction’s potential unreliability, which could compromise the quality of the final outputs. We present SURE (Scalable Uncertainty and Reconstruction Estimation), a novel framework that extends the capabilities of pretrained multimodal models by introducing latent space reconstruction and uncertainty estimation for both reconstructed modalities and downstream tasks. Our method is architecture-agnostic, reconstructs missing modalities, and delivers reliable uncertainty estimates, improving both interpretability and performance. SURE introduces a unique Pearson Correlation-based loss and applies statistical error propagation in deep networks for the first time, allowing precise quantification of uncertainties from missing data and model predictions. Extensive experiments across tasks such as sentiment analysis, genre classification, and action recognition show that SURE consistently achieves state-of-the-art performance, ensuring robust predictions even in the presence of incomplete data.

[LG-26] Stratify: Rethinking Federated Learning for Non-IID Data through Balanced Sampling

链接: https://arxiv.org/abs/2504.13462
作者: Hui Yeok Wong,Chee Kau Lim,Chee Seng Chan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) on non-independently and identically distributed (non-IID) data remains a critical challenge, as existing approaches struggle with severe data heterogeneity. Current methods primarily address symptoms of non-IID by applying incremental adjustments to Federated Averaging (FedAvg), rather than directly resolving its inherent design limitations. Consequently, performance significantly deteriorates under highly heterogeneous conditions, as the fundamental issue of imbalanced exposure to diverse class and feature distributions remains unresolved. This paper introduces Stratify, a novel FL framework designed to systematically manage class and feature distributions throughout training, effectively tackling the root cause of non-IID challenges. Inspired by classical stratified sampling, our approach employs a Stratified Label Schedule (SLS) to ensure balanced exposure across labels, significantly reducing bias and variance in aggregated gradients. Complementing SLS, we propose a label-aware client selection strategy, restricting participation exclusively to clients possessing data relevant to scheduled labels. Additionally, Stratify incorporates a fine-grained, high-frequency update scheme, accelerating convergence and further mitigating data heterogeneity. To uphold privacy, we implement a secure client selection protocol leveraging homomorphic encryption, enabling precise global label statistics without disclosing sensitive client information. Extensive evaluations on MNIST, CIFAR-10, CIFAR-100, Tiny-ImageNet, COVTYPE, PACS, and Digits-DG demonstrate that Stratify attains performance comparable to IID baselines, accelerates convergence, and reduces client-side computation compared to state-of-the-art methods, underscoring its practical effectiveness in realistic federated learning scenarios.

[LG-27] Using Machine Learning and Neural Networks to Analyze and Predict Chaos in Multi-Pendulum and Chaotic Systems

链接: https://arxiv.org/abs/2504.13453
作者: Vasista Ramachandruni,Sai Hruday Reddy Nara,Geo Lalu,Sabrina Yang,Mohit Ramesh Kumar,Aarjav Jain,Pratham Mehta,Hankyu Koo,Jason Damonte,Marx Akl
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 35 Pages, Approximately 20 figures

点击查看摘要

Abstract:A chaotic system is a highly volatile system characterized by its sensitive dependence on initial conditions and outside factors. Chaotic systems are prevalent throughout the world today: in weather patterns, disease outbreaks, and even financial markets. Chaotic systems are seen in every field of science and humanities, so being able to predict these systems is greatly beneficial to society. In this study, we evaluate 10 different machine learning models and neural networks [1] based on Root Mean Squared Error (RMSE) and R^2 values for their ability to predict one of these systems, the multi-pendulum. We begin by generating synthetic data representing the angles of the pendulum over time using the Runge Kutta Method for solving 4th Order Differential Equations (ODE-RK4) [2]. At first, we used the single-step sliding window approach, predicting the 50st step after training for steps 0-49 and so forth. However, to more accurately cover chaotic motion and behavior in these systems, we transitioned to a time-step based approach. Here, we trained the model/network on many initial angles and tested it on a completely new set of initial angles, or ‘in-between’ to capture chaotic motion to its fullest extent. We also evaluated the stability of the system using Lyapunov exponents. We concluded that for a double pendulum, the best model was the Long Short Term Memory Network (LSTM)[3] for the sliding window and time step approaches in both friction and frictionless scenarios. For triple pendulum, the Vanilla Recurrent Neural Network (VRNN)[4] was the best for the sliding window and Gated Recurrent Network (GRU) [5] was the best for the time step approach, but for friction, LSTM was the best.

[LG-28] Simplifying Graph Convolutional Networks with Redundancy-Free Neighbors

链接: https://arxiv.org/abs/2504.13426
作者: Jielong LuZhihao Wu,Zhiling Cai,Yueyang Pi,Shiping Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Graph Convolutional Networks (GCNs) have gained popularity for their exceptional ability to process graph-structured data. Existing GCN-based approaches typically employ a shallow model architecture due to the over-smoothing phenomenon. Current approaches to mitigating over-smoothing primarily involve adding supplementary components to GCN architectures, such as residual connections and random edge-dropping strategies. However, these improvements toward deep GCNs have achieved only limited success. In this work, we analyze the intrinsic message passing mechanism of GCNs and identify a critical issue: messages originating from high-order neighbors must traverse through low-order neighbors to reach the target node. This repeated reliance on low-order neighbors leads to redundant information aggregation, a phenomenon we term over-aggregation. Our analysis demonstrates that over-aggregation not only introduces significant redundancy but also serves as the fundamental cause of over-smoothing in GCNs.

[LG-29] Equilibrium Conserving Neural Operators for Super-Resolution Learning

链接: https://arxiv.org/abs/2504.13422
作者: Vivek Oommen,Andreas E. Robertson,Daniel Diaz,Coleman Alleman,Zhen Zhang,Anthony D. Rollett,George E. Karniadakis,Rémi Dingreville
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Neural surrogate solvers can estimate solutions to partial differential equations in physical problems more efficiently than standard numerical methods, but require extensive high-resolution training data. In this paper, we break this limitation; we introduce a framework for super-resolution learning in solid mechanics problems. Our approach allows one to train a high-resolution neural network using only low-resolution data. Our Equilibrium Conserving Operator (ECO) architecture embeds known physics directly into the network to make up for missing high-resolution information during training. We evaluate this ECO-based super-resolution framework that strongly enforces conservation-laws in the predicted solutions on two working examples: embedded pores in a homogenized matrix and randomly textured polycrystalline materials. ECO eliminates the reliance on high-fidelity data and reduces the upfront cost of data collection by two orders of magnitude, offering a robust pathway for resource-efficient surrogate modeling in materials modeling. ECO is readily generalizable to other physics-based problems.

[LG-30] A Model-Based Approach to Imitation Learning through Multi-Step Predictions

链接: https://arxiv.org/abs/2504.13413
作者: Haldun Balim,Yang Hu,Yuyang Zhang,Na Li
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Imitation learning is a widely used approach for training agents to replicate expert behavior in complex decision-making tasks. However, existing methods often struggle with compounding errors and limited generalization, due to the inherent challenge of error correction and the distribution shift between training and deployment. In this paper, we present a novel model-based imitation learning framework inspired by model predictive control, which addresses these limitations by integrating predictive modeling through multi-step state predictions. Our method outperforms traditional behavior cloning numerical benchmarks, demonstrating superior robustness to distribution shift and measurement noise both in available data and during execution. Furthermore, we provide theoretical guarantees on the sample complexity and error bounds of our method, offering insights into its convergence properties.

[LG-31] OpCode-Based Malware Classification Using Machine Learning and Deep Learning Techniques

链接: https://arxiv.org/abs/2504.13408
作者: Varij Saini,Rudraksh Gupta,Neel Soni
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:This technical report presents a comprehensive analysis of malware classification using OpCode sequences. Two distinct approaches are evaluated: traditional machine learning using n-gram analysis with Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree classifiers; and a deep learning approach employing a Convolutional Neural Network (CNN). The traditional machine learning approach establishes a baseline using handcrafted 1-gram and 2-gram features from disassembled malware samples. The deep learning methodology builds upon the work proposed in “Deep Android Malware Detection” by McLaughlin et al. and evaluates the performance of a CNN model trained to automatically extract features from raw OpCode data. Empirical results are compared using standard performance metrics (accuracy, precision, recall, and F1-score). While the SVM classifier outperforms other traditional techniques, the CNN model demonstrates competitive performance with the added benefit of automated feature extraction.

[LG-32] Denoising and Reconstruction of Nonlinear Dynamics using Truncated Reservoir Computing

链接: https://arxiv.org/abs/2504.13355
作者: Omid Sedehi,Manish Yadav,Merten Stender,Sebastian Oberst
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Measurements acquired from distributed physical systems are often sparse and noisy. Therefore, signal processing and system identification tools are required to mitigate noise effects and reconstruct unobserved dynamics from limited sensor data. However, this process is particularly challenging because the fundamental equations governing the dynamics are largely unavailable in practice. Reservoir Computing (RC) techniques have shown promise in efficiently simulating dynamical systems through an unstructured and efficient computation graph comprising a set of neurons with random connectivity. However, the potential of RC to operate in noisy regimes and distinguish noise from the primary dynamics of the system has not been fully explored. This paper presents a novel RC method for noise filtering and reconstructing nonlinear dynamics, offering a novel learning protocol associated with hyperparameter optimization. The performance of the RC in terms of noise intensity, noise frequency content, and drastic shifts in dynamical parameters are studied in two illustrative examples involving the nonlinear dynamics of the Lorenz attractor and adaptive exponential integrate-and-fire system (AdEx). It is shown that the denoising performance improves via truncating redundant nodes and edges of the computing reservoir, as well as properly optimizing the hyperparameters, e.g., the leakage rate, the spectral radius, the input connectivity, and the ridge regression parameter. Furthermore, the presented framework shows good generalization behavior when tested for reconstructing unseen attractors from the bifurcation diagram. Compared to the Extended Kalman Filter (EKF), the presented RC framework yields competitive accuracy at low signal-to-noise ratios (SNRs) and high-frequency ranges.

[LG-33] raining Autoencoders Using Stochastic Hessian-Free Optimization with LSMR

链接: https://arxiv.org/abs/2504.13302
作者: Ibrahim Emirahmetoglu,David E. Stewart
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Hessian-free (HF) optimization has been shown to effectively train deep autoencoders (Martens, 2010). In this paper, we aim to accelerate HF training of autoencoders by reducing the amount of data used in training. HF utilizes the conjugate gradient algorithm to estimate update directions. Instead, we propose using the LSMR method, which is known for effectively solving large sparse linear systems. We also incorporate Chapelle Erhan (2011)'s improved preconditioner for HF optimization. In addition, we introduce a new mini-batch selection algorithm to mitigate overfitting. Our algorithm starts with a small subset of the training data and gradually increases the mini-batch size based on (i) variance estimates obtained during the computation of a mini-batch gradient (Byrd et al., 2012) and (ii) the relative decrease in objective value for the validation data. Our experimental results demonstrate that our stochastic Hessian-free optimization, using the LSMR method and the new sample selection algorithm, leads to rapid training of deep autoencoders with improved generalization error.

[LG-34] Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model ICLR2025

链接: https://arxiv.org/abs/2504.13292
作者: Zhiwei Xu,Zhiyu Ni,Yixin Wang,Wei Hu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICLR 2025

点击查看摘要

Abstract:‘‘Grokking’’ is a phenomenon where a neural network first memorizes training data and generalizes poorly, but then suddenly transitions to near-perfect generalization after prolonged training. While intriguing, this delayed generalization phenomenon compromises predictability and efficiency. Ideally, models should generalize directly without delay. To this end, this paper proposes GrokTransfer, a simple and principled method for accelerating grokking in training neural networks, based on the key observation that data embedding plays a crucial role in determining whether generalization is delayed. GrokTransfer first trains a smaller, weaker model to reach a nontrivial (but far from optimal) test performance. Then, the learned input embedding from this weaker model is extracted and used to initialize the embedding in the target, stronger model. We rigorously prove that, on a synthetic XOR task where delayed generalization always occurs in normal training, GrokTransfer enables the target model to generalize directly without delay. Moreover, we demonstrate that, across empirical studies of different tasks, GrokTransfer effectively reshapes the training dynamics and eliminates delayed generalization, for both fully-connected neural networks and Transformers.

[LG-35] Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs

链接: https://arxiv.org/abs/2504.13266
作者: Zichao Yue,Chenhui Deng,Zhiru Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are widely used for learning node embeddings in graphs, typically adopting a message-passing scheme. This approach, however, leads to the neighbor explosion problem, with exponentially growing computational and memory demands as layers increase. Graph sampling has become the predominant method for scaling GNNs to large graphs, mitigating but not fully solving the issue. Pre-propagation GNNs (PP-GNNs) represent a new class of models that decouple feature propagation from training through pre-processing, addressing neighbor explosion in theory. Yet, their practical advantages and system-level optimizations remain underexplored. This paper provides a comprehensive characterization of PP-GNNs, comparing them with graph-sampling-based methods in training efficiency, scalability, and accuracy. While PP-GNNs achieve comparable accuracy, we identify data loading as the key bottleneck for training efficiency and input expansion as a major scalability challenge. To address these issues, we propose optimized data loading schemes and tailored training methods that improve PP-GNN training throughput by an average of 15 \times over the PP-GNN baselines, with speedup of up to 2 orders of magnitude compared to sampling-based GNNs on large graph benchmarks. Our implementation is publicly available at this https URL.

[LG-36] NNTile: a machine learning framework capable of training extremely large GPT language models on a single node

链接: https://arxiv.org/abs/2504.13236
作者: Aleksandr Mikhalev,Aleksandr Katrutsa,Konstantin Sozykin,Ivan Oseledets
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注:

点击查看摘要

Abstract:This study presents an NNTile framework for training large deep neural networks in heterogeneous clusters. The NNTile is based on a StarPU library, which implements task-based parallelism and schedules all provided tasks onto all available processing units (CPUs and GPUs). It means that a particular operation, necessary to train a large neural network, can be performed on any of the CPU cores or GPU devices, depending on automatic scheduling decisions. Such an approach shifts the burden of deciding where to compute and when to communicate from a human being to an automatic decision maker, whether a simple greedy heuristic or a complex AI-based software. The performance of the presented tool for training large language models is demonstrated in extensive numerical experiments.

[LG-37] Auto-FEDUS: Autoregressive Generative Modeling of Doppler Ultrasound Signals from Fetal Electrocardiograms ALT AAAI2025

链接: https://arxiv.org/abs/2504.13233
作者: Alireza Rafiei,Gari D. Clifford,Nasim Katebi
类目: Machine Learning (cs.LG)
*备注: AAAI 2025 Workshop on Large Language Models and Generative AI for Health

点击查看摘要

Abstract:Fetal health monitoring through one-dimensional Doppler ultrasound (DUS) signals offers a cost-effective and accessible approach that is increasingly gaining interest. Despite its potential, the development of machine learning based techniques to assess the health condition of mothers and fetuses using DUS signals remains limited. This scarcity is primarily due to the lack of extensive DUS datasets with a reliable reference for interpretation and data imbalance across different gestational ages. In response, we introduce a novel autoregressive generative model designed to map fetal electrocardiogram (FECG) signals to corresponding DUS waveforms (Auto-FEDUS). By leveraging a neural temporal network based on dilated causal convolutions that operate directly on the waveform level, the model effectively captures both short and long-range dependencies within the signals, preserving the integrity of generated data. Cross-subject experiments demonstrate that Auto-FEDUS outperforms conventional generative architectures across both time and frequency domain evaluations, producing DUS signals that closely resemble the morphology of their real counterparts. The realism of these synthesized signals was further gauged using a quality assessment model, which classified all as good quality, and a heart rate estimation model, which produced comparable results for generated and real data, with a Bland-Altman limit of 4.5 beats per minute. This advancement offers a promising solution for mitigating limited data availability and enhancing the training of DUS-based fetal models, making them more effective and generalizable.

[LG-38] PSG-MAE: Robust Multitask Sleep Event Monitoring using Multichannel PSG Reconstruction and Inter-channel Contrastive Learning

链接: https://arxiv.org/abs/2504.13229
作者: Yifei Wang,Qi Liu,Fuli Min,Honghao Wang
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Polysomnography (PSG) signals are essential for studying sleep processes and diagnosing sleep disorders. Analyzing PSG data through deep neural networks (DNNs) for automated sleep monitoring has become increasingly feasible. However, the limited availability of datasets for certain sleep events often leads to DNNs focusing on a single task with a single-sourced training dataset. As a result, these models struggle to transfer to new sleep events and lack robustness when applied to new datasets. To address these challenges, we propose PSG-MAE, a mask autoencoder (MAE) based pre-training framework. By performing self-supervised learning on a large volume of unlabeled PSG data, PSG-MAE develops a robust feature extraction network that can be broadly applied to various sleep event monitoring tasks. Unlike conventional MAEs, PSG-MAE generates complementary masks across PSG channels, integrates a multichannel signal reconstruction method, and employs a self-supervised inter-channel contrastive learning (ICCL) strategy. This approach enables the encoder to capture temporal features from each channel while simultaneously learning latent relationships between channels, thereby enhancing the utilization of multichannel information. Experimental results show that PSG-MAE effectively captures both temporal details and inter-channel information from PSG signals. When the encoder pre-trained through PSG-MAE is fine-tuned with downstream feature decomposition networks, it achieves an accuracy of 83.7% for sleep staging and 90.45% for detecting obstructive sleep apnea, which highlights the framework’s robustness and broad applicability.

[LG-39] Modelling Mean-Field Games with Neural Ordinary Differential Equations

链接: https://arxiv.org/abs/2504.13228
作者: Anna C.M. Thöni,Yoram Bachrach,Tal Kachman
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Mean-field game theory relies on approximating games that would otherwise have been intractable to model. While the games can be solved analytically via the associated system of partial derivatives, this approach is not model-free, can lead to the loss of the existence or uniqueness of solutions and may suffer from modelling bias. To reduce the dependency between the model and the game, we combine mean-field game theory with deep learning in the form of neural ordinary differential equations. The resulting model is data-driven, lightweight and can learn extensive strategic interactions that are hard to capture using mean-field theory alone. In addition, the model is based on automatic differentiation, making it more robust and objective than approaches based on finite differences. We highlight the efficiency and flexibility of our approach by solving three mean-field games that vary in their complexity, observability and the presence of noise. Using these results, we show that the model is flexible, lightweight and requires few observations to learn the distribution underlying the data.

[LG-40] Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI

链接: https://arxiv.org/abs/2504.13201
作者: Jirui Yang,Zheyu Lin,Shuhan Yang,Zhihui Lu,Xin Du
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Embodied Intelligence (EI) systems integrated with large language models (LLMs) face significant security risks, particularly from jailbreak attacks that manipulate models into generating harmful outputs or executing unsafe physical actions. Traditional defense strategies, such as input filtering and output monitoring, often introduce high computational overhead or interfere with task performance in real-time embodied scenarios. To address these challenges, we propose Concept Enhancement Engineering (CEE), a novel defense framework that leverages representation engineering to enhance the safety of embodied LLMs by dynamically steering their internal activations. CEE operates by (1) extracting multilingual safety patterns from model activations, (2) constructing control directions based on safety-aligned concept subspaces, and (3) applying subspace concept rotation to reinforce safe behavior during inference. Our experiments demonstrate that CEE effectively mitigates jailbreak attacks while maintaining task performance, outperforming existing defense methods in both robustness and efficiency. This work contributes a scalable and interpretable safety mechanism for embodied AI, bridging the gap between theoretical representation engineering and practical security applications. Our findings highlight the potential of latent-space interventions as a viable defense paradigm against emerging adversarial threats in physically grounded AI systems.

[LG-41] On the Convergence of Irregular Sampling in Reproducing Kernel Hilbert Spaces

链接: https://arxiv.org/abs/2504.13623
作者: Armin Iske
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We analyse the convergence of sampling algorithms for functions in reproducing kernel Hilbert spaces (RKHS). To this end, we discuss approximation properties of kernel regression under minimalistic assumptions on both the kernel and the input data. We first prove error estimates in the kernel’s RKHS norm. This leads us to new results concerning uniform convergence of kernel regression on compact domains. For Lipschitz continuous and Hölder continuous kernels, we prove convergence rates.

[LG-42] Quantum repeaters enhanced by vacuum beam guides

链接: https://arxiv.org/abs/2504.13397
作者: Yu Gan,Mohadeseh Azar,Nitish Kumar Chandra,Xin Jin,Jinglei Cheng,Kaushik P. Seshadreesan,Junyu Liu
类目: Quantum Physics (quant-ph); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 10 pages

点击查看摘要

Abstract:The development of large-scale quantum communication networks faces critical challenges due to photon loss and decoherence in optical fiber channels. These fundamentally limit transmission distances and demand dense networks of repeater stations. This work investigates using vacuum beam guides (VBGs)-a promising ultra-low-loss transmission platform-as an alternative to traditional fiber links. By incorporating VBGs into repeater-based architectures, we demonstrate that the inter-repeater spacing can be substantially extended, resulting in fewer required nodes and significantly reducing hardware and operational complexity. We perform a cost-function analysis to quantify performance trade-offs across first, second, and third-generation repeaters. Our results show that first-generation repeaters reduce costs dramatically by eliminating entanglement purification. Third-generation repeaters benefit from improved link transmission success, which is crucial for quantum error correction. In contrast, second-generation repeaters exhibit a more nuanced response; although transmission loss is reduced, their performance remains primarily limited by logical gate errors rather than channel loss. These findings highlight that while all repeater generations benefit from reduced photon loss, the magnitude of improvement depends critically on the underlying error mechanisms. Vacuum beam guides thus emerge as a powerful enabler for scalable, high-performance quantum networks, particularly in conjunction with near-term quantum hardware capabilities.

[LG-43] On the minimax optimality of Flow Matching through the connection to kernel density estimation

链接: https://arxiv.org/abs/2504.13336
作者: Lea Kunkel,Mathias Trabs
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Flow Matching has recently gained attention in generative modeling as a simple and flexible alternative to diffusion models, the current state of the art. While existing statistical guarantees adapt tools from the analysis of diffusion models, we take a different perspective by connecting Flow Matching to kernel density estimation. We first verify that the kernel density estimator matches the optimal rate of convergence in Wasserstein distance up to logarithmic factors, improving existing bounds for the Gaussian kernel. Based on this result, we prove that for sufficiently large networks, Flow Matching also achieves the optimal rate up to logarithmic factors, providing a theoretical foundation for the empirical success of this method. Finally, we provide a first justification of Flow Matching’s effectiveness in high-dimensional settings by showing that rates improve when the target distribution lies on a lower-dimensional linear subspace.

[LG-44] Predicting Forced Responses of Probability Distributions via the Fluctuation-Dissipation Theorem and Generative Modeling

链接: https://arxiv.org/abs/2504.13333
作者: Ludovico T. Giorgini,Fabrizio Falasca,Andre N. Souza
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:We present a novel data-driven framework for estimating the response of higher-order moments of nonlinear stochastic systems to small external perturbations. The classical Generalized Fluctuation-Dissipation Theorem (GFDT) links the unperturbed steady-state distribution to the system’s linear response. Standard implementations rely on Gaussian approximations, which can often accurately predict the mean response but usually introduce significant biases in higher-order moments, such as variance, skewness, and kurtosis. To address this limitation, we combine GFDT with recent advances in score-based generative modeling, which enable direct estimation of the score function from data without requiring full density reconstruction. Our method is validated on three reduced-order stochastic models relevant to climate dynamics: a scalar stochastic model for low-frequency climate variability, a slow-fast triad model mimicking key features of the El Nino-Southern Oscillation (ENSO), and a six-dimensional stochastic barotropic model capturing atmospheric regime transitions. In all cases, the approach captures strongly nonlinear and non-Gaussian features of the system’s response, outperforming traditional Gaussian approximations.

[LG-45] Gradient-Free Sequential Bayesian Experimental Design via Interacting Particle Systems

链接: https://arxiv.org/abs/2504.13320
作者: Robert Gruhlke,Matei Hanu,Claudia Schillings,Philipp Wacker
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We introduce a gradient-free framework for Bayesian Optimal Experimental Design (BOED) in sequential settings, aimed at complex systems where gradient information is unavailable. Our method combines Ensemble Kalman Inversion (EKI) for design optimization with the Affine-Invariant Langevin Dynamics (ALDI) sampler for efficient posterior sampling-both of which are derivative-free and ensemble-based. To address the computational challenges posed by nested expectations in BOED, we propose variational Gaussian and parametrized Laplace approximations that provide tractable upper and lower bounds on the Expected Information Gain (EIG). These approximations enable scalable utility estimation in high-dimensional spaces and PDE-constrained inverse problems. We demonstrate the performance of our framework through numerical experiments ranging from linear Gaussian models to PDE-based inference tasks, highlighting the method’s robustness, accuracy, and efficiency in information-driven experimental design.

[LG-46] A Quantum of Learning: Using Quaternion Algebra to Model Learning on Quantum Devices

链接: https://arxiv.org/abs/2504.13232
作者: Sayed Pouria Talebi,Clive Cheong Took,Danilo P. Mandic
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Quantum Algebra (math.QA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This article considers the problem of designing adaption and optimisation techniques for training quantum learning machines. To this end, the division algebra of quaternions is used to derive an effective model for representing computation and measurement operations on qubits. In turn, the derived model, serves as the foundation for formulating an adaptive learning problem on principal quantum learning units, thereby establishing quantum information processing units akin to that of neurons in classical approaches. Then, leveraging the modern HR-calculus, a comprehensive training framework for learning on quantum machines is developed. The quaternion-valued model accommodates mathematical tractability and establishment of performance criteria, such as convergence conditions.

信息检索

[IR-0] Consensus-aware Contrastive Learning for Group Recommendation

链接: https://arxiv.org/abs/2504.13703
作者: Soyoung Kim,Dongjun Lee,Jaekwang Kim
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Group recommendation aims to provide personalized item suggestions to a group of users by reflecting their collective preferences. A fundamental challenge in this task is deriving a consensus that adequately represents the diverse interests of individual group members. Despite advancements made by deep learning-based models, existing approaches still struggle in two main areas: (1) Capturing consensus in small-group settings, which are more prevalent in real-world applications, and (2) Balancing individual preferences with overall group performance, particularly in hypergraph-based methods that tend to emphasize group accuracy at the expense of personalization. To address these challenges, we introduce a Consensus-aware Contrastive Learning for Group Recommendation (CoCoRec) that models group consensus through contrastive learning. CoCoRec utilizes a transformer encoder to jointly learn user and group representations, enabling richer modeling of intra-group dynamics. Additionally, the contrastive objective helps reduce overfitting from high-frequency user interactions, leading to more robust and representative group embeddings. Experiments conducted on four benchmark datasets show that CoCoRec consistently outperforms state-of-the-art baselines in both individual and group recommendation scenarios, highlighting the effectiveness of consensus-aware contrastive learning in group recommendation tasks.

[IR-1] Contextualizing Spotifys Audiobook List Recommendations with Descriptive Shelves ECIR’25

链接: https://arxiv.org/abs/2504.13572
作者: Gustavo Penha,Alice Wang,Martin Achenbach,Kristen Sheets,Sahitya Mantravadi,Remi Galvez,Nico Guetta-Jeanrenaud,Divya Narayanan,Ofeliya Kalaydzhyan,Hugues Bouchard
类目: Information Retrieval (cs.IR)
*备注: Accepted for publication in the 47th European Conference on Information Retrieval (ECIR’25)

点击查看摘要

Abstract:In this paper, we propose a pipeline to generate contextualized list recommendations with descriptive shelves in the domain of audiobooks. By creating several shelves for topics the user has an affinity to, e.g. Uplifting Women’s Fiction, we can help them explore their recommendations according to their interests and at the same time recommend a diverse set of items. To do so, we use Large Language Models (LLMs) to enrich each item’s metadata based on a taxonomy created for this domain. Then we create diverse descriptive shelves for each user. A/B tests show improvements in user engagement and audiobook discovery metrics, demonstrating benefits for users and content creators.

[IR-2] Improving Sequential Recommenders through Counterfactual Augmentation of System Exposure SIGIR2025 SIGIR

链接: https://arxiv.org/abs/2504.13482
作者: Ziqi Zhao,Zhaochun Ren,Jiyuan Yang,Zuming Yan,Zihan Wang,Liu Yang,Pengjie Ren,Zhumin Chen,Maarten de Rijke,Xin Xin
类目: Information Retrieval (cs.IR)
*备注: accepted at SIGIR 2025 (Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)

点击查看摘要

Abstract:In sequential recommendation (SR), system exposure refers to items that are exposed to the user. Typically, only a few of the exposed items would be interacted with by the user. Although SR has achieved great success in predicting future user interests, existing SR methods still fail to fully exploit system exposure data. Most methods only model items that have been interacted with, while the large volume of exposed but non-interacted items is overlooked. Even methods that consider the whole system exposure typically train the recommender using only the logged historical system exposure, without exploring unseen user interests. In this paper, we propose counterfactual augmentation over system exposure for sequential recommendation (CaseRec). To better model historical system exposure, CaseRec introduces reinforcement learning to account for different exposure rewards. CaseRec uses a decision transformer-based sequential model to take an exposure sequence as input and assigns different rewards according to the user feedback. To further explore unseen user interests, CaseRec proposes to perform counterfactual augmentation, where exposed original items are replaced with counterfactual items. Then, a transformer-based user simulator is proposed to predict the user feedback reward for the augmented items. Augmentation, together with the user simulator, constructs counterfactual exposure sequences to uncover new user interests. Finally, CaseRec jointly uses the logged exposure sequences with the counterfactual exposure sequences to train a decision transformer-based sequential model for generating recommendation. Experiments on three real-world benchmarks show the effectiveness of CaseRec. Our code is available at this https URL. Comments: accepted at SIGIR 2025 (Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval) Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2504.13482 [cs.IR] (or arXiv:2504.13482v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.13482 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表