本篇博文主要内容为 2025-04-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-04-09)
今日共更新477篇论文,其中:
- 自然语言处理共66篇(Computation and Language (cs.CL))
- 人工智能共129篇(Artificial Intelligence (cs.AI))
- 计算机视觉共120篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共133篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂任务时推理时间较长的问题。传统方法通过显式合作框架(如投票机制或独立子任务创建)实现LLMs的并行操作,但这些框架可能不适用于所有任务类型。论文提出了一种新的设计思路:让多个LLM实例(“worker”)在平行运行的同时,通过共享的注意力缓存进行同步,并允许它们自主决定如何协作。关键在于引入Hogwild! 推理引擎,使不同LLM实例能够实时访问彼此生成的标记,同时利用Rotary Position Embeddings (RoPE) 提高硬件利用率并避免重复计算。这种设计使得现代具备推理能力的LLMs能够直接使用共享键值缓存进行推断,而无需额外微调。
链接: https://arxiv.org/abs/2504.06261
作者: Gleb Rodionov,Roman Garipov,Alina Shutova,George Yakushev,Vage Egiazarian,Anton Sinitsin,Denis Kuznedelev,Dan Alistarh
机构: Yandex; HSE University (高等经济研究大学), Yandex; IST Austria (奥地利科学技术学院); Yandex; Yandex; IST Austria (奥地利科学技术学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint, work in progress
Abstract:Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM “workers” in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the instances to come up with their own collaboration strategy for the problem at hand, all the while “seeing” each other’s partial progress in the concurrent cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with “instant” access to each other’s generated tokens. Hogwild! inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.
zh
[NLP-1] FEABench: Evaluating Language Models on Multiphysics Reasoning Ability NEURIPS2024
【速读】: 该论文试图解决的问题是如何评估大型语言模型(Large Language Models, LLMs)及其代理在利用有限元分析(Finite Element Analysis, FEA)进行物理、数学及工程问题模拟与求解的能力。论文的关键在于提出了一套综合的评估方案,通过让LLMs基于自然语言问题描述进行推理,并操作COMSOL Multiphysics^\circledR软件计算答案,从而实现端到端的问题求解。此外,论文设计了一种具备通过应用程序接口(API)与软件交互能力的语言模型代理,使其能够检查输出结果并在多轮迭代中改进解决方案。实验结果显示,最佳策略能够生成可执行的API调用达88%的时间。这一方案的核心在于结合LLMs的推理能力和数值求解器的精确性,推动工程自动化前沿的发展,并为构建能够处理现实世界复杂问题的自主系统奠定基础。
链接: https://arxiv.org/abs/2504.06260
作者: Nayantara Mudur,Hao Cui,Subhashini Venugopalan,Paul Raccuglia,Michael P. Brenner,Peter Norgaard
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Numerical Analysis (math.NA)
备注: 39 pages. Accepted at the NeurIPS 2024 Workshops on Mathematical Reasoning and AI and Open-World Agents
Abstract:Building precise simulations of the real world and invoking numerical solvers to answer quantitative problems is an essential requirement in engineering and science. We present FEABench, a benchmark to evaluate the ability of large language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA). We introduce a comprehensive evaluation scheme to investigate the ability of LLMs to solve these problems end-to-end by reasoning over natural language problem descriptions and operating COMSOL Multiphysics ^\circledR , an FEA software, to compute the answers. We additionally design a language model agent equipped with the ability to interact with the software through its Application Programming Interface (API), examine its outputs and use tools to improve its solutions over multiple iterations. Our best performing strategy generates executable API calls 88% of the time. LLMs that can successfully interact with and operate FEA software to solve problems such as those in our benchmark would push the frontiers of automation in engineering. Acquiring this capability would augment LLMs’ reasoning skills with the precision of numerical solvers and advance the development of autonomous systems that can tackle complex problems in the real world. The code is available at this https URL
zh
[NLP-2] LExT: Towards Evaluating Trustworthiness of Natural Language Explanations
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成自然语言解释时缺乏有效评估框架的问题。现有自然语言生成指标如BLEU和ROUGE主要关注语法和语义准确性,但忽视了事实准确性、一致性和忠实性等关键特性。论文的关键解决方案是提出一个通用的框架——语言解释可信度评分(Language Explanation Trustworthiness Score, LExT),该框架通过平衡可解释性(Plausibility)和忠实性(Faithfulness)来量化自然语言解释的可信度。通过在医疗领域应用此框架,论文评估了多个模型,并揭示了通用模型与领域特定模型在生成可信解释方面的显著差异,强调了针对敏感领域设计专门评估框架的重要性。
链接: https://arxiv.org/abs/2504.06227
作者: Krithi Shailya,Shreya Rajpal,Gokul S Krishnan,Balaraman Ravindran
机构: Centre for Responsible AI (负责任的人工智能中心), IIT Madras (印度理工学院马德拉斯分校); Chennai (金奈), India (印度)
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) become increasingly integrated into high-stakes domains, there have been several approaches proposed toward generating natural language explanations. These explanations are crucial for enhancing the interpretability of a model, especially in sensitive domains like healthcare, where transparency and reliability are key. In light of such explanations being generated by LLMs and its known concerns, there is a growing need for robust evaluation frameworks to assess model-generated explanations. Natural Language Generation metrics like BLEU and ROUGE capture syntactic and semantic accuracies but overlook other crucial aspects such as factual accuracy, consistency, and faithfulness. To address this gap, we propose a general framework for quantifying trustworthiness of natural language explanations, balancing Plausibility and Faithfulness, to derive a comprehensive Language Explanation Trustworthiness Score (LExT) (The code and set up to reproduce our experiments are publicly available at this https URL). Applying our domain-agnostic framework to the healthcare domain using public medical datasets, we evaluate six models, including domain-specific and general-purpose models. Our findings demonstrate significant differences in their ability to generate trustworthy explanations. On comparing these explanations, we make interesting observations such as inconsistencies in Faithfulness demonstrated by general-purpose models and their tendency to outperform domain-specific fine-tuned models. This work further highlights the importance of using a tailored evaluation framework to assess natural language explanations in sensitive fields, providing a foundation for improving the trustworthiness and transparency of language models in healthcare and beyond.
zh
[NLP-3] Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation
【速读】: 本文旨在解决将预训练的解码器-only大型语言模型(Decoder-only Large Language Models, LLMs)适配为编码器-解码器模型的问题,目标是结合两种方法的优势,实现更优的质量-效率权衡。关键在于通过适配不仅继承解码器-only LLMs的能力,还能在计算需求上低于从头开始预训练。为此,作者系统性地探索了不同的预训练目标以及参数初始化与优化技术。实验结果表明,适配后的编码器-解码器LLMs在相似的推理预算下,相较于解码器-only模型,在微调性能上具有显著优势,同时保持相近的预训练性能。例如,在指令微调后,Gemma 2B-2B比Gemma 2B提升了约7%,而更大规模的Gemma 9B-2B则比Gemma 2B-2B高出3%。此外,适配后的编码器表示在SuperGLUE任务上也表现出更好的性能。
链接: https://arxiv.org/abs/2504.06225
作者: Biao Zhang,Fedor Moiseev,Joshua Ainslie,Paul Suganthan,Min Ma,Surya Bhupatiraju,Fede Lebron,Orhan Firat,Armand Joulin,Zhe Dong
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:While decoder-only large language models (LLMs) have shown impressive results, encoder-decoder models are still widely adopted in real-world applications for their inference efficiency and richer encoder representation. In this paper, we study a novel problem: adapting pretrained decoder-only LLMs to encoder-decoder, with the goal of leveraging the strengths of both approaches to achieve a more favorable quality-efficiency trade-off. We argue that adaptation not only enables inheriting the capability of decoder-only LLMs but also reduces the demand for computation compared to pretraining from scratch. We rigorously explore different pretraining objectives and parameter initialization/optimization techniques. Through extensive experiments based on Gemma 2 (2B and 9B) and a suite of newly pretrained mT5-sized models (up to 1.6B), we demonstrate the effectiveness of adaptation and the advantage of encoder-decoder LLMs. Under similar inference budget, encoder-decoder LLMs achieve comparable (often better) pretraining performance but substantially better finetuning performance than their decoder-only counterpart. For example, Gemma 2B-2B outperforms Gemma 2B by \sim 7% after instruction tuning. Encoder-decoder adaptation also allows for flexible combination of different-sized models, where Gemma 9B-2B significantly surpasses Gemma 2B-2B by 3%. The adapted encoder representation also yields better results on SuperGLUE. We will release our checkpoints to facilitate future research.
zh
[NLP-4] Can Performant LLM s Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs
【速读】: 该论文旨在研究网络爬虫弃权(web crawling opt-outs)对大规模语言模型(Large Language Model, LLM)性能的影响,特别是版权持有者广泛采用此类弃权措施后,数据合规性如何影响模型在预训练阶段及后续训练中的能力。论文定义了一个新的度量指标——“数据合规性差距”(Data Compliance Gap, DCG),用于量化在遵守网络爬虫弃权的数据集与未遵守的数据集上训练的模型之间的性能差异。关键解决方案在于通过实验评估这一差距在不同场景下的表现,包括从头开始预训练模型以及基于现有合规模型进行持续预训练的情况。研究表明,尽管总体知识获取能力不受显著影响(接近0%的DCG),但在特定领域如生物医学研究中,排除主要出版商会导致性能下降。这表明通用型LLM可以在完全开放的数据上训练到良好的性能,但特定领域的优化可能需要在后期训练中引入高质量的受版权保护的数据源。因此,论文的核心贡献在于为数据合规性与下游模型性能之间的长期争议提供了实证依据,并为未来的AI训练实践及相关政策决策提供了参考。
链接: https://arxiv.org/abs/2504.06219
作者: Dongyang Fan,Vinko Sabolčec,Matin Ansaripour,Ayush Kumar Tarun,Martin Jaggi,Antoine Bosselut,Imanol Schlag
机构: EPFL (瑞士联邦理工学院); ETH Zürich (瑞士联邦理工学院苏黎世分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the \textitdata compliance gap (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. We measure the data compliance gap in two settings: pretraining models from scratch and continual pretraining from existing compliant models (simulating a setting where copyrighted data could be integrated later in pretraining). Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions.
zh
[NLP-5] From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models
【速读】: 该论文旨在解决长上下文能力在多种应用场景中的需求,包括文档和视频理解、情境学习以及推理时间扩展等,这些问题要求模型能够处理和推理超长文本及多模态数据序列。为应对这一挑战,论文提出了一种高效的训练方法,用于从对齐指令模型构建超长上下文的大规模语言模型(LLMs),将上下文长度从传统的128K tokens扩展到1M、2M甚至4M tokens。解决方案的关键在于采用高效的持续预训练策略以扩展上下文窗口,并结合有效的指令微调技术,确保模型保持良好的指令遵循能力和推理性能。通过这种方法,所提出的UltraLong-8B模型在一系列长上下文基准测试中取得了最先进的性能,同时在标准基准测试中也表现出色,实现了长短上下文任务的均衡提升。此外,论文深入分析了关键设计选择的影响,特别是缩放策略和数据组成的作用,为高效扩展上下文长度并保持模型通用能力提供了一个稳健的框架。
链接: https://arxiv.org/abs/2504.06214
作者: Chejian Xu,Wei Ping,Peng Xu,Zihan Liu,Boxin Wang,Mohammad Shoeybi,Bo Li,Bryan Catanzaro
机构: UIUC (伊利诺伊大学香槟分校); NVIDIA (英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: this https URL.
zh
[NLP-6] xGemma: Efficient and Agent ic LLM s for Therapeutics
【速读】: 该论文旨在解决药物研发领域高成本、高风险以及高失败率的问题。解决方案的关键在于提出了一套名为TxGemma的高效通用大型语言模型(LLMs)套件,其不仅能够预测治疗性质,还支持交互推理与可解释性。与任务特定模型不同,TxGemma可以从多种来源合成信息,从而在整个药物开发流程中实现广泛应用。该套件包含参数量分别为2B、9B和27B的模型,这些模型是在一个涵盖小分子、蛋白质、核酸、疾病和细胞系的综合数据集上微调自Gemma-2而来的。在66个药物开发任务中,TxGemma在64项任务上的表现优于或相当于最先进的通用模型(在45项任务上表现更优),并且在50项任务上优于最先进的专用模型(在26项任务上表现更优)。此外,针对临床试验不良事件预测等下游治疗任务对TxGemma模型进行微调所需的训练数据量少于对基础LLM进行微调所需的数据量,这使得TxGemma适用于数据受限的应用场景。除了预测能力之外,TxGemma还包括会话模型,弥合了通用LLM与专门性质预测器之间的差距,使科学家能够以自然语言进行交互,并基于分子结构提供机制推理和科学讨论。在此基础上,进一步引入了由Gemini 2.5驱动的通用治疗代理系统Agentic-Tx,该系统具备推理、行动、管理多样化工作流以及获取外部领域知识的能力。在Humanity’s Last Exam基准测试(化学和生物学部分)中,Agentic-Tx相对于o3-mini(高)在Chemistry Biology方面提高了52.3%,在GPQA(化学)方面提高了26.7%,并且在ChemBench-Preference和ChemBench-Mini上分别比o3-mini(高)提高了6.3%和2.4%。
链接: https://arxiv.org/abs/2504.06196
作者: Eric Wang,Samuel Schmidgall,Paul F. Jaeger,Fan Zhang,Rory Pilgrim,Yossi Matias,Joelle Barral,David Fleet,Shekoofeh Azizi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Therapeutic development is a costly and high-risk endeavor that is often plagued by high failure rates. To address this, we introduce TxGemma, a suite of efficient, generalist large language models (LLMs) capable of therapeutic property prediction as well as interactive reasoning and explainability. Unlike task-specific models, TxGemma synthesizes information from diverse sources, enabling broad application across the therapeutic development pipeline. The suite includes 2B, 9B, and 27B parameter models, fine-tuned from Gemma-2 on a comprehensive dataset of small molecules, proteins, nucleic acids, diseases, and cell lines. Across 66 therapeutic development tasks, TxGemma achieved superior or comparable performance to the state-of-the-art generalist model on 64 (superior on 45), and against state-of-the-art specialist models on 50 (superior on 26). Fine-tuning TxGemma models on therapeutic downstream tasks, such as clinical trial adverse event prediction, requires less training data than fine-tuning base LLMs, making TxGemma suitable for data-limited applications. Beyond these predictive capabilities, TxGemma features conversational models that bridge the gap between general LLMs and specialized property predictors. These allow scientists to interact in natural language, provide mechanistic reasoning for predictions based on molecular structure, and engage in scientific discussions. Building on this, we further introduce Agentic-Tx, a generalist therapeutic agentic system powered by Gemini 2.5 that reasons, acts, manages diverse workflows, and acquires external domain knowledge. Agentic-Tx surpasses prior leading models on the Humanity’s Last Exam benchmark (Chemistry Biology) with 52.3% relative improvement over o3-mini (high) and 26.7% over o3-mini (high) on GPQA (Chemistry) and excels with improvements of 6.3% (ChemBench-Preference) and 2.4% (ChemBench-Mini) over o3-mini (high).
zh
[NLP-7] SkillFlow: Efficient Skill and Code Transfer Through Communication in Adapting AI Agents
【速读】: 该论文试图解决如何使AI代理(AI agents)在动态环境中通过获取新技能(skills)来扩展其功能,并优化任务完成效率与成本。解决方案的关键在于SkillFlow框架,它是一个模块化且技术无关的系统,允许代理以即插即用的方式从环境或其他代理中学习新技能。通过理论分析与实际应用验证,研究展示了SkillFlow在高通信成本场景下能够显著提升任务完成效率(时间减少24.8%,p值=6.4×10⁻³),并降低累积成本。这一方法借鉴了生物系统中的横向基因转移(lateral gene transfer)机制,强调适应性与进化能力在新环境中的重要性。
链接: https://arxiv.org/abs/2504.06188
作者: Pagkratios Tagkopoulos,Fangzhou Li,Ilias Tagkopoulos
机构: University of California, Davis (加州大学戴维斯分校); USDA/NSF AI Institute for Next Generation Food Systems (美国农业部/国家科学基金会下一代食品系统人工智能研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:AI agents are autonomous systems that can execute specific tasks based on predefined programming. Here, we present SkillFlow, a modular, technology-agnostic framework that allows agents to expand their functionality in an ad-hoc fashion by acquiring new skills from their environment or other agents. We present a theoretical model that examines under which conditions this framework would be beneficial, and we then explore SkillFlow’s ability to accelerate task completion and lead to lower cumulative costs in a real-world application, namely scheduling agents for calendar events. We demonstrate that within a few iterations, SkillFlow leads to considerable (24.8%, p-value = 6.4\times10^-3 ) gains in time and cost, especially when the communication cost is high. Finally, we draw analogies from well-studied biological systems and compare this framework to that of lateral gene transfer, a significant process of adaptation and evolution in novel environments.
zh
[NLP-8] Assessing how hyperparameters impact Large Language Models sarcasm detection performance
【速读】: 该论文旨在研究模型特性如何影响OpenAI的GPT系列和Meta的Llama-2系列在讽刺(Sarcasm)检测任务中的表现。论文的关键在于评估不同规模、版本以及超参数调优下的微调(Fine-tuned)与零样本(Zero-shot)模型性能,特别关注模型大小、微调策略及每次发布后性能变化对讽刺检测准确性的影响。实验基于SARC2.0数据集的pol-bal部分开展。结果显示,在微调场景下,全精度Llama-2-13b达到了最先进的准确率(0.83)和F_1分数,与人类平均水平相当;而在零样本设置中,某个GPT-4模型也展现了竞争力,其准确率为0.70,F_1分数为0.75。此外,还发现单个模型的性能可能随版本更新而波动,强调了每次新版本发布后重新评估模型性能的重要性。
链接: https://arxiv.org/abs/2504.06166
作者: Montgomery Gole,Andriy Miranskyy
机构: Toronto Metropolitan University (多伦多都会大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Sarcasm detection is challenging for both humans and machines. This work explores how model characteristics impact sarcasm detection in OpenAI’s GPT, and Meta’s Llama-2 models, given their strong natural language understanding, and popularity. We evaluate fine-tuned and zero-shot models across various sizes, releases, and hyperparameters. Experiments were conducted on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC2.0) sarcasm dataset. Fine-tuned performance improves monotonically with model size within a model family, while hyperparameter tuning also impacts performance. In the fine-tuning scenario, full precision Llama-2-13b achieves state-of-the-art accuracy and F_1 -score, both measured at 0.83, comparable to average human performance. In the zero-shot setting, one GPT-4 model achieves competitive performance to prior attempts, yielding an accuracy of 0.70 and an F_1 -score of 0.75. Furthermore, a model’s performance may increase or decline with each release, highlighting the need to reassess performance after each release.
zh
[NLP-9] Navigating the Rabbit Hole: Emergent Biases in LLM -Generated Attack Narratives Targeting Mental Health Groups
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在未受挑衅的情况下针对高危人群发起有针对性攻击的问题,尤其是这些攻击对心理健康相关群体的影响。论文的关键解决方案包括:(1) 提出一种评估LLM生成攻击对高度脆弱的心理健康群体影响的方法;(2) 设计一个基于网络的框架来研究相对偏见的传播机制;(3) 评估由此类攻击引发的污名化程度。通过分析大规模偏见审计数据集,论文揭示了心理健康实体在网络叙事中的中心地位,并结合污名化理论的基础指出,与初始目标相比,心理健康障碍相关的攻击目标表现出更高的标签化成分。这些贡献揭示了LLMs加剧有害话语的结构倾向,并强调了开发适当缓解方法的必要性。
链接: https://arxiv.org/abs/2504.06160
作者: Rijul Magu,Arka Dutta,Sean Kim,Ashiqur R. KhudaBukhsh,Munmun De Choudhury
机构: College of Computing, Georgia Institute of Technology(乔治亚理工学院), Georgia, USA; Rochester Institute of Technology(罗切斯特理工学院), Rochester, New York, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:Large Language Models (LLMs) have been shown to demonstrate imbalanced biases against certain groups. However, the study of unprovoked targeted attacks by LLMs towards at-risk populations remains underexplored. Our paper presents three novel contributions: (1) the explicit evaluation of LLM-generated attacks on highly vulnerable mental health groups; (2) a network-based framework to study the propagation of relative biases; and (3) an assessment of the relative degree of stigmatization that emerges from these attacks. Our analysis of a recently released large-scale bias audit dataset reveals that mental health entities occupy central positions within attack narrative networks, as revealed by a significantly higher mean centrality of closeness (p-value = 4.06e-10) and dense clustering (Gini coefficient = 0.7). Drawing from sociological foundations of stigmatization theory, our stigmatization analysis indicates increased labeling components for mental health disorder-related targets relative to initial targets in generation chains. Taken together, these insights shed light on the structural predilections of large language models to heighten harmful discourse and highlight the need for suitable approaches for mitigation.
zh
[NLP-10] QGen Studio: An Adaptive Question-Answer Generation Training and Evaluation Platform
【速读】: 该论文试图解决自适应性问题生成、训练及评估的挑战,特别是在利用大规模语言模型(Large Language Models, LLMs)构建高质量问答(Question-Answer, QA)数据集和优化模型性能方面。论文的关键解决方案在于提出QGen Studio平台,它通过提供数据集查看器(dataset viewer)和模型探索器(model explorer),实现了从数据生成、模型微调到性能评估的端到端交互式流程。数据集查看器能够量化数据质量并可视化生成上下文,而模型探索器支持多模型对比与性能基准测试,从而确保生成的数据集和训练的模型具备高质量和可扩展性。
链接: https://arxiv.org/abs/2504.06136
作者: Movina Moses,Mohab Elkaref,James Barry,Shinnosuke Tanaka,Vishnudev Kuruvanthodi,Nathan Herr,Campbell D Watson,Geeth De Mel
机构: IBM(国际商业机器公司); unknown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present QGen Studio: an adaptive question-answer generation, training, and evaluation platform. QGen Studio enables users to leverage large language models (LLMs) to create custom question-answer datasets and fine-tune models on this synthetic data. It features a dataset viewer and model explorer to streamline this process. The dataset viewer provides key metrics and visualizes the context from which the QA pairs are generated, offering insights into data quality. The model explorer supports model comparison, allowing users to contrast the performance of their trained LLMs against other models, supporting performance benchmarking and refinement. QGen Studio delivers an interactive, end-to-end solution for generating QA datasets and training scalable, domain-adaptable models. The studio will be open-sourced soon, allowing users to deploy it locally.
zh
[NLP-11] Confidence Regularized Masked Language Modeling using Text Length
【速读】: 该论文旨在解决掩码语言模型(Masked Language Modeling, MLM)在处理短输入文本时,因填充掩码位置的词分布熵较高而导致模型对单一答案过于自信的问题。论文的关键解决方案是提出了一种新颖的信心正则化器(confidence regularizer),通过动态调节输入文本长度来控制正则化强度,从而缓解上述问题。实验结果表明,该方法在GLUE和SQuAD数据集上实现了更高的准确率和更低的期望校准误差(Expected Calibration Error, ECE)。
链接: https://arxiv.org/abs/2504.06037
作者: Seunghyun Ji,Soowon Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 1 figure
Abstract:Masked language modeling, which is a task to predict a randomly masked word in the input text, is an efficient language representation learning method. Masked language modeling ignores various words which people can think of for filling in the masked position and calculates the loss with a single word. Especially when the input text is short, the entropy of the word distribution that can fill in the masked position can be high. This may cause the model to be overconfident in the single answer. To address this issue, we propose a novel confidence regularizer that controls regularizing strength dynamically by the input text length. Experiments with GLUE and SQuAD datasets showed that our method achieves better accuracy and lower expected calibration error.
zh
[NLP-12] Multi-Sense Embeddings for Language Models and Knowledge Distillation
【速读】: 该论文试图解决的问题是:如何在保证模型性能的同时,减少大型语言模型(Large Language Models, LLMs)的存储空间需求和推理时间开销。传统基于Transformer的LLMs依赖于上下文嵌入(contextual embeddings),但相同词汇在不同上下文中会生成不同的连续表示,而词汇的实际含义通常有限。为了解决这一问题,论文提出了一种多义词嵌入(multi-sense embeddings)作为替代方案,并通过聚类算法从LLM生成的嵌入中提取代表性多义词嵌入构建字典。同时,论文设计了一种新颖的知识蒸馏方法,利用该多义词嵌入字典训练一个小规模的学生模型(student model),使其能够模拟大规模基础模型(base LLM model)中的多义特性,从而实现显著的空间节省和推理加速,同时保持竞争力的性能表现。
链接: https://arxiv.org/abs/2504.06036
作者: Qitong Wang,Mohammed J. Zaki,Georgios Kollias,Vasileios Kalantzis
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM研究院)
类目: Computation and Language (cs.CL)
备注: 16 pages, 4 figures
Abstract:Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a limited number of senses (or meanings). We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language. To construct a sense embedding dictionary, we apply a clustering algorithm to embeddings generated by an LLM and consider the cluster centers as representative sense embeddings. In addition, we propose a novel knowledge distillation method that leverages the sense dictionary to learn a smaller student model that mimics the senses from the much larger base LLM model, offering significant space and inference time savings, while maintaining competitive performance. Via thorough experiments on various benchmarks, we showcase the effectiveness of our sense embeddings and knowledge distillation approach. We share our code at this https URL
zh
[NLP-13] Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
【速读】: 该论文旨在解决中等资源语言(特别是印地语)高质量大型语言模型(Large Language Models, LLMs)开发中的关键挑战,包括数据可用性不足、模型适配困难以及评估难题。论文提出的关键解决方案是Nanda(Llama-3-Nanda-10B-Chat),这是一个以印地语为中心、经过指令微调的生成式LLM。其核心突破在于通过持续预训练扩展Transformer模块,并采用Llama Pro方法论,同时克服了高质量印地语文本数据稀缺的问题。为此,研究团队通过严格的数据整理、增强以及双语策略的训练,平衡印地语与英语语料库的比例,以优化跨语言知识迁移。最终,Nanda凭借100亿参数规模,在同类开源印地语及多语言模型中表现卓越,显著超越了许多现有模型。这一成果得益于创新的训练策略、精细调整技术、安全性对齐机制以及全面的评估指标体系。
链接: https://arxiv.org/abs/2504.06011
作者: Monojit Choudhury,Shivam Chauhan,Rocktim Jyoti Das,Dhruv Sahnan,Xudong Han,Haonan Li,Aaryamonvikram Singh,Alok Anil Jadhav,Utkarsh Agarwal,Mukund Choudhary,Debopriyo Banerjee,Fajri Koto,Junaid Bhat,Awantika Shukla,Samujjwal Ghosh,Samta Kamboj,Onkar Pandit,Lalit Pradhan,Rahul Pal,Sunil Sahu,Soundar Doraiswamy,Parvez Mullah,Ali El Filali,Neha Sengupta,Gokul Ramakrishnan,Rituraj Joshi,Gurpreet Gosal,Avraham Sheinin,Natalia Vassilieva,Preslav Nakov
机构: Mohamed Bin Zayed University of Artificial Intelligence (阿联酋穆罕默德·本·扎耶德人工智能大学); Inception (英启生公司); Cerebras Systems (Cerebras系统公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continuous pre-training with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.
zh
[NLP-14] NativQA Framework: Enabling LLM s with Native Local and Everyday Knowledge
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言、本地化及文化背景下应用时存在的文化偏见、公平性以及适用性问题。为提升和评估LLMs的能力,研究需要构建聚焦于多语言、本地化及文化语境的大规模资源。论文提出了一种名为NativQA的框架,其关键是通过用户定义的种子查询并结合搜索引擎收集特定地点的日常信息,从而无缝构建大规模且与文化及地区相匹配的问答(QA)数据集。该框架已在24个国家的39个地点、7种语言(涵盖从低资源到高资源语言)中得到验证,生成超过30万组QA对。这些资源可用于LLMs的基准测试及进一步微调,并已公开供社区使用。
链接: https://arxiv.org/abs/2504.05995
作者: Firoj Alam,Md Arid Hasan,Sahinur Rahman Laskar,Mucahid Kutlu,Shammur Absar Chowdhury
机构: Qatar Computing Research Institute (卡塔尔计算研究研究所); University of New Brunswick (加拿大新布伦瑞克大学); UPES (印度工程与技术大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: LLMs, Native, Multilingual, Language Diversity, Contextual Understanding, Minority Languages, Culturally Informed, Foundation Models, Large Language Models
Abstract:The rapid advancement of large language models (LLMs) has raised concerns about cultural bias, fairness, and their applicability in diverse linguistic and underrepresented regional contexts. To enhance and benchmark the capabilities of LLMs, there is a need to develop large-scale resources focused on multilingual, local, and cultural contexts. In this study, we propose a framework, NativQA, that can seamlessly construct large-scale, culturally and regionally aligned QA datasets in native languages. The framework utilizes user-defined seed queries and leverages search engines to collect location-specific, everyday information. It has been evaluated across 39 locations in 24 countries and in 7 languages, ranging from extremely low-resource to high-resource languages, which resulted over 300K Question Answer (QA) pairs. The developed resources can be used for LLM benchmarking and further fine-tuning. The framework has been made publicly available for the community (this https URL).
zh
[NLP-15] Unsupervised Location Mapping for Narrative Corpora
【速读】: 该论文试图解决无监督叙事文本位置映射(unsupervised location mapping)的问题,旨在将单个叙事文本的轨迹映射到一组叙事文本共同涉及的空间位置地图上。论文提出了一种完全无监督的管道方法来完成这一任务,无需预先定义标签集。解决方案的关键在于利用大型语言模型(Large Language Models)上下文长度扩展的最新进展,通过两个步骤实现:(1) 从文本集中诱导出包含提及地点的地图,以及 (2) 从单一叙事中提取轨迹并将其定位在地图上。研究在两个不同领域(大屠杀证词与英国湖区文学)进行了测试,并通过内在和外在评估展示了令人鼓舞的结果,同时为该任务设定了基准与评估实践,揭示了相关挑战。
链接: https://arxiv.org/abs/2504.05954
作者: Eitan Wagner,Renana Keydar,Omri Abend
机构: Hebrew University of Jerusalem (耶路撒冷希伯来大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:This work presents the task of unsupervised location mapping, which seeks to map the trajectory of an individual narrative on a spatial map of locations in which a large set of narratives take place. Despite the fundamentality and generality of the task, very little work addressed the spatial mapping of narrative texts. The task consists of two parts: (1) inducing a ``map’’ with the locations mentioned in a set of texts, and (2) extracting a trajectory from a single narrative and positioning it on the map. Following recent advances in increasing the context length of large language models, we propose a pipeline for this task in a completely unsupervised manner without predefining the set of labels. We test our method on two different domains: (1) Holocaust testimonies and (2) Lake District writing, namely multi-century literature on travels in the English Lake District. We perform both intrinsic and extrinsic evaluations for the task, with encouraging results, thereby setting a benchmark and evaluation practices for the task, as well as highlighting challenges.
zh
[NLP-16] High-Resource Translation:Turning Abundance into Accessibility
【速读】: 该论文旨在解决低资源语言(Low-resource Languages)英汉翻译任务中数据稀缺的问题,通过引入迁移学习技术提升模型性能。其关键解决方案在于利用Bharat Parallel Corpus Collection (BPCC) 数据集,并结合迭代反向翻译(Iterative Backtranslation)生成合成平行数据以扩充训练集,同时优化训练参数与预训练模型的使用,从而构建一个能够处理英语和泰卢固语多样化句法结构及语言细微差别的鲁棒翻译系统。这些方法强调了创新的数据处理技术和迁移学习在应对低资源语言数据稀疏性方面的潜力。
链接: https://arxiv.org/abs/2504.05914
作者: Abhiram Reddy Yanampally
机构: ABV-IIITM Gwalior (ABV-IIITM 古瓦拉尔)
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 figures
Abstract:This paper presents a novel approach to constructing an English-to-Telugu translation model by leveraging transfer learning techniques and addressing the challenges associated with low-resource languages. Utilizing the Bharat Parallel Corpus Collection (BPCC) as the primary dataset, the model incorporates iterative backtranslation to generate synthetic parallel data, effectively augmenting the training dataset and enhancing the model’s translation capabilities. The research focuses on a comprehensive strategy for improving model performance through data augmentation, optimization of training parameters, and the effective use of pre-trained models. These methodologies aim to create a robust translation system that can handle diverse sentence structures and linguistic nuances in both English and Telugu. This work highlights the significance of innovative data handling techniques and the potential of transfer learning in overcoming limitations posed by sparse datasets in low-resource languages. The study contributes to the field of machine translation and seeks to improve communication between English and Telugu speakers in practical contexts.
zh
[NLP-17] Defending Deep Neural Networks against Backdoor Attacks via Module Switching
【速读】: 该论文旨在解决深度神经网络(DNNs)参数规模快速增长导致独立训练成本显著提高的问题,特别是在资源受限场景下。同时,随着对开源模型依赖性的增加,模型训练过程的不透明性加剧了安全风险,使模型更容易受到如后门攻击等恶意威胁,并且增加了防御机制的复杂性。尽管合并同质模型作为一种成本效益高的后训练防御方法引起了关注,但现有策略(如权重平均法)仅部分缓解了中毒参数的影响,在破坏模型参数中广泛存在的虚假相关性方面仍然无效。论文的关键在于提出了一种新的模块切换策略,通过在模型传播路径内打破这些虚假相关性,并利用进化算法优化融合策略来验证其方法在针对文本和视觉领域的后门攻击中的有效性。实验结果显示,即使整合了一些受损模型,该方法也能有效减轻后门攻击的影响,例如将SST-2数据集上的平均攻击成功率(ASR)从基线的最佳表现31.9%降低至22%。
链接: https://arxiv.org/abs/2504.05902
作者: Weijun Li,Ansh Arora,Xuanli He,Mark Dras,Qiongkai Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 20 pages, 12 figures
Abstract:The exponential increase in the parameters of Deep Neural Networks (DNNs) has significantly raised the cost of independent training, particularly for resource-constrained entities. As a result, there is a growing reliance on open-source models. However, the opacity of training processes exacerbates security risks, making these models more vulnerable to malicious threats, such as backdoor attacks, while simultaneously complicating defense mechanisms. Merging homogeneous models has gained attention as a cost-effective post-training defense. However, we notice that existing strategies, such as weight averaging, only partially mitigate the influence of poisoned parameters and remain ineffective in disrupting the pervasive spurious correlations embedded across model parameters. We propose a novel module-switching strategy to break such spurious correlations within the model’s propagation path. By leveraging evolutionary algorithms to optimize fusion strategies, we validate our approach against backdoor attacks targeting text and vision domains. Our method achieves effective backdoor mitigation even when incorporating a couple of compromised models, e.g., reducing the average attack success rate (ASR) to 22% compared to 31.9% with the best-performing baseline on SST-2.
zh
[NLP-18] Assessing Thai Dialect Performance in LLM s with Automatic Benchmarks and Human Evaluation
【速读】: 该论文旨在解决大型语言模型(LLMs)在泰语本地方言上的鲁棒性和一致性问题,特别是现有基准测试主要关注主流方言而忽视了对本地方言文本处理能力的评估。论文的关键在于引入了一个涵盖泰语北部(兰纳)、东北部(伊善)和南部(潭布罗)方言的本地方言基准数据集,并针对五个自然语言处理任务——摘要生成、问答、翻译、对话以及与食品相关的任务进行了评估。此外,还提出了一套用于评估泰语本地方言生成流畅性和特定方言准确性的人类评价指南和度量标准。研究结果表明,相较于标准泰语,LLMs在本地方言上的表现显著下降,仅一些专有模型如GPT-4o和Gemini2显示出一定的流畅性。
链接: https://arxiv.org/abs/2504.05898
作者: Peerat Limkonchotiwat,Kanruethai Masuk,Surapon Nonesung,Chalermpun Mai-On,Sarana Nutanong,Wuttikorn Ponwitayarat,Potsawee Manakul
机构: AI Singapore (AI 新加坡); Vidyasirimedhi Institute of Science and Technology (维苏里梅德希科学技术研究所); SCB 10X (SCB 10X); National University of Singapore (新加坡国立大学); School of Information Science and Technology, VISTEC (信息科学与技术学院, 维苏里梅德希科学技术研究所); Department of Computer Engineering, Chulalongkorn University (查尔格林克隆大学计算机工程系)
类目: Computation and Language (cs.CL)
备注: Datasets and codes are available at this https URL
Abstract:Large language models show promising results in various NLP tasks. Despite these successes, the robustness and consistency of LLMs in underrepresented languages remain largely unexplored, especially concerning local dialects. Existing benchmarks also focus on main dialects, neglecting LLMs’ ability on local dialect texts. In this paper, we introduce a Thai local dialect benchmark covering Northern (Lanna), Northeastern (Isan), and Southern (Dambro) Thai, evaluating LLMs on five NLP tasks: summarization, question answering, translation, conversation, and food-related tasks. Furthermore, we propose a human evaluation guideline and metric for Thai local dialects to assess generation fluency and dialect-specific accuracy. Results show that LLM performance declines significantly in local Thai dialects compared to standard Thai, with only proprietary models like GPT-4o and Gemini2 demonstrating some fluency
zh
[NLP-19] Are Generative AI Agents Effective Personalized Financial Advisors?
【速读】: 该论文旨在探究基于大型语言模型(Large Language Model, LLM)的智能顾问在金融领域的有效性,重点关注三个具体挑战:(1) 在用户自身不确定需求的情况下提取用户偏好;(2) 针对多样化投资偏好的个性化指导;(3) 利用顾问人格特性建立关系并培养信任。通过一项包含64名参与者的实验室用户研究,研究表明,LLM-顾问在提取用户偏好时通常能够媲美人类顾问的表现,但在解决用户需求冲突方面存在困难。在提供个性化建议时,LLM能够正面影响用户行为,但也表现出明显的失效模式。研究的关键发现表明,准确的偏好提取至关重要,否则LLM-顾问的影响甚微,甚至可能引导投资者选择不合适的资产。更令人担忧的是,用户对建议质量的敏感度较低,甚至可能出现质量越差满意度越高的反直觉现象。此外,尽管采用外向型人格的LLM提供了较差的建议,但用户对其表现出更高的偏好、满意度以及情感信任。
链接: https://arxiv.org/abs/2504.05862
作者: Takehiro Takayanagi,Kiyoshi Izumi,Javier Sanz-Cruzado,Richard McCreadie,Iadh Ounis
机构: The University of Tokyo (东京大学) Tokyo Japan; University of Glasgow (格拉斯哥大学) Glasgow United Kingdom
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Computational Finance (q-fin.CP)
备注:
Abstract:Large language model-based agents are becoming increasingly popular as a low-cost mechanism to provide personalized, conversational advice, and have demonstrated impressive capabilities in relatively simple scenarios, such as movie recommendations. But how do these agents perform in complex high-stakes domains, where domain expertise is essential and mistakes carry substantial risk? This paper investigates the effectiveness of LLM-advisors in the finance domain, focusing on three distinct challenges: (1) eliciting user preferences when users themselves may be unsure of their needs, (2) providing personalized guidance for diverse investment preferences, and (3) leveraging advisor personality to build relationships and foster trust. Via a lab-based user study with 64 participants, we show that LLM-advisors often match human advisor performance when eliciting preferences, although they can struggle to resolve conflicting user needs. When providing personalized advice, the LLM was able to positively influence user behavior, but demonstrated clear failure modes. Our results show that accurate preference elicitation is key, otherwise, the LLM-advisor has little impact, or can even direct the investor toward unsuitable assets. More worryingly, users appear insensitive to the quality of advice being given, or worse these can have an inverse relationship. Indeed, users reported a preference for and increased satisfaction as well as emotional trust with LLMs adopting an extroverted persona, even though those agents provided worse advice.
zh
[NLP-20] Enhancing Coreference Resolution with Pretrained Language Models: Bridging the Gap Between Syntax and Semantics ACL
【速读】: 该论文旨在解决传统方法在共指消解(Coreference Resolution)任务中因缺乏句法与语义信息有效整合而导致区分指代关系能力不足的问题。论文的关键解决方案在于提出了一种创新框架,通过结合预训练语言模型(Pretrained Language Models),将句法分析(Syntax Parsing)与语义角色标注(Semantic Role Labeling)相结合,以更精确地捕捉指代关系中的细微差异。具体而言,该方法利用最先进的预训练模型获取上下文嵌入(Contextual Embeddings),并通过注意力机制(Attention Mechanism)进行微调,从而显著提升了共指消解任务的性能。实验结果表明,该方法在多个数据集上的表现优于传统系统,并在消除歧义方面取得了显著的准确性提升。这一进展不仅改善了共指消解的效果,还对依赖精确指代表达理解的其他自然语言处理任务产生了积极影响。
链接: https://arxiv.org/abs/2504.05855
作者: Xingzu Liu,Songhang deng,Mingbang Wang,Zhang Dong,Le Dai,Jiyuan Li,Ruilin Nong
机构: Tianjin University (天津大学); University of Florida (佛罗里达大学); UCLA (加州大学洛杉矶分校); Amazon (亚马逊); Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: acl submission
Abstract:Large language models have made significant advancements in various natural language processing tasks, including coreference resolution. However, traditional methods often fall short in effectively distinguishing referential relationships due to a lack of integration between syntactic and semantic information. This study introduces an innovative framework aimed at enhancing coreference resolution by utilizing pretrained language models. Our approach combines syntax parsing with semantic role labeling to accurately capture finer distinctions in referential relationships. By employing state-of-the-art pretrained models to gather contextual embeddings and applying an attention mechanism for fine-tuning, we improve the performance of coreference tasks. Experimental results across diverse datasets show that our method surpasses conventional coreference resolution systems, achieving notable accuracy in disambiguating references. This development not only improves coreference resolution outcomes but also positively impacts other natural language processing tasks that depend on precise referential understanding.
zh
[NLP-21] Leverag ing Robust Optimization for LLM Alignment under Distribution Shifts
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在偏好对齐(Preference Alignment)过程中因高质量人工标注数据稀缺而导致的方法局限性问题。同时,尽管合成数据被用作可扩展替代方案,但其可能引入的分布偏移(Distribution Shifts)会损害人类偏好的细微差别,从而影响期望输出的质量。为应对这些挑战,论文提出了一种新颖的分布感知优化框架(Distribution-Aware Optimization Framework)。该方法的关键在于首先利用学习到的分类器估计目标分布与训练分布之间的似然比(Likelihood Ratios),然后在反映目标人类偏好评价分布的数据区域最小化最坏情况下的损失(Worst-Case Loss)。通过在优化过程中显式优先考虑目标分布,该方法减轻了分布变化带来的不利影响,并提升了生成响应忠实反映人类价值观的能力。
链接: https://arxiv.org/abs/2504.05831
作者: Mingye Zhu,Yi Liu,Junbo Guo,Quan Wang,Yongdong Zhang,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of Communication Content Cognition, People’s Daily Online (人民日报社传播内容认知国家重点实验室); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) increasingly rely on preference alignment methods to steer outputs toward human values, yet these methods are often constrained by the scarcity of high-quality human-annotated data. To tackle this, recent approaches have turned to synthetic data generated by LLMs as a scalable alternative. However, synthetic data can introduce distribution shifts, compromising the nuanced human preferences that are essential for desirable outputs. In this paper, we propose a novel distribution-aware optimization framework that improves preference alignment in the presence of such shifts. Our approach first estimates the likelihood ratios between the target and training distributions leveraging a learned classifier, then it minimizes the worst-case loss over data regions that reflect the target human-preferred distribution. By explicitly prioritizing the target distribution during optimization, our method mitigates the adverse effects of distributional variation and enhances the generation of responses that faithfully reflect human values.
zh
[NLP-22] End-to-End Dialog Neural Coreference Resolution: Balancing Efficiency and Accuracy in Large-Scale Systems ACL2025
【速读】: 该论文旨在解决大规模共指消解(Coreference Resolution)在自然语言处理中的效率与准确性平衡难题。论文的关键解决方案在于提出了一种端到端神经网络共指消解系统(End-to-End Neural Coreference Resolution System),通过采用先进的神经网络架构结合多种上下文嵌入(contextual embeddings)和注意力机制(attention mechanisms),在保证预测质量的同时实现高效且低计算开销的性能。此外,通过优化策略加速推理速度,使系统适用于实际部署场景。这些方法不仅提升了模型的准确性,还保持了快速的推理能力,为未来该领域的研究设定了新基准。
链接: https://arxiv.org/abs/2504.05824
作者: Zhang Dong,Songhang deng,Mingbang Wang,Le Dai,Jiyuan Li,Xingzu Liu,Ruilin Nong
机构: Amazon(亚马逊); University of Florida(佛罗里达大学); UCLA(加州大学洛杉矶分校); Huazhong University of Science and Technology(华中科技大学); Tianjin University(天津大学)
类目: Computation and Language (cs.CL)
备注: submission of acl 2025
Abstract:Large-scale coreference resolution presents a significant challenge in natural language processing, necessitating a balance between efficiency and accuracy. In response to this challenge, we introduce an End-to-End Neural Coreference Resolution system tailored for large-scale applications. Our system efficiently identifies and resolves coreference links in text, ensuring minimal computational overhead without compromising on performance. By utilizing advanced neural network architectures, we incorporate various contextual embeddings and attention mechanisms, which enhance the quality of predictions for coreference pairs. Furthermore, we apply optimization strategies to accelerate processing speeds, making the system suitable for real-world deployment. Extensive evaluations conducted on benchmark datasets demonstrate that our model achieves improved accuracy compared to existing approaches, while effectively maintaining rapid inference times. Rigorous testing confirms the ability of our system to deliver precise coreference resolutions efficiently, thereby establishing a benchmark for future advancements in this field.
zh
[NLP-23] Cross-Document Contextual Coreference Resolution in Knowledge Graphs ACL2025
【速读】: 该论文致力于解决跨文档共指解析(cross-document coreference resolution)这一自然语言处理领域的挑战,特别是在知识图谱领域。传统方法难以有效识别和关联不同文本中指向同一实体的提及,导致信息一致性与协作性不足。论文的关键解决方案在于提出一种创新的方法,通过动态链接机制将知识图谱中的实体与其对应的文本提及相连接,并结合上下文嵌入(contextual embeddings)和基于图的推理策略,精准捕捉实体间的关系与交互,从而显著提升共指解析的准确性。实验结果表明,该方法在多个基准数据集上的精确率和召回率均有大幅提升,证明其在跨文档共指解析任务中的有效性。
链接: https://arxiv.org/abs/2504.05767
作者: Zhang Dong,Mingbang Wang,Songhang deng,Le Dai,Jiyuan Li,Xingzu Liu,Ruilin Nong
机构: Amazon (亚马逊); University of Florida (佛罗里达大学); UCLA (加州大学洛杉矶分校); Huazhong University of Science and Technology (华中科技大学); Tianjin University (天津大学)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: ACL 2025 Submission Version
Abstract:Coreference resolution across multiple documents poses a significant challenge in natural language processing, particularly within the domain of knowledge graphs. This study introduces an innovative method aimed at identifying and resolving references to the same entities that appear across differing texts, thus enhancing the coherence and collaboration of information. Our method employs a dynamic linking mechanism that associates entities in the knowledge graph with their corresponding textual mentions. By utilizing contextual embeddings along with graph-based inference strategies, we effectively capture the relationships and interactions among entities, thereby improving the accuracy of coreference resolution. Rigorous evaluations on various benchmark datasets highlight notable advancements in our approach over traditional methodologies. The results showcase how the contextual information derived from knowledge graphs enhances the understanding of complex relationships across documents, leading to better entity linking and information extraction capabilities in applications driven by knowledge. Our technique demonstrates substantial improvements in both precision and recall, underscoring its effectiveness in the area of cross-document coreference resolution.
zh
[NLP-24] Probabilistic Process Discovery with Stochastic Process Trees
【速读】: 该论文试图解决传统方法在构建基于事件日志的随机业务过程模型时存在的不确定性问题。传统方法通过过程发现算法从事件日志获得过程树,并将其转换为生成相同序列集的Petri网,再为Petri网的变迁赋权以捕捉事件日志中序列的频率,从而形成随机Petri网。然而,这种方法存在两个主要问题:一是赋权的变迁在最终的随机语言中作用不明确,同一权重可能对生成序列的概率产生多种模糊影响;二是不同数量变迁的多个Petri网可能对应同一个过程树,导致决定随机语言的参数数量不明确。为避免这些歧义,论文提出直接向过程树添加随机性,由此形成一种新的形式化方法——随机过程树,其参数数量及其在相关随机语言中的作用均清晰且明确。
链接: https://arxiv.org/abs/2504.05765
作者: András Horváth,Paolo Ballarini(MICS),Pierre Cry(MICS)
机构: 未知
类目: Computation and Language (cs.CL)
备注: EAI VALUESTOOLS 2024, Dec 2024, Milan, Italy
Abstract:In order to obtain a stochastic model that accounts for the stochastic aspects of the dynamics of a business process, usually the following steps are taken. Given an event log, a process tree is obtained through a process discovery algorithm, i.e., a process tree that is aimed at reproducing, as accurately as possible, the language of the log. The process tree is then transformed into a Petri net that generates the same set of sequences as the process tree. In order to capture the frequency of the sequences in the event log, weights are assigned to the transitions of the Petri net, resulting in a stochastic Petri net with a stochastic language in which each sequence is associated with a probability. In this paper we show that this procedure has unfavorable properties. First, the weights assigned to the transitions of the Petri net have an unclear role in the resulting stochastic language. We will show that a weight can have multiple, ambiguous impact on the probability of the sequences generated by the Petri net. Second, a number of different Petri nets with different number of transitions can correspond to the same process tree. This means that the number of parameters (the number of weights) that determines the stochastic language is not well-defined. In order to avoid these ambiguities, in this paper, we propose to add stochasticity directly to process trees. The result is a new formalism, called stochastic process trees, in which the number of parameters and their role in the associated stochastic language is clear and well-defined.
zh
[NLP-25] Layer-Aware Embedding Fusion for LLM s in Text Classifications
【速读】: 本文旨在解决在利用大规模语言模型(LLMs)进行嵌入融合以提升自然语言处理(NLP)任务性能时,缺乏系统性指导的问题,特别是如何选择最优层以及开发有效的融合策略。论文的关键在于提出了一种基于层感知的嵌入选择方法,并研究了如何定量评估不同层以确定下游NLP任务中最关键的层,发现这些关键层取决于数据集。此外,论文探索了无需微调模型即可通过组合多个LLMs的嵌入来提高性能的方法。实验结果表明,LLMs中的不同层在分类任务上的表征能力各异,且若模型间具有互补特性,则组合不同模型的嵌入可以增强性能。同时,文中讨论了资源开销(内存与推理时间),以提供关于嵌入融合实际可行性的全面视角。未来工作将关注多语言及领域特定数据集,并探索自动化层选择技术,以进一步提升性能与可扩展性。
链接: https://arxiv.org/abs/2504.05764
作者: Jiho Gwak,Yuchul Jung
机构: Department of Computer Engineering, Kumoh National Institute of Technology (庆北国立工科大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures, Preprint
Abstract:Embedding fusion has emerged as an effective approach for enhancing performance across various NLP tasks. However, systematic guidelines for selecting optimal layers and developing effective fusion strategies for the integration of LLMs remain underexplored. In this study, we propose a layer-aware embedding selection method and investigate how to quantitatively evaluate different layers to identify the most important ones for downstream NLP tasks, showing that the critical layers vary depending on the dataset. We also explore how combining embeddings from multiple LLMs, without requiring model fine-tuning, can improve performance. Experiments on four English text classification datasets (SST-2, MR, R8, and R52) demonstrate that different layers in LLMs exhibit varying degrees of representational strength for classification, and that combining embeddings from different models can enhance performance if the models exhibit complementary characteristics. Additionally, we discuss resources overhead (memory and inference time) to provide a balanced perspective on the real world feasibility of embedding fusion. Future work will explore multilingual and domain specific datasets, as well as techniques for automating layer selection, to improve both performance and scalability.
zh
[NLP-26] RETROcode: Leverag ing a Code Database for Improved Natural Language to Code Generation
【速读】: 该论文旨在解决随着模型规模和数据量增加以提升性能所带来的计算资源需求上升以及过拟合风险加剧的问题。论文提出的关键解决方案是RETROcode,这是一种基于RETRO架构的创新适配方法,专为序列到序列模型设计。通过利用大规模代码数据库作为辅助扩展手段,RETROcode避免了单纯依赖增大模型或数据集规模的传统方式,而是通过集成庞大的记忆库来增强模型预测能力,从而提高效率。研究结果显示,RETROcode不仅在测试集上超越了同规模的传统架构,还接近于经过微调的更大规模Codex模型的效果,尽管其训练所用的数据集显著较小。
链接: https://arxiv.org/abs/2504.05759
作者: Nathanaël Beau,Benoît Crabbé
机构: Université de Paris(巴黎大学), LLF, CNRS(法国国家科学研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:As text and code resources have expanded, large-scale pre-trained models have shown promising capabilities in code generation tasks, typically employing supervised fine-tuning with problem statement-program pairs. However, increasing model size and data volume for performance gains also raises computational demands and risks of overfitting. Addressing these challenges, we present RETROcode, a novel adaptation of the RETRO architecture \citeRETRO for sequence-to-sequence models, utilizing a large code database as an auxiliary scaling method. This approach, diverging from simply enlarging model and dataset sizes, allows RETROcode to leverage a vast code database for prediction, enhancing the model’s efficiency by integrating extensive memory. Our findings indicate that RETROcode not only outperforms similar-sized traditional architectures on test sets but also approaches the effectiveness of the much larger Codex model, despite being trained from scratch on a substantially smaller dataset.
zh
[NLP-27] SEA-LION: Southeast Asian Languages in One Network
【速读】: 该论文旨在解决低资源语言(如东南亚地区语言)在大型语言模型(Large Language Models, LLMs)研究中的代表性不足问题。为应对这一挑战,论文提出了Llama-SEA-LION-v3-8B-IT和Gemma-SEA-LION-v3-9B-IT两个面向东南亚(Southeast Asian, SEA)语言的前沿多语言LLMs。这些模型支持包括英语、中文、印尼语、越南语、马来语、泰语、缅甸语、老挝语、菲律宾语、泰米尔语和高棉语在内的11种SEA语言。解决方案的关键在于采用大规模多语言持续预训练,并结合全面的后训练流程,包括指令微调(instruction fine-tuning)、对齐(alignment)以及模型合并(model merging)。评估结果显示,所提出模型在支持SEA语言的LLMs中达到了最先进的性能,并已开源以造福更广泛的SEA社区。
链接: https://arxiv.org/abs/2504.05747
作者: Raymond Ng,Thanh Ngan Nguyen,Yuli Huang,Ngee Chia Tai,Wai Yi Leong,Wei Qi Leong,Xianbin Yong,Jian Gang Ngui,Yosephine Susanto,Nicholas Cheng,Hamsawardhini Rengarajan,Peerat Limkonchotiwat,Adithya Venkatadri Hulagadri,Kok Wai Teng,Yeo Yeow Tong,Bryan Siow,Wei Yi Teo,Wayne Lau,Choon Meng Tan,Brandon Ong,Zhi Hao Ong,Jann Railey Montalan,Adwin Chan,Sajeban Antonyrex,Ren Lee,Esther Choa,David Ong Tat-Wee,Bing Jie Darius Liu,William Chandra Tjhi,Erik Cambria,Leslie Teo
机构: AI Singapore (AI新加坡); National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: We released our model at this https URL
Abstract:Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.
zh
[NLP-28] Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在自动作文评分(Automated Essay Scoring, AES)领域的潜力未被充分开发的问题,特别是在中文数据上的方法发展相对不足。为了解决这一问题,论文提出了一种基于LLMs的微调框架——Rank-Then-Score (RTS)。该方案的关键在于首先通过特征增强的数据微调一个排名模型(Ranker),然后将排名模型输出的候选分数集与作文内容一同输入到评分模型(Scorer)中以生成最终分数。实验结果表明,RTS方法在平均质量加权 kappa 值(QWK)方面优于直接提示(Vanilla)方法,并在中文作文评分任务中基于HSK数据集取得了最佳性能。
链接: https://arxiv.org/abs/2504.05736
作者: Yida Cai,Kun Liang,Sanwoo Lee,Qinghan Wang,Yunfang Wu
机构: MOE Key Laboratory of Computational Linguistics, Peking University (教育部计算语言学重点实验室,北京大学); School of Software and Microelectronics, Peking University (软件与微电子学院,北京大学); School of Computer Science, Peking University (计算机科学学院,北京大学); School of Artificial Intelligence, Beijing Normal University (北京师范大学人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:In recent years, large language models (LLMs) achieve remarkable success across a variety of tasks. However, their potential in the domain of Automated Essay Scoring (AES) remains largely underexplored. Moreover, compared to English data, the methods for Chinese AES is not well developed. In this paper, we propose Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities. Specifically, we fine-tune the ranking model (Ranker) with feature-enriched data, and then feed the output of the ranking model, in the form of a candidate score set, with the essay content into the scoring model (Scorer) to produce the final score. Experimental results on two benchmark datasets, HSK and ASAP, demonstrate that RTS consistently outperforms the direct prompting (Vanilla) method in terms of average QWK across all LLMs and datasets, and achieves the best performance on Chinese essay scoring using the HSK dataset.
zh
[NLP-29] LLM timesMapReduce-V2: Entropy-Driven Convolutional Test-Time Scaling for Generating Long-Form Articles from Extremely Long Resources
【速读】: 该论文旨在解决从极其长的输入资源中生成长文本的问题,这一任务在当前的大语言模型(Large Language Models, LLMs)中仍具有挑战性,主要难点在于如何有效整合和分析来自广泛输入的相关信息。论文的关键解决方案是提出了一种名为LLM × MapReduce-V2的新颖测试时扩展策略,通过借鉴卷积神经网络的思想,利用堆叠的卷积扩展层逐步扩展对输入材料的理解,从而增强LLMs处理极长输入的能力。实验结果表明,该方法显著提升了LLMs处理长输入并生成连贯且信息丰富的长篇文章的能力,优于多个代表性基线方法。
链接: https://arxiv.org/abs/2504.05732
作者: Haoyu Wang,Yujia Fu,Zhu Zhang,Shuo Wang,Zirui Ren,Xiaorong Wang,Zhili Li,Chaoqun He,Bo An,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学); Beijing Jiaotong University (北京交通大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-form generation is crucial for a wide range of practical applications, typically categorized into short-to-long and long-to-long generation. While short-to-long generations have received considerable attention, generating long texts from extremely long resources remains relatively underexplored. The primary challenge in long-to-long generation lies in effectively integrating and analyzing relevant information from extensive inputs, which remains difficult for current large language models (LLMs). In this paper, we propose LLM \times MapReduce-V2, a novel test-time scaling strategy designed to enhance the ability of LLMs to process extremely long inputs. Drawing inspiration from convolutional neural networks, which iteratively integrate local features into higher-level global representations, LLM \times MapReduce-V2 utilizes stacked convolutional scaling layers to progressively expand the understanding of input materials. Both quantitative and qualitative experimental results demonstrate that our approach substantially enhances the ability of LLMs to process long inputs and generate coherent, informative long-form articles, outperforming several representative baselines.
zh
[NLP-30] Retrieval Augmented Generation with Collaborative Filtering for Personalized Text Generation SIGIR2025
【速读】: 本文旨在解决个性化大型语言模型(Personalized LLMs)在生成内容时未能充分利用相似用户历史信息的问题。现有的个性化检索增强生成(Personalized Retrieval-Augmented Generation, RAG)方法仅关注单一用户的交互历史,而忽略了用户间协作信息的价值。为解决这一局限性,论文提出了一种名为CFRAG的方法,将协同过滤技术引入RAG框架以实现个性化的文本生成。
解决方案的关键在于应对两个主要挑战:首先,如何在缺乏显式用户相似性标签的情况下融入协作信息?为此,论文采用对比学习训练用户嵌入,以检索相似用户并引入协作信息;其次,如何从其他用户的交互历史中检索支持个性化LLM生成的文档?针对此问题,设计了一个个性化检索器和重排序器,通过考虑用户的偏好完成文档检索与重排序,并进一步利用LLM反馈微调这些模块,从而实现满足个性化生成需求的文档检索。实验结果表明,CFRAG在LaMP基准上的有效性,并验证了协作信息的重要性。
链接: https://arxiv.org/abs/2504.05731
作者: Teng Shi,Jun Xu,Xiao Zhang,Xiaoxue Zang,Kai Zheng,Yang Song,Han Li
机构: Renmin University of China(中国人民大学)(北京,中国); Kuaishou Technology Co., Ltd.(快手科技)(北京,中国)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by SIGIR 2025
Abstract:Recently, the personalization of Large Language Models (LLMs) to generate content that aligns with individual user preferences has garnered widespread attention. Personalized Retrieval-Augmented Generation (RAG), which retrieves relevant documents from the user’s history to reflect their preferences and enhance LLM generation, is one commonly used approach for personalization. However, existing personalized RAG methods do not consider that the histories of similar users can also assist in personalized generation for the current user, meaning that collaborative information between users can also benefit personalized generation. Inspired by the application of collaborative filtering in recommender systems, we propose a method called CFRAG, which adapts Collaborative Filtering to RAG for personalized text generation. However, this presents two challenges: (1)~how to incorporate collaborative information without explicit user similarity labels? (2)~how to retrieve documents that support personalized LLM generation? For Challenge 1, we use contrastive learning to train user embeddings to retrieve similar users and introduce collaborative information. For Challenge 2, we design a personalized retriever and reranker to retrieve the top- k documents from these users’ histories. We take into account the user’s preference during retrieval and reranking. Then we leverage feedback from the LLM to fine-tune the personalized retriever and reranker, enabling them to retrieve documents that meet the personalized generation needs of the LLM. Experimental results on the Language Model Personalization (LaMP) benchmark validate the effectiveness of CFRAG. Further analysis confirms the importance of incorporating collaborative information.
zh
[NLP-31] Evaluating Speech-to-Text Systems with PennSound
【速读】: 该论文旨在评估多个商业及开源语音转文字(Speech-to-Text)系统的性能,并以PennSound这一包含大量诗歌朗诵与讨论的全球最大音频资源作为基准测试数据集。由于PennSound涵盖了广泛的录音条件与说话风格,其数据具有高度代表性,能够有效反映其他未转录音频集合的挑战。为实现这一目标,研究者通过训练标注员创建参考转录文本,并利用AWS、Azure、Google、IBM、NeMo、this http URL、Whisper以及this http URL等系统生成对应的系统转录结果。论文的关键在于基于词错误率(WER)评估各系统的性能,其中this http URL表现最优,而Whisper在避免幻觉(hallucinations)的情况下成为顶尖的开源方案;此外,AWS在三款系统中展现出最佳的对话者分离错误率(DER)。解决方案的关键在于综合考量不同系统在词错误率、对话者分离能力以及特定应用场景下速度与精度权衡之间的差异,从而为不同的终端用户提供定制化选择建议。同时,论文强调了使用Whisper时需注意其运行选项及速度与准确性之间的平衡问题。
链接: https://arxiv.org/abs/2504.05702
作者: Jonathan Wright,Mark Liberman,Neville Ryant,James Fiumara
机构: Linguistic Data Consortium
类目: Computation and Language (cs.CL)
备注:
Abstract:A random sample of nearly 10 hours of speech from PennSound, the world’s largest online collection of poetry readings and discussions, was used as a benchmark to evaluate several commercial and open-source speech-to-text systems. PennSound’s wide variation in recording conditions and speech styles makes it a good representative for many other untranscribed audio collections. Reference transcripts were created by trained annotators, and system transcripts were produced from AWS, Azure, Google, IBM, NeMo, this http URL, Whisper, and this http URL. Based on word error rate, this http URL was the top performer, and Whisper was the top open source performer (as long as hallucinations were avoided). AWS had the best diarization error rates among three systems. However, WER and DER differences were slim, and various tradeoffs may motivate choosing different systems for different end users. We also examine the issue of hallucinations in Whisper. Users of Whisper should be cautioned to be aware of runtime options, and whether the speed vs accuracy trade off is acceptable.
zh
[NLP-32] STRIVE: A Think Improve Approach with Iterative Refinement for Enhancing Question Quality Estimation
【速读】: 该论文旨在解决自动评估问题质量这一关键教育挑战,以减轻教育工作者的负担,确保评估一致性,并提供即时反馈以优化教学材料。论文提出了一种名为STRIVE(Structured Thinking and Refinement with multiLLMs for Improving Verified Question Estimation)的新方法,利用一系列大型语言模型(Large Language Models, LLMs)实现问题质量的自动化评估。该方法的核心在于通过生成多组基于所提供问题优缺点的评估结果,并从中选择最佳方案,从而提高评估的准确性和深度。此外,通过迭代式的LLM审查与响应过程,进一步优化评估直至指标值收敛。关键创新点在于自动化评估流程以及对相关性和适宜性等指标的显著改进,使其与人工判断的相关性更高。
链接: https://arxiv.org/abs/2504.05693
作者: Aniket Deroy,Subhankar Maity
机构: IIT Kharagpur (印度理工学院克勒格布尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 6 figures
Abstract:Automatically assessing question quality is crucial for educators as it saves time, ensures consistency, and provides immediate feedback for refining teaching materials. We propose a novel methodology called STRIVE (Structured Thinking and Refinement with multiLLMs for Improving Verified Question Estimation) using a series of Large Language Models (LLMs) for automatic question evaluation. This approach aims to improve the accuracy and depth of question quality assessment, ultimately supporting diverse learners and enhancing educational practices. The method estimates question quality in an automated manner by generating multiple evaluations based on the strengths and weaknesses of the provided question and then choosing the best solution generated by the LLM. Then the process is improved by iterative review and response with another LLM until the evaluation metric values converge. This sophisticated method of evaluating question quality improves the estimation of question quality by automating the task of question quality evaluation. Correlation scores show that using this proposed method helps to improve correlation with human judgments compared to the baseline method. Error analysis shows that metrics like relevance and appropriateness improve significantly relative to human judgments by using STRIVE.
zh
[NLP-33] Separator Injection Attack: Uncovering Dialogue Biases in Large Language Models Caused by Role Separators
【速读】: 该论文旨在解决因角色分隔符(role separators)引入对话系统中的安全性问题,特别是探讨其对指令跟随型大型语言模型(LLMs)行为的潜在影响。论文指出,虽然已有研究提出了多种提示注入攻击(prompt injection attacks),但鲜有关注角色分隔符对对话系统安全性的系统性影响。为应对这一挑战,论文揭示了由角色分隔符引发的建模弱点,即观察到与角色分隔符位置相关的强位置偏差(positional bias),这种偏差源于对话建模格式的本质,并可通过插入角色分隔符触发。基于此,论文进一步提出了一种新的正交攻击方法——分隔符注入攻击(Separators Injection Attack, SIA),专门针对角色分隔符进行设计。实验结果表明,SIA 在人工评估中平均提升了 18.2% 的攻击效率,并通过自动化手段将攻击成功率提升至 100%,从而有效验证了解决方案的关键有效性。
链接: https://arxiv.org/abs/2504.05689
作者: Xitao Li,Haijun Wang,Jiang Wu,Ting Liu
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Conversational large language models (LLMs) have gained widespread attention due to their instruction-following capabilities. To ensure conversational LLMs follow instructions, role separators are employed to distinguish between different participants in a conversation. However, incorporating role separators introduces potential vulnerabilities. Misusing roles can lead to prompt injection attacks, which can easily misalign the model’s behavior with the user’s intentions, raising significant security concerns. Although various prompt injection attacks have been proposed, recent research has largely overlooked the impact of role separators on safety. This highlights the critical need to thoroughly understand the systemic weaknesses in dialogue systems caused by role separators. This paper identifies modeling weaknesses caused by role separators. Specifically, we observe a strong positional bias associated with role separators, which is inherent in the format of dialogue modeling and can be triggered by the insertion of role separators. We further develop the Separators Injection Attack (SIA), a new orthometric attack based on role separators. The experiment results show that SIA is efficient and extensive in manipulating model behavior with an average gain of 18.2% for manual methods and enhances the attack success rate to 100% with automatic methods.
zh
[NLP-34] owards Smarter Hiring: Are Zero-Shot and Few-Shot Pre-trained LLM s Ready for HR Spoken Interview Transcript Analysis?
【速读】: 本文旨在解决如何有效利用预训练大型语言模型(Pre-trained Large Language Models, LLMs)在模拟人力资源(HR)面试场景中,提供与专家人类评估员相当的评分、错误识别、反馈及改进建议的问题。论文通过构建一个包含3,890份真实HR面试转录数据的数据集HURIT(Human Resource Interview Transcripts),分析了包括GPT-4 Turbo、GPT-3.5 Turbo等在内的多种LLMs的表现。研究的关键发现表明,尽管最先进的LLMs如GPT-4 Turbo和GPT-3.5 Turbo在评分方面接近人类专家水平,但在错误识别和生成具体可操作的改进建议方面存在不足。因此,论文建议采用“人在回路”(human-in-the-loop)的方法,结合人工审核以确保一致性,并优化反馈质量,而非直接进行自动化部署。这一方法被视为更合适的解决方案。
链接: https://arxiv.org/abs/2504.05683
作者: Subhankar Maity,Aniket Deroy,Sudeshna Sarkar
机构: Indian Institute of Technology Kharagpur (IIT-Kgp)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages, 24 figures
Abstract:This research paper presents a comprehensive analysis of the performance of prominent pre-trained large language models (LLMs), including GPT-4 Turbo, GPT-3.5 Turbo, text-davinci-003, text-babbage-001, text-curie-001, text-ada-001, llama-2-7b-chat, llama-2-13b-chat, and llama-2-70b-chat, in comparison to expert human evaluators in providing scores, identifying errors, and offering feedback and improvement suggestions to candidates during mock HR (Human Resources) interviews. We introduce a dataset called HURIT (Human Resource Interview Transcripts), which comprises 3,890 HR interview transcripts sourced from real-world HR interview scenarios. Our findings reveal that pre-trained LLMs, particularly GPT-4 Turbo and GPT-3.5 Turbo, exhibit commendable performance and are capable of producing evaluations comparable to those of expert human evaluators. Although these LLMs demonstrate proficiency in providing scores comparable to human experts in terms of human evaluation metrics, they frequently fail to identify errors and offer specific actionable advice for candidate performance improvement in HR interviews. Our research suggests that the current state-of-the-art pre-trained LLMs are not fully conducive for automatic deployment in an HR interview assessment. Instead, our findings advocate for a human-in-the-loop approach, to incorporate manual checks for inconsistencies and provisions for improving feedback quality as a more suitable strategy.
zh
[NLP-35] Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对“越狱攻击”(jailbreak attacks)时的脆弱性问题,具体通过分析模型在生成良性内容过程中的注意力权重变化揭示了一种新的漏洞,即“防御阈值衰减”(Defense Threshold Decay, DTD)。论文的关键在于提出一种新型的越狱攻击方法“糖衣毒药”(Sugar-Coated Poison, SCP),利用良性输入和对抗性推理诱导模型生成大量良性内容,从而降低其对恶意输出的防御能力,最终实现恶意内容的生成。为缓解此类攻击,论文引入了一种简单但有效的防御策略“基于提示的在线检测”(POSD),显著降低了越狱攻击的成功率,同时保持了模型的泛化能力。
链接: https://arxiv.org/abs/2504.05652
作者: Yu-Hang Wu,Yu-Jie Xiong,Jie-Zhang
机构: Shanghai University of Engineering Science (上海工程技术大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have become increasingly integral to a wide range of applications. However, they still remain the threat of jailbreak attacks, where attackers manipulate designed prompts to make the models elicit malicious outputs. Analyzing jailbreak methods can help us delve into the weakness of LLMs and improve it. In this paper, We reveal a vulnerability in large language models (LLMs), which we term Defense Threshold Decay (DTD), by analyzing the attention weights of the model’s output on input and subsequent output on prior output: as the model generates substantial benign content, its attention weights shift from the input to prior output, making it more susceptible to jailbreak attacks. To demonstrate the exploitability of DTD, we propose a novel jailbreak attack method, Sugar-Coated Poison (SCP), which induces the model to generate substantial benign content through benign input and adversarial reasoning, subsequently producing malicious content. To mitigate such attacks, we introduce a simple yet effective defense strategy, POSD, which significantly reduces jailbreak success rates while preserving the model’s generalization capabilities.
zh
[NLP-36] Leverag ing Prompt-Tuning for Bengali Grammatical Error Explanation Using Large Language Models
【速读】: 该论文旨在解决孟加拉语语法错误解释(BGEE)任务中的挑战,具体目标是识别和分类孟加拉语句子中的语法错误、生成正确的句子版本,并为每个错误提供自然语言解释。论文的关键创新在于提出了一种新颖的三步提示微调(prompt-tuning)方法,利用先进的大语言模型(LLMs),如GPT-4、GPT-3.5 Turbo和Llama-2-70b,通过优化提示的方式提升模型在语法错误检测、修正及解释方面的性能。这种方法不仅提升了自动化评估指标下的表现(F1分数提高5.26%,精确匹配提高6.95%),还显著减少了错误类型和错误解释的错误率(分别降低25.51%和26.27%),但仍然未能完全达到人工专家的水平。
链接: https://arxiv.org/abs/2504.05642
作者: Subhankar Maity,Aniket Deroy
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures
Abstract:We propose a novel three-step prompt-tuning method for Bengali Grammatical Error Explanation (BGEE) using state-of-the-art large language models (LLMs) such as GPT-4, GPT-3.5 Turbo, and Llama-2-70b. Our approach involves identifying and categorizing grammatical errors in Bengali sentences, generating corrected versions of the sentences, and providing natural language explanations for each identified error. We evaluate the performance of our BGEE system using both automated evaluation metrics and human evaluation conducted by experienced Bengali language experts. Our proposed prompt-tuning approach shows that GPT-4, the best performing LLM, surpasses the baseline model in automated evaluation metrics, with a 5.26% improvement in F1 score and a 6.95% improvement in exact match. Furthermore, compared to the previous baseline, GPT-4 demonstrates a decrease of 25.51% in wrong error type and a decrease of 26.27% in wrong error explanation. However, the results still lag behind the human baseline.
zh
[NLP-37] DBOT: Artificial Intelligence for Systematic Long-Term Investing
【速读】: 该论文试图解决长期投资领域中依赖人工判断的问题,并探索利用生成式人工智能(Generative AI)实现自动化系统化长期投资的可能性。论文的关键解决方案是提出DBOT系统,其目标是通过分析公司估值来模拟Aswath Damodaran的投资判断能力,后者在公司估值领域具有深厚的专业积累,为AI系统的训练提供了丰富的数据基础。DBOT不仅能够对任何公开上市的公司进行估值,还支持回测以验证其行为和性能的科学性。论文进一步比较了DBOT与Damodaran的分析能力,并探讨了提升DBOT至Damodaran水平所涉及的研究挑战。最后,论文评估了类似DBOT的AI代理对金融行业的影响,特别是其对人类分析师在估值工作中角色的潜在改变。
链接: https://arxiv.org/abs/2504.05639
作者: Vasant Dhar,João Sedoc
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Pricing of Securities (q-fin.PR)
备注:
Abstract:Long-term investing was previously seen as requiring human judgment. With the advent of generative artificial intelligence (AI) systems, automated systematic long-term investing is now feasible. In this paper, we present DBOT, a system whose goal is to reason about valuation like Aswath Damodaran, who is a unique expert in the investment arena in terms of having published thousands of valuations on companies in addition to his numerous writings on the topic, which provide ready training data for an AI system. DBOT can value any publicly traded company. DBOT can also be back-tested, making its behavior and performance amenable to scientific inquiry. We compare DBOT to its analytic parent, Damodaran, and highlight the research challenges involved in raising its current capability to that of Damodaran’s. Finally, we examine the implications of DBOT-like AI agents for the financial industry, especially how they will impact the role of human analysts in valuation.
zh
[NLP-38] Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning -Guided Fine-Tuning
【速读】: 该论文旨在探究大型语言模型(LLMs)的推理能力与公平性之间的关系,并试图解决如何通过提升推理能力来减轻有害的刻板印象响应,特别是那些源于浅层或有缺陷推理所导致的刻板印象。论文的关键在于提出了一种名为ReGiFT(Reasoning Guided Fine-Tuning)的新方法,该方法从具有高级推理能力的模型中提取结构化的推理痕迹,并将其注入到缺乏此类能力的模型中。这种方法仅利用通用推理,无需任何特定于公平性的监督即可实现偏见缓解。研究发现,使用ReGiFT微调后的模型不仅相对于不包含推理能力的模型在公平性方面有所提高,而且在公平性基准测试中也优于先进的推理模型。此外,还分析了推理痕迹的正确性和长度变化对模型公平性和整体性能的影响。研究结果表明,增强推理能力是一种有效的、与公平性无关的策略,可减轻因推理缺陷引起的刻板印象偏见。
链接: https://arxiv.org/abs/2504.05632
作者: Sanchit Kabra,Akshita Jha,Chandan Reddy
机构: Virginia Tech (弗吉尼亚理工学院暨州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages
Abstract:Recent advances in large-scale generative language models have shown that reasoning capabilities can significantly improve model performance across a variety of tasks. However, the impact of reasoning on a model’s ability to mitigate stereotypical responses remains largely underexplored. In this work, we investigate the crucial relationship between a model’s reasoning ability and fairness, and ask whether improved reasoning capabilities can mitigate harmful stereotypical responses, especially those arising due to shallow or flawed reasoning. We conduct a comprehensive evaluation of multiple open-source LLMs, and find that larger models with stronger reasoning abilities exhibit substantially lower stereotypical bias on existing fairness benchmarks. Building on this insight, we introduce ReGiFT – Reasoning Guided Fine-Tuning, a novel approach that extracts structured reasoning traces from advanced reasoning models and infuses them into models that lack such capabilities. We use only general-purpose reasoning and do not require any fairness-specific supervision for bias mitigation. Notably, we see that models fine-tuned using ReGiFT not only improve fairness relative to their non-reasoning counterparts but also outperform advanced reasoning models on fairness benchmarks. We also analyze how variations in the correctness of the reasoning traces and their length influence model fairness and their overall performance. Our findings highlight that enhancing reasoning capabilities is an effective, fairness-agnostic strategy for mitigating stereotypical bias caused by reasoning flaws.
zh
[NLP-39] wo Intermediate Translations Are Better Than One: Fine-tuning LLM s for Document-level Translation Refinement
【速读】: 该论文旨在解决将翻译精炼从句子级(Sentence-to-Sentence, Sent2Sent)扩展到文档级(Document-to-Document, Doc2Doc)的问题,以提升大规模语言模型(Large Language Models, LLMs)在文档级翻译精炼中的性能。论文的关键在于提出了一种结合句子级与文档级翻译优势的双阶段微调方法,并通过引入具有质量感知能力的增强微调策略,动态调整不同难度翻译任务的权重,使模型能够更专注于处理挑战性翻译案例。实验结果验证了该方法的有效性。
链接: https://arxiv.org/abs/2504.05614
作者: Yichen Dong,Xinglin Lyu,Junhui Li,Daimeng Wei,Min Zhang,Shimin Tao,Hao Yang
机构: School of Computer Science and Technology, Soochow University (苏州大学), Suzhou, China; Huawei Translation Services Center (华为翻译服务中心), Beijing, China
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Recent research has shown that large language models (LLMs) can enhance translation quality through self-refinement. In this paper, we build on this idea by extending the refinement from sentence-level to document-level translation, specifically focusing on document-to-document (Doc2Doc) translation refinement. Since sentence-to-sentence (Sent2Sent) and Doc2Doc translation address different aspects of the translation process, we propose fine-tuning LLMs for translation refinement using two intermediate translations, combining the strengths of both Sent2Sent and Doc2Doc. Additionally, recognizing that the quality of intermediate translations varies, we introduce an enhanced fine-tuning method with quality awareness that assigns lower weights to easier translations and higher weights to more difficult ones, enabling the model to focus on challenging translation cases. Experimental results across ten translation tasks with LLaMA-3-8B-Instruct and Mistral-Nemo-Instruct demonstrate the effectiveness of our approach.
zh
[NLP-40] FactGuard: Leverag ing Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction
【速读】: 该论文旨在解决现有抽取式阅读理解系统在处理长上下文时,难以同时保持高精度回答可回答问题并可靠识别不可回答问题的挑战。论文指出,尽管大型语言模型(Large Language Models, LLMs)在阅读理解任务上取得了显著进展,但这一问题依然严峻,尤其是在支持上下文长度不断增加的情况下。为应对这一挑战,论文提出了一种基于多智能体协作框架的创新数据增强方法。该方法的关键在于自主生成基于证据的问题-答案对,并系统性构建不可回答的问题,从而有效降低传统人工标注的高昂成本,同时提升模型对不可回答问题的推理能力,避免生成看似合理但实际上错误的答案。通过这种方法,论文构建了FactGuard-Bench数据集,并验证了其在提高LLMs训练和优化方面的价值。
链接: https://arxiv.org/abs/2504.05607
作者: Qian-Wen Zhang,Fang Li,Jie Wang,Lingfeng Qiao,Yifei Yu,Di Yin,Xing Sun
机构: Tencent YouTu Lab(Beijing, China)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Extractive reading comprehension systems are designed to locate the correct answer to a question within a given text. However, a persistent challenge lies in ensuring these models maintain high accuracy in answering questions while reliably recognizing unanswerable queries. Despite significant advances in large language models (LLMs) for reading comprehension, this issue remains critical, particularly as the length of supported contexts continues to expand. To address this challenge, we propose an innovative data augmentation methodology grounded in a multi-agent collaborative framework. Unlike traditional methods, such as the costly human annotation process required for datasets like SQuAD 2.0, our method autonomously generates evidence-based question-answer pairs and systematically constructs unanswerable questions. Using this methodology, we developed the FactGuard-Bench dataset, which comprises 25,220 examples of both answerable and unanswerable question scenarios, with context lengths ranging from 8K to 128K. Experimental evaluations conducted on seven popular LLMs reveal that even the most advanced models achieve only 61.79% overall accuracy. Furthermore, we emphasize the importance of a model’s ability to reason about unanswerable questions to avoid generating plausible but incorrect answers. By implementing efficient data selection and generation within the multi-agent collaborative framework, our method significantly reduces the traditionally high costs associated with manual annotation and provides valuable insights for the training and optimization of LLMs.
zh
[NLP-41] ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在引入链式思维(Chain-of-Thought, CoT)增强复杂推理能力的同时所引发的新安全问题。具体而言,论文提出了一种名为ShadowCoT的新型后门攻击框架,该框架针对LLMs的内部推理机制进行攻击。解决方案的关键在于通过条件化模型的内部推理状态,ShadowCoT能够识别并选择性干扰关键推理步骤,从而实现对目标模型的认知层面自我反射式攻击。其轻量级多阶段注入管道通过对注意力路径的选择性重连以及对中间表示的扰动,在仅更新0.15%参数的情况下实现高效攻击。此外,ShadowCoT结合强化学习与推理链污染(Reasoning Chain Pollution, RCP),自主合成隐蔽的对抗性CoTs,使其能够规避先进防御手段。实验结果表明,ShadowCoT在多种推理基准测试中实现了高攻击成功率(94.4%)和推理链劫持成功率(88.4%),同时保持良性性能,揭示了认知层面威胁的新兴类别,并强调了超越浅表一致性防御的紧迫需求。
链接: https://arxiv.org/abs/2504.05605
作者: Gejian Zhao,Hanzhou Wu,Xinpeng Zhang,Athanasios V. Vasilakos
机构: School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China (上海大学通信与信息工程学院,中国上海 200444); College of Computer Science and Information Technology, IAU, Saudi Arabia, and the Center for AI Research (CAIR), University of Agder (UiA), Grimstad, Norway (沙特阿拉伯IAU计算机科学与信息技术学院,挪威阿格德大学(UiA)人工智能研究中心,格里姆斯塔)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Zhao et al., 16 pages, 2025, uploaded by Hanzhou Wu, Shanghai University
Abstract:Chain-of-Thought (CoT) enhances an LLM’s ability to perform complex reasoning tasks, but it also introduces new security issues. In this work, we present ShadowCoT, a novel backdoor attack framework that targets the internal reasoning mechanism of LLMs. Unlike prior token-level or prompt-based attacks, ShadowCoT directly manipulates the model’s cognitive reasoning path, enabling it to hijack multi-step reasoning chains and produce logically coherent but adversarial outcomes. By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps, effectively mounting a self-reflective cognitive attack within the target model. Our approach introduces a lightweight yet effective multi-stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations with minimal parameter overhead (only 0.15% updated). ShadowCoT further leverages reinforcement learning and reasoning chain pollution (RCP) to autonomously synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses. Extensive experiments across diverse reasoning benchmarks and LLMs show that ShadowCoT consistently achieves high Attack Success Rate (94.4%) and Hijacking Success Rate (88.4%) while preserving benign performance. These results reveal an emergent class of cognition-level threats and highlight the urgent need for defenses beyond shallow surface-level consistency.
zh
[NLP-42] On the Impact of Language Nuances on Sentiment Analysis with Large Language Models : Paraphrasing Sarcasm and Emojis
【速读】: 该论文旨在解决大型语言模型(LLMs)在处理带讽刺语境的文本时,由于数据质量尤其是社交媒体数据中的细微差异(如表情符号和讽刺)导致的准确性下降问题。论文的关键在于通过多种技术手段提升数据质量和模型性能,其中核心解决方案包括:构建人工标注的讽刺数据集以评估LLMs在不同讽刺场景下的表现;通过主题特定数据微调后引入外部干预(如讽刺移除)来改善模型准确性;利用对抗性文本增强技术生成合成文本变体以提高模型对讽刺文本的鲁棒性和预测精度;以及通过对语言碎片化推文进行文本改写,将约40%低置信度标签转化为高置信度标签,从而显著提升情感分析的准确性(增加6%)。这些方法共同表明,引入多样化且高质量的数据以及增强模型对复杂文本的理解能力是解决该问题的关键。
链接: https://arxiv.org/abs/2504.05603
作者: Naman Bhargava,Mohammed I. Radaideh,O Hwang Kwon,Aditi Verma,Majdi I. Radaideh
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 10 Tables, 5 figures
Abstract:Large Language Models (LLMs) have demonstrated impressive performance across various tasks, including sentiment analysis. However, data quality–particularly when sourced from social media–can significantly impact their accuracy. This research explores how textual nuances, including emojis and sarcasm, affect sentiment analysis, with a particular focus on improving data quality through text paraphrasing techniques. To address the lack of labeled sarcasm data, the authors created a human-labeled dataset of 5929 tweets that enabled the assessment of LLM in various sarcasm contexts. The results show that when topic-specific datasets, such as those related to nuclear power, are used to finetune LLMs these models are not able to comprehend accurate sentiment in presence of sarcasm due to less diverse text, requiring external interventions like sarcasm removal to boost model accuracy. Sarcasm removal led to up to 21% improvement in sentiment accuracy, as LLMs trained on nuclear power-related content struggled with sarcastic tweets, achieving only 30% accuracy. In contrast, LLMs trained on general tweet datasets, covering a broader range of topics, showed considerable improvements in predicting sentiment for sarcastic tweets (60% accuracy), indicating that incorporating general text data can enhance sarcasm detection. The study also utilized adversarial text augmentation, showing that creating synthetic text variants by making minor changes significantly increased model robustness and accuracy for sarcastic tweets (approximately 85%). Additionally, text paraphrasing of tweets with fragmented language transformed around 40% of the tweets with low-confidence labels into high-confidence ones, improving LLMs sentiment analysis accuracy by 6%.
zh
[NLP-43] Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
【速读】: 该论文致力于解决如何高效地将大型语言模型(Large Language Model, LLM)扩展至多模态领域,特别是视觉模态的问题,同时保持模型的推理性能和参数效率。论文的关键创新在于提出了一种高效的多模态迁移方法,通过轻量级的视觉投影器实现语言模型与视觉编码器之间的无缝适配,而无需重新训练基础模型或视觉编码器。此外,通过结合迭代监督微调(SFT)与分组相对策略优化(GRPO)的混合优化策略,显著提升了跨模态整合的效率,并引入自适应长度链式思维蒸馏方法以动态优化推理链长度,从而提高推理效率并避免过长推理导致的冗余。这些方案共同构成了Skywork R1V模型的核心优势。
链接: https://arxiv.org/abs/2504.05599
作者: Yi Peng,Chris,Xiaokun Wang,Yichen Wei,Jiangbo Pei,Weijie Qiu,Ai Jian,Yunzhuo Hao,Jiachun Pan,Tianyidan Xie,Li Ge,Rongxian Zhuang,Xuchen Song,Yang Liu,Yahui Zhou
机构: Skywork AI (Skywork AI); Kunlun Inc. (昆仑万维)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.
zh
[NLP-44] DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
【速读】: 该论文旨在解决Speculative Decoding (SD) 方法在不同任务及序列上下文中,静态选择退出层(exit layer)和推测长度(speculation length)导致性能受限的问题。论文的关键创新在于提出了一种名为DEL(Dynamic Exit Layer Selection)的插件式方法,通过动态跟踪大型语言模型(Large Language Models, LLMs)各层的令牌接受率(token acceptance rate),自适应地选择最优的退出层和推测长度。这种方法克服了传统静态超参数调整的局限性,显著提升了SD方法的性能,实现了相对于标准自回归解码(auto-regressive decoding)2.16倍至2.50倍的整体加速,并在多个模型和下游任务中超越现有SD方法高达0.27倍的性能提升。
链接: https://arxiv.org/abs/2504.05598
作者: Hossein Entezari Zarch,Lei Gao,Chaoyi Jiang,Murali Annavaram
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed by parallel verification using the target LLM. This approach leads to faster inference compared to auto-regressive decoding. While there are multiple approaches to create a draft model, one promising approach is to use early-exit methods. These methods draft candidate tokens by using a subset of layers of the primary model and applying the remaining layers for verification, allowing a single model to handle both drafting and verification. While this technique reduces memory usage and computational cost, its performance relies on the choice of the exit layer for drafting and the number of tokens drafted (speculation length) in each SD round. Prior works use hyperparameter exploration to statically select these values. However, our evaluations show that these hyperparameter values are task-specific, and even within a task they are dependent on the current sequence context. We introduce DEL, a plug-and-play method that adaptively selects the exit layer and speculation length during inference. DEL dynamically tracks the token acceptance rate if the tokens are drafted at each layer of an LLM and uses that knowledge to heuristically select the optimal exit layer and speculation length. Our experiments across a broad range of models and downstream tasks show that DEL achieves overall speedups of 2.16\times \sim 2.50\times over vanilla auto-regressive decoding and improves upon the state-of-the-art SD methods by up to 0.27\times .
zh
[NLP-45] Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在预训练过程中缺乏领域特定、新颖或利基信息的问题。传统持续预训练(Continual Pre-Training, CPT)方法在此场景下存在灾难性遗忘(catastrophic forgetting)以及低数据量情形下的效率低下等挑战。论文提出了一种名为Knowledge-Instruct的新方法,通过纯指令微调(pure instruction-tuning)从有限语料库中高效注入知识。其关键在于利用生成的信息密集型合成指令数据,在保留通用推理和指令跟随能力的同时,有效整合新知识,并通过较小规模语言模型产生的合成数据实现扩展性。此外,该方案提升了上下文理解能力,包括复杂的多跳推理,有助于与检索系统集成。研究验证了Knowledge-Instruct在多种基准测试中的有效性,其中包括作者发布的用于衡量知识注入能力的新数据集Companies。
链接: https://arxiv.org/abs/2504.05571
作者: Oded Ovadia,Meni Brief,Rachel Lemberg,Eitam Sheetrit
机构: Microsoft Industry AI (微软行业人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) acquire vast knowledge during pre-training, they often lack domain-specific, new, or niche information. Continual pre-training (CPT) attempts to address this gap but suffers from catastrophic forgetting and inefficiencies in low-data regimes. We introduce Knowledge-Instruct, a novel approach to efficiently inject knowledge from limited corpora through pure instruction-tuning. By generating information-dense synthetic instruction data, it effectively integrates new knowledge while preserving general reasoning and instruction-following abilities. Knowledge-Instruct demonstrates superior factual memorization, minimizes catastrophic forgetting, and remains scalable by leveraging synthetic data from relatively small language models. Additionally, it enhances contextual understanding, including complex multi-hop reasoning, facilitating integration with retrieval systems. We validate its effectiveness across diverse benchmarks, including Companies, a new dataset that we release to measure knowledge injection capabilities.
zh
[NLP-46] Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study
【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)是否能够复制智能辅导系统(ITS)的适应性,即在教学过程中能否像ITS一样显式地建模学生知识与教学策略。为了解决这一问题,论文的关键在于提出了一种提示变异框架(prompt variation framework),通过系统性地移除提示中的关键上下文组件(如学生的错误和知识点),创建多种场景变体,并评估三种代表性LLM(Llama3-8B、Llama3-70B和GPT-4o)生成的教学行为在适应性和教学合理性方面的表现。通过这种方法,论文量化了上下文特征缺失对LLMs输出的影响,并结合验证过的导师培训分类器评估响应质量,从而系统性地分析LLMs与ITS之间的差异。
链接: https://arxiv.org/abs/2504.05570
作者: Conrad Borchers,Tianze Shou
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: Accepted as full paper to the 26th International Conference on Artificial Intelligence in Education (AIED 2025)
Abstract:Large Language Models (LLMs) hold promise as dynamic instructional aids. Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)–where student knowledge and pedagogical strategies are explicitly modeled. We propose a prompt variation framework to assess LLM-generated instructional moves’ adaptivity and pedagogical soundness across 75 real-world tutoring scenarios from an ITS. We systematically remove key context components (e.g., student errors and knowledge components) from prompts to create variations of each scenario. Three representative LLMs (Llama3-8B, Llama3-70B, and GPT-4o) generate 1,350 instructional moves. We use text embeddings and randomization tests to measure how the omission of each context feature impacts the LLMs’ outputs (adaptivity) and a validated tutor-training classifier to evaluate response quality (pedagogical soundness). Surprisingly, even the best-performing model only marginally mimics the adaptivity of ITS. Specifically, Llama3-70B demonstrates statistically significant adaptivity to student errors. Although Llama3-8B’s recommendations receive higher pedagogical soundness scores than the other models, it struggles with instruction-following behaviors, including output formatting. By contrast, GPT-4o reliably adheres to instructions but tends to provide overly direct feedback that diverges from effective tutoring, prompting learners with open-ended questions to gauge knowledge. Given these results, we discuss how current LLM-based tutoring is unlikely to produce learning benefits rivaling known-to-be-effective ITS tutoring. Through our open-source benchmarking code, we contribute a reproducible method for evaluating LLMs’ instructional adaptivity and fidelity.
zh
[NLP-47] COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
【速读】: 该论文旨在解决现有中文偏好数据集规模小、领域覆盖窄以及缺乏严格验证的问题,并减少对人工标注的高度依赖。为应对这些挑战,论文设计了一种无需人工干预的基于大型语言模型(LLMs)的中文偏好数据集标注流程。关键在于利用爬取的高质量中文查询,并结合15种主流LLMs生成和评分“选定-拒绝”响应对,从而构建了一个包含1,009k中文偏好对的高质量大规模数据集COIG-P,涵盖Chat、Code、Math、Logic、Novel和Role六大领域。此外,通过训练一个8B参数量的中文奖励模型(CRM)并精心构建中文奖励基准(CRBench),进一步降低使用LLMs评分的成本。评估结果表明,COIG-P显著优于其他中文偏好数据集,并在多个模型系列上带来2%至12%的性能提升,而CRM在识别低质量样本方面表现出与GPT-4o相当的能力,同时保持高效和经济性。
链接: https://arxiv.org/abs/2504.05535
作者: M-A-P Team,Siwei Wu,Jincheng Ren,Xinrun Du,Shuyue Guo,Xingwei Qu,Yiming Liang,Jie Liu,Yunwen Li,Tianyu Zheng,Boyu Feng,Huaqing Yuan,Zenith Wang,Jiaheng Liu,Wenhao Huang,Chenglin Cai,Haoran Que,Jian Yang,Yuelin Bai,Zekun Moore Wang,Zhouliang Yu,Qunshu Lin,Ding Pan,Yuchen Jiang,Tiannan Wang,Wangchunshu Zhou,Shenzhi Wang,Xingyuan Bu,Minghao Liu,Guoyin Wang,Ge Zhang,Chenghua Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench \citepliu2024alignbenchbenchmarkingchinesealignment show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in this https URL.
zh
[NLP-48] Bridging Industrial Expertise and XR with LLM -Powered Conversational Agents
【速读】: 该论文致力于解决工业环境中知识转移的挑战,特别是在提升工人培训效率、远程协助能力及操作指导方面的不足。论文提出的关键解决方案是将增强型大语言模型(LLMs)与检索增强生成(RAG)技术相结合,并集成到扩展现实(XR)技术中,通过自然语言接口嵌入特定领域的工业知识,从而实现无接触、上下文感知的专家级指导。该方案的核心在于构建一个包含动态工具编排的大语言模型聊天引擎(LLM Chat Engine)以及支持语音交互的XR应用架构。性能评估表明,语义分块(semantic chunking)、均衡的嵌入模型(balanced embedding models)以及高效的向量存储(efficient vector stores)是实现工业知识高效检索的关键策略。通过在机器人装配、智能基础设施维护和航空航天部件服务等多个工业场景中的初步实施,验证了该系统的潜力及其对行业5.0以人为本和弹性工业发展的契合性。
链接: https://arxiv.org/abs/2504.05527
作者: Despina Tomkou,George Fatouros,Andreas Andreou,Georgios Makridis,Fotis Liarokapis,Dimitrios Dardanis,Athanasios Kiourtis,John Soldatos,Dimosthenis Kyriazis
机构: Innov-Acts Ltd. (创新行动有限公司); CYENS Centre of Excellence (CYENS卓越中心); University of Piraeus (比雷埃夫斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures
Abstract:This paper introduces a novel integration of Retrieval-Augmented Generation (RAG) enhanced Large Language Models (LLMs) with Extended Reality (XR) technologies to address knowledge transfer challenges in industrial environments. The proposed system embeds domain-specific industrial knowledge into XR environments through a natural language interface, enabling hands-free, context-aware expert guidance for workers. We present the architecture of the proposed system consisting of an LLM Chat Engine with dynamic tool orchestration and an XR application featuring voice-driven interaction. Performance evaluation of various chunking strategies, embedding models, and vector databases reveals that semantic chunking, balanced embedding models, and efficient vector stores deliver optimal performance for industrial knowledge retrieval. The system’s potential is demonstrated through early implementation in multiple industrial use cases, including robotic assembly, smart infrastructure maintenance, and aerospace component servicing. Results indicate potential for enhancing training efficiency, remote assistance capabilities, and operational guidance in alignment with Industry 5.0’s human-centric and resilient approach to industrial development.
zh
[NLP-49] Pretraining Language Models for Diachronic Linguistic Change Discovery
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在历史语言学等人文学科中的应用问题,特别是如何在特定领域内有效限制模型推理范围,同时避免昂贵的数据和计算开销。论文的关键在于提出了一种高效的预训练技术,能够在无法手动检查但又不足以支持传统LLM方法的大规模语料库上生成有用的模型。为此,作者开发了一种新颖的日期归因流水线,构建了一个按时间分割的数据集,并通过两种方式训练模型:一种是高效的预训练,另一种是对Llama3-8B参数进行快速微调。研究发现,预训练模型不仅训练速度更快,而且更好地尊重了语料库的历史划分。这种方法强调速度和精确性而非跨时代的全面性,从而为假设发现和验证提供了新的途径。关键解决方案在于结合高效的预训练技术和定制化的数据处理流程,以实现领域特定的语言模型构建。
链接: https://arxiv.org/abs/2504.05523
作者: Elisabeth Fittschen,Sabrina Li,Tom Lippincott,Leshem Choshsem,Craig Messner
机构: University of Hamburg (汉堡大学, Germany); Center for Digital Humanities, Johns Hopkins University (约翰斯·霍普金斯大学数字人文中心, USA); IBM Research, MIT (IBM研究, 麻省理工学院, USA)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domains via fine-tuning or model editing, we posit that the only true guarantee is domain-restricted pretraining – typically, a data- and compute-expensive proposition. We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection but too small for “typical” LLM approaches. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus. Emphasizing speed and precision over a-historical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields. Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence. We provide a ready-to-use pipeline that allows extension of our approach to other target fields with only minimal adaptation. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.05523 [cs.CL] (or arXiv:2504.05523v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.05523 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-50] Efficient Reinforcement Finetuning via Adaptive Curriculum Learning
【速读】: 该论文试图解决传统强化微调(Reinforcement Finetuning, RFT)方法在提升大型语言模型(Large Language Models, LLMs)数学推理能力时样本效率和计算效率较低的问题。论文的关键解决方案是提出了一种名为AdaRFT(自适应课程强化微调)的方法,通过引入自适应课程学习策略显著提高了RFT的效率和最终准确性。AdaRFT的核心在于动态调整训练问题的难度,基于模型近期的奖励信号,确保模型始终专注于具有挑战性但可解的任务,从而避免在过于简单或困难的问题上浪费计算资源。此自适应采样策略无需修改奖励函数或模型架构,仅需对标准RFT算法(如Proximal Policy Optimization, PPO)进行轻量级扩展即可实现,同时在多种数据分布和模型规模下验证了其有效性,显著减少了训练步数并提升了推理性能。
链接: https://arxiv.org/abs/2504.05520
作者: Taiwei Shi,Yiyang Wu,Linxin Song,Tianyi Zhou,Jieyu Zhao
机构: University of Southern California (南加州大学); University of Maryland, College Park (马里兰大学帕克分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 18 pages, 4 figures, 2 tables
Abstract:Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model’s recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets-including AMC, AIME, and IMO-style problems-demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces the number of training steps by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.
zh
[NLP-51] Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在代码推理能力上的泛化性问题,具体评估其在不同类型程序上的表现。论文的关键在于提出了一种系统的方法来获取具有不同特征的“分布内”和“分布外”程序样本,包括从领域特定语言生成的代码、由LLM自动生成的代码、竞赛编程中收集的代码及其变异版本,并设计了一种实验方法以比较这些模型在这类程序上的性能。通过评估10个近一年来的顶级模型,研究揭示了早期模型表现出模式匹配行为,而最新模型在代码推理任务上展现出较强的泛化能力。
链接: https://arxiv.org/abs/2504.05518
作者: Rem Yang,Julian Dai,Nikos Vasilakis,Martin Rinard
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); Brown University (布朗大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We assess how the code reasoning abilities of large language models (LLMs) generalize to different kinds of programs. We present techniques for obtaining in- and out-of-distribution programs with different characteristics: code sampled from a domain-specific language, code automatically generated by an LLM, code collected from competitive programming contests, and mutated versions of these programs. We also present an experimental methodology for evaluating LLM generalization by comparing their performance on these programs. We perform an extensive evaluation across 10 state-of-the-art models from the past year, obtaining insights into their generalization capabilities over time and across different classes of programs. Our results highlight that while earlier models exhibit behavior consistent with pattern matching, the latest models exhibit strong generalization abilities on code reasoning.
zh
[NLP-52] ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering
【速读】: 该论文试图解决现有图表问答(Chart Question Answering, CQA)基准数据集缺乏真实世界多样性和性能饱和的问题。为了解决这些问题,论文引入了ChartQAPro这一新基准,包含来自157个多样化来源的1,341张图表,涵盖多种图表类型及1,948个不同类型的问题(如选择题、对话型、假设型和无法回答的问题),以更好地反映实际应用中的挑战。解决方案的关键在于通过构建更复杂且多样化的数据集来评估大型视觉-语言模型(Large Vision-Language Models, LVLMs)在图表理解和推理方面的能力,并揭示现有模型的局限性,从而推动相关领域的进一步发展。
链接: https://arxiv.org/abs/2504.05506
作者: Ahmed Masry,Mohammed Saidul Islam,Mahir Ahmed,Aayush Bajaj,Firoz Kabir,Aaryaman Kartha,Md Tahmid Rahman Laskar,Mizanur Rahman,Shadikur Rahman,Mehrad Shahmohammadi,Megh Thakkar,Md Rizwan Parvez,Enamul Hoque,Shafiq Joty
机构: York University (约克大学), Canada; Dialpad Inc. (戴尔帕德公司), Canada; RBC (皇家银行), Canada; MILA - Quebec AI Institute (魁北克人工智能研究所), Canada; Qatar Computing Research Institute (QCRI) (卡塔尔计算研究所), Qatar; Nanyang Technological University (南洋理工大学), Singapore; Salesforce Research (销售力量研究), USA
类目: Computation and Language (cs.CL)
备注:
Abstract:Charts are ubiquitous, as people often use them to analyze data, answer questions, and discover critical insights. However, performing complex analytical tasks with charts requires significant perceptual and cognitive effort. Chart Question Answering (CQA) systems automate this process by enabling models to interpret and reason with visual representations of data. However, existing benchmarks like ChartQA lack real-world diversity and have recently shown performance saturation with modern large vision-language models (LVLMs). To address these limitations, we introduce ChartQAPro, a new benchmark that includes 1,341 charts from 157 diverse sources, spanning various chart types, including infographics and dashboards, and featuring 1,948 questions in various types, such as multiple-choice, conversational, hypothetical, and unanswerable questions, to better reflect real-world challenges. Our evaluations with 21 models show a substantial performance drop for LVLMs on ChartQAPro; e.g., Claude Sonnet 3.5 scores 90.5% on ChartQA but only 55.81% on ChartQAPro, underscoring the complexity of chart reasoning. We complement our findings with detailed error analyses and ablation studies, identifying key challenges and opportunities for advancing LVLMs in chart understanding and reasoning. We release ChartQAPro at this https URL.
zh
[NLP-53] A Survey on Hypothesis Generation for Scientific Discovery in the Era of Large Language Models
【速读】: 该论文旨在解决科学发现中假设生成面临的挑战,特别是信息过载和学科碎片化带来的问题。论文的关键解决方案在于全面调查利用大型语言模型(Large Language Models, LLMs)进行假设生成的方法,通过梳理从简单提示技术到复杂框架的各种现有方法,并提出一种分类 taxonomy 来归类这些方法;分析提升假设质量的技术,如新颖性增强和结构化推理;概述评估策略;讨论多模态集成和人机协作等关键挑战与未来方向。关键在于系统性地整合和归纳 LLM 在假设生成中的应用,为研究者提供参考。
链接: https://arxiv.org/abs/2504.05496
作者: Atilla Kaan Alkan,Shashwat Sourav,Maja Jablonska,Simone Astarita,Rishabh Chakrabarty,Nikhil Garuda,Pranav Khetarpal,Maciej Pióro,Dimitrios Tanoglidis,Kartheik G. Iyer,Mugdha S. Polimera,Michael J. Smith,Tirthankar Ghosal,Marc Huertas-Company,Sandor Kruk,Kevin Schawinski,Ioana Ciucă
机构: Center for Astrophysics, Harvard & Smithsonian (哈佛与史密森学会天体物理中心); Washington University in St. Louis (圣路易斯华盛顿大学); Australian National University (澳大利亚国立大学); European Commission, Joint Research Centre (JRC) (欧盟联合研究中心); Intelligent Internet Inc. (智能互联网公司); University of Arizona (亚利桑那大学); Indian Institute of Technology, Delhi (德里印度理工学院); Institute of Fundamental Technological Research, Polish Academy of Sciences (波兰科学院基础技术研究所); Walgreens Boots Alliance AI Lab (沃尔格林博兹联盟人工智能实验室); Columbia University (哥伦比亚大学); UniverseTBD (未知); Oak Ridge National Laboratory (橡树岭国家实验室); Instituto de Astrofísica de Canarias (加那利群岛天体物理研究所); European Space Agency (欧洲航天局); Modulos AG (未知); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 9 pages (+2 pages of references), 2 figures
Abstract:Hypothesis generation is a fundamental step in scientific discovery, yet it is increasingly challenged by information overload and disciplinary fragmentation. Recent advances in Large Language Models (LLMs) have sparked growing interest in their potential to enhance and automate this process. This paper presents a comprehensive survey of hypothesis generation with LLMs by (i) reviewing existing methods, from simple prompting techniques to more complex frameworks, and proposing a taxonomy that categorizes these approaches; (ii) analyzing techniques for improving hypothesis quality, such as novelty boosting and structured reasoning; (iii) providing an overview of evaluation strategies; and (iv) discussing key challenges and future directions, including multimodal integration and human-AI collaboration. Our survey aims to serve as a reference for researchers exploring LLMs for hypothesis generation.
zh
[NLP-54] GraphRAFT: Retrieval Augmented Fine-Tuning for Knowledge Graphs on Graph Databases
【速读】: 本文旨在解决大型语言模型(LLMs)在处理私有数据相关问题时容易产生幻觉(hallucination)的问题,并特别关注结构化知识图谱(Knowledge Graphs, KGs)中多跳实体关联问题的回答。现有大多数基于GraphRAG的方法要么忽略了检索步骤,要么采用抽象或低效的检索过程,这限制了它们在支持图形查询语言的图数据库中的应用。为了解决这些问题,论文提出了GraphRAFT框架,它通过微调LLMs生成可证明正确的Cypher查询,从而检索高质量的子图上下文并提供精确的答案。该方法的关键在于设计了一个能够直接用于存储于原生图数据库中的知识图谱上的端到端可部署方案,同时确保样本效率高且随训练数据量增加而扩展性能良好。实验结果表明,GraphRAFT在四个标准度量指标上显著优于当前所有最先进的模型。
链接: https://arxiv.org/abs/2504.05478
作者: Alfred Clemedtson,Borun Shi
机构: Neo4j
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large language models have shown remarkable language processing and reasoning ability but are prone to hallucinate when asked about private data. Retrieval-augmented generation (RAG) retrieves relevant data that fit into an LLM’s context window and prompts the LLM for an answer. GraphRAG extends this approach to structured Knowledge Graphs (KGs) and questions regarding entities multiple hops away. The majority of recent GraphRAG methods either overlook the retrieval step or have ad hoc retrieval processes that are abstract or inefficient. This prevents them from being adopted when the KGs are stored in graph databases supporting graph query languages. In this work, we present GraphRAFT, a retrieve-and-reason framework that finetunes LLMs to generate provably correct Cypher queries to retrieve high-quality subgraph contexts and produce accurate answers. Our method is the first such solution that can be taken off-the-shelf and used on KGs stored in native graph DBs. Benchmarks suggest that our method is sample-efficient and scales with the availability of training data. Our method achieves significantly better results than all state-of-the-art models across all four standard metrics on two challenging Q\As on large text-attributed KGs.
zh
[NLP-55] PreSumm: Predicting Summarization Performance Without Summarizing
【速读】: 该论文旨在解决自动摘要系统在不同文档上的性能不均衡问题,探索文档特性对摘要质量的影响。论文提出了两个关键研究问题:1)不同摘要系统对同一文档的摘要质量是否具有一致性?2)能否仅基于源文档预测其摘要性能而不实际生成摘要?论文通过肯定这两个问题的答案,并引入PreSumm任务,即仅依赖源文档预测摘要性能的系统,解决了上述问题。PreSumm的关键在于识别出低性能文档通常存在连贯性差、内容复杂或缺乏明确主题等共同特性。此外,论文展示了PreSumm在改进混合摘要工作流及提升数据集质量方面的实用价值。因此,PreSumm的核心解决方案是通过分析文档特性来预测摘要性能,从而揭示当前系统局限并指导未来改进。
链接: https://arxiv.org/abs/2504.05420
作者: Steven Koniaev,Ori Ernst,Jackie Chi Kit Cheung
机构: Mila – Quebec Artificial Intelligence Institute (魁北克人工智能研究所); McGill University (麦吉尔大学); Canada CIFAR AI Chair, Mila (加拿大 CIFAR AI 主席, Mila)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite recent advancements in automatic summarization, state-of-the-art models do not summarize all documents equally well, raising the question: why? While prior research has extensively analyzed summarization models, little attention has been given to the role of document characteristics in influencing summarization performance. In this work, we explore two key research questions. First, do documents exhibit consistent summarization quality across multiple systems? If so, can we predict a document’s summarization performance without generating a summary? We answer both questions affirmatively and introduce PreSumm, a novel task in which a system predicts summarization performance based solely on the source document. Our analysis sheds light on common properties of documents with low PreSumm scores, revealing that they often suffer from coherence issues, complex content, or a lack of a clear main theme. In addition, we demonstrate PreSumm’s practical utility in two key applications: improving hybrid summarization workflows by identifying documents that require manual summarization and enhancing dataset quality by filtering outliers and noisy documents. Overall, our findings highlight the critical role of document properties in summarization performance and offer insights into the limitations of current systems that could serve as the basis for future improvements.
zh
[NLP-56] Reasoning Models Know When Theyre Right: Probing Hidden States for Self-Verification
【速读】: 该论文试图解决推理模型在数学和逻辑推理任务中因过度思考(overthinking)而导致的执行不必要的推理步骤的问题。尽管这些模型能够通过搜索进行推理并取得显著性能,但它们未能有效利用隐藏状态中编码的关于中间答案正确性的信息。论文的关键解决方案是设计了一种探针(probe),通过分析模型的隐藏状态来验证中间答案的正确性,该探针不仅能以高精度确认中间答案的正确性,还能提供高度校准的评分。此外,研究发现模型的隐藏状态还编码了未来答案的正确性,从而实现提前预测中间答案的正确性。基于此,作者将探针作为验证器,在推理过程中用于判断是否应在中间答案处退出推理,从而在减少24%推理标记(inference tokens)的同时保持性能不下降。这表明推理模型确实编码了正确性的概念,但未能充分利用这一能力,揭示了提升模型效率的巨大潜力。
链接: https://arxiv.org/abs/2504.05419
作者: Anqi Zhang,Yulin Chen,Jane Pan,Chen Zhao,Aurojit Panda,Jinyang Li,He He
机构: New York University (纽约大学); NYU Shanghai (上海纽约大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reasoning models have achieved remarkable performance on tasks like math and logical reasoning thanks to their ability to search during reasoning. However, they still suffer from overthinking, often performing unnecessary reasoning steps even after reaching the correct answer. This raises the question: can models evaluate the correctness of their intermediate answers during reasoning? In this work, we study whether reasoning models encode information about answer correctness through probing the model’s hidden states. The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores. Additionally, we find models’ hidden states encode correctness of future answers, enabling early prediction of the correctness before the intermediate answer is fully formulated. We then use the probe as a verifier to decide whether to exit reasoning at intermediate answers during inference, reducing the number of inference tokens by 24% without compromising performance. These findings confirm that reasoning models do encode a notion of correctness yet fail to exploit it, revealing substantial untapped potential to enhance their efficiency.
zh
[NLP-57] Less but Better: Parameter-Efficient Fine-Tuning of Large Language Models for Personality Detection
【速读】: 该论文旨在解决随着语言模型参数规模的不断扩大,计算成本日益增加以及微调复杂度提高的问题,导致难以合理评估投入与可靠预测结果之间的关系。为应对这些挑战,论文引入了一种新颖的参数高效微调框架——PersLLM。其关键是利用大型语言模型从原始数据中提取高维表示,并存储在动态记忆层中,同时通过可替换的输出网络更新下游层,从而实现对多种人格检测场景的灵活适应。这种方法避免了语言模型对复杂计算的重复需求,而轻量级的输出网络则作为评估框架整体效果的代理,提升了结果的可预测性。实验结果表明,PersLLM在保持竞争力性能的同时显著降低了计算成本,并展现出强大的适应能力。
链接: https://arxiv.org/abs/2504.05411
作者: Lingzhi Shen,Yunfei Long,Xiaohao Cai,Guanming Chen,Imran Razzak,Shoaib Jameel
机构: University of Southampton (南安普顿大学); Queen Mary University of London (伦敦玛丽女王大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Personality detection automatically identifies an individual’s personality from various data sources, such as social media texts. However, as the parameter scale of language models continues to grow, the computational cost becomes increasingly difficult to manage. Fine-tuning also grows more complex, making it harder to justify the effort and reliably predict outcomes. We introduce a novel parameter-efficient fine-tuning framework, PersLLM, to address these challenges. In PersLLM, a large language model (LLM) extracts high-dimensional representations from raw data and stores them in a dynamic memory layer. PersLLM then updates the downstream layers with a replaceable output network, enabling flexible adaptation to various personality detection scenarios. By storing the features in the memory layer, we eliminate the need for repeated complex computations by the LLM. Meanwhile, the lightweight output network serves as a proxy for evaluating the overall effectiveness of the framework, improving the predictability of results. Experimental results on key benchmark datasets like Kaggle and Pandora show that PersLLM significantly reduces computational cost while maintaining competitive performance and strong adaptability.
zh
[NLP-58] Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling
【速读】: 该论文旨在解决基于语言模型生成内容时受约束条件下存在的两个关键问题:(i) 对完整词汇表中的每个标记评估约束可能极其昂贵(例如,超过100,000个标记);(ii) 局部约束解码(LCD)可能导致全局字符串分布失真,因为其仅依赖局部信息进行采样,即使这些采样可能导向死胡同路径。论文提出了一种新算法来同时应对这两个挑战。解决方案的关键在于:首先,通过引入一种自适应拒绝采样算法,在生成过程的每一步避免对整个词汇表评估约束,通常只需几个数量级更少的约束评估次数;其次,进一步扩展该算法以在极小的额外成本下生成重要性权重的低方差、无偏估计值,并利用这些估计值修正先前提出的序贯蒙特卡洛算法中的短视行为。这一方法在文本到SQL、分子合成、目标推理、模式匹配以及JSON等多个领域通过广泛的实证评估验证了其优越性,支持更广泛的约束类型,同时提升了运行时间和性能。
链接: https://arxiv.org/abs/2504.05410
作者: Benjamin Lipkin,Benjamin LeBrun,Jacob Hoover Vigly,João Loula,David R. MacIver,Li Du,Jason Eisner,Ryan Cotterell,Vikash Mansinghka,Timothy J. O’Donnell,Alexander K. Lew,Tim Vieira
机构: MIT (麻省理工学院); ETH Zürich (瑞士联邦理工学院); Mila; Johns Hopkins (约翰霍普金斯大学); Yale (耶鲁大学); CHI FRO
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sampling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive – LM vocabularies often exceed 100,000 tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost – estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method’s runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.
zh
[NLP-59] hanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在内存占用和计算效率方面的资源消耗问题,同时确保模型精度不受显著影响。论文提出的解决方案之关键是Thanos算法,这是一种新颖的权重剪枝(weight-pruning)方法,通过引入基于块的自适应掩码(block-wise adaptive masks)策略,动态调整权重的重要性,从而实现灵活的稀疏性模式(如n:m稀疏性),并优化硬件加速性能。这种方法不仅在结构化剪枝任务中达到最先进的性能,还在非结构化剪枝中优于现有技术,为在资源受限环境中部署大模型提供了高效且可调的压缩方案。
链接: https://arxiv.org/abs/2504.05346
作者: Ivan Ilin,Peter Richtarik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Performance (cs.PF)
备注: 8 pages, 3 Figures, 3 Tables, 2 Algorithms, paper comes with Appendix
Abstract:This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured formats, such as n:m sparsity, optimized for hardware acceleration. Experimental evaluations demonstrate that Thanos achieves state-of-the-art performance in structured pruning and outperforms existing methods in unstructured pruning. By providing an efficient and adaptable approach to model compression, Thanos offers a practical solution for deploying large models in resource-constrained environments.
zh
[NLP-60] Unequal Opportunities: Examining the Bias in Geographical Recommendations by Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推荐不同主题(如美国城市和城镇的搬迁、旅游和创业)时存在的表示偏差问题,这些偏差可能源于其统计训练方法,从而导致潜在的偏见影响现实世界的决策和机会。论文的关键在于研究LLMs响应的一致性及其对特定地点过度或代表性不足的倾向,通过探索模型推荐结果中的一致性偏差,揭示其如何可能强化现有的经济不平等现象(即“富者愈富”效应)。
链接: https://arxiv.org/abs/2504.05325
作者: Shiran Dudy,Thulasi Tholeti,Resmi Ramachandranpillai,Muhammad Ali,Toby Jia-Jun Li,Ricardo Baeza-Yates
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
Abstract:Recent advancements in Large Language Models (LLMs) have made them a popular information-seeking tool among end users. However, the statistical training methods for LLMs have raised concerns about their representation of under-represented topics, potentially leading to biases that could influence real-world decisions and opportunities. These biases could have significant economic, social, and cultural impacts as LLMs become more prevalent, whether through direct interactions–such as when users engage with chatbots or automated assistants–or through their integration into third-party applications (as agents), where the models influence decision-making processes and functionalities behind the scenes. Our study examines the biases present in LLMs recommendations of U.S. cities and towns across three domains: relocation, tourism, and starting a business. We explore two key research questions: (i) How similar LLMs responses are, and (ii) How this similarity might favor areas with certain characteristics over others, introducing biases. We focus on the consistency of LLMs responses and their tendency to over-represent or under-represent specific locations. Our findings point to consistent demographic biases in these recommendations, which could perpetuate a ``rich-get-richer’’ effect that widens existing economic disparities.
zh
[NLP-61] Hybrid Retrieval for Hallucination Mitigation in Large Language Models : A Comparative Analysis
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在语言理解和生成方面表现出色,但容易产生幻觉(hallucinations),即输出事实错误或缺乏支持的问题。为应对这一挑战,论文提出了一种结合检索增强生成(Retrieval Augmented Generation, RAG)的方法,通过将外部知识与LLM响应相结合来减少幻觉现象。
解决方案的关键在于设计一种混合检索模块(hybrid retrieval module)。此模块融合了三种检索方法:基于BM25关键词搜索的稀疏检索(sparse retrieval)、利用Sentence Transformers进行语义搜索的密集检索(dense retrieval),以及结合查询扩展技术并通过动态加权的互反排名融合(Reciprocal Rank Fusion, RRF)得分整合稀疏与密集检索结果的混合检索策略。实验结果显示,这种混合检索方法不仅在相关性评分上优于单一的稀疏或密集检索,还能显著降低LLM生成答案中的幻觉率,并提高其准确性及可靠性。这些发现强调了采用先进的检索技术对于减轻幻觉现象和提升响应准确性的重要性。
链接: https://arxiv.org/abs/2504.05324
作者: Chandana Sree Mala,Gizem Gezici,Fosca Giannotti
机构: Department of Computer Science, University of Pisa (计算机科学系, 比萨大学); Department of Computer Science, Scuola Normale Superiore (计算机科学系, 骆马山高等师范学校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) excel in language comprehension and generation but are prone to hallucinations, producing factually incorrect or unsupported outputs. Retrieval Augmented Generation (RAG) systems address this issue by grounding LLM responses with external knowledge. This study evaluates the relationship between retriever effectiveness and hallucination reduction in LLMs using three retrieval approaches: sparse retrieval based on BM25 keyword search, dense retrieval using semantic search with Sentence Transformers, and a proposed hybrid retrieval module. The hybrid module incorporates query expansion and combines the results of sparse and dense retrievers through a dynamically weighted Reciprocal Rank Fusion score. Using the HaluBench dataset, a benchmark for hallucinations in question answering tasks, we assess retrieval performance with metrics such as mean average precision and normalised discounted cumulative gain, focusing on the relevance of the top three retrieved documents. Results show that the hybrid retriever achieves better relevance scores, outperforming both sparse and dense retrievers. Further evaluation of LLM-generated answers against ground truth using metrics such as accuracy, hallucination rate, and rejection rate reveals that the hybrid retriever achieves the highest accuracy on fails, the lowest hallucination rate, and the lowest rejection rate. These findings highlight the hybrid retriever’s ability to enhance retrieval relevance, reduce hallucination rates, and improve LLM reliability, emphasising the importance of advanced retrieval techniques in mitigating hallucinations and improving response accuracy.
zh
[NLP-62] Multi-Perspective Attention Mechanism for Bias-Aware Sequential Recommendation
【速读】: 该论文旨在解决传统推荐系统在捕捉用户行为动态演化方面的局限性,特别是忽视了普遍偏差放大效应的问题,这会导致推荐结果易受马太效应影响,并限制系统深入感知和捕捉用户偏好动态变化的能力。为有效应对这一挑战,论文提出了一种基于序列信息和注意力机制的推荐系统——Multi-Perspective Attention Bias Sequential Recommendation (MABSRec)。其关键是首先将用户序列重构为三种短序列类型,并利用图神经网络进行物品加权;随后引入自适应多偏差视角注意力模块,以提升推荐准确性。实验结果表明,MABSRec 在各项评估指标上表现出显著优势,证明了其在序列推荐任务中的卓越性能。
链接: https://arxiv.org/abs/2504.05323
作者: Mingjian Fu,Hengsheng Chen,Dongchun Jiang,Yanchao Tan
机构: Fuzhou University (福州大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages,10 figures,4 tables
Abstract:In the era of advancing information technology, recommender systems have emerged as crucial tools for dealing with information overload. However, traditional recommender systems still have limitations in capturing the dynamic evolution of user behavior. To better understand and predict user behavior, especially taking into account the complexity of temporal evolution, sequential recommender systems have gradually become the focus of research. Currently, many sequential recommendation algorithms ignore the amplification effects of prevalent biases, which leads to recommendation results being susceptible to the Matthew Effect. Additionally, it will impose limitations on the recommender system’s ability to deeply perceive and capture the dynamic shifts in user preferences, thereby diminishing the extent of its recommendation reach. To address this issue effectively, we propose a recommendation system based on sequential information and attention mechanism called Multi-Perspective Attention Bias Sequential Recommendation (MABSRec). Firstly, we reconstruct user sequences into three short types and utilize graph neural networks for item weighting. Subsequently, an adaptive multi-bias perspective attention module is proposed to enhance the accuracy of recommendations. Experimental results show that the MABSRec model exhibits significant advantages in all evaluation metrics, demonstrating its excellent performance in the sequence recommendation task.
zh
[NLP-63] On Synthesizing Data for Context Attribution in Question Answering
【速读】: 该论文旨在解决大型语言模型(LLMs)在问答任务中因“幻觉”(hallucinations)导致生成虚假或误导性答案的问题,并提出通过上下文归因(context attribution)提升模型可信度的方法。论文的关键在于提出了一种名为SynQA的新型生成策略,用于合成上下文归因数据。SynQA利用LLMs在文本生成方面的自然优势,基于选定的上下文句子生成支持这些句子的问答对,从而确保合成训练数据中具有清晰的归因路径。这种策略不仅有效提升了小规模语言模型(small LMs)在不同问答任务和领域的上下文归因能力,还通过用户研究验证了其实际应用价值。
链接: https://arxiv.org/abs/2504.05317
作者: Gorjan Radevski,Kiril Gashteovski,Shahbaz Syed,Christopher Malon,Sebastien Nicolas,Chia-Chien Hung,Timo Sztyler,Verena Heußer,Wiem Ben Rim,Masafumi Enomoto,Kunihiro Takeoka,Masafumi Oyamada,Goran Glavaš,Carolin Lawrence
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Question Answering (QA) accounts for a significant portion of LLM usage “in the wild”. However, LLMs sometimes produce false or misleading responses, also known as “hallucinations”. Therefore, grounding the generated answers in contextually provided information – i.e., providing evidence for the generated text – is paramount for LLMs’ trustworthiness. Providing this information is the task of context attribution. In this paper, we systematically study LLM-based approaches for this task, namely we investigate (i) zero-shot inference, (ii) LLM ensembling, and (iii) fine-tuning of small LMs on synthetic data generated by larger LLMs. Our key contribution is SynQA: a novel generative strategy for synthesizing context attribution data. Given selected context sentences, an LLM generates QA pairs that are supported by these sentences. This leverages LLMs’ natural strengths in text generation while ensuring clear attribution paths in the synthetic training data. We show that the attribution data synthesized via SynQA is highly effective for fine-tuning small LMs for context attribution in different QA tasks and domains. Finally, with a user study, we validate the usefulness of small LMs (fine-tuned on synthetic data from SynQA) in context attribution for QA.
zh
[NLP-64] Multimodal Quantitative Language for Generative Recommendation
【速读】: 该论文旨在解决现有生成式推荐方法未能有效利用预训练语言模型(Pre-trained Language Models, PLMs) 的通用语言知识与推荐系统的特定需求之间的差异,同时忽视了跨模态信息之间互补知识的问题。为了解决这些问题,论文提出了一种名为“多模态定量语言用于生成式推荐”(Multimodal Quantitative Language for Generative Recommendation, MQL4GRec) 的新方法。其关键思想是将来自不同领域和模态的项目转化为统一的语言形式,作为知识迁移的桥梁。具体而言,首先引入定量翻译器将来自不同领域的文本和图像内容转换为共享相同词表的新语言(即定量语言),然后通过设计一系列定量语言生成任务丰富其语义信息和先验知识,最终通过预训练和微调实现跨领域和模态的推荐知识向目标推荐任务的转移。
链接: https://arxiv.org/abs/2504.05314
作者: Jianyang Zhai,Zi-Feng Mai,Chang-Dong Wang,Feidiao Yang,Xiawu Zheng,Hui Li,Yonghong Tian
机构: Sun Yat-sen University (中山大学); Pengcheng Laboratory (鹏城实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室); Xiamen University (厦门大学); Peking University (北京大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Generative recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. Most existing methods attempt to leverage prior knowledge embedded in Pre-trained Language Models (PLMs) to improve the recommendation performance. However, they often fail to accommodate the differences between the general linguistic knowledge of PLMs and the specific needs of recommendation systems. Moreover, they rarely consider the complementary knowledge between the multimodal information of items, which represents the multi-faceted preferences of users. To facilitate efficient recommendation knowledge transfer, we propose a novel approach called Multimodal Quantitative Language for Generative Recommendation (MQL4GRec). Our key idea is to transform items from different domains and modalities into a unified language, which can serve as a bridge for transferring recommendation knowledge. Specifically, we first introduce quantitative translators to convert the text and image content of items from various domains into a new and concise language, known as quantitative language, with all items sharing the same vocabulary. Then, we design a series of quantitative language generation tasks to enrich quantitative language with semantic information and prior knowledge. Finally, we achieve the transfer of recommendation knowledge from different domains and modalities to the recommendation task through pre-training and fine-tuning. We evaluate the effectiveness of MQL4GRec through extensive experiments and comparisons with existing methods, achieving improvements over the baseline by 11.18%, 14.82%, and 7.95% on the NDCG metric across three different datasets, respectively.
zh
[NLP-65] Dr Web: a modern query-based web data retrieval engine
【速读】: 本文介绍了一个名为Data Retrieval Web Engine(简称Doctor Web)的工具,旨在通过一种简单的查询语言从网页中提取结构化数据。论文讨论了在开发过程中解决的关键工程挑战,包括动态内容处理和杂乱数据提取等问题。解决方案的核心在于设计一个灵活且模块化的系统,能够适应不同网页结构的复杂性,并提供高效的查询能力以满足实际需求。此外,论文还阐述了将DR Web Engine开源化的步骤,强调其潜在的社会价值和技术影响力。
链接: https://arxiv.org/abs/2504.05311
作者: Ylli Prifti,Alessandro Provetti,Pasquale de Meo
机构: Birkbeck, University of London (伯克贝克大学伦敦学院); DICAM, University of Messina (墨西拿大学DICAM)
类目: Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 10 pages, 1 figure, 1 table, 7 listings
Abstract:This article introduces the Data Retrieval Web Engine (also referred to as doctor web), a flexible and modular tool for extracting structured data from web pages using a simple query language. We discuss the engineering challenges addressed during its development, such as dynamic content handling and messy data extraction. Furthermore, we cover the steps for making the DR Web Engine public, highlighting its open source potential.
zh
计算机视觉
[CV-0] D2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes
【速读】:该论文致力于解决动态场景中的三维重建问题,传统方法如DUSt3R虽在静态场景中表现出色,但在存在物体运动时会因仅依赖相机姿态导致点云配准失败。论文的关键在于提出了一种名为D²USt3R的新方法,通过前馈方式回归包含静态与动态三维场景几何信息的四维点云(4D pointmaps),显式结合空间与时间维度,构建时空稠密对应关系,从而提升下游任务性能。实验表明,该方法在包含复杂运动的各种数据集上均实现了更优的重建效果。
链接: https://arxiv.org/abs/2504.06264
作者: Jisang Han,Honggyu An,Jaewoo Jung,Takuya Narihira,Junyoung Seo,Kazumi Fukuda,Chaehyun Kim,Sunghwan Hong,Yuki Mitsufuji,Seungryong Kim
机构: KAIST AI (KAIST AI); Sony AI (Sony AI); Korea University (韩国大学); Sony Group Corporation (索尼集团公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:We address the task of 3D reconstruction in dynamic scenes, where object motions degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose D^2USt3R that regresses 4D pointmaps that simultaneiously capture both static and dynamic 3D scene geometry in a feed-forward manner. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates spatio-temporal dense correspondence to the proposed 4D pointmaps, enhancing downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior reconstruction performance across various datasets featuring complex motions.
zh
[CV-1] OmniSVG: A Unified Scalable Vector Graphics Generation Model
【速读】:本文旨在解决高质量复杂 Scalable Vector Graphics (SVG) 生成的问题,现有方法要么产生无结构输出且计算成本巨大,要么局限于生成简化结构的单色图标。为实现高质量复杂 SVG 的生成,论文提出 OmniSVG,这是一种利用预训练视觉-语言模型 (Vision-Language Models, VLMs) 的统一框架,用于端到端多模态 SVG 生成。其关键在于将 SVG 命令和坐标参数化为离散标记,从而在保持复杂 SVG 结构表达性的同时,解耦结构逻辑与低级几何以实现高效训练。此外,论文还引入了 MMSVG-2M 数据集和标准化评估协议,进一步推动 SVG 合成的发展。实验表明,OmniSVG 在性能上优于现有方法,并展示了其融入专业 SVG 设计工作流的潜力。
链接: https://arxiv.org/abs/2504.06263
作者: Yiying Yang,Wei Cheng,Sijin Chen,Xianfang Zeng,Jiaxu Zhang,Liao Wang,Gang Yu,Xingjun Ma,Yu-Gang Jiang
机构: Fudan University (复旦大学); StepFun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages; Project Page: this https URL
Abstract:Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.
zh
[CV-2] PainNet: Statistical Relation Network with Episode-Based Training for Pain Estimation
【速读】:该论文旨在解决序列级自报疼痛(self-reported pain)估计的问题,这一领域尽管面部表情疼痛估计已有广泛研究,但针对序列级疼痛的研究较为有限。论文提出了一种名为PainNet的新型统计关系网络作为解决方案。PainNet的关键在于其嵌入(embedding)模块和关系(relation)模块的设计,这两个模块用于比较疼痛视频对,并输出关系分数以判断每一对视频是否属于相同的疼痛类别。嵌入模块的核心是一个统计层,该层位于循环神经网络(RNN)之上,用于提取紧凑的视频级特征。通过将统计层整合到深度架构中,PainNet实现了多个先前研究中的训练阶段的端到端联合优化。此外,PainNet采用基于片段(episode-based)的训练方案,即将查询视频与代表不同疼痛类别的视频集进行比较。实验结果表明,统计层和基于片段的训练在模型中的有效性,同时PainNet在自报疼痛估计任务上超越了现有技术的最佳性能。
链接: https://arxiv.org/abs/2504.06257
作者: Mina Bishay,Graham Page,Mohammad Mavadati
机构: Smart Eye AB
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the ACII 2024 Workshops
Abstract:Despite the span in estimating pain from facial expressions, limited works have focused on estimating the sequence-level pain, which is reported by patients and used commonly in clinics. In this paper, we introduce a novel Statistical Relation Network, referred to as PainNet, designed for the estimation of the sequence-level pain. PainNet employs two key modules, the embedding and the relation modules, for comparing pairs of pain videos, and producing relation scores indicating if each pair belongs to the same pain category or not. At the core of the embedding module is a statistical layer mounted on the top of a RNN for extracting compact video-level features. The statistical layer is implemented as part of the deep architecture. Doing so, allows combining multiple training stages used in previous research, into a single end-to-end training stage. PainNet is trained using the episode-based training scheme, which involves comparing a query video with a set of videos representing the different pain categories. Experimental results show the benefit of using the statistical layer and the episode-based training in the proposed model. Furthermore, PainNet outperforms the state-of-the-art results on self-reported pain estimation.
zh
[CV-3] ransfer between Modalities with MetaQueries
【速读】:该论文旨在解决统一多模态模型在理解(文本输出)和生成(像素输出)之间对齐的问题,传统方法通常需要复杂的训练策略和精细的数据平衡。论文的关键解决方案是提出了MetaQueries,这是一种可学习的查询集,作为自回归多模态大型语言模型(MLLMs)与扩散模型之间的高效接口。MetaQueries通过连接MLLM的潜在空间与扩散解码器,利用MLLM的深度理解和推理能力实现知识增强的图像生成,同时简化了训练过程,仅需配对的图像-标题数据和标准扩散目标即可。此外,即使MLLM主干保持冻结,该方法依然有效,从而保留了其领先的多模态理解能力,同时实现了强大的生成性能。这种方法还具有灵活性,易于通过指令微调用于高级应用,如图像编辑和主题驱动的生成。
链接: https://arxiv.org/abs/2504.06256
作者: Xichen Pan,Satya Narayan Shukla,Aashu Singh,Zhuokai Zhao,Shlok Kumar Mishra,Jialiang Wang,Zhiyang Xu,Jiuhai Chen,Kunpeng Li,Felix Juefei-Xu,Ji Hou,Saining Xie
机构: Meta (Meta); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM’s latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM’s deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen, thereby preserving its state-of-the-art multimodal understanding capabilities while achieving strong generative performance. Additionally, our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.
zh
[CV-4] Monitoring Viewer Attention During Online Ads ECCV2024
【速读】:该论文旨在解决在线视频广告测试过程中因参与者注意力分散而导致的数据偏差问题。解决方案的关键在于提出了一种监测在线广告观看者注意力的架构,通过结合AFFDEX 2.0和SmartEye SDK两个行为分析工具包,提取低级面部特征(如面部表情、头部姿态和 gaze 方向)以及高级特征(如屏幕平面上的估计 gaze、打哈欠、说话等),从而识别四种主要的注意力分散因素:屏幕外 gaze、嗜睡、说话和未关注屏幕。此外,该架构根据不同设备类型(桌面或移动设备)调整 gaze 设置,并通过标注特定分散因素的数据集及真实世界广告测试数据集验证了其有效性,在检测桌面和移动设备上的注意力分散方面表现出良好的性能。
链接: https://arxiv.org/abs/2504.06237
作者: Mina Bishay,Graham Page,Waleed Emad,Mohammad Mavadati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the ECCV 2024 Workshops
Abstract:Nowadays, video ads spread through numerous online platforms, and are being watched by millions of viewers worldwide. Big brands gauge the liking and purchase intent of their new ads, by analyzing the facial responses of viewers recruited online to watch the ads from home or work. Although this approach captures naturalistic responses, it is susceptible to distractions inherent in the participants’ environments, such as a movie playing on TV, a colleague speaking, or mobile notifications. Inattentive participants should get flagged and eliminated to avoid skewing the ad-testing process. In this paper we introduce an architecture for monitoring viewer attention during online ads. Leveraging two behavior analysis toolkits; AFFDEX 2.0 and SmartEye SDK, we extract low-level facial features encompassing facial expressions, head pose, and gaze direction. These features are then combined to extract high-level features that include estimated gaze on the screen plane, yawning, speaking, etc – this enables the identification of four primary distractors; off-screen gaze, drowsiness, speaking, and unattended screen. Our architecture tailors the gaze settings according to the device type (desktop or mobile). We validate our architecture first on datasets annotated for specific distractors, and then on a real-world ad testing dataset with various distractors. The proposed architecture shows promising results in detecting distraction across both desktop and mobile devices.
zh
[CV-5] HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散/流模型在高分辨率图像合成中的挑战,由于高质量高分辨率内容的稀缺性和复杂性,现有方法面临显著困难。论文提出了一种名为HiFlow的训练-free且与模型无关的框架,用于释放预训练流模型在高分辨率下的潜力。HiFlow的关键在于构建了一个虚拟参考流(virtual reference flow),该流有效捕获低分辨率流信息的特性,并通过初始化对齐(low-frequency consistency)、方向对齐(structure preservation)以及加速对齐(detail fidelity)三个关键方面为高分辨率生成提供指导。这些对齐机制确保了生成图像在频率一致性、结构保持和细节保真度方面的优越表现,从而大幅提升T2I模型在高分辨率图像合成中的质量,并展示出广泛的适用性。实验结果验证了HiFlow在当前最先进的方法中实现卓越高分辨率图像质量的能力。
链接: https://arxiv.org/abs/2504.06232
作者: Jiazi Bu,Pengyang Ling,Yujie Zhou,Pan Zhang,Tong Wu,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Dahua Lin,Jiaqi Wang
机构: Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学); Stanford University (斯坦福大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海产业创新研究院); CPII under InnoHK (InnoHK旗下的CPII)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image (T2I) diffusion/flow models have drawn considerable attention recently due to their remarkable ability to deliver flexible visual creations. Still, high-resolution image synthesis presents formidable challenges due to the scarcity and complexity of high-resolution content. To this end, we present HiFlow, a training-free and model-agnostic framework to unlock the resolution potential of pre-trained flow models. Specifically, HiFlow establishes a virtual reference flow within the high-resolution space that effectively captures the characteristics of low-resolution flow information, offering guidance for high-resolution generation through three key aspects: initialization alignment for low-frequency consistency, direction alignment for structure preservation, and acceleration alignment for detail fidelity. By leveraging this flow-aligned guidance, HiFlow substantially elevates the quality of high-resolution image synthesis of T2I models and demonstrates versatility across their personalized variants. Extensive experiments validate HiFlow’s superiority in achieving superior high-resolution image quality over current state-of-the-art methods.
zh
[CV-6] Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation
【速读】:该论文试图解决现有 Parameter-Efficient Fine-Tuning (PEFT) 方法在 Remote Sensing (RS) 场景中因无法有效处理 artifact 影响而导致性能下降的问题。解决方案的关键在于引入 Earth-Adapter,这是一种专为 RS artifact 处理设计的 PEFT 方法。Earth-Adapter 提出了一种新颖的 Frequency Adaptation 过程,结合了 Mixture of Adapter (MoA) 和 Discrete Fourier Transformation (DFT),通过 DFT 将特征分解为不同频率分量以分离 artifact 和原始特征,并利用 MoA 动态分配适配器专家权重,从而跨频域组合特征。这些方法显著提升了 Foundation Models 在 RS 场景中的性能,在 Domain Adaptation 和 Domain Generalization 的语义分割基准测试中分别提升了 9.0% 和 3.1% 的 mIoU。
链接: https://arxiv.org/abs/2504.06220
作者: Xiaoxing Hu,Ziyang Gong,Yupei Wang,Yuru Jia,Gen Luo,Xue Yang
机构: Beijing Institute of Technology (北京理工大学); Shanghai AI Laboratory (上海人工智能实验室); KU Leuven (鲁汶大学); KTH (皇家理工学院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows us to adapt powerful Foundation Models (FMs) to diverse downstream tasks while preserving and unleashing their inherent capabilities. However, we have observed that existing PEFT methods, which are often designed with natural imagery in mind, struggle when applied to Remote Sensing (RS) scenarios. This is primarily due to their inability to handle artifact influences, a problem particularly severe in RS image features. To tackle this challenge, we introduce Earth-Adapter, the first PEFT method specifically designed for RS artifacts conquering. Earth-Adapter introduces a novel Mixture of Frequency Adaptation process that combines a Mixture of Adapter (MoA) with Discrete Fourier Transformation (DFT). By utilizing DFT, Earth-Adapter can decompose features into different frequency components, precisely separating artifacts from original features. The MoA then dynamically assigns weights to each adapter expert, allowing for the combination of features across various frequency domains. These simple-yet-effective approaches enable Earth-Adapter to more efficiently overcome the disturbances caused by artifacts than previous PEFT methods, significantly enhancing the FMs’ performance on RS scenarios. Experiments on Domain Adaptation (DA), and Domain Generalization (DG) semantic segmentation benchmarks showcase the Earth-Adapter’s effectiveness. Compared with baseline Rein, Earth-Adapter significantly improves 9.0% mIoU in DA and 3.1% mIoU in DG benchmarks. Our code will be released at this https URL.
zh
[CV-7] HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation CVPR2025
【速读】:该论文致力于解决单目动态3D重建这一具有挑战性的任务。论文提出了一种名为分层运动表示(Hierarchical Motion Representation, HiMoR)的新颖方法,用于3D高斯原语的变形表示,以实现高质量的单目动态3D重建。解决方案的关键在于通过树结构将场景中的运动分解为不同层次的细节,其中较浅层节点建模粗略运动以保证时间平滑性,较深层节点捕捉精细运动。此外,HiMoR使用共享的运动基底来表示不同节点集的运动,符合运动通常平滑且简单的假设。这种运动表示设计赋予高斯原语更结构化的变形能力,最大限度地利用时间关系,从而有效应对单目动态3D重建的难题。同时,论文还建议采用更可靠的感知度量作为替代方案,以克服像素级指标可能无法准确反映重建质量的问题。实验结果验证了该方法在复杂运动的单目视频中实现卓越的新视角合成的能力。
链接: https://arxiv.org/abs/2504.06210
作者: Yiming Liang,Tianhan Xu,Yuta Kikuchi
机构: Waseda University (早稻田大学); Preferred Networks, Inc. (Preferred Networks, Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project Page: this https URL
Abstract:We present Hierarchical Motion Representation (HiMoR), a novel deformation representation for 3D Gaussian primitives capable of achieving high-quality monocular dynamic 3D reconstruction. The insight behind HiMoR is that motions in everyday scenes can be decomposed into coarser motions that serve as the foundation for finer details. Using a tree structure, HiMoR’s nodes represent different levels of motion detail, with shallower nodes modeling coarse motion for temporal smoothness and deeper nodes capturing finer motion. Additionally, our model uses a few shared motion bases to represent motions of different sets of nodes, aligning with the assumption that motion tends to be smooth and simple. This motion representation design provides Gaussians with a more structured deformation, maximizing the use of temporal relationships to tackle the challenging task of monocular dynamic 3D reconstruction. We also propose using a more reliable perceptual metric as an alternative, given that pixel-level metrics for evaluating monocular dynamic 3D reconstruction can sometimes fail to accurately reflect the true quality of reconstruction. Extensive experiments demonstrate our method’s efficacy in achieving superior novel view synthesis from challenging monocular videos with complex motions.
zh
[CV-8] HRMedSeg: Unlocking High-resolution Medical Image segmentation via Memory-efficient Attention Modeling
【速读】:该论文旨在解决现有基于Transformer的编码器-解码器框架在处理高分辨率医学图像分割时内存消耗过大的问题,这一局限性限制了其在实际场景中的广泛应用。为了解决此问题,论文提出了一种名为HRMedSeg的记忆高效高分辨率医学图像分割框架。关键解决方案包括:(1) 设计了一个轻量级门控视觉Transformer (LGViT) 作为图像编码器,以线性复杂度建模长距离依赖关系;(2) 提出了一种高效的跨多尺度解码器 (ECM-Decoder),用于生成高分辨率分割掩膜;(3) 在预训练过程中引入特征蒸馏技术,进一步提升模型性能。这些创新点使HRMedSeg在保持高性能的同时显著降低了内存占用(每批次仅需0.59GB GPU内存),展示了较低的训练成本。
链接: https://arxiv.org/abs/2504.06205
作者: Qing Xu,Zhenye Lou,Chenxin Li,Xiangjian He,Rong Qu,Tesema Fiseha Berhanu,Yi Wang,Wenting Duan,Zhen Chen
机构: UNNC (宁波诺丁汉大学); Sichuan University (四川大学); CUHK (香港中文大学); University of Nottingham (诺丁汉大学); Dalian University of Technology (大连理工大学); University of Lincoln (林肯大学); HKISI, CAS (中国科学院海岛与海岸带综合研究站)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:High-resolution segmentation is critical for precise disease diagnosis by extracting micro-imaging information from medical images. Existing transformer-based encoder-decoder frameworks have demonstrated remarkable versatility and zero-shot performance in medical segmentation. While beneficial, they usually require huge memory costs when handling large-size segmentation mask predictions, which are expensive to apply to real-world scenarios. To address this limitation, we propose a memory-efficient framework for high-resolution medical image segmentation, called HRMedSeg. Specifically, we first devise a lightweight gated vision transformer (LGViT) as our image encoder to model long-range dependencies with linear complexity. Then, we design an efficient cross-multiscale decoder (ECM-Decoder) to generate high-resolution segmentation masks. Moreover, we utilize feature distillation during pretraining to unleash the potential of our proposed model. Extensive experiments reveal that HRMedSeg outperforms state-of-the-arts in diverse high-resolution medical image segmentation tasks. In particular, HRMedSeg uses only 0.59GB GPU memory per batch during fine-tuning, demonstrating low training costs. Besides, when HRMedSeg meets the Segment Anything Model (SAM), our HRMedSegSAM takes 0.61% parameters of SAM-H. The code is available at this https URL.
zh
[CV-9] WoundAmbit: Bridging State-of-the-Art Semantic Segmentation and Real-World Wound Care ECML KDD2025
【速读】:该论文旨在解决慢性伤口自动监测领域中伤口语义分割研究不足的问题。当前医疗影像研究中,伤口分割的关注度较低,而这一过程对于通过移动设备图像采集实现远程伤口大小跟踪至关重要。论文的关键解决方案在于评估了一系列最先进的深度学习模型,包括通用视觉模型、医学影像模型以及公开伤口挑战赛中的顶级方法,并通过标准化训练、数据增强和交叉验证确保公平比较,以减少数据划分偏差。此外,论文提出了一种基于参考对象的方法,将AI生成的掩膜转换为临床相关的伤口大小估计值,并通过医生评估来优化模型性能。最终,基于Transformer的TransNeXt在泛化能力方面表现最优,而VWFormer和ConvNeXtS主干模型在掩膜质量和专家评价中表现最佳。这些方法共同构成了一个高效的AI驱动伤口大小估计算法框架(WoundAmbit),能够有效支持远程医疗系统的集成应用。
链接: https://arxiv.org/abs/2504.06185
作者: Vanessa Borst,Timo Dittus,Tassilo Dege,Astrid Schmieder,Samuel Kounev
机构: Julius-Maximilians-University Würzburg ( Julius-Maximilians-Universität 威斯巴登大学), 97070 Würzburg, Germany; University Hospital Würzburg (威斯巴登大学医院), 97070 Würzburg, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Main paper: 17 pages; supplementary material: 16 pages; paper submitted to the application track of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2025)
Abstract:Chronic wounds affect a large population, particularly the elderly and diabetic patients, who often exhibit limited mobility and co-existing health conditions. Automated wound monitoring via mobile image capture can reduce in-person physician visits by enabling remote tracking of wound size. Semantic segmentation is key to this process, yet wound segmentation remains underrepresented in medical imaging research. To address this, we benchmark state-of-the-art deep learning models from general-purpose vision, medical imaging, and top methods from public wound challenges. For fair comparison, we standardize training, data augmentation, and evaluation, conducting cross-validationto minimize partitioning bias. We also assess real-world deployment aspects, including generalization to an out-of-distribution wound dataset, computational efficiency, and interpretability. Additionally, we propose a reference object-based approach to convert AI-generated masks into clinically relevant wound size estimates, and evaluate this, along with mask quality, for the best models based on physician assessments. Overall, the transformer-based TransNeXt showed the highest levels of generalizability. Despite variations in inference times, all models processed at least one image per second on the CPU, which is deemed adequate for the intended application. Interpretability analysis typically revealed prominent activations in wound regions, emphasizing focus on clinically relevant features. Expert evaluation showed high mask approval for all analyzed models, with VWFormer and ConvNeXtS backbone performing the best. Size retrieval accuracy was similar across models, and predictions closely matched expert annotations. Finally, we demonstrate how our AI-driven wound size estimation framework, WoundAmbit, can be integrated into a custom telehealth system. Our code will be made available on GitHub upon publication.
zh
[CV-10] Flash Sculptor: Modular 3D Worlds from Objects
【速读】:该论文旨在解决现有文本到三维(Text-to-3D)和图像到三维(Image-to-3D)模型在处理包含多对象及复杂交互的场景时面临的挑战,特别是现有方法在优化整个布局时过程繁琐甚至不可行的问题。为克服这些困难,论文提出了一种名为Flash Sculptor的新框架,用于从单张图像进行组合式三维场景或物体重建。Flash Sculptor的核心在于其“分而治之”的策略,将组合式场景重建分解为一系列子任务,包括处理每个独立实例的外观、旋转、缩放和平移等属性。对于旋转,引入了粗到细的方案以兼顾效率与准确性;而对于平移,则开发了一种基于异常值去除的算法,在单一步骤内确保参数的鲁棒性和精确性,无需迭代优化。实验结果表明,Flash Sculptor相比现有方法至少提升了三倍的速度,并在组合式三维重建性能上树立了新的基准。代码已开源。
链接: https://arxiv.org/abs/2504.06178
作者: Yujia Hu,Songhua Liu,Xingyi Yang,Xinchao Wang
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing text-to-3D and image-to-3D models often struggle with complex scenes involving multiple objects and intricate interactions. Although some recent attempts have explored such compositional scenarios, they still require an extensive process of optimizing the entire layout, which is highly cumbersome if not infeasible at all. To overcome these challenges, we propose Flash Sculptor in this paper, a simple yet effective framework for compositional 3D scene/object reconstruction from a single image. At the heart of Flash Sculptor lies a divide-and-conquer strategy, which decouples compositional scene reconstruction into a sequence of sub-tasks, including handling the appearance, rotation, scale, and translation of each individual instance. Specifically, for rotation, we introduce a coarse-to-fine scheme that brings the best of both worlds–efficiency and accuracy–while for translation, we develop an outlier-removal-based algorithm that ensures robust and precise parameters in a single step, without any iterative optimization. Extensive experiments demonstrate that Flash Sculptor achieves at least a 3 times speedup over existing compositional 3D methods, while setting new benchmarks in compositional 3D reconstruction performance. Codes are available at this https URL.
zh
[CV-11] Action Valuation in Sports: A Survey
【速读】:该论文旨在解决体育分析领域中动作估值(Action Valuation, AV)任务缺乏全面深入综述的问题。尽管已有少量研究涉及与球员估值等相关概念的调查,但尚未有系统性综述专门针对不同体育项目中的AV进行全面分析。为此,论文提出了一种包含九个维度的分类法,涵盖数据、方法论、评估技术和实际应用等方面。通过这种多维度分析,论文试图识别出高效AV方法的关键特性,揭示当前研究中存在的不足,并为推动该领域的发展指明未来方向。关键在于构建这一全面的分类框架以促进对AV任务本质的理解及其在实际中的有效应用。
链接: https://arxiv.org/abs/2504.06163
作者: Artur Xarles,Sergio Escalera,Thomas B. Moeslund,Albert Clapés
机构: Universitat de Barcelona(巴塞罗那大学); Computer Vision Center(计算机视觉中心); Aalborg University(奥胡斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Action Valuation (AV) has emerged as a key topic in Sports Analytics, offering valuable insights by assigning scores to individual actions based on their contribution to desired outcomes. Despite a few surveys addressing related concepts such as Player Valuation, there is no comprehensive review dedicated to an in-depth analysis of AV across different sports. In this survey, we introduce a taxonomy with nine dimensions related to the AV task, encompassing data, methodological approaches, evaluation techniques, and practical applications. Through this analysis, we aim to identify the essential characteristics of effective AV methods, highlight existing gaps in research, and propose future directions for advancing the field.
zh
[CV-12] Rethinking the Nested U-Net Approach: Enhancing Biomarker Segmentation with Attention Mechanisms and Multiscale Feature Fusion
【速读】:该论文旨在解决医学图像分割中因形态和染色变化导致的特征提取能力受限问题,以及在数据样本有限的情况下,现有端到端方法在多尺度特征有效迁移方面的困难。论文的关键创新在于提出了一种嵌套UNet架构,通过多尺度特征融合(Multiscale Feature Fusion)和注意力机制(Attention Mechanisms),提升编码器特征的整合能力,突出关键通道和区域,并恢复空间细节以增强分割性能。实验结果表明,该方法在四个数据集上的表现超越了当前最先进的方法。
链接: https://arxiv.org/abs/2504.06158
作者: Saad Wazir,Daeyoung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the Proceedings of the 2024 International Conference on Medical Imaging and Computer-Aided Diagnosis (MICAD 2024), Lecture Notes in Electrical Engineering (LNEE), Volume 1372, Springer Nature, Singapore
Abstract:Identifying biomarkers in medical images is vital for a wide range of biotech applications. However, recent Transformer and CNN based methods often struggle with variations in morphology and staining, which limits their feature extraction capabilities. In medical image segmentation, where data samples are often limited, state-of-the-art (SOTA) methods improve accuracy by using pre-trained encoders, while end-to-end approaches typically fall short due to difficulties in transferring multiscale features effectively between encoders and decoders. To handle these challenges, we introduce a nested UNet architecture that captures both local and global context through Multiscale Feature Fusion and Attention Mechanisms. This design improves feature integration from encoders, highlights key channels and regions, and restores spatial details to enhance segmentation performance. Our method surpasses SOTA approaches, as evidenced by experiments across four datasets and detailed ablation studies. Code: this https URL
zh
[CV-13] A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning CVPR’25
【速读】:该论文旨在解决自监督学习在视频领域标准化基准缺失的问题,并深入探究自监督视频表征学习中的五个关键方面:数据集规模、模型复杂度、数据分布、数据噪声以及特征表示。论文通过评估六种自监督方法在六个网络架构上的性能,开展大规模实验并分析其在两个下游任务上的表现,揭示了预训练策略、数据集特性、预训练任务与模型架构之间的内在联系。此外,论文将这些发现扩展到视频基础模型(Video Foundation Models, ViFMs),展示了其在大规模视频表征学习中的价值。关键解决方案在于提出了一种新方法,显著减少了训练数据需求,同时超越了依赖更多预训练数据的现有最先进方法。
链接: https://arxiv.org/abs/2504.06153
作者: Akash Kumar,Ashlesha Kumar,Vibhav Vineet,Yogesh S Rawat
机构: CRCV, University of Central Florida (中佛罗里达大学CRCV); BITS Pilani (比卡内尔理工学院); Microsoft Research (微软研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR’25 Workshop: 6th Data-Efficient Workshop
Abstract:Self-supervised learning has emerged as a powerful paradigm for label-free model pretraining, particularly in the video domain, where manual annotation is costly and time-intensive. However, existing self-supervised approaches employ diverse experimental setups, making direct comparisons challenging due to the absence of a standardized benchmark. In this work, we establish a unified benchmark that enables fair comparisons across different methods. Additionally, we systematically investigate five critical aspects of self-supervised learning in videos: (1) dataset size, (2) model complexity, (3) data distribution, (4) data noise, and (5) feature representations. To facilitate this study, we evaluate six self-supervised learning methods across six network architectures, conducting extensive experiments on five benchmark datasets and assessing performance on two distinct downstream tasks. Our analysis reveals key insights into the interplay between pretraining strategies, dataset characteristics, pretext tasks, and model architectures. Furthermore, we extend these findings to Video Foundation Models (ViFMs), demonstrating their relevance in large-scale video representation learning. Finally, leveraging these insights, we propose a novel approach that significantly reduces training data requirements while surpassing state-of-the-art methods that rely on 10% more pretraining data. We believe this work will guide future research toward a deeper understanding of self-supervised video representation learning and its broader implications.
zh
[CV-14] V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models
【速读】:该论文试图解决现有游戏化基准测试在评估多模态大型语言模型(MLLMs)视觉推理能力方面的不足。当前基准测试缺乏以视觉为中心的任务,无法全面评估模型在真实世界决策所需的多样化推理技能。为了解决这一问题,论文提出了一种基于游戏的评估框架——视觉为中心的多重能力游戏评估(V-MAGE)。其关键是设计了包含五个多样化游戏及超过30个手工制作关卡的评估体系,专注于测试模型的核心视觉技能(如定位、轨迹跟踪、时间感知和视觉记忆)以及更高层次的推理能力(如长期规划与深思熟虑),从而更有效地衡量MLLMs在开放世界动态环境中的表现。
链接: https://arxiv.org/abs/2504.06148
作者: Xiangxi Zheng,Linjie Li,Zhengyuan Yang,Ping Yu,Alex Jinpeng Wang,Rui Yan,Yuan Yao,Lijuan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have led to significant improvements across various multimodal benchmarks. However, as evaluations shift from static datasets to open-world, dynamic environments, current game-based benchmarks remain inadequate because they lack visual-centric tasks and fail to assess the diverse reasoning skills required for real-world decision-making. To address this, we introduce Visual-centric Multiple Abilities Game Evaluation (V-MAGE), a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. V-MAGE features five diverse games with 30+ handcrafted levels, testing models on core visual skills such as positioning, trajectory tracking, timing, and visual memory, alongside higher-level reasoning like long-term planning and deliberation. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning. In all game environments, the top-performing MLLMs, as determined by Elo rating comparisons, exhibit a substantial performance gap compared to humans. Our findings highlight critical limitations, including various types of perceptual errors made by the models, and suggest potential avenues for improvement from an agent-centric perspective, such as refining agent strategies and addressing perceptual inaccuracies. Code is available at this https URL.
zh
[CV-15] A Training-Free Style-aligned Image Generation with Scale-wise Autoregressive Model
【速读】:该论文旨在解决现有大规模文本到图像(Text-to-Image, T2I)生成模型,尤其是基于扩散的方法,在生成图像集中的风格不一致以及推理速度慢的问题,这些问题限制了其实际应用。为了解决这些问题,论文提出了三个关键组件:初始特征替换以确保背景外观的一致性,关键特征插值以对齐物体位置,以及动态风格注入,通过调度函数增强风格一致性。与需要微调或额外训练的先前方法不同,本文提出的方法在保持快速推理的同时保留了个体内容细节。实验结果表明,该方法在生成质量上可媲美竞争方法,显著提升了风格对齐效果,并实现了比最快模型快六倍以上的推理速度。
链接: https://arxiv.org/abs/2504.06144
作者: Jihun Park,Jongmin Gim,Kyoungmin Lee,Minseok Oh,Minwoo Choi,Jaeyeul Kim,Woo Chool Park,Sunghoon Im
机构: DGIST (Daegu Gyeongbuk Institute of Science and Technology)(DGIST); KETI (Korea Electronics Technology Institute)(KETI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 15 figures
Abstract:We present a training-free style-aligned image generation method that leverages a scale-wise autoregressive model. While large-scale text-to-image (T2I) models, particularly diffusion-based methods, have demonstrated impressive generation quality, they often suffer from style misalignment across generated image sets and slow inference speeds, limiting their practical usability. To address these issues, we propose three key components: initial feature replacement to ensure consistent background appearance, pivotal feature interpolation to align object placement, and dynamic style injection, which reinforces style consistency using a schedule function. Unlike previous methods requiring fine-tuning or additional training, our approach maintains fast inference while preserving individual content details. Extensive experiments show that our method achieves generation quality comparable to competing approaches, significantly improves style alignment, and delivers inference speeds over six times faster than the fastest model.
zh
[CV-16] FaceCloak: Learning to Protect Face Templates
【速读】:该论文旨在解决由生成式模型从编码表示(模板)重构高度逼真人脸图像引发的安全与隐私问题。论文提出了一种名为FaceCloak的神经网络框架,通过生成智能且可更新的二值化伪装(binary cloaks),主动防御反转攻击(inversion attacks)。其关键在于利用单个面部模板实时合成独特的干扰器(disruptors),在保证生物特征效用(biometric utility)和不可链接性(unlinkability)的同时保护面部模板。此外,伪装后的模板能够抑制敏感属性,泛化至新的特征提取方案,并在生物特征匹配性能及抵抗重建攻击的鲁棒性方面优于现有基线方法。该方案具有极高的运行效率(推理时间成本为0.28毫秒)和轻量化特性(仅0.57 MB)。
链接: https://arxiv.org/abs/2504.06131
作者: Sudipta Banerjee,Anubhav Jain,Chinmay Hegde,Nasir Memon
机构: New York University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE International Conference on Automatic Face and Gesture Recognition (FG 2025)
Abstract:Generative models can reconstruct face images from encoded representations (templates) bearing remarkable likeness to the original face raising security and privacy concerns. We present FaceCloak, a neural network framework that protects face templates by generating smart, renewable binary cloaks. Our method proactively thwarts inversion attacks by cloaking face templates with unique disruptors synthesized from a single face template on the fly while provably retaining biometric utility and unlinkability. Our cloaked templates can suppress sensitive attributes while generalizing to novel feature extraction schemes and outperforms leading baselines in terms of biometric matching and resiliency to reconstruction attacks. FaceCloak-based matching is extremely fast (inference time cost=0.28ms) and light-weight (0.57MB).
zh
[CV-17] A Robust Real-Time Lane Detection Method with Fog-Enhanced Feature Fusion for Foggy Conditions
【速读】:该论文旨在解决雾天条件下车道检测性能显著下降的问题,这一挑战源于缺乏专门针对雾天环境设计的数据集和方法。论文的关键解决方案在于提出了一种鲁棒的雾增强网络(Fog-Enhanced Network),该网络包含三个核心模块:全局特征融合模块(Global Feature Fusion Module, GFFM)用于捕捉雾天图像中的全局关系;核特征融合模块(Kernel Feature Fusion Module, KFFM)用于建模车道实例的结构和位置关系;低级边缘增强模块(Low-level Edge Enhanced Module, LEEM)用于处理雾天条件下的缺失边缘细节。这些创新使得所提方法在FoggyLane、FoggyCULane和FoggyTusimple数据集上的F1分数分别达到了95.04、79.85和96.95,并且在NVIDIA Jetson AGX Orin平台上实现了38.4 FPS的处理速度,验证了其实时性和鲁棒性。
链接: https://arxiv.org/abs/2504.06121
作者: Ronghui Zhang,Yuhang Ma,Tengfei Li,Ziyu Lin,Yueying Wu,Junzhou Chen,Lin Zhang,Jia Hu,Tony Z. Qiu,Konghui Guo
机构: Guangdong Key Laboratory of Intelligent Transportation System, School of Intelligent Systems Engineering, Sun Yat-sen University (中山大学智能工程学院智能交通系统重点实验室); College of Automotive Studies, Tongji University (同济大学汽车学院); College of Transportation Engineering, Tongji University (同济大学交通工程学院); Department of Civil and Environmental Engineering, University of Alberta (阿尔伯塔大学土木与环境工程系); State Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University (吉林大学汽车底盘集成与仿生国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lane detection is a critical component of Advanced Driver Assistance Systems (ADAS). Existing lane detection algorithms generally perform well under favorable weather conditions. However, their performance degrades significantly in adverse conditions, such as fog, which increases the risk of traffic accidents. This challenge is compounded by the lack of specialized datasets and methods designed for foggy environments. To address this, we introduce the FoggyLane dataset, captured in real-world foggy scenarios, and synthesize two additional datasets, FoggyCULane and FoggyTusimple, from existing popular lane detection datasets. Furthermore, we propose a robust Fog-Enhanced Network for lane detection, incorporating a Global Feature Fusion Module (GFFM) to capture global relationships in foggy images, a Kernel Feature Fusion Module (KFFM) to model the structural and positional relationships of lane instances, and a Low-level Edge Enhanced Module (LEEM) to address missing edge details in foggy conditions. Comprehensive experiments demonstrate that our method achieves state-of-the-art performance, with F1-scores of 95.04 on FoggyLane, 79.85 on FoggyCULane, and 96.95 on FoggyTusimple. Additionally, with TensorRT acceleration, the method reaches a processing speed of 38.4 FPS on the NVIDIA Jetson AGX Orin, confirming its real-time capabilities and robustness in foggy environments.
zh
[CV-18] Hyperbolic Category Discovery CVPR2025
【速读】:该论文致力于解决广义类别发现(Generalized Category Discovery, GCD)这一开放世界问题,其核心挑战在于对未标注数据集中的所有图像进行分类,无论这些图像属于已知类别还是未知类别。现有方法通常在自监督预训练主干网络后采用球面投影算子,在欧几里得或球面空间中操作,但这些空间对于编码具有层次结构的样本表现欠佳。论文的关键创新在于提出利用双曲空间来处理这一问题,因为双曲空间具有随半径呈指数增长的体积特性,能够更好地捕捉来自已见和未见类别的样本的层次结构。为此,作者引入了HypCD框架,通过将主干网络的欧几里得嵌入空间转换为双曲空间,并结合双曲距离与样本之间的角度来学习层次感知的表示和分类器,从而显著提升知识从已知类别到未知类别的迁移能力。实验结果表明,HypCD在多个公开基准测试中均实现了显著性能提升。
链接: https://arxiv.org/abs/2504.06120
作者: Yuanpei Liu,Zhenqi He,Kai Han
机构: Visual AI Lab, The University of Hong Kong (香港大学视觉人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a conference paper at CVPR 2025
Abstract:Generalized Category Discovery (GCD) is an intriguing open-world problem that has garnered increasing attention. Given a dataset that includes both labelled and unlabelled images, GCD aims to categorize all images in the unlabelled subset, regardless of whether they belong to known or unknown classes. In GCD, the common practice typically involves applying a spherical projection operator at the end of the self-supervised pretrained backbone, operating within Euclidean or spherical space. However, both of these spaces have been shown to be suboptimal for encoding samples that possesses hierarchical structures. In contrast, hyperbolic space exhibits exponential volume growth relative to radius, making it inherently strong at capturing the hierarchical structure of samples from both seen and unseen categories. Therefore, we propose to tackle the category discovery challenge in the hyperbolic space. We introduce HypCD, a simple \underlineHyperbolic framework for learning hierarchy-aware representations and classifiers for generalized \underlineCategory \underlineDiscovery. HypCD first transforms the Euclidean embedding space of the backbone network into hyperbolic space, facilitating subsequent representation and classification learning by considering both hyperbolic distance and the angle between samples. This approach is particularly helpful for knowledge transfer from known to unknown categories in GCD. We thoroughly evaluate HypCD on public GCD benchmarks, by applying it to various baseline and state-of-the-art methods, consistently achieving significant improvements.
zh
[CV-19] o Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition CVPR
【速读】:该论文试图解决现代视觉位置识别(VPR)检索系统中重排序(re-ranking)步骤可能降低性能的问题。论文指出,当前VPR数据集的饱和状态使得重排序在某些情况下变得不再必要甚至有害。解决方案的关键在于将图像匹配作为验证步骤,通过统计内点数(inlier counts)来可靠预测何时需要重排序,从而实现更鲁棒和自适应的VPR系统设计。
链接: https://arxiv.org/abs/2504.06116
作者: Davide Sferrazza,Gabriele Berton,Gabriele Trivigno,Carlo Masone
机构: Politecnico di Torino (都灵理工大学); Focoos AI (Focoos AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPRW 2025
Abstract:Visual Place Recognition (VPR) is a critical task in computer vision, traditionally enhanced by re-ranking retrieval results with image matching. However, recent advancements in VPR methods have significantly improved performance, challenging the necessity of re-ranking. In this work, we show that modern retrieval systems often reach a point where re-ranking can degrade results, as current VPR datasets are largely saturated. We propose using image matching as a verification step to assess retrieval confidence, demonstrating that inlier counts can reliably predict when re-ranking is beneficial. Our findings shift the paradigm of retrieval pipelines, offering insights for more robust and adaptive VPR systems.
zh
[CV-20] owards Varroa destructor mite detection using a narrow spectra illumination
【速读】:该论文旨在开发和改进一种用于蜂箱监测的设备,并结合高光谱成像技术实现对蜜蜂身上瓦螨(Varroa destructor)的检测。论文的关键在于收集蜜蜂与螨虫的数据集,并提出一种基于U-Net语义分割架构以及传统计算机视觉方法的模型,以实现蜜蜂与螨虫的有效区分与检测。
链接: https://arxiv.org/abs/2504.06099
作者: Samuel Bielik,Simon Bilik
机构: Brno University of Technology (布尔诺理工大学); University of Ostrava (奥斯特拉瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper focuses on the development and modification of a beehive monitoring device and Varroa destructor detection on the bees with the help of hyperspectral imagery while utilizing a U-net, semantic segmentation architecture, and conventional computer vision methods. The main objectives were to collect a dataset of bees and mites, and propose the computer vision model which can achieve the detection between bees and mites.
zh
[CV-21] MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer AAAI2025
【速读】:该论文旨在解决胎儿超声(US)视频中标准切面获取不准确的问题,这限制了胎儿生长评估、异常检测以及临床指南的遵循。传统方法依赖手动选择标准帧,耗时且易受操作者间差异的影响;现有方法主要基于图像处理,忽略了视频动态特性和跨解剖结构分类的复杂性。论文提出的关键解决方案是Multi-Tier Class-Aware Token Transformer (MCAT),这是一种基于视觉查询的视频片段定位(VQ-VCL)方法。MCAT通过辅助超声技师快速完成超声扫查,并根据所需分析的解剖结构提供视觉查询,返回包含标准帧的视频片段,从而促进潜在异常的全面筛查。其核心在于高效利用较少的tokens实现更优的性能,显著提升了模型在多个数据集上的表现,特别是在胎儿超声数据集上比现有方法分别高出10%和13%的mIoU,同时在Ego4D数据集上提升5.35% mIoU,且token使用量减少96%,展现出卓越的效率与准确性。
链接: https://arxiv.org/abs/2504.06088
作者: Divyanshu Mishra,Pramit Saha,He Zhao,Netzahualcoyotl Hernandez-Cruz,Olga Patey,Aris Papageorghiou,J. Alison Noble
机构: University of Oxford(牛津大学); Imperial College London(帝国理工学院); University of Melbourne(墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in AAAI 2025
Abstract:Accurate standard plane acquisition in fetal ultrasound (US) videos is crucial for fetal growth assessment, anomaly detection, and adherence to clinical guidelines. However, manually selecting standard frames is time-consuming and prone to intra- and inter-sonographer variability. Existing methods primarily rely on image-based approaches that capture standard frames and then classify the input frames across different anatomies. This ignores the dynamic nature of video acquisition and its interpretation. To address these challenges, we introduce Multi-Tier Class-Aware Token Transformer (MCAT), a visual query-based video clip localization (VQ-VCL) method, to assist sonographers by enabling them to capture a quick US sweep. By then providing a visual query of the anatomy they wish to analyze, MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies. We evaluate MCAT on two ultrasound video datasets and a natural image VQ-VCL dataset based on Ego4D. Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens. MCAT’s efficiency and accuracy have significant potential implications for public health, especially in low- and middle-income countries (LMICs), where it may enhance prenatal care by streamlining standard plane acquisition, simplifying US-based screening, diagnosis and allowing sonographers to examine more patients.
zh
[CV-22] MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos
【速读】:该论文旨在解决传统数据驱动方法在处理需要精细灵巧控制的机器人操作任务时存在的不足。具体而言,现有方法往往难以充分捕捉复杂且灵巧的操作技能所需的精确控制能力。为填补这一空白,论文提出利用从大规模第一人称视角视频数据集中学习到的操作先验知识,以提升机器人灵巧操作任务中的策略学习效果。方案的关键在于引入MAPLE方法,通过预测手与物体接触点及接触瞬间的手部详细姿态,并利用这些学到的特征来训练下游操作任务的策略。实验结果表明,MAPLE不仅在现有的模拟基准测试中表现出色,还在一组新设计的具有挑战性的模拟任务以及真实环境下的实验中展现了其优势,特别是在需要精细物体控制和复杂灵巧技能的任务中。
链接: https://arxiv.org/abs/2504.06084
作者: Alexey Gavryushin,Xi Wang,Robert J. S. Malate,Chenyu Yang,Xiangyi Jia,Shubh Goel,Davide Liconti,René Zurbrügg,Robert K. Katzschmann,Marc Pollefeys
机构: ETH Zürich (瑞士苏黎世联邦理工学院); Mimic Robotics (Mimic Robotics); Microsoft Research (微软研究)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale egocentric video datasets capture diverse human activities across a wide range of scenarios, offering rich and detailed insights into how humans interact with objects, especially those that require fine-grained dexterous control. Such complex, dexterous skills with precise controls are crucial for many robotic manipulation tasks, yet are often insufficiently addressed by traditional data-driven approaches to robotic manipulation. To address this gap, we leverage manipulation priors learned from large-scale egocentric video datasets to improve policy learning for dexterous robotic manipulation tasks. We present MAPLE, a novel method for dexterous robotic manipulation that exploits rich manipulation priors to enable efficient policy learning and better performance on diverse, complex manipulation tasks. Specifically, we predict hand-object contact points and detailed hand poses at the moment of hand-object contact and use the learned features to train policies for downstream manipulation tasks. Experimental results demonstrate the effectiveness of MAPLE across existing simulation benchmarks, as well as a newly designed set of challenging simulation tasks, which require fine-grained object control and complex dexterous skills. The benefits of MAPLE are further highlighted in real-world experiments using a dexterous robotic hand, whereas simultaneous evaluation across both simulation and real-world experiments has remained underexplored in prior work.
zh
[CV-23] Enhanced Anomaly Detection for Capsule Endoscopy Using Ensemble Learning Strategies
【速读】:本文旨在解决视频胶囊内窥镜中异常检测面临的两个主要挑战:一是由于视频胶囊体积限制,直接嵌入AI模型需要谨慎考虑模型大小;二是该领域可用数据稀缺。为应对这些挑战,论文提出了一种集成学习策略,在训练和推理阶段仅需少量独立神经网络即可实现有效的异常检测。关键在于采用多种从异常检测领域选取的损失函数分别训练每个网络,而非使用相同的训练算法。这种方法在Galar和Kvasir-Capsule两个公开数据集上的验证表明,其在较少参数量的情况下显著优于现有基线,为将人工智能融入胶囊内窥镜提供了重要进展。
链接: https://arxiv.org/abs/2504.06039
作者: Julia Werner,Christoph Gerum,Jorg Nick,Maxime Le Floch,Franz Brinkmann,Jochen Hampe,Oliver Bringmann
机构: Department of Computer Science, University of Tübingen (蒂宾根大学计算机科学系); Seminar of Applied Mathematics, ETH Zürich (苏黎世联邦理工学院应用数学研讨会); Else Kröner Fresenius Center for Digital Health, TU Dresden (德累斯顿工业大学克罗纳弗莱申中心数字健康); Department of Medicine I, University Hospital Dresden, TU Dresden (德累斯顿工业大学医学一所医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS EMBC)
Abstract:Capsule endoscopy is a method to capture images of the gastrointestinal tract and screen for diseases which might remain hidden if investigated with standard endoscopes. Due to the limited size of a video capsule, embedding AI models directly into the capsule demands careful consideration of the model size and thus complicates anomaly detection in this field. Furthermore, the scarcity of available data in this domain poses an ongoing challenge to achieving effective anomaly detection. Thus, this work introduces an ensemble strategy to address this challenge in anomaly detection tasks in video capsule endoscopies, requiring only a small number of individual neural networks during both the training and inference phases. Ensemble learning combines the predictions of multiple independently trained neural networks. This has shown to be highly effective in enhancing both the accuracy and robustness of machine learning models. However, this comes at the cost of higher memory usage and increased computational effort, which quickly becomes prohibitive in many real-world applications. Instead of applying the same training algorithm to each individual network, we propose using various loss functions, drawn from the anomaly detection field, to train each network. The methods are validated on the two largest publicly available datasets for video capsule endoscopy images, the Galar and the Kvasir-Capsule dataset. We achieve an AUC score of 76.86% on the Kvasir-Capsule and an AUC score of 76.98% on the Galar dataset. Our approach outperforms current baselines with significantly fewer parameters across all models, which is a crucial step towards incorporating artificial intelligence into capsule endoscopies.
zh
[CV-24] CamContextI2V: Context-aware Controllable Video Generation
【速读】:该论文试图解决图像到视频(Image-to-Video, I2V)扩散模型在生成过程中过度依赖静态图像条件、缺乏场景上下文扩展以及引入额外约束(如相机轨迹)时视觉质量下降的问题。论文的关键解决方案在于提出了一种名为CamContextI2V的新方法,通过整合多模态图像条件与三维(3D)几何约束,并结合相机控制能力,增强全局语义表达和细节刻画,同时引入时间感知机制以实现更连贯且上下文敏感的视频生成。实验表明,该方法显著提升了视觉质量和相机操控性能。
链接: https://arxiv.org/abs/2504.06022
作者: Luis Denninger,Sina Mokhtarzadeh Azar,Juergen Gall
机构: University of Bonn (波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, image-to-video (I2V) diffusion models have demonstrated impressive scene understanding and generative quality, incorporating image conditions to guide generation. However, these models primarily animate static images without extending beyond their provided context. Introducing additional constraints, such as camera trajectories, can enhance diversity but often degrades visual quality, limiting their applicability for tasks requiring faithful scene representation. We propose CamContextI2V, an I2V model that integrates multiple image conditions with 3D constraints alongside camera control to enrich both global semantics and fine-grained visual details. This enables more coherent and context-aware video generation. Moreover, we motivate the necessity of temporal awareness for an effective context representation. Our comprehensive study on the RealEstate10K dataset demonstrates improvements in visual quality and camera controllability. We make our code and models publicly available at: this https URL.
zh
[CV-25] Memory-Modular Classification: Learning to Generalize with Memory Replacement
【速读】:该论文旨在解决传统图像分类模型在面对新类别时需要重新训练以适应新任务的问题,提出了一种新颖的记忆模块化学习器(Memory-Modular Learner),其核心在于将知识记忆与推理分离。解决方案的关键在于通过外部存储(external memory)来保存从网络爬取的图像和文本数据所代表的世界知识,而不是像传统模型那样将世界知识和特定任务技能编码到模型权重中。在推理阶段,模型能够根据输入图像动态选择相关记忆内容,从而仅需替换内存内容即可适应任意新类别,而无需重新训练模型。这种机制使得模型能够在零样本/少样本分类、细粒度分类以及类别增量分类等多样化场景中展现出鲁棒性和灵活性。
链接: https://arxiv.org/abs/2504.06021
作者: Dahyun Kang,Ahmet Iscen,Eunchan Jo,Sua Choi,Minsu Cho,Cordelia Schmid
机构: POSTECH; Google(谷歌); Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TMLR. Code available: this https URL
Abstract:We propose a novel memory-modular learner for image classification that separates knowledge memorization from reasoning. Our model enables effective generalization to new classes by simply replacing the memory contents, without the need for model retraining. Unlike traditional models that encode both world knowledge and task-specific skills into their weights during training, our model stores knowledge in the external memory of web-crawled image and text data. At inference time, the model dynamically selects relevant content from the memory based on the input image, allowing it to adapt to arbitrary classes by simply replacing the memory contents. The key differentiator that our learner meta-learns to perform classification tasks with noisy web data from unseen classes, resulting in robust performance across various classification scenarios. Experimental results demonstrate the promising performance and versatility of our approach in handling diverse classification tasks, including zero-shot/few-shot classification of unseen classes, fine-grained classification, and class-incremental classification.
zh
[CV-26] Latent Multimodal Reconstruction for Misinformation Detection
【速读】:该论文旨在解决多模态虚假信息检测(MMD)领域中因大规模标注数据稀缺而导致检测模型鲁棒性不足的问题。传统方法通过合成训练数据(如上下文无关的图像-标题对或命名实体篡改)生成的虚假信息过于简单,未能充分反映现实世界的复杂性。论文的关键创新在于提出了“MisCaption This!”这一包含大型视觉语言模型(LVLM)生成的错误标注图像的训练数据集,并引入了一种名为“潜在多模态重建”(LAMAR)的新网络架构。LAMAR通过重建真实标题的嵌入向量,为检测过程提供强大的辅助信号。此外,论文探索了多种训练策略(端到端训练与大规模预训练)和集成方法(直接、掩码、门控及注意力机制),以优化LAMAR的性能。实验结果表明,“MisCaption This!”显著提升了模型在真实世界中的泛化能力,而LAMAR在NewsCLIPpings和VERITE基准测试中达到了新的性能高度,突显了LVLM生成数据和基于重建的方法在推动MMD领域的潜力。
链接: https://arxiv.org/abs/2504.06010
作者: Stefanos-Iordanis Papadopoulos,Christos Koutlis,Symeon Papadopoulos,Panagiotis C. Petrantonakis
机构: Information Technology Institute, Centre for Research & Technology, Hellas (信息技术研究所,研究中心与技术中心,希腊); Department of Electrical & Computer Engineering, Aristotle University of Thessaloniki (电气与计算机工程系,亚里士多德大学,塞萨洛尼基)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image’s origin, context, or meaning, poses a growing challenge in the digital age. To support fact-checkers, researchers have been focusing on creating datasets and developing methods for multimodal misinformation detection (MMD). Due to the scarcity of large-scale annotated MMD datasets, recent studies leverage synthetic training data via out-of-context image-caption pairs or named entity manipulations; altering names, dates, and locations. However, these approaches often produce simplistic misinformation that fails to reflect real-world complexity, limiting the robustness of detection models trained on them. Meanwhile, despite recent advancements, Large Vision-Language Models (LVLMs) remain underutilized for generating diverse, realistic synthetic training data for MMD. To address this gap, we introduce “MisCaption This!”, a training dataset comprising LVLM-generated miscaptioned images. Additionally, we introduce “Latent Multimodal Reconstruction” (LAMAR), a network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to the detection process. To optimize LAMAR, we explore different training strategies (end-to-end training and large-scale pre-training) and integration approaches (direct, mask, gate, and attention). Extensive experiments show that models trained on “MisCaption This!” generalize better on real-world misinformation, while LAMAR sets new state-of-the-art on both NewsCLIPpings and VERITE benchmarks; highlighting the potential of LVLM-generated data and reconstruction-based approaches for advancing MMD. We release our code at: this https URL
zh
[CV-27] FedFeat: A Robust Federated Learning Framework Through Federated Aggregation and Differentially Private Feature-Based Classifier Retraining
【速读】:该论文旨在解决联邦学习(Federated Learning)中模型泛化能力不足以及隐私保护的问题。传统联邦平均(FedAvg)方法在非独立同分布(non-IID)数据场景下容易导致模型性能下降,同时直接传输特征可能带来隐私泄露风险。为了解决这些问题,论文提出FedFeat+框架,其关键在于将特征提取与分类任务分离,并设计了一个两阶段的模型训练过程:首先客户端在本地训练后上传部分特征及其权重到服务器;接着服务器利用FedAvg聚合模型,并通过共享特征重新训练全局分类器以增强对数据分布的整体理解,从而提升泛化能力。此外,通过引入差分隐私机制向上传的特征添加噪声,确保敏感信息的安全性。这一方案不仅实现了较高的分类精度,还在多种基准数据集上验证了其优越性,尤其是在非IID设置下的表现显著优于FedAvg方法。
链接: https://arxiv.org/abs/2504.06004
作者: Mrityunjoy Gain,Kitae Kim,Avi Deb Raha,Apurba Adhikary,Eui-Nam Huh,Zhu Han,Choong Seon Hong
机构: Department of Artificial Intelligence, School of Computing, Kyung Hee University (庆熙大学), Yongin 17104, Republic of Korea; Department of Computer Science and Engineering, School of Computing, Kyung Hee University (庆熙大学), Yongin 17104, Republic of Korea; Department of Information and Communication Engineering, Noakhali Science and Technology University (诺阿卡利科技大学), Noakhali-3814, Bangladesh; Electrical and Computer Engineering Department, University of Houston (休斯敦大学), Houston, TX 77004
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we propose the FedFeat+ framework, which distinctively separates feature extraction from classification. We develop a two-tiered model training process: following local training, clients transmit their weights and some features extracted from the feature extractor from the final local epochs to the server. The server aggregates these models using the FedAvg method and subsequently retrains the global classifier utilizing the shared features. The classifier retraining process enhances the model’s understanding of the holistic view of the data distribution, ensuring better generalization across diverse datasets. This improved generalization enables the classifier to adaptively influence the feature extractor during subsequent local training epochs. We establish a balance between enhancing model accuracy and safeguarding individual privacy through the implementation of differential privacy mechanisms. By incorporating noise into the feature vectors shared with the server, we ensure that sensitive data remains confidential. We present a comprehensive convergence analysis, along with theoretical reasoning regarding performance enhancement and privacy preservation. We validate our approach through empirical evaluations conducted on benchmark datasets, including CIFAR-10, CIFAR-100, MNIST, and FMNIST, achieving high accuracy while adhering to stringent privacy guarantees. The experimental results demonstrate that the FedFeat+ framework, despite using only a lightweight two-layer CNN classifier, outperforms the FedAvg method in both IID and non-IID scenarios, achieving accuracy improvements ranging from 3.92 % to 12.34 % across CIFAR-10, CIFAR-100, and Fashion-MNIST datasets.
zh
[CV-28] conSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians
【速读】:该论文旨在解决现有开放词汇神经场研究中存在的两个主要问题:1)过度依赖SAM(Segment Anything Model)来正则化CLIP(Contrastive Language-Image Pretraining)而缺乏进一步优化;2)通过降维2D视觉语言模型(VLM)语义特征以提高效率,但导致多视角一致性问题。为了解决这些问题,论文提出了一种名为econSG的新方法,其关键是结合Confidence-region Guided Regularization (CRR),通过相互优化SAM和CLIP,实现更精确的语义特征提取与边界完整性;同时引入低维上下文空间,在融合多视角2D特征后直接对三维特征进行降维,从而在保证多视角一致性的同时提升计算效率。实验结果显示,econSG在四个基准数据集上达到了最先进的性能,并且在所有方法中具有最高的训练效率。
链接: https://arxiv.org/abs/2504.06003
作者: Can Zhang,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The primary focus of most recent works on open-vocabulary neural fields is extracting precise semantic features from the VLMs and then consolidating them efficiently into a multi-view consistent 3D neural fields representation. However, most existing works over-trusted SAM to regularize image-level CLIP without any further refinement. Moreover, several existing works improved efficiency by dimensionality reduction of semantic features from 2D VLMs before fusing with 3DGS semantic fields, which inevitably leads to multi-view inconsistency. In this work, we propose econSG for open-vocabulary semantic segmentation with 3DGS. Our econSG consists of: 1) A Confidence-region Guided Regularization (CRR) that mutually refines SAM and CLIP to get the best of both worlds for precise semantic features with complete and precise boundaries. 2) A low dimensional contextual space to enforce 3D multi-view consistency while improving computational efficiency by fusing backprojected multi-view 2D features and follow by dimensional reduction directly on the fused 3D features instead of operating on each 2D view separately. Our econSG shows state-of-the-art performance on four benchmark datasets compared to the existing methods. Furthermore, we are also the most efficient training among all the methods.
zh
[CV-29] An Empirical Study of GPT -4o Image Generation Capabilities
【速读】:该论文试图探索GPT-4o在图像生成领域的综合能力,并评估其在统一框架下整合文本与图像生成任务的可行性。论文通过对比领先开源与商业模型,系统研究了GPT-4o在文本到图像(text-to-image)、图像到图像(image-to-image)、图像到三维模型(image-to-3D)以及图像到其他模态(image-to-X)等超过20项任务中的表现,分析其优势与局限性,并探讨其在生成式建模(Generative Modeling)发展中的位置。论文的关键在于揭示架构设计(architectural design)与数据规模(data scaling)对构建未来统一生成模型的重要性,以进一步推动高保真多模态生成(high-fidelity multimodal generation)的发展。
链接: https://arxiv.org/abs/2504.05979
作者: Sixiang Chen,Jinbin Bai,Zhuoran Zhao,Tian Ye,Qingyu Shi,Donghao Zhou,Wenhao Chai,Xin Lin,Jianzong Wu,Chao Tang,Shilin Xu,Tao Zhang,Haobo Yuan,Yikang Zhou,Wei Chow,Linfeng Li,Xiangtai Li,Lei Zhu,Lu Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o’s image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling.
zh
[CV-30] Diffusion Based Ambiguous Image Segmentation
【速读】:该论文旨在解决医学图像分割中由于专家标注变化而引起的固有不确定性问题,目标是通过扩散模型(Diffusion Model)捕捉专家标注全分布的可能性。论文的关键在于探索扩散模型在生成式分割任务中的设计空间,特别是噪声调度(noise schedule)、预测类型(prediction type)以及损失权重(loss weighting)的影响。研究发现,通过对输入进行缩放使噪声调度更加困难显著提升了性能。此外,x- 和 v-预测优于 ε 预测,且许多损失权重在赋予扩散过程末尾足够权重的情况下表现相当。基于LIDC-IDRI肺部病灶数据集的实验结果表明,该方法达到了当前最先进的性能(SOTA),并且引入了一个随机裁剪版本的数据集以更好地反映分割任务中的不确定性,模型在此更具挑战性的设置下同样达到 SOTA。
链接: https://arxiv.org/abs/2504.05977
作者: Jakob Lønborg Christensen,Morten Rieger Hannemose,Anders Bjorholm Dahl,Vedrana Andersen Dahl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at SCIA25
Abstract:Medical image segmentation often involves inherent uncertainty due to variations in expert annotations. Capturing this uncertainty is an important goal and previous works have used various generative image models for the purpose of representing the full distribution of plausible expert ground truths. In this work, we explore the design space of diffusion models for generative segmentation, investigating the impact of noise schedules, prediction types, and loss weightings. Notably, we find that making the noise schedule harder with input scaling significantly improves performance. We conclude that x- and v-prediction outperform epsilon-prediction, likely because the diffusion process is in the discrete segmentation domain. Many loss weightings achieve similar performance as long as they give enough weight to the end of the diffusion process. We base our experiments on the LIDC-IDRI lung lesion dataset and obtain state-of-the-art (SOTA) performance. Additionally, we introduce a randomly cropped variant of the LIDC-IDRI dataset that is better suited for uncertainty in image segmentation. Our model also achieves SOTA in this harder setting.
zh
[CV-31] mporal Alignment-Free Video Matching for Few-shot Action Recognition CVPR2025
【速读】:该论文旨在解决 Few-Shot Action Recognition (FSAR) 中处理动作叙事轨迹多样性以实现精确视频匹配的关键挑战。现有方法(如基于帧或元组级对齐的方法)依赖于预定义且长度相关的对齐单元,这限制了其在不同长度和速度动作上的灵活性。论文提出了一种新颖的 TEmporal Alignment-free Matching (TEAM) 方法作为解决方案,其关键是无需使用时间单位即可表示动作并在匹配过程中避免暴力对齐。TEAM 使用固定模式令牌来捕获视频实例中的全局判别线索,确保灵活性,并通过令牌级比较而非现有方法依赖的成对时间对齐来高效测量视频相似性。此外,引入的自适应过程能够识别并移除类别间共有信息,从而在新类别之间建立清晰边界。实验验证了 TEAM 的有效性。
链接: https://arxiv.org/abs/2504.05956
作者: SuBeen Lee,WonJun Moon,Hyun Seok Seong,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, 6 tables, Accepted to CVPR 2025 as Oral Presentation
Abstract:Few-Shot Action Recognition (FSAR) aims to train a model with only a few labeled video instances. A key challenge in FSAR is handling divergent narrative trajectories for precise video matching. While the frame- and tuple-level alignment approaches have been promising, their methods heavily rely on pre-defined and length-dependent alignment units (e.g., frames or tuples), which limits flexibility for actions of varying lengths and speeds. In this work, we introduce a novel TEmporal Alignment-free Matching (TEAM) approach, which eliminates the need for temporal units in action representation and brute-force alignment during matching. Specifically, TEAM represents each video with a fixed set of pattern tokens that capture globally discriminative clues within the video instance regardless of action length or speed, ensuring its flexibility. Furthermore, TEAM is inherently efficient, using token-wise comparisons to measure similarity between videos, unlike existing methods that rely on pairwise comparisons for temporal alignment. Additionally, we propose an adaptation process that identifies and removes common information across classes, establishing clear boundaries even between novel categories. Extensive experiments demonstrate the effectiveness of TEAM. Codes are available at this http URL.
zh
[CV-32] CKGAN: Training Generative Adversarial Networks Using Characteristic Kernel Integral Probability Metrics
【速读】:本文旨在解决生成式对抗网络(GAN)中的模式崩塌(mode collapse)问题,并提出一种基于积分概率度量框架与特征核(CKIPM)的新变体CKGAN。解决方案的关键在于引入CKIPM作为两个概率分布之间的距离度量,其设计目标是最小化再生核希尔伯特空间(RKHS)中最大均值差异(MMD)的下界,从而用于训练GAN。此外,CKGAN通过将生成图像映射回随机噪声来缓解模式崩塌问题,并且提出了一种软选择方法以自动学习特征核函数,避免人工选择核函数的繁琐过程。实验结果表明,CKGAN在合成数据集和真实图像数据集(如MNIST、CelebA等)上的表现优于其他基于MMD的GAN,并且自动选择的核函数在真实图像任务上接近手动调优的最佳性能,同时提升了其他基于MMD的GAN的表现。
链接: https://arxiv.org/abs/2504.05945
作者: Kuntian Zhang,Simin Yu,Yaoshu Wang,Makoto Onizuka,Chuan Xiao
机构: Osaka University (大阪大学); Shenzhen University (深圳大学); Shenzhen Institute of Computing Sciences (深圳计算科学研究院); Nagoya University (名古屋大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Source codes are available at this https URL
Abstract:In this paper, we propose CKGAN, a novel generative adversarial network (GAN) variant based on an integral probability metrics framework with characteristic kernel (CKIPM). CKIPM, as a distance between two probability distributions, is designed to optimize the lowerbound of the maximum mean discrepancy (MMD) in a reproducing kernel Hilbert space, and thus can be used to train GANs. CKGAN mitigates the notorious problem of mode collapse by mapping the generated images back to random noise. To save the effort of selecting the kernel function manually, we propose a soft selection method to automatically learn a characteristic kernel function. The experimental evaluation conducted on a set of synthetic and real image benchmarks (MNIST, CelebA, etc.) demonstrates that CKGAN generally outperforms other MMD-based GANs. The results also show that at the cost of moderately more training time, the automatically selected kernel function delivers very close performance to the best of manually fine-tuned one on real image benchmarks and is able to improve the performances of other MMD-based GANs.
zh
[CV-33] SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation CVPR2025
【速读】:该论文旨在解决视觉-语言时间对齐(Vision-Language Temporal Alignment)在实际场景中面临的挑战,特别是现有基准数据集因时间分布偏差、标注不精确及组合性不足导致的局限性。论文的核心目标是从时间维度评估模型对齐视觉场景与语言上下文的能力,并提出公平评估与全面探索的方法。为实现这一目标,作者首先对现有基准数据集进行了统计分析,揭示其存在的分解视角下的挑战。解决方案的关键在于引入SVLTA(Synthetic Vision-Language Temporal Alignment),通过在模拟环境中利用精心设计且可行的控制生成方法,结合常识知识、可操作动作及约束过滤,生成合理、多样且平衡的数据分布,用于诊断性评估。这种方法能够有效应对时间问答、分布偏移敏感性和时间对齐适应性的挑战。
链接: https://arxiv.org/abs/2504.05925
作者: Hao Du,Bo Wu,Yan Lu,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); MIT-IBM Watson AI Lab (MIT-IBM沃森人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. The first two authors contributed equally
Abstract:Vision-language temporal alignment is a crucial capability for human dynamic recognition and cognition in real-world scenarios. While existing research focuses on capturing vision-language relevance, it faces limitations due to biased temporal distributions, imprecise annotations, and insufficient compositionally. To achieve fair evaluation and comprehensive exploration, our objective is to investigate and evaluate the ability of models to achieve alignment from a temporal perspective, specifically focusing on their capacity to synchronize visual scenarios with linguistic context in a temporally coherent manner. As a preliminary step, we present the statistical analysis of existing benchmarks and reveal the existing challenges from a decomposed perspective. To this end, we introduce SVLTA, the Synthetic Vision-Language Temporal Alignment derived via a well-designed and feasible control generation method within a simulation environment. The approach considers commonsense knowledge, manipulable action, and constrained filtering, which generates reasonable, diverse, and balanced data distributions for diagnostic evaluations. Our experiments reveal diagnostic insights through the evaluations in temporal question answering, distributional shift sensitiveness, and temporal alignment adaptation.
zh
[CV-34] Balancing long- and short-term dynamics for the modeling of saliency in videos
【速读】:该论文旨在研究长期和短期动态在视频显著物体检测中的作用,这是一个尚未充分探索的领域。论文提出了一种基于Transformer的方法,通过联合学习视频帧与过去显著性信息的表示来解决这一问题。解决方案的关键在于模型能够嵌入长期和短期信息,并将视频帧序列分解为时空标记(spatiotemporal tokens),使模型既能利用局部时间范围内的短时信息,又能建立全局时间范围内的长时连接。此外,模型采用双流Transformer架构分别处理两种模态的信息后进行融合,并通过显著性引导的掩码机制学习偏差识别的嵌入表示。研究发现,先验信息有助于首次检测显著位置,且时空长短期特征的比例直接影响模型性能,其中扩展长期上下文对提升模型表现尤为有益。
链接: https://arxiv.org/abs/2504.05913
作者: Theodor Wulff,Fares Abawi,Philipp Allgeuer,Stefan Wermter
机构: Department of Computer Science, The University of Manchester (曼彻斯特大学); Department of Informatics, Universität Hamburg (汉堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The role of long- and short-term dynamics towards salient object detection in videos is under-researched. We present a Transformer-based approach to learn a joint representation of video frames and past saliency information. Our model embeds long- and short-term information to detect dynamically shifting saliency in video. We provide our model with a stream of video frames and past saliency maps, which acts as a prior for the next prediction, and extract spatiotemporal tokens from both modalities. The decomposition of the frame sequence into tokens lets the model incorporate short-term information from within the token, while being able to make long-term connections between tokens throughout the sequence. The core of the system consists of a dual-stream Transformer architecture to process the extracted sequences independently before fusing the two modalities. Additionally, we apply a saliency-based masking scheme to the input frames to learn an embedding that facilitates the recognition of deviations from previous outputs. We observe that the additional prior information aids in the first detection of the salient location. Our findings indicate that the ratio of spatiotemporal long- and short-term features directly impacts the model’s performance. While increasing the short-term context is beneficial up to a certain threshold, the model’s performance greatly benefits from an expansion of the long-term context.
zh
[CV-35] PRIMEDrive-CoT: A Precognitive Chain-of-Thought Framework for Uncertainty-Aware Object Interaction in Driving Scene Scenario CVPR
【速读】:该论文旨在解决传统自动驾驶场景理解方法未能有效捕捉真实驾驶环境中概率性和不确定性的问题。为应对这一挑战,论文提出了一种名为PRIMEDrive-CoT的新型不确定性感知模型,其关键在于结合基于LiDAR的3D目标检测与多视角RGB参考图像,以实现可解释且可靠的场景理解。通过使用贝叶斯图神经网络(Bayesian Graph Neural Networks, BGNNs)进行概率推理,该模型能够对不确定性及风险进行建模,并利用对象动态和上下文线索支持可解释的决策过程,同时借助Grad-CAM可视化技术突出注意力区域。实验结果表明,PRIMEDrive-CoT在DriveCoT数据集上的表现优于现有的基于Chain-of-Thought (CoT) 和风险感知的方法。
链接: https://arxiv.org/abs/2504.05908
作者: Sriram Mandalika,Lalitha V,Athira Nambiar
机构: Department of Computational Intelligence (计算智能系), SRM Institute of Science and Technology (SRM理工学院); Department of Electronics and Communication Engineering (电子与通信工程系), Faculty of Engineering and Technology (工程与技术学院), SRM Institute of Science and Technology (SRM理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 - CVPRW
Abstract:Driving scene understanding is a critical real-world problem that involves interpreting and associating various elements of a driving environment, such as vehicles, pedestrians, and traffic signals. Despite advancements in autonomous driving, traditional pipelines rely on deterministic models that fail to capture the probabilistic nature and inherent uncertainty of real-world driving. To address this, we propose PRIMEDrive-CoT, a novel uncertainty-aware model for object interaction and Chain-of-Thought (CoT) reasoning in driving scenarios. In particular, our approach combines LiDAR-based 3D object detection with multi-view RGB references to ensure interpretable and reliable scene understanding. Uncertainty and risk assessment, along with object interactions, are modelled using Bayesian Graph Neural Networks (BGNNs) for probabilistic reasoning under ambiguous conditions. Interpretable decisions are facilitated through CoT reasoning, leveraging object dynamics and contextual cues, while Grad-CAM visualizations highlight attention regions. Extensive evaluations on the DriveCoT dataset demonstrate that PRIMEDrive-CoT outperforms state-of-the-art CoT and risk-aware models.
zh
[CV-36] Intrinsic Saliency Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation
【速读】:该论文旨在解决现有无监督视频对象分割(Unsupervised Video Object Segmentation, UVOS)方法在运动与表观特征关系建模上的不足以及光流质量对分割结果的影响。主流方法主要采用双编码器结构分别处理运动和表观特征或单编码器结构进行联合编码,但这些方法难以有效平衡运动与表观之间的关系,导致提取的次优特征降低了模型的整体性能。此外,不同场景下光流的质量差异使得仅依赖光流难以获得高质量的分割结果。
为了解决上述挑战,论文提出了一种名为Intrinsic Saliency guided Trunk-Collateral Network (ISTC-Net) 的网络架构。该方案的关键在于引入了一种新颖的Trunk-Collateral结构,其中共享的主干网络捕获运动与表观的共同特性,而辅助分支则专注于学习运动特征的独特性。此外,设计了一个Intrinsic Saliency guided Refinement Module (ISRM),用于高效利用模型的内在显著性信息来优化高级特征,并为运动与表观特征的融合提供像素级指导,从而在不增加额外输入的情况下提升性能。实验结果表明,ISTC-Net 在三个UVOS数据集和四个标准视频显著物体检测基准上取得了最先进的性能,验证了其有效性和优越性。
链接: https://arxiv.org/abs/2504.05904
作者: Xiangyu Zheng,Wanyun Li,Songcheng He,Xiaoqiang Li,We Zhang
机构: Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University (复旦大学计算机科学学院,上海智能信息处理重点实验室); School of Computer Engineering and Science, Shanghai University (上海大学计算机工程与科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent unsupervised video object segmentation (UVOS) methods predominantly adopt the motion-appearance paradigm. Mainstream motion-appearance approaches use either the two-encoder structure to separately encode motion and appearance features, or the single-encoder structure for joint encoding. However, these methods fail to properly balance the motion-appearance relationship. Consequently, even with complex fusion modules for motion-appearance integration, the extracted suboptimal features degrade the models’ overall performance. Moreover, the quality of optical flow varies across scenarios, making it insufficient to rely solely on optical flow to achieve high-quality segmentation results. To address these challenges, we propose the Intrinsic Saliency guided Trunk-Collateral Network (ISTC-Net), which better balances the motion-appearance relationship and incorporates model’s intrinsic saliency information to enhance segmentation performance. Specifically, considering that optical flow maps are derived from RGB images, they share both commonalities and differences. We propose a novel Trunk-Collateral structure. The shared trunk backbone captures the motion-appearance commonality, while the collateral branch learns the uniqueness of motion features. Furthermore, an Intrinsic Saliency guided Refinement Module (ISRM) is devised to efficiently leverage the model’s intrinsic saliency information to refine high-level features, and provide pixel-level guidance for motion-appearance fusion, thereby enhancing performance without additional input. Experimental results show that ISTC-Net achieved state-of-the-art performance on three UVOS datasets (89.2% JF on DAVIS-16, 76% J on YouTube-Objects, 86.4% J on FBMS) and four standard video salient object detection (VSOD) benchmarks with the notable increase, demonstrating its effectiveness and superiority over previous methods.
zh
[CV-37] UVG-VPC: Voxelized Point Cloud Dataset for Visual Volumetric Video-based Coding
【速读】:该论文旨在解决点云压缩在沉浸式视觉媒体处理与流传输中的关键挑战,其核心目标是推动MPEG基于体积视频编码(Visual Volumetric Video-based Coding, V3C)技术的发展。论文的关键解决方案是发布一个名为UVG-VPC的新开放数据集,该数据集包含12个具有多样特性的点云测试视频序列,涵盖运动、RGB纹理、三维几何结构以及表面遮挡等不同属性,并提供每秒25帧、共计10秒长度的250帧数据。这些点云数据以9到12位的几何精度进行体素化,并使用8位RGB值表示体素颜色属性,同时附加法向量信息以增强对点云压缩方案的评估能力。通过这一数据集,论文期望促进V3C技术的研究与验证,从而引领该领域的未来发展。
链接: https://arxiv.org/abs/2504.05888
作者: Guillaume Gautier,Alexandre Mercat,Louis Fréneau,Mikko Pitkänen,Jarno Vanne
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: Point cloud compression;Geometry;Visualization;Three-dimensional displays;Video sequences;Transform coding;Media;Open dataset;point cloud;Visual Volumetric Video-based Coding (V3C);Video-based Point Cloud Compression (V-PCC);Extended Reality (XR)
Abstract:Point cloud compression has become a crucial factor in immersive visual media processing and streaming. This paper presents a new open dataset called UVG-VPC for the development, evaluation, and validation of MPEG Visual Volumetric Video-based Coding (V3C) technology. The dataset is distributed under its own non-commercial license. It consists of 12 point cloud test video sequences of diverse characteristics with respect to the motion, RGB texture, 3D geometry, and surface occlusion of the points. Each sequence is 10 seconds long and comprises 250 frames captured at 25 frames per second. The sequences are voxelized with a geometry precision of 9 to 12 bits, and the voxel color attributes are represented as 8-bit RGB values. The dataset also includes associated normals that make it more suitable for evaluating point cloud compression solutions. The main objective of releasing the UVG-VPC dataset is to foster the development of V3C technologies and thereby shape the future in this field.
zh
[CV-38] urin3D: Evaluating Adaptation Strategies under Label Scarcity in Urban LiDAR Segmentation with Semi-Supervised Techniques CVPR
【速读】:该论文旨在解决城市环境中点云语义分割的研究挑战,特别是针对缺乏大规模标注数据的问题。论文提出了解决方案的关键在于构建了一个名为Turin3D的新航测LiDAR数据集,并结合半监督学习技术来提升点云语义分割模型的性能。由于全标注过程复杂且耗时,研究团队仅对验证集和测试集进行了人工标注,同时利用未标注的训练集通过半监督方法增强模型效果。该数据集将公开可用,以支持自监督和半监督学习方法在室外点云分割领域的研究。
链接: https://arxiv.org/abs/2504.05882
作者: Luca Barco,Giacomo Blanco,Gaetano Chiriaco,Alessia Intini,Luigi La Riccia,Vittorio Scolamiero,Piero Boccardo,Paolo Garza,Fabrizio Dominici
机构: Politecnico di Torino (都灵理工大学); LINKS Foundation (LINKS基金会); Sapienza Università di Roma (罗马一大)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPRW2025 - USM3D
Abstract:3D semantic segmentation plays a critical role in urban modelling, enabling detailed understanding and mapping of city environments. In this paper, we introduce Turin3D: a new aerial LiDAR dataset for point cloud semantic segmentation covering an area of around 1.43 km2 in the city centre of Turin with almost 70M points. We describe the data collection process and compare Turin3D with others previously proposed in the literature. We did not fully annotate the dataset due to the complexity and time-consuming nature of the process; however, a manual annotation process was performed on the validation and test sets, to enable a reliable evaluation of the proposed techniques. We first benchmark the performances of several point cloud semantic segmentation models, trained on the existing datasets, when tested on Turin3D, and then improve their performances by applying a semi-supervised learning technique leveraging the unlabelled training set. The dataset will be publicly available to support research in outdoor point cloud segmentation, with particular relevance for self-supervised and semi-supervised learning approaches given the absence of ground truth annotations for the training set.
zh
[CV-39] KAN-SAM: Kolmogorov-Arnold Network Guided Segment Anything Model for RGB-T Salient Object Detection ICME2025
【速读】:该论文旨在解决现有RGB-热 (RGB-T) 显着目标检测 (SOD) 方法在复杂场景中因数据集多样性受限及多模态表征构建效率低下而导致的泛化能力不足的问题。论文的关键解决方案是提出了一种基于提示学习的新型RGB-T SOD方法KAN-SAM。具体而言,通过引入高效的Kolmogorov-Arnold网络(KAN)适配器将热特征作为引导提示扩展Segment Anything Model 2 (SAM2),从而有效增强RGB表征并提升鲁棒性。此外,还引入了一种互斥随机掩码策略以减少对RGB数据的依赖并进一步提高泛化能力。实验结果表明,该方法在基准数据集上表现优于现有最先进的方法。
链接: https://arxiv.org/abs/2504.05878
作者: Xingyuan Li,Ruichao Hou,Tongwei Ren,Gangshan Wu
机构: State Key Laboratory for Novel Software Technology, Nanjing University (国家重点实验室,南京大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by ICME2025
Abstract:Existing RGB-thermal salient object detection (RGB-T SOD) methods aim to identify visually significant objects by leveraging both RGB and thermal modalities to enable robust performance in complex scenarios, but they often suffer from limited generalization due to the constrained diversity of available datasets and the inefficiencies in constructing multi-modal representations. In this paper, we propose a novel prompt learning-based RGB-T SOD method, named KAN-SAM, which reveals the potential of visual foundational models for RGB-T SOD tasks. Specifically, we extend Segment Anything Model 2 (SAM2) for RGB-T SOD by introducing thermal features as guiding prompts through efficient and accurate Kolmogorov-Arnold Network (KAN) adapters, which effectively enhance RGB representations and improve robustness. Furthermore, we introduce a mutually exclusive random masking strategy to reduce reliance on RGB data and improve generalization. Experimental results on benchmarks demonstrate superior performance over the state-of-the-art methods.
zh
[CV-40] On the Importance of Conditioning for Privacy-Preserving Data Augmentation
【速读】:本文旨在解决利用条件扩散模型(Conditioned Latent Diffusion Models)进行数据匿名化时存在的隐私保护不足问题。论文的关键发现是,当扩散模型通过深度图或边缘等特征引导生成过程时,这种匿名化方法容易受到对比学习模型的识别攻击,且无法有效保护隐私。解决方案的关键在于揭示了条件扩散模型在匿名化过程中因遵循特定指导信号(如相似边缘模式)而导致的漏洞,从而使得训练出的模型能够通过学习这些可识别的模式实现身份识别。
链接: https://arxiv.org/abs/2504.05849
作者: Julian Lorenz,Katja Ludwig,Valentin Haug,Rainer Lienhart
机构: University of Augsburg (奥格斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Latent diffusion models can be used as a powerful augmentation method to artificially extend datasets for enhanced training. To the human eye, these augmented images look very different to the originals. Previous work has suggested to use this data augmentation technique for data anonymization. However, we show that latent diffusion models that are conditioned on features like depth maps or edges to guide the diffusion process are not suitable as a privacy preserving method. We use a contrastive learning approach to train a model that can correctly identify people out of a pool of candidates. Moreover, we demonstrate that anonymization using conditioned diffusion models is susceptible to black box attacks. We attribute the success of the described methods to the conditioning of the latent diffusion model in the anonymization process. The diffusion model is instructed to produce similar edges for the anonymized images. Hence, a model can learn to recognize these patterns for identification.
zh
[CV-41] Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking CVPR2025
【速读】:该论文旨在揭示文本到图像扩散模型(Text-to-Image Diffusion Models, T2I-DMs)结合图像提示适配器(Image Prompt Adapter, IP-Adapter)后引发的新攻击威胁,即劫持攻击(hijacking attack)。论文指出,通过上传难以察觉的图像空间对抗样本(Adversarial Examples, AEs),攻击者能够劫持大量合法用户,使基于T2I-IP-DM驱动的图像生成服务(Image Generation Service, IGS)被破解,并误导公众对服务提供商产生负面评价。此外,IP-Adapter对开源图像编码器的依赖降低了生成对抗样本的技术门槛,加剧了此威胁。为应对这一问题,作者不仅验证了劫持攻击的技术可行性,还探索了现有防御方法,并提出将IP-Adapter与对抗训练模型相结合以克服现有防御机制的局限性。论文的关键在于揭示新型攻击方式及其技术成因,并提出结合IP-Adapter与对抗训练模型的创新防御思路。
链接: https://arxiv.org/abs/2504.05838
作者: Junxi Chen,Junhao Dong,Xiaohua Xie
机构: School of Computer Science and Engineering, Sun Yat-Sen University, China (中山大学计算机科学与工程学院); Nanyang Technological University, Singapore (南洋理工大学); Guangdong Province Key Laboratory of Information Security Technology, China (广东省信息安全技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted by CVPR2025 as Highlight
Abstract:Recently, the Image Prompt Adapter (IP-Adapter) has been increasingly integrated into text-to-image diffusion models (T2I-DMs) to improve controllability. However, in this paper, we reveal that T2I-DMs equipped with the IP-Adapter (T2I-IP-DMs) enable a new jailbreak attack named the hijacking attack. We demonstrate that, by uploading imperceptible image-space adversarial examples (AEs), the adversary can hijack massive benign users to jailbreak an Image Generation Service (IGS) driven by T2I-IP-DMs and mislead the public to discredit the service provider. Worse still, the IP-Adapter’s dependency on open-source image encoders reduces the knowledge required to craft AEs. Extensive experiments verify the technical feasibility of the hijacking attack. In light of the revealed threat, we investigate several existing defenses and explore combining the IP-Adapter with adversarially trained models to overcome existing defenses’ limitations. Our code is available at this https URL.
zh
[CV-42] Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset AAAI2024
【速读】:该论文旨在解决传统RGB相机在实际场景中因光照不足和快速运动等因素导致的人类活动识别(HAR)性能下降的问题。为应对这些挑战,论文提出结合RGB相机与事件相机的优势,并构建了一个包含300类日常真实世界动作的大规模多模态RGB-事件人类活动识别基准数据集HARDVS 2.0。解决方案的关键在于提出的多模态热传导操作框架MMHCO-HAR,其核心是设计了多模态热传导块,特别是其中的多模态热传导操作层,通过物理启发的热传导模型融合RGB帧和事件流特征嵌入,并利用基于政策路由策略的自适应融合模块实现高性能分类,从而有效提升活动识别的准确性和鲁棒性。
链接: https://arxiv.org/abs/2504.05830
作者: Shiao Wang,Xiao Wang,Bo Jiang,Lin Zhu,Guoqi Li,Yaowei Wang,Yonghong Tian,Jin Tang
机构: Anhui Medical University (安徽医科大学); Peking University (北京大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Journal Extension of HARDVS (AAAI 2024)
Abstract:Human Activity Recognition (HAR) primarily relied on traditional RGB cameras to achieve high-performance activity recognition. However, the challenging factors in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras. To address these challenges, biologically inspired event cameras offer a promising solution to overcome the limitations of traditional RGB cameras. In this work, we rethink human activity recognition by combining the RGB and event cameras. The first contribution is the proposed large-scale multi-modal RGB-Event human activity recognition benchmark dataset, termed HARDVS 2.0, which bridges the dataset gaps. It contains 300 categories of everyday real-world actions with a total of 107,646 paired videos covering various challenging scenarios. Inspired by the physics-informed heat conduction model, we propose a novel multi-modal heat conduction operation framework for effective activity recognition, termed MMHCO-HAR. More in detail, given the RGB frames and event streams, we first extract the feature embeddings using a stem network. Then, multi-modal Heat Conduction blocks are designed to fuse the dual features, the key module of which is the multi-modal Heat Conduction Operation layer. We integrate RGB and event embeddings through a multi-modal DCT-IDCT layer while adaptively incorporating the thermal conductivity coefficient via FVEs into this module. After that, we propose an adaptive fusion module based on a policy routing strategy for high-performance classification. Comprehensive experiments demonstrate that our method consistently performs well, validating its effectiveness and robustness. The source code and benchmark dataset will be released on this https URL
zh
[CV-43] Parasite: A Steganography-based Backdoor Attack Framework for Diffusion Models
【速读】:该论文旨在解决扩散模型在图像到图像(image-to-image)任务中面临的后门攻击问题,特别是现有方法在隐蔽性和灵活性方面的局限性。传统后门攻击方法主要针对目标噪声到图像或文本到图像任务,并且通常依赖单一且显眼的触发器生成固定的目标图像,缺乏隐蔽性和灵活性。此外,这些方法容易被现有的防御框架检测到。
为了解决这些问题,论文提出了一种名为“Parasite”的新型后门攻击方法,专门针对扩散模型中的图像到图像任务。该方法的关键创新在于首次利用隐写术(steganography)隐藏触发器,使攻击更加隐蔽,并允许攻击者将目标内容嵌入作为后门触发器,从而实现更灵活的攻击。通过这种方法,“Parasite”能够有效绕过主流的防御框架,实现在实验中对这些框架零检测率的效果。此外,论文还探讨了不同隐藏系数对攻击结果的影响。
链接: https://arxiv.org/abs/2504.05815
作者: Jiahao Chen,Yu Pan,Yi Du,Chunkai Wu,Lin Wang
机构: School of Computer and Information Engineering, Shanghai Polytechnic University (上海第二工业大学计算机与信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, the diffusion model has gained significant attention as one of the most successful image generation models, which can generate high-quality images by iteratively sampling noise. However, recent studies have shown that diffusion models are vulnerable to backdoor attacks, allowing attackers to enter input data containing triggers to activate the backdoor and generate their desired output. Existing backdoor attack methods primarily focused on target noise-to-image and text-to-image tasks, with limited work on backdoor attacks in image-to-image tasks. Furthermore, traditional backdoor attacks often rely on a single, conspicuous trigger to generate a fixed target image, lacking concealability and flexibility. To address these limitations, we propose a novel backdoor attack method called “Parasite” for image-to-image tasks in diffusion models, which not only is the first to leverage steganography for triggers hiding, but also allows attackers to embed the target content as a backdoor trigger to achieve a more flexible attack. “Parasite” as a novel attack method effectively bypasses existing detection frameworks to execute backdoor attacks. In our experiments, “Parasite” achieved a 0 percent backdoor detection rate against the mainstream defense frameworks. In addition, in the ablation study, we discuss the influence of different hiding coefficients on the attack results. You can find our code at this https URL.
zh
[CV-44] PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
【速读】:该论文旨在解决现有Direct Preference Optimization (DPO) 方法在Video Multimodal Large Language Models (VLLMs) 中因依赖离线偏好数据而导致的适应性差以及未能有效捕捉视频与响应之间真实错位的问题。论文提出了一种名为Video Direct Preference Optimization (VDPO) 的在线偏好学习框架,通过利用视频增强技术生成拒绝样本,同时保持响应固定,从而无需人工标注偏好。然而,选择有效的增强策略并非易事,因为某些片段在特定提示下可能与原始片段语义相同,导致错误拒绝并破坏对齐。为了解决这一问题,论文引入了Prompt-aware Multi-instance Learning VDPO (PaMi-VDPO),其关键在于基于提示上下文选择增强策略。不同于单一拒绝,该方法构建候选增强片段集,并采用近似优先的选择策略,首先确保所有片段语义相关,然后优先选择最具有提示感知差异的片段。这种方法使模型能够更好地捕捉有意义的视觉差异,减少幻觉现象,同时避免错误拒绝,提升对齐效果。PaMi-VDPO无缝集成到现有的VLLMs中,无需额外参数或GPT-4/人类监督,在仅使用10k SFT数据的情况下,使基础模型在VideoHallucer上的性能提升了5.3%,超过了GPT-4,并在通用视频基准测试中保持了稳定性能。
链接: https://arxiv.org/abs/2504.05810
作者: Xinpeng Ding,Kui Zhang,Jinahua Han,Lanqing Hong,Hang Xu,Xiaomeng Li
机构: The Hong Kong University of Science and Technology (香港科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Direct Preference Optimization (DPO) helps reduce hallucinations in Video Multimodal Large Language Models (VLLMs), but its reliance on offline preference data limits adaptability and fails to capture true video-response misalignment. We propose Video Direct Preference Optimization (VDPO), an online preference learning framework that eliminates the need for preference annotation by leveraging video augmentations to generate rejected samples while keeping responses fixed. However, selecting effective augmentations is non-trivial, as some clips may be semantically identical to the original under specific prompts, leading to false rejections and disrupting alignment. To address this, we introduce Prompt-aware Multi-instance Learning VDPO (PaMi-VDPO), which selects augmentations based on prompt context. Instead of a single rejection, we construct a candidate set of augmented clips and apply a close-to-far selection strategy, initially ensuring all clips are semantically relevant while then prioritizing the most prompt-aware distinct clip. This allows the model to better capture meaningful visual differences, mitigating hallucinations, while avoiding false rejections, and improving alignment. PaMi-VDPOseamlessly integrates into existing VLLMs without additional parameters, GPT-4/human supervision. With only 10k SFT data, it improves the base model by 5.3% on VideoHallucer, surpassing GPT-4o, while maintaining stable performance on general video benchmarks.
zh
[CV-45] Fast Sphericity and Roundness approximation in 2D and 3D using Local Thickness CVPR2025
【速读】:该论文旨在解决在大规模二维和三维显微图像数据集中高效量化多个物体球度(sphericity)和圆度(roundness)的问题。现有方法基于严格定义计算,导致效率低下,而随着数据集规模的增长,这一需求变得尤为迫切。论文的关键创新在于提出了一种基于局部厚度算法输出的新方法:对于球度,通过将物体建模为具有平均局部厚度均值的椭球体来简化表面积计算;对于圆度,则通过轮廓/表面的局部厚度值近似替代复杂的角点曲率确定过程。这种方法在保持高精度的同时,显著提高了计算速度。
链接: https://arxiv.org/abs/2504.05808
作者: Pawel Tomasz Pieta,Peter Winkel Rasumssen,Anders Bjorholm Dahl,Anders Nymark Christensen
机构: Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVMI (CVPR 2025 Workshop)
Abstract:Sphericity and roundness are fundamental measures used for assessing object uniformity in 2D and 3D images. However, using their strict definition makes computation costly. As both 2D and 3D microscopy imaging datasets grow larger, there is an increased demand for efficient algorithms that can quantify multiple objects in large volumes. We propose a novel approach for extracting sphericity and roundness based on the output of a local thickness algorithm. For sphericity, we simplify the surface area computation by modeling objects as spheroids/ellipses of varying lengths and widths of mean local thickness. For roundness, we avoid a complex corner curvature determination process by approximating it with local thickness values on the contour/surface of the object. The resulting methods provide an accurate representation of the exact measures while being significantly faster than their existing implementations.
zh
[CV-46] SE4Lip: Speech-Lip Encoder for Talking Head Synthesis to Solve Phoneme-Viseme Alignment Ambiguity
【速读】:该论文试图解决语音驱动的 Talking Head 合成任务中因通用声学特征(如 HuBERT 和 DeepSpeech)引起的音素-口型对齐模糊性问题,即在匹配语音的音素与唇部的视觉特征(visemes)时存在的不确定性与不精确性。为了解决这一问题,论文提出了一种名为 Speech Encoder for Lip (SE4Lip) 的方法,通过直接从语音编码唇部特征,并结合跨模态对齐框架,在联合嵌入空间中实现语音与唇部特征的对齐。SE4Lip 的关键在于使用基于 STFT 谱图和 GRU 模型的设计,以保留细粒度的语音特征,从而有效提升唇同步的准确性。实验结果表明,SE4Lip 在 NeRF 和 3DGS 渲染模型中达到了最先进的性能,其唇同步准确性分别比最佳基线提升了 13.7% 和 14.2%,且生成的结果接近真实视频。
链接: https://arxiv.org/abs/2504.05803
作者: Yihuan Huang,Jiajun Liu,Yanzhen Ren,Wuyang Liu,Juhua Tang
机构: School of Cyber Science and Engineering, Wuhan University (武汉大学网络空间安全学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Speech-driven talking head synthesis tasks commonly use general acoustic features (such as HuBERT and DeepSpeech) as guided speech features. However, we discovered that these features suffer from phoneme-viseme alignment ambiguity, which refers to the uncertainty and imprecision in matching phonemes (speech) with visemes (lip). To address this issue, we propose the Speech Encoder for Lip (SE4Lip) to encode lip features from speech directly, aligning speech and lip features in the joint embedding space by a cross-modal alignment framework. The STFT spectrogram with the GRU-based model is designed in SE4Lip to preserve the fine-grained speech features. Experimental results show that SE4Lip achieves state-of-the-art performance in both NeRF and 3DGS rendering models. Its lip sync accuracy improves by 13.7% and 14.2% compared to the best baseline and produces results close to the ground truth videos.
zh
[CV-47] Storybooth: Training-free Multi-Subject Consistency for Improved Visual Storytelling
【速读】:该论文致力于解决跨帧文本到图像生成中多角色一致性的问题,现有方法主要依赖于跨帧自注意力机制(cross-frame self-attention),但其在处理单一主体时表现良好,扩展至多个角色时却因主体间注意力泄漏(self-attention-leakage)导致角色一致性受损。为应对这一挑战,论文提出StoryBooth方法,其关键是通过多模态链式推理(multi-modal chain-of-thought reasoning)和基于区域的生成技术预先定位所需故事输出中的不同主体,同时引入改进的扩散模型,包含两个创新层:1)受限的跨帧自注意力层以减少角色间的注意力泄漏;2)令牌合并层以提升细粒度主体细节的一致性。实验表明,该方法在多角色一致性和细粒度主体细节方面超越了现有最先进技术。
链接: https://arxiv.org/abs/2504.05800
作者: Jaskirat Singh,Junshen Kevin Chen,Jonas Kohler,Michael Cohen
机构: Meta GenAI (Meta生成人工智能); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Training-free consistent text-to-image generation depicting the same subjects across different images is a topic of widespread recent interest. Existing works in this direction predominantly rely on cross-frame self-attention; which improves subject-consistency by allowing tokens in each frame to pay attention to tokens in other frames during self-attention computation. While useful for single subjects, we find that it struggles when scaling to multiple characters. In this work, we first analyze the reason for these limitations. Our exploration reveals that the primary-issue stems from self-attention-leakage, which is exacerbated when trying to ensure consistency across multiple-characters. This happens when tokens from one subject pay attention to other characters, causing them to appear like each other (e.g., a dog appearing like a duck). Motivated by these findings, we propose StoryBooth: a training-free approach for improving multi-character consistency. In particular, we first leverage multi-modal chain-of-thought reasoning and region-based generation to apriori localize the different subjects across the desired story outputs. The final outputs are then generated using a modified diffusion model which consists of two novel layers: 1) a bounded cross-frame self-attention layer for reducing inter-character attention leakage, and 2) token-merging layer for improving consistency of fine-grain subject details. Through both qualitative and quantitative results we find that the proposed approach surpasses prior state-of-the-art, exhibiting improved consistency across both multiple-characters and fine-grain subject details.
zh
[CV-48] Robust Fusion Controller: Degradation-aware Image Fusion with Fine-grained Language Instructions
【速读】:该论文旨在解决当前图像融合方法难以适应包含空间变化特征的多样化退化(diverse degradations with spatially varying characteristics)的真实世界环境的问题。为应对这一挑战,论文提出了一种鲁棒融合控制器(Robust Fusion Controller, RFC),其关键在于通过细粒度的语言指令实现退化感知的图像融合(degradation-aware image fusion)。具体而言,RFC首先解析语言指令以创新性地提取功能条件(指定需去除的退化类型)和空间条件(定义其空间覆盖范围),然后通过多条件耦合网络生成复合控制先验(composite control priori),实现从抽象语言指令到潜在控制变量的无缝转换。此外,设计了一种基于混合注意力机制的融合网络,将所得复合控制先验深度嵌入,以线性调制中间融合特征,并引入一种新颖的语言-特征对齐损失(language-feature alignment loss)确保语言指令与控制结果的一致性。实验结果表明,RFC在多种复杂退化场景,尤其是强光晕(flare)情况下表现出显著的鲁棒性。
链接: https://arxiv.org/abs/2504.05795
作者: Hao Zhang,Yanping Zha,Qingwei Zhuang,Zhenfeng Shao,Jiayi Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current image fusion methods struggle to adapt to real-world environments encompassing diverse degradations with spatially varying characteristics. To address this challenge, we propose a robust fusion controller (RFC) capable of achieving degradation-aware image fusion through fine-grained language instructions, ensuring its reliable application in adverse environments. Specifically, RFC first parses language instructions to innovatively derive the functional condition and the spatial condition, where the former specifies the degradation type to remove, while the latter defines its spatial coverage. Then, a composite control priori is generated through a multi-condition coupling network, achieving a seamless transition from abstract language instructions to latent control variables. Subsequently, we design a hybrid attention-based fusion network to aggregate multi-modal information, in which the obtained composite control priori is deeply embedded to linearly modulate the intermediate fused features. To ensure the alignment between language instructions and control outcomes, we introduce a novel language-feature alignment loss, which constrains the consistency between feature-level gains and the composite control priori. Extensive experiments on publicly available datasets demonstrate that our RFC is robust against various composite degradations, particularly in highly challenging flare scenarios.
zh
[CV-49] DefMamba: Deformable Visual State Space Model CVPR2025
【速读】:该论文旨在解决现有视觉 Mamba 方法在特征提取过程中因采用预定义扫描顺序将图像展平为一维序列而导致的空间结构信息利用不足的问题。为了解决这一问题,论文提出了一种名为 DefMamba 的新型视觉基础模型,其关键在于引入了多尺度主干结构和可变形 Mamba (Deformable Mamba, DM) 块,这些 DM 块能够动态调整扫描路径以优先关注重要信息,从而增强对相关输入特征的捕捉与处理能力。此外,通过结合可变形扫描 (Deformable Scanning, DS) 策略,DefMamba 显著提升了学习图像结构的能力,并改进了对物体细节变化的检测效果。实验结果表明,DefMamba 在多种视觉任务中达到了最先进的性能水平。
链接: https://arxiv.org/abs/2504.05794
作者: Leiye Liu,Miao Zhang,Jihao Yin,Tingwei Liu,Wei Ji,Yongri Piao,Huchuan Lu
机构: Dalian University of Technology (大连理工大学); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025
Abstract:Recently, state space models (SSM), particularly Mamba, have attracted significant attention from scholars due to their ability to effectively balance computational efficiency and performance. However, most existing visual Mamba methods flatten images into 1D sequences using predefined scan orders, which results the model being less capable of utilizing the spatial structural information of the image during the feature extraction process. To address this issue, we proposed a novel visual foundation model called DefMamba. This model includes a multi-scale backbone structure and deformable mamba (DM) blocks, which dynamically adjust the scanning path to prioritize important information, thus enhancing the capture and processing of relevant input features. By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details. Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code is open source on DefMamba.
zh
[CV-50] Leverag ing Synthetic Adult Datasets for Unsupervised Infant Pose Estimation CVPR2025
【速读】:该论文旨在解决婴儿姿态估计中的两个关键挑战:一是缺乏大量标注数据导致现有方法依赖完全监督学习的问题;二是模型在分布偏移下的泛化能力不足。为应对这些挑战,论文提出了SHIFT方法,其核心在于利用基于伪标签的Mean-Teacher框架以缓解标注数据不足的问题,并通过学生与教师伪标签之间的一致性约束来适应分布偏移。此外,引入婴儿流形姿态先验以惩罚不合理预测,并设计了一种新的可见性一致性模块以提升对自遮挡现象的感知能力。实验结果表明,SHIFT在多个基准数据集上显著优于现有的无监督领域自适应(UDA)和部分监督婴儿姿态估计算法。
链接: https://arxiv.org/abs/2504.05789
作者: Sarosij Bose,Hannah Dela Cruz,Arindam Dutta,Elena Kokkoni,Konstantinos Karydis,Amit K. Roy-Chowdhury
机构: University of California, Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ABAW@CVPR 2025
Abstract:Human pose estimation is a critical tool across a variety of healthcare applications. Despite significant progress in pose estimation algorithms targeting adults, such developments for infants remain limited. Existing algorithms for infant pose estimation, despite achieving commendable performance, depend on fully supervised approaches that require large amounts of labeled data. These algorithms also struggle with poor generalizability under distribution shifts. To address these challenges, we introduce SHIFT: Leveraging SyntHetic Adult Datasets for Unsupervised InFanT Pose Estimation, which leverages the pseudo-labeling-based Mean-Teacher framework to compensate for the lack of labeled data and addresses distribution shifts by enforcing consistency between the student and the teacher pseudo-labels. Additionally, to penalize implausible predictions obtained from the mean-teacher framework, we incorporate an infant manifold pose prior. To enhance SHIFT’s self-occlusion perception ability, we propose a novel visibility consistency module for improved alignment of the predicted poses with the original image. Extensive experiments on multiple benchmarks show that SHIFT significantly outperforms existing state-of-the-art unsupervised domain adaptation (UDA) pose estimation methods by 5% and supervised infant pose estimation methods by a margin of 16%. The project page is available at: this https URL.
zh
[CV-51] How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM
【速读】:该论文旨在解决如何通过整合大型语言模型(Large Language Models, LLMs)提升三维空间理解(3D Spatial Understanding)任务性能的问题。论文的关键在于提出了一种分类方法,将现有方法分为三类:基于图像的方法(从二维视觉数据推导三维理解)、基于点云的方法(直接处理三维表示)以及混合模态的方法(融合多源数据流)。通过系统性回顾这些类别中的代表性方法,论文重点探讨了数据表示、架构调整及跨文本与三维模态的训练策略,以实现两者之间的有效桥接。此外,论文还指出了当前研究面临的局限性,如数据集匮乏与计算挑战,并展望了在空间感知、多模态融合及实际应用中的潜在发展方向。
链接: https://arxiv.org/abs/2504.05786
作者: Jirong Zha,Yuxuan Fan,Xiao Yang,Chen Gao,Xinlei Chen
机构: Tsinghua University (清华大学); The Hong Kong University of Science and Technology (Guang Zhou) (香港科技大学(广州)); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures
Abstract:3D spatial understanding is essential in real-world applications such as robotics, autonomous vehicles, virtual reality, and medical imaging. Recently, Large Language Models (LLMs), having demonstrated remarkable success across various domains, have been leveraged to enhance 3D understanding tasks, showing potential to surpass traditional computer vision methods. In this survey, we present a comprehensive review of methods integrating LLMs with 3D spatial understanding. We propose a taxonomy that categorizes existing methods into three branches: image-based methods deriving 3D understanding from 2D visual data, point cloud-based methods working directly with 3D representations, and hybrid modality-based methods combining multiple data streams. We systematically review representative methods along these categories, covering data representations, architectural modifications, and training strategies that bridge textual and 3D modalities. Finally, we discuss current limitations, including dataset scarcity and computational challenges, while highlighting promising research directions in spatial perception, multi-modal fusion, and real-world applications.
zh
[CV-52] Video Flow as Time Series: Discovering Temporal Consistency and Variability for VideoQA
【速读】:该论文旨在解决视频问答(Video Question Answering, VideoQA)任务中对时间动态性建模不足的问题。传统基于Transformer的架构虽能有效整合多模态数据,但通常通过位置编码简化时间动态性,难以捕捉视频序列中的非线性交互。为应对这一挑战,论文提出了一种新颖的时间三重Transformer(Temporal Trio Transformer, T3T)架构,其关键在于集成三个模块:Temporal Smoothing (TS) 模块利用布朗桥(Brownian Bridge)捕捉平滑连续的时间转换;Temporal Difference (TD) 模块识别并编码视频内容中的显著时间变化和突变;Temporal Fusion (TF) 模块将这些时间特征与文本线索融合,以促进更深层次的上下文理解和回答准确性。实验结果表明,T3T在多个VideoQA基准数据集上的表现验证了其方法的有效性。
链接: https://arxiv.org/abs/2504.05783
作者: Zijie Song,Zhenzhen Hu,Yixiao Ma,Jia Li,Richang Hong
机构: Hefei University of Technology (合肥工业大学, 中国); University of Science and Technology of China (中国科学技术大学, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video Question Answering (VideoQA) is a complex video-language task that demands a sophisticated understanding of both visual content and temporal dynamics. Traditional Transformer-style architectures, while effective in integrating multimodal data, often simplify temporal dynamics through positional encoding and fail to capture non-linear interactions within video sequences. In this paper, we introduce the Temporal Trio Transformer (T3T), a novel architecture that models time consistency and time variability. The T3T integrates three key components: Temporal Smoothing (TS), Temporal Difference (TD), and Temporal Fusion (TF). The TS module employs Brownian Bridge for capturing smooth, continuous temporal transitions, while the TD module identifies and encodes significant temporal variations and abrupt changes within the video content. Subsequently, the TF module synthesizes these temporal features with textual cues, facilitating a deeper contextual understanding and response accuracy. The efficacy of the T3T is demonstrated through extensive testing on multiple VideoQA benchmark datasets. Our results underscore the importance of a nuanced approach to temporal modeling in improving the accuracy and depth of video-based question answering.
zh
[CV-53] MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
【速读】:该论文试图解决多模态大型语言模型(Multimodal Large Language Models, MLLMs)在多模态推理能力评估方面存在的不足,具体表现为现有推理基准测试数据规模有限、学科领域覆盖狭窄以及知识分布无序等问题。为解决这些问题,论文提出了MDK12-Bench这一多学科基准测试集,通过基于真实世界K-12考试题目的方式全面评估MLLMs的推理能力。该基准涵盖了数学、物理、化学、生物、地理及信息科学六个学科,包含从小学到12年级的140K个推理实例,并具有详细的实例级知识点标注、结构化的知识体系、详尽的答案解析、难度标签以及跨年度划分,从而构建了一个强大的综合评估平台。此外,论文还引入了一种新颖的动态评估框架,在评估过程中通过对问题形式、问题类型和图像风格进行引导式采样,有效缓解了数据污染问题。因此,该研究的关键在于创新性地设计了MDK12-Bench及其对应的动态评估方法,以更全面、准确地衡量MLLMs的多模态推理能力。
链接: https://arxiv.org/abs/2504.05782
作者: Pengfei Zhou,Fanrui Zhang,Xiaopeng Peng,Zhaopan Xu,Jiaxin Ai,Yansheng Qiu,Chuanhao Li,Zhen Li,Ming Li,Yukang Feng,Jianwen Sun,Haoquan Zhang,Zizhen Li,Xiaofeng Mao,Wangbo Zhao,Kai Wang,Xiaojun Chang,Wenqi Shao,Yang You,Kaipeng Zhang
机构: Shanghai AI Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); USTC (中国科学技术大学); RIT (罗切斯特理工学院); HIT (哈尔滨工业大学); WHU (武汉大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学); NUS (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures
Abstract:Multimodal reasoning, which integrates language and visual cues into problem solving and decision making, is a fundamental aspect of human intelligence and a crucial step toward artificial general intelligence. However, the evaluation of multimodal reasoning capabilities in Multimodal Large Language Models (MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained by limited data size, narrow domain coverage, and unstructured knowledge distribution. To close these gaps, we introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Spanning six disciplines (math, physics, chemistry, biology, geography, and information science), our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions, providing a robust platform for comprehensive evaluation. Additionally, we present a novel dynamic evaluation framework to mitigate data contamination issues by bootstrapping question forms, question types, and image styles during evaluation. Extensive experiment on MDK12-Bench reveals the significant limitation of current MLLMs in multimodal reasoning. The findings on our benchmark provide insights into the development of the next-generation models. Our data and codes are available at this https URL.
zh
[CV-54] FASR-Net: Unsupervised Shadow Removal Leverag ing Inherent Frequency Priors
【速读】:该论文旨在解决阴影移除中因几何、光照及环境因素复杂交互导致的难题,现有无监督方法常忽视阴影特有的先验知识,造成阴影恢复不完全。为应对这一挑战,论文提出了一种新颖的无监督频率感知阴影移除网络(Frequency Aware Shadow Removal Network, FASR-Net),其关键在于利用阴影区域的固有频率特性。具体而言,提出的Wavelet Attention Downsampling Module (WADM) 结合小波图像分解与可变形注意力机制,有效将图像分解为频率分量,以增强特定频率带内的阴影细节。此外,引入多种新损失函数以实现精确的无阴影图像重建:频率损失用于捕捉图像分量细节,亮度-色度损失参考无阴影区域的色度,对齐损失确保阴影与非阴影区域之间的平滑过渡。实验结果表明,该方法在AISTD和SRD数据集上展现出卓越的阴影移除性能。
链接: https://arxiv.org/abs/2504.05779
作者: Tao Lin,Qingwang Wang,Qiwei Liang,Minghua Tang,Yuxuan Sun
机构: Faculty of Information Engineering and Automation, Kunming University of Science and Technology (昆明理工大学信息工程与自动化学院), China; College of Mechatronics and Control Engineering, Shenzhen University (深圳大学机电与控制工程学院), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Shadow removal is challenging due to the complex interaction of geometry, lighting, and environmental factors. Existing unsupervised methods often overlook shadow-specific priors, leading to incomplete shadow recovery. To address this issue, we propose a novel unsupervised Frequency Aware Shadow Removal Network (FASR-Net), which leverages the inherent frequency characteristics of shadow regions. Specifically, the proposed Wavelet Attention Downsampling Module (WADM) integrates wavelet-based image decomposition and deformable attention, effectively breaking down the image into frequency components to enhance shadow details within specific frequency bands. We also introduce several new loss functions for precise shadow-free image reproduction: a frequency loss to capture image component details, a brightness-chromaticity loss that references the chromaticity of shadow-free regions, and an alignment loss to ensure smooth transitions between shadowed and shadow-free regions. Experimental results on the AISTD and SRD datasets demonstrate that our method achieves superior shadow removal performance.
zh
[CV-55] ransferable Mask Transformer: Cross-domain Semantic Segmentation with Region-adaptive Transferability Estimation
【速读】:该论文旨在解决跨域语义分割任务中,预训练视觉Transformer(Vision Transformer, ViT)在适应新目标域时因分布偏移导致性能显著下降的问题。具体而言,当源域与目标域在纹理、尺度或物体共现模式等方面存在差异时,自注意力机制可能无法有效关注关键对象,从而影响全局注意力的效果。为应对这一挑战,论文提出了一种基于区域级适配的新方法,以克服不同图像区域间迁移性的空间异质性。
解决方案的关键在于引入Transferable Mask Transformer (TMT),这是一种新颖的区域级适配框架,通过空间迁移性分析来对齐跨域表示。TMT包含两个核心组件:(1) 自适应聚类驱动的迁移性评估器(Adaptive Cluster-based Transferability Estimator, ACTE),用于动态将图像分割成结构和语义一致的区域,以便进行局部化的迁移性评估;(2) 可迁移掩码注意力模块(Transferable Masked Attention, TMA),它将区域特定的迁移性图集成到ViTs的注意力机制中,优先在迁移性低且语义不确定性高的区域进行适应。综合评估表明,TMT相比标准微调提升了平均2%的MIoU,并比现有最佳基线提高了1.28%的性能。
链接: https://arxiv.org/abs/2504.05774
作者: Enming Zhang,Zhengyu Li,Yanru Wu,Jingge Wang,Yang Tan,Ruizhe Zhao,Guan Wang,Yang Li
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Shandong University (山东大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Vision Transformers (ViTs) have set new benchmarks in semantic segmentation. However, when adapting pretrained ViTs to new target domains, significant performance degradation often occurs due to distribution shifts, resulting in suboptimal global attention. Since self-attention mechanisms are inherently data-driven, they may fail to effectively attend to key objects when source and target domains exhibit differences in texture, scale, or object co-occurrence patterns. While global and patch-level domain adaptation methods provide partial solutions, region-level adaptation with dynamically shaped regions is crucial due to spatial heterogeneity in transferability across different image areas. We present Transferable Mask Transformer (TMT), a novel region-level adaptation framework for semantic segmentation that aligns cross-domain representations through spatial transferability analysis. TMT consists of two key components: (1) An Adaptive Cluster-based Transferability Estimator (ACTE) that dynamically segments images into structurally and semantically coherent regions for localized transferability assessment, and (2) A Transferable Masked Attention (TMA) module that integrates region-specific transferability maps into ViTs’ attention mechanisms, prioritizing adaptation in regions with low transferability and high semantic uncertainty. Comprehensive evaluations across 20 cross-domain pairs demonstrate TMT’s superiority, achieving an average 2% MIoU improvement over vanilla fine-tuning and a 1.28% increase compared to state-of-the-art baselines. The source code will be publicly available.
zh
[CV-56] A Lightweight Multi-Module Fusion Approach for Korean Character Recognition
【速读】:该论文旨在解决现有光学字符识别(OCR)模型在实际场景中因不规则文本布局、图像质量差、字符变化以及高计算成本而导致性能不足的问题。论文提出了一种名为SDA-Net(笔画敏感注意力与动态上下文编码网络)的轻量级高效架构,专注于鲁棒的单字符识别。其解决方案的关键在于:(1) 引入双注意力机制以增强笔画级别和空间特征提取;(2) 设计动态上下文编码模块,通过可学习门机制自适应优化语义信息;(3) 借鉴U-Net的特征融合策略,结合低级与高级特征;(4) 提供高度优化的轻量化主干网络以减少内存和计算需求。实验结果表明,SDA-Net在具有挑战性的OCR基准测试中实现了最先进的准确性,并显著加快了推理速度,使其非常适合实时和边缘OCR系统的部署。
链接: https://arxiv.org/abs/2504.05770
作者: Inho Jake Park,Jaehoon Jay Jeong,Ho-Sang Jo
机构: Among Solution (Among Solution); GIST Laboratory Autonomous Driving (GLAD), Gwangju Institute of Science and Technology (GIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 5 tables
Abstract:Optical Character Recognition (OCR) is essential in applications such as document processing, license plate recognition, and intelligent surveillance. However, existing OCR models often underperform in real-world scenarios due to irregular text layouts, poor image quality, character variability, and high computational costs. This paper introduces SDA-Net (Stroke-Sensitive Attention and Dynamic Context Encoding Network), a lightweight and efficient architecture designed for robust single-character recognition. SDA-Net incorporates: (1) a Dual Attention Mechanism to enhance stroke-level and spatial feature extraction; (2) a Dynamic Context Encoding module that adaptively refines semantic information using a learnable gating mechanism; (3) a U-Net-inspired Feature Fusion Strategy for combining low-level and high-level features; and (4) a highly optimized lightweight backbone that reduces memory and computational demands. Experimental results show that SDA-Net achieves state-of-the-art accuracy on challenging OCR benchmarks, with significantly faster inference, making it well-suited for deployment in real-time and edge-based OCR systems. Comments: 12 pages, 5 figures, 5 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) MSC classes: 68T07 ACMclasses: I.2.10 Cite as: arXiv:2504.05770 [cs.CV] (or arXiv:2504.05770v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.05770 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-57] InvNeRF-Seg: Fine-Tuning a Pre-Trained NeRF for 3D Object Segmentation
【速读】:该论文旨在解决从2D RGB图像重建的3D点云分割问题,这一任务对于下游应用如物体计数、尺寸估计及场景理解至关重要。然而,传统的基于深度学习的原始3D点云分割需要耗费大量人工标注工作,而直接在二值掩模上训练NeRF也因缺乏几何学习所需的色彩和阴影线索而失败。为了解决这些问题,论文提出了Invariant NeRF for Segmentation (InvNeRFSeg),这是一种两阶段、零更改的微调策略,用于3D分割。其关键在于首先使用RGB图像训练标准NeRF模型,然后利用2D分割掩模对模型进行微调,且不改变模型架构或损失函数。这种方法能够在最小计算开销下生成高质量、更干净的分割点云,并通过场密度分析确保语义细化的一致性,即对象区域密度增加而背景密度被抑制,从而实现清晰可解释的分割结果。实验表明,相比SA3D和FruitNeRF,InvNeRFSeg在合成水果数据集和真实大豆数据集上均表现出优越性能,有效实现了从2D到高质量3D分割的扩展。
链接: https://arxiv.org/abs/2504.05751
作者: Jiangsan Zhao,Jakob Geipel,Krzysztof Kusnierek,Xuean Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural Radiance Fields (NeRF) have been widely adopted for reconstructing high quality 3D point clouds from 2D RGB images. However, the segmentation of these reconstructed 3D scenes is more essential for downstream tasks such as object counting, size estimation, and scene understanding. While segmentation on raw 3D point clouds using deep learning requires labor intensive and time-consuming manual annotation, directly training NeRF on binary masks also fails due to the absence of color and shading cues essential for geometry learning. We propose Invariant NeRF for Segmentation (InvNeRFSeg), a two step, zero change fine tuning strategy for 3D segmentation. We first train a standard NeRF on RGB images and then fine tune it using 2D segmentation masks without altering either the model architecture or loss function. This approach produces higher quality, cleaner segmented point clouds directly from the refined radiance field with minimal computational overhead or complexity. Field density analysis reveals consistent semantic refinement: densities of object regions increase while background densities are suppressed, ensuring clean and interpretable segmentations. We demonstrate InvNeRFSegs superior performance over both SA3D and FruitNeRF on both synthetic fruit and real world soybean datasets. This approach effectively extends 2D segmentation to high quality 3D segmentation.
zh
[CV-58] When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning
【速读】:该论文旨在解决非言语面部运动预测中的两个关键挑战:其一,现有基于连续到离散表示的方法难以有效捕捉非言语面部模式,尤其是在低保真生成运动的情况下;其二,由于非言语面部运动具有时间变化性和多模态特性,现有方法在训练倾听头部行为预测模型时效率较低。论文提出了一种新颖的方法,通过将长序列编码为稀疏的关键帧和过渡帧来表示和预测非言语面部运动。该方法的关键在于识别关键运动步骤并插值中间帧,从而在保持运动时间结构的同时增强实例级别的多样性。此外,此稀疏表示被应用于倾听头部预测任务,展示了其在提升面部运动模式解释能力方面的贡献。
链接: https://arxiv.org/abs/2504.05748
作者: Tri Tung Nguyen Nguyen,Quang Tien Dam,Dinh Tuan Tran,Joo-Ho Lee
机构: Graduate School of Information Science and Engineering, Ritsumeikan University (立命馆大学信息科学与工程研究生院), Osaka, Japan; College of Information Science and Engineering, Ritsumeikan University (立命馆大学信息科学与工程学院), Osaka, Japan; Faculty of Data Science, Shiga University (滋贺大学数据科学系), Japan
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Effective human behavior modeling is critical for successful human-robot interaction. Current state-of-the-art approaches for predicting listening head behavior during dyadic conversations employ continuous-to-discrete representations, where continuous facial motion sequence is converted into discrete latent tokens. However, non-verbal facial motion presents unique challenges owing to its temporal variance and multi-modal nature. State-of-the-art discrete motion token representation struggles to capture underlying non-verbal facial patterns making training the listening head inefficient with low-fidelity generated motion. This study proposes a novel method for representing and predicting non-verbal facial motion by encoding long sequences into a sparse sequence of keyframes and transition frames. By identifying crucial motion steps and interpolating intermediate frames, our method preserves the temporal structure of motion while enhancing instance-wise diversity during the learning process. Additionally, we apply this novel sparse representation to the task of listening head prediction, demonstrating its contribution to improving the explanation of facial motion patterns.
zh
[CV-59] Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation
【速读】:该论文致力于解决音频驱动的单样本Talking Head动画(ADOS-THA)中的关键挑战,即捕捉相邻视频帧之间微妙且难以察觉的变化。论文提出的关键解决方案是引入一种新颖的时序音频-视觉相关嵌入(Temporal Audio-Visual Correlation Embedding, TAVCE)框架,通过学习音频与视觉之间的时序相关性,并将其整合以增强特征表示和指导最终生成。其核心在于首先学习音频-视觉时序相关性度量,确保相邻音频片段的时序关系与对应的相邻视频帧的时序关系对齐;然后利用这种对齐信息通过通道注意力机制引导更具有代表性的特征学习,并在训练过程中将对齐的相关性作为额外目标来监督视觉帧的生成。这一方法旨在有效提升ADOS-THA的效果,超越现有领先算法。
链接: https://arxiv.org/abs/2504.05746
作者: Zhihua Xu,Tianshui Chen,Zhijing Yang,Siyuan Peng,Keze Wang,Liang Lin
机构: School of Information Engineering, Guangdong University of Technology (广东工业大学信息工程学院); School of Computer Science and Engineering, Sun Yat-sen University (中山大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at TMM 2025
Abstract:The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.
zh
[CV-60] DDT: Decoupled Diffusion Transformer
【速读】:该论文旨在解决扩散变压器(diffusion transformers)在生成高质量图像时面临的优化困境:即在去噪过程中,低频语义编码与高频解码之间的固有矛盾。这一矛盾源于编码低频语义需要抑制高频成分,从而导致语义提取与高频解码之间存在张力。为了解决此问题,论文提出了一种新的解耦扩散变压器(Decoupled Diffusion Transformer, DDT),其核心创新在于设计了一个专门用于语义提取的条件编码器以及一个专用的速度解码器,实现了任务的解耦。此外,通过引入一种新颖的统计动态规划方法,优化了相邻去噪步骤间的自条件共享策略,以最小化性能下降。实验结果显示,提出的DDT架构不仅提升了生成质量,在ImageNet 256×256上的FID达到了1.31的新SOTA性能,并且训练收敛速度提高了近4倍;在ImageNet 512×512上也实现了1.28的FID新SOTA,同时显著加速了推理速度。
链接: https://arxiv.org/abs/2504.05741
作者: Shuai Wang,Zhi Tian,Weilin Huang,Limin Wang
机构: Nanjing University (南京大学); ByteDance Seed Vision (字节跳动悟道)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: sota on ImageNet256 and ImageNet512
Abstract:Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \textbf\colorddtDecoupled \textbf\colorddtDiffusion \textbf\colorddtTransformer~(\textbf\colorddtDDT), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet 256\times256 , Our DDT-XL/2 achieves a new state-of-the-art performance of 1.31 FID~(nearly 4\times faster training convergence compared to previous diffusion transformers). For ImageNet 512\times512 , Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.
zh
[CV-61] Micro-splatting: Maximizing Isotropic Constraints for Refined Optimization in 3D Gaussian Splatting
【速读】:该论文致力于解决现有3D高斯点 splatting 方法在捕捉细粒度细节方面的不足。传统方法因依赖较大的协方差参数而容易产生模糊表示,而直接减小协方差尺寸则会导致稀疏性问题。为克服这些限制,论文提出了一种名为Micro-splatting的新框架,其关键是通过引入协方差正则化项来惩罚过大的高斯分布,确保每个splat保持紧凑且各向同性,并结合自适应细化策略动态优化图像梯度高的区域,同时增强损失函数以提高细节密度,从而在不牺牲渲染效率的前提下显著提升3D重建中的细粒度细节表现。
链接: https://arxiv.org/abs/2504.05740
作者: Jee Won Lee,Hansol Lim,Sooyeun Yang,Jongseong Choi
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in 3D Gaussian Splatting have achieved impressive scalability and real-time rendering for large-scale scenes but often fall short in capturing fine-grained details. Conventional approaches that rely on relatively large covariance parameters tend to produce blurred representations, while directly reducing covariance sizes leads to sparsity. In this work, we introduce Micro-splatting (Maximizing Isotropic Constraints for Refined Optimization in 3D Gaussian Splatting), a novel framework designed to overcome these limitations. Our approach leverages a covariance regularization term to penalize excessively large Gaussians to ensure each splat remains compact and isotropic. This work implements an adaptive densification strategy that dynamically refines regions with high image gradients by lowering the splitting threshold, followed by loss function enhancement. This strategy results in a denser and more detailed gaussian means where needed, without sacrificing rendering efficiency. Quantitative evaluations using metrics such as L1, L2, PSNR, SSIM, and LPIPS, alongside qualitative comparisons demonstrate that our method significantly enhances fine-details in 3D reconstructions.
zh
[CV-62] QEMesh: Employing A Quadric Error Metrics-Based Representation for Mesh Generation
【速读】:本文旨在解决3D网格生成中的几个关键问题,包括表面出现不真实的图案或凹坑、薄部分缺失以及结构不完整等。这些问题主要源于形状表示的选择或生成网络的能力限制。为了解决这些问题,论文提出了QEMesh模型,其核心创新在于扩展了基于Quadric Error Metrics (QEM) 的PoNQ表示方法,并设计了一种独特的潜在扩散模型,包含一个新颖的多解码器变分自编码器(VAE),用于生成PoNQ参数。通过该扩散模型产生的潜在代码,三个参数解码器在每个体素单元内生成多个PoNQ参数,同时占用解码器预测哪些体素单元包含参数以形成最终形状。实验结果表明,所提出的方法能够生成具有水密表面的高质量网格,并在多项关键指标上与最先进的方法相当。
链接: https://arxiv.org/abs/2504.05720
作者: Jiaqi Li,Ruowei Wang,Yu Liu,Qijun Zhao
机构: National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University (四川大学); College of Computer Science, Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by International Conference on Multimedia and Expo
Abstract:Mesh generation plays a crucial role in 3D content creation, as mesh is widely used in various industrial applications. Recent works have achieved impressive results but still face several issues, such as unrealistic patterns or pits on surfaces, thin parts missing, and incomplete structures. Most of these problems stem from the choice of shape representation or the capabilities of the generative network. To alleviate these, we extend PoNQ, a Quadric Error Metrics (QEM)-based representation, and propose a novel model, QEMesh, for high-quality mesh generation. PoNQ divides the shape surface into tiny patches, each represented by a point with its normal and QEM matrix, which preserves fine local geometry information. In our QEMesh, we regard these elements as generable parameters and design a unique latent diffusion model containing a novel multi-decoder VAE for PoNQ parameters generation. Given the latent code generated by the diffusion model, three parameter decoders produce several PoNQ parameters within each voxel cell, and an occupancy decoder predicts which voxel cells containing parameters to form the final shape. Extensive evaluations demonstrate that our method generates results with watertight surfaces and is comparable to state-of-the-art methods in several main metrics.
zh
[CV-63] SEVERE: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
【速读】:该论文试图解决视频自监督学习(Video Self-Supervised Learning, VSSL)方法在实际应用中的泛化能力不足的问题。尽管现有方法在标准动作识别基准数据集上表现优异,但它们通常仅在狭窄的评估协议下进行测试,如在Kinetics-400数据集上预训练并在相似数据集上微调,这限制了对其在真实世界场景中泛化能力的理解。论文的关键在于提出了一种全面的评估框架,聚焦于四个关键下游因素:领域迁移(domain shift)、样本效率(sample efficiency)、动作粒度(action granularity)以及任务多样性(task diversity)。通过扩展先前基于CNN的对比学习基准研究,论文评估了12种基于Transformer的视频自监督模型(包括7种仅视频模型和5种视频-文本联合模型),并与10种基于CNN的方法进行了对比,覆盖了8个数据集和7个下游任务。研究表明,尽管Transformer架构有所改进,但这些模型仍对下游条件敏感,没有一种方法能在所有因素上保持一致的泛化能力。论文的解决方案之关键是通过广泛的实验揭示了当前视频自监督学习方法的优势与局限性,并提供了统一的基准来评估视频表示学习中的泛化性能。
链接: https://arxiv.org/abs/2504.05706
作者: Fida Mohammad Thoker,Letian Jiang,Chen Zhao,Piyush Bagad,Hazel Doughty,Bernard Ghanem,Cees G. M. Snoek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning.
zh
[CV-64] Pose-Aware Weakly-Supervised Action Segmentation
【速读】:该论文旨在解决在视觉智能领域中理解人类行为时面临的挑战,特别是长时指令视频中人体动作分割需要大量精确标注的问题。为应对这一难题,论文提出了一种弱监督框架,在训练过程中巧妙地融入姿态知识,而在推理阶段则无需使用姿态信息,从而提炼出与每个动作成分相关的关键姿态知识。该方法的关键在于引入了一种受姿态启发的对比损失(pose-inspired contrastive loss),用于更有效地区分动作边界。通过在代表性数据集上的广泛实验验证,所提方法在在线和离线设置下均超越了先前最先进的技术(SOTA),同时展示了其对不同分割主干网络和姿态提取器的适应性。
链接: https://arxiv.org/abs/2504.05700
作者: Seth Z. Zhao,Reza Ghoddoosian,Isht Dwivedi,Nakul Agarwal,Behzad Dariush
机构: Honda Research Institute (本田研究院), USA; UC Berkeley
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding human behavior is an important problem in the pursuit of visual intelligence. A challenge in this endeavor is the extensive and costly effort required to accurately label action segments. To address this issue, we consider learning methods that demand minimal supervision for segmentation of human actions in long instructional videos. Specifically, we introduce a weakly-supervised framework that uniquely incorporates pose knowledge during training while omitting its use during inference, thereby distilling pose knowledge pertinent to each action component. We propose a pose-inspired contrastive loss as a part of the whole weakly-supervised framework which is trained to distinguish action boundaries more effectively. Our approach, validated through extensive experiments on representative datasets, outperforms previous state-of-the-art (SOTA) in segmenting long instructional videos under both online and offline settings. Additionally, we demonstrate the framework’s adaptability to various segmentation backbones and pose extractors across different datasets.
zh
[CV-65] Point-based Instance Completion with Scene Constraints
【速读】:该论文试图解决部分观测物体在场景环境中完成的问题,现有方法主要关注独立物体的几何恢复,但无法有效利用场景约束(如其他已观测表面)且假设输入为规范坐标系,这在复杂场景中不成立。同时,已有场景完成方法虽能处理场景中的物体,但在完成质量及考虑场景约束方面仍显不足。论文的关键解决方案在于提出一种基于点云的实例级场景完成模型,通过引入稀疏场景约束点云,并利用跨注意机制将其整合到完成模型中,从而实现任意尺度和姿态下物体的鲁棒完成。此外,为了评估室内场景中的实例级场景完成任务,构建了一个名为ScanWCF的新数据集,包含标注的部分扫描数据以及对齐的地面真实场景完成结果。实验表明,该方法在部分扫描保真度、完成质量和合理性方面优于现有最先进方法。
链接: https://arxiv.org/abs/2504.05698
作者: Wesley Khademi,Li Fuxin
机构: Oregon State University (俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent point-based object completion methods have demonstrated the ability to accurately recover the missing geometry of partially observed objects. However, these approaches are not well-suited for completing objects within a scene, as they do not consider known scene constraints (e.g., other observed surfaces) in their completions and further expect the partial input to be in a canonical coordinate system, which does not hold for objects within scenes. While instance scene completion methods have been proposed for completing objects within a scene, they lag behind point-based object completion methods in terms of object completion quality and still do not consider known scene constraints during completion. To overcome these limitations, we propose a point cloud-based instance completion model that can robustly complete objects at arbitrary scales and pose in the scene. To enable reasoning at the scene level, we introduce a sparse set of scene constraints represented as point clouds and integrate them into our completion model via a cross-attention mechanism. To evaluate the instance scene completion task on indoor scenes, we further build a new dataset called ScanWCF, which contains labeled partial scans as well as aligned ground truth scene completions that are watertight and collision-free. Through several experiments, we demonstrate that our method achieves improved fidelity to partial scans, higher completion quality, and greater plausibility over existing state-of-the-art methods.
zh
[CV-66] ARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
【速读】:该论文旨在解决视频到音频合成中的高保真度和时间一致性问题。为实现这一目标,提出了一种名为Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO) 的新框架。TARO的关键创新在于两项核心技术:(1) Timestep-Adaptive Representation Alignment (TRA),它通过根据噪声调度动态调整对齐强度来动态对齐潜在表示,从而确保平滑演化和提升保真度;(2) Onset-Aware Conditioning (OAC),它利用起始线索作为与音频相关的视觉时刻的清晰事件驱动标记,以增强与动态视觉事件的同步性。实验结果表明,TARO在VGGSound和Landscape数据集上的表现优于现有方法,Frechet Distance (FD) 和 Frechet Audio Distance (FAD) 分别降低了53%和29%,同时达到了97.19%的对齐精度,凸显了其卓越的音频质量和同步精确性。
链接: https://arxiv.org/abs/2504.05684
作者: Tri Ton,Ji Woo Hong,Chang D. Yoo
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures
Abstract:This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.
zh
[CV-67] On the Suitability of Reinforcement Fine-Tuning to Visual Tasks
【速读】:该论文试图解决的问题是如何评估和理解强化微调(Reinforcement Fine-Tuning, RFT)在视觉任务中的适用性和局限性。当前关于将RFT应用于多模态大语言模型(MLLMs)以增强其视觉理解能力的研究尚处于早期阶段,尚未充分考察RFT是否适合视觉任务。为此,论文通过实验分析与观察,从定量比较不同任务出发,发现RFT在视觉任务中总体上优于监督微调(Supervised Fine-Tuning, SFT),尤其是在训练样本数量有限的情况下。进一步地,为了探究RFT的优势是否源于推理过程,研究设计了一种新的奖励机制以鼓励模型进行更多“思考”,结果显示这种策略对复杂任务有益,但可能对简单任务产生负面影响。关键在于通过定量分析与针对性的奖励机制设计,揭示RFT在视觉任务中的潜在优势及限制条件。
链接: https://arxiv.org/abs/2504.05682
作者: Xiaxu Chen,Wei Li,Chunxu Liu,Chi Xie,Xiaoyan Hu,Chengqian Ma,Feng Zhu,Rui Zhao
机构: Beijing Institute of Technology (北京理工大学); SenseTime Research (商汤科技研究部); Nanjing University (南京大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement Fine-Tuning (RFT) is proved to be greatly valuable for enhancing the reasoning ability of LLMs. Researchers have been starting to apply RFT to MLLMs, hoping it will also enhance the capabilities of visual understanding. However, these works are at a very early stage and have not examined how suitable RFT actually is for visual tasks. In this work, we endeavor to understand the suitabilities and limitations of RFT for visual tasks, through experimental analysis and observations. We start by quantitative comparisons on various tasks, which shows RFT is generally better than SFT on visual tasks. %especially when the number of training samples are limited. To check whether such advantages are brought up by the reasoning process, we design a new reward that encourages the model to ``think’’ more, whose results show more thinking can be beneficial for complicated tasks but harmful for simple tasks. We hope this study can provide more insight for the rapid advancements on this topic.
zh
[CV-68] Event-based Civil Infrastructure Visual Defect Detection: ev-CIVIL Dataset and Benchmark ALT
【速读】:该论文旨在解决传统基于帧的相机在小型无人飞行器(UAV)视觉检测中难以应对低光照或动态光照条件下的缺陷检测问题。解决方案的关键在于引入动态视觉传感器(Dynamic Vision Sensor, DVS),即事件驱动相机,其通过减少运动模糊、提高功率效率以及在多样化光照条件下保持高质量成像而不饱和或信息丢失,显著提升了在复杂光照环境中的性能。为填补现有研究空白,该研究构建了首个基于事件的民用基础设施缺陷检测数据集,不仅包含由DAVIS346相机捕获的空间-时间事件流,还同时记录了灰度强度图像帧。此数据集涵盖了两种主要缺陷类型(裂缝和剥落),并包含了来自现场与实验室环境的数据。此外,论文通过实时目标检测模型验证了该数据集的有效性,证明其能够在具有挑战性的光照条件下实现精确的缺陷检测与分类。
链接: https://arxiv.org/abs/2504.05679
作者: Udayanga G.W.K.N. Gamage,Xuanni Huo,Luca Zanatta,T Delbruck,Cesar Cadena,Matteo Fumagalli,Silvia Tolu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A journal paper which submitted to Sage SHM journa and it is under review currently. consist of 25 pages. It has 19 figures and 5 tables. Keywords Event-based vision, civil structural health monitoring, defect detection, crack, spalling, DVS, dataset, YOLOv6, SSD, 2D event histograms
Abstract:Small Unmanned Aerial Vehicle (UAV) based visual inspections are a more efficient alternative to manual methods for examining civil structural defects, offering safe access to hazardous areas and significant cost savings by reducing labor requirements. However, traditional frame-based cameras, widely used in UAV-based inspections, often struggle to capture defects under low or dynamic lighting conditions. In contrast, Dynamic Vision Sensors (DVS), or event-based cameras, excel in such scenarios by minimizing motion blur, enhancing power efficiency, and maintaining high-quality imaging across diverse lighting conditions without saturation or information loss. Despite these advantages, existing research lacks studies exploring the feasibility of using DVS for detecting civil structural this http URL, there is no dedicated event-based dataset tailored for this purpose. Addressing this gap, this study introduces the first event-based civil infrastructure defect detection dataset, capturing defective surfaces as a spatio-temporal event stream using this http URL addition to event-based data, the dataset includes grayscale intensity image frames captured simultaneously using an Active Pixel Sensor (APS). Both data types were collected using the DAVIS346 camera, which integrates DVS and APS this http URL dataset focuses on two types of defects: cracks and spalling, and includes data from both field and laboratory environments. The field dataset comprises 318 recording sequences,documenting 458 distinct cracks and 121 distinct spalling this http URL laboratory dataset includes 362 recording sequences, covering 220 distinct cracks and 308 spalling this http URL realtime object detection models were evaluated on it to validate the dataset this http URL results demonstrate the dataset robustness in enabling accurate defect detection and classification,even under challenging lighting conditions.
zh
[CV-69] Noisy Deep Ensemble: Accelerating Deep Ensemble Learning via Noise Injection
【速读】:该论文试图解决神经网络集成方法中训练时间随集成成员数量线性增加的问题。为了解决这一挑战,论文提出了一种新颖的“噪声深度集成(Noisy Deep Ensemble)”方法。其关键在于通过训练一个“父模型”至收敛后,以多种方式扰动父模型的权重来构建多个“子模型”。这种方法不仅能够探索不同的局部最优解,还显著减少了每个集成成员的训练时间,同时保持了与标准集成方法相当的测试精度。
链接: https://arxiv.org/abs/2504.05677
作者: Shunsuke Sakai,Shunsuke Tsuge,Tatsuhito Hasegawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural network ensembles is a simple yet effective approach for enhancing generalization capabilities. The most common method involves independently training multiple neural networks initialized with different weights and then averaging their predictions during inference. However, this approach increases training time linearly with the number of ensemble members. To address this issue, we propose the novel ``\textbfNoisy Deep Ensemble’’ method, significantly reducing the training time required for neural network ensembles. In this method, a \textitparent model is trained until convergence, and then the weights of the \textitparent model are perturbed in various ways to construct multiple \textitchild models. This perturbation of the \textitparent model weights facilitates the exploration of different local minima while significantly reducing the training time for each ensemble member. We evaluated our method using diverse CNN architectures on CIFAR-10 and CIFAR-100 datasets, surpassing conventional efficient ensemble methods and achieving test accuracy comparable to standard ensembles. Code is available at \hrefthis https URLthis https URL
zh
[CV-70] VC-LLM : Automated Advertisement Video Creation from Raw Footage using Multi-modal LLM s
【速读】:该论文旨在解决广告短视频自动化高质量生成的问题,尤其针对同一产品创建多种不同视频内容时创意门槛高、人工效率低的挑战。论文提出的关键解决方案是VC-LLM框架,其核心在于利用大语言模型(Large Language Models)结合高分辨率空间输入和低分辨率时间输入来更有效地表征视频片段,同时通过重写真实文本生成补充信息,确保输出信息可直接追溯至输入,从而减少模型幻觉。此外,该方法通过构建预训练数据集和高质量微调数据集进一步优化模型性能,最终实现与人工创作质量相当甚至在叙事逻辑方面更优的广告短视频生成效果。
链接: https://arxiv.org/abs/2504.05673
作者: Dongjun Qian,Kai Su,Yiming Tan,Qishuai Diao,Xian Wu,Chang Liu,Bingyue Peng,Zehuan Yuan
机构: Bytedance Inc (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As short videos have risen in popularity, the role of video content in advertising has become increasingly significant. Typically, advertisers record a large amount of raw footage about the product and then create numerous different short-form advertisement videos based on this raw footage. Creating such videos mainly involves editing raw footage and writing advertisement scripts, which requires a certain level of creative ability. It is usually challenging to create many different video contents for the same product, and manual efficiency is often low. In this paper, we present VC-LLM, a framework powered by Large Language Models for the automatic creation of high-quality short-form advertisement videos. Our approach leverages high-resolution spatial input and low-resolution temporal input to represent video clips more effectively, capturing both fine-grained visual details and broader temporal dynamics. In addition, during training, we incorporate supplementary information generated by rewriting the ground truth text, ensuring that all key output information can be directly traced back to the input, thereby reducing model hallucinations. We also designed a benchmark to evaluate the quality of the created videos. Experiments show that VC-LLM based on GPT-4o can produce videos comparable to those created by humans. Furthermore, we collected numerous high-quality short advertisement videos to create a pre-training dataset and manually cleaned a portion of the data to construct a high-quality fine-tuning dataset. Experiments indicate that, on the benchmark, the VC-LLM based on fine-tuned LLM can produce videos with superior narrative logic compared to those created by the VC-LLM based on GPT-4o.
zh
[CV-71] Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation
【速读】:该论文旨在解决语音保留面部表情操控(Speech-preserving Facial Expression Manipulation, SPFEM)中的挑战,即在修改说话头以显示特定参考情感的同时,保持源输入语音内容的嘴部动画。论文指出,参考输入和源输入中存在的情感与内容信息虽然可以直接提供监督信号,但这些元素在说话过程中的内在交织限制了其作为监督信号的有效性。为此,论文提出通过创新的对比解耦表示学习(Contrastive Decoupled Representation Learning, CDRL)算法,学习内容和情感先验作为引导,实现内容和情感表示的解耦。关键在于设计了对比内容表示学习(Contrastive Content Representation Learning, CCRL)模块来提取音频特征中的内容信息,并结合对比情感表示学习(Contrastive Emotion Representation Learning, CERL)模块利用预训练的视觉-语言模型学习情感先验,同时引入情感感知和增强的对比学习策略,确保学习到的情感独立的内容表示和内容独立的情感表示。最终,解耦的内容和情感表示被用于指导SPFEM模型的生成过程,确保更准确的情感操控与音频-唇同步。
链接: https://arxiv.org/abs/2504.05672
作者: Tianshui Chen,Jianman Lin,Zhijing Yang,Chumei Qing,Yukai Shi,Liang Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:
Abstract:Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents. Thus, emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. However, the intrinsic intertwining of these elements during the talking process poses challenges to their effectiveness as supervisory signals. In this work, we propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation via an innovative Contrastive Decoupled Representation Learning (CDRL) algorithm. Specifically, a Contrastive Content Representation Learning (CCRL) module is designed to learn audio feature, which primarily contains content information, as content priors to guide learning content representation from the source input. Meanwhile, a Contrastive Emotion Representation Learning (CERL) module is proposed to make use of a pre-trained visual-language model to learn emotion prior, which is then used to guide learning emotion representation from the reference input. We further introduce emotion-aware and emotion-augmented contrastive learning to train CCRL and CERL modules, respectively, ensuring learning emotion-independent content representation and content-independent emotion representation. During SPFEM model training, the decoupled content and emotion representations are used to supervise the generation process, ensuring more accurate emotion manipulation together with audio-lip synchronization. Extensive experiments and evaluations on various benchmarks show the effectiveness of the proposed algorithm.
zh
[CV-72] Reconstruction-Free Anomaly Detection with Diffusion Models via Direct Latent Likelihood Evaluation
【速读】:该论文旨在解决基于扩散模型(Diffusion Models)的异常检测方法在实际应用中因依赖于资源密集型的图像重建过程而导致检测速度显著下降的问题。传统方法通过计算原始图像与去噪后图像之间的重构误差来衡量异常程度,这需要精细调整噪声强度,并对每个输入进行十次以上的网络评估。为克服这些局限性,论文提出了一种新的基于扩散的异常检测方法,其关键在于避免了复杂的图像重建步骤,而是直接推断输入图像对应的潜在变量,并在其高斯先验分布下的密度作为异常评分。令人惊讶的是,即使仅使用2到5步的短部分扩散过程,该先验密度也能有效作为异常分数。这种方法在MVTecAD数据集上的测试结果表明,它达到了AUC为0.991且帧速率为15 FPS的性能,从而在速度与AUC的权衡中实现了新的最先进水平。
链接: https://arxiv.org/abs/2504.05662
作者: Shunsuke Sakai,Tatsuhito Hasegawa
机构: University of Fukui (福井大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL
Abstract:Diffusion models, with their robust distribution approximation capabilities, have demonstrated excellent performance in anomaly detection. However, conventional reconstruction-based approaches rely on computing the reconstruction error between the original and denoised images, which requires careful noise-strength tuning and over ten network evaluations per input-leading to significantly slower detection speeds. To address these limitations, we propose a novel diffusion-based anomaly detection method that circumvents the need for resource-intensive reconstruction. Instead of reconstructing the input image, we directly infer its corresponding latent variables and measure their density under the Gaussian prior distribution. Remarkably, the prior density proves effective as an anomaly score even when using a short partial diffusion process of only 2-5 steps. We evaluate our method on the MVTecAD dataset, achieving an AUC of 0.991 at 15 FPS, thereby setting a new state-of-the-art speed-AUC anomaly detection trade-off.
zh
[CV-73] Measuring Déjà vu Memorization Efficiently
【速读】:该论文试图解决预训练开放源代码图像表示和视觉-语言表示模型中意外记忆(memorization)的测量问题。传统方法需要训练两个模型来分别估计数据集级别的相关性和模型的意外记忆能力,但在大规模开放源代码模型上变得不可行。论文的关键解决方案是提出了一种替代的简单方法来估算数据集级别的相关性,并证明这些方法可以用来近似评估现成模型的意外记忆能力,而无需重新训练。这种方法首次实现了对预训练开放源代码模型中意外记忆能力的测量,并发现不同测量方式得到的汇总结果非常相似,同时开放源代码模型的整体意外记忆能力通常低于在子集数据上训练的类似模型。代码资源已提供用于视觉和视觉-语言模型。
链接: https://arxiv.org/abs/2504.05651
作者: Narine Kokhlikyan,Bargav Jayaraman,Florian Bordes,Chuan Guo,Kamalika Chaudhuri
机构: Meta (Facebook)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent research has shown that representation learning models may accidentally memorize their training data. For example, the déjà vu method shows that for certain representation learning models and training images, it is sometimes possible to correctly predict the foreground label given only the representation of the background - better than through dataset-level correlations. However, their measurement method requires training two models - one to estimate dataset-level correlations and the other to estimate memorization. This multiple model setup becomes infeasible for large open-source models. In this work, we propose alternative simple methods to estimate dataset-level correlations, and show that these can be used to approximate an off-the-shelf model’s memorization ability without any retraining. This enables, for the first time, the measurement of memorization in pre-trained open-source image representation and vision-language representation models. Our results show that different ways of measuring memorization yield very similar aggregate results. We also find that open-source models typically have lower aggregate memorization than similar models trained on a subset of the data. The code is available both for vision and vision language models.
zh
[CV-74] POD: Predictive Object Detection with Single-Frame FMCW LiDAR Point Cloud
【速读】:该论文旨在解决基于单帧 Frequency Modulated Continuous Wave (FMCW) 激光雷达点云的短时预测性目标检测问题。传统目标检测通常依赖多帧历史数据来捕捉目标的运动趋势,而该论文提出了一种新颖的预测性目标检测 (Predictive Object Detection, POD) 方法,通过利用 FMCW 激光雷达提供的径向速度信息,在仅基于当前帧观测的情况下预测目标的短期未来位置与尺寸。其解决方案的关键在于设计了一种基于射线投射机制生成虚拟未来点云的框架,构建包含当前帧与虚拟未来帧的两帧点云,并通过稀疏 4D 编码器提取两帧体素特征。随后,这些特征被分离并映射为两个鸟瞰图 (Bird’s Eye View, BEV) 特征:一个用于标准的目标检测,另一个用于未来的预测性检测,从而实现快速响应潜在危险的能力。
链接: https://arxiv.org/abs/2504.05649
作者: Yining Shi,Kun Jiang,Xin Zhao,Kangan Qian,Chuchu Xie,Tuopu Wen,Mengmeng Yang,Diange Yang
机构: School of Vehicle and Mobility, Tsinghua University (清华大学车辆与运载学院), State Key Laboratory of Intelligent Green Vehicle and Mobility (智能绿色车辆国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LiDAR-based 3D object detection is a fundamental task in the field of autonomous driving. This paper explores the unique advantage of Frequency Modulated Continuous Wave (FMCW) LiDAR in autonomous perception. Given a single frame FMCW point cloud with radial velocity measurements, we expect that our object detector can detect the short-term future locations of objects using only the current frame sensor data and demonstrate a fast ability to respond to intermediate danger. To achieve this, we extend the standard object detection task to a novel task named predictive object detection (POD), which aims to predict the short-term future location and dimensions of objects based solely on current observations. Typically, a motion prediction task requires historical sensor information to process the temporal contexts of each object, while our detector’s avoidance of multi-frame historical information enables a much faster response time to potential dangers. The core advantage of FMCW LiDAR lies in the radial velocity associated with every reflected point. We propose a novel POD framework, the core idea of which is to generate a virtual future point using a ray casting mechanism, create virtual two-frame point clouds with the current and virtual future frames, and encode these two-frame voxel features with a sparse 4D encoder. Subsequently, the 4D voxel features are separated by temporal indices and remapped into two Bird’s Eye View (BEV) features: one decoded for standard current frame object detection and the other for future predictive object detection. Extensive experiments on our in-house dataset demonstrate the state-of-the-art standard and predictive detection performance of the proposed POD framework.
zh
[CV-75] EBAKER: Improved Remote Sensing Image-Text Retrieval Framework via Eliminate Before Align and Keyword Explicit Reasoning
【速读】:该论文旨在解决基于文本查询的遥感图像检索(RSITR)任务中,现有基于Foundation Models (FMs) 方法忽视弱相关样本对的负面影响以及未能充分考虑遥感文本关键区别性特征的问题,导致对样本对的偏颇且浅层次探索。为应对这些挑战,论文提出了一种名为iEBAKER的方法,其核心在于引入创新的Eliminate Before Align (EBA) 策略以过滤掉弱相关样本对,并通过局部与全局相似性交互影响的两种具体方案优化嵌入空间。此外,还提出了Sort After Reversed Retrieval (SAR) 策略来通过逆向检索优化相似性矩阵,并结合Keyword Explicit Reasoning (KER) 模块增强细微关键概念差异的有益影响。该方法无需额外的遥感数据预训练,即可实现从FMs到RSITR任务的直接迁移,同时在三个流行的基准数据集上的实验表明,该方法超越了现有的最先进模型,且所需训练数据更少。
链接: https://arxiv.org/abs/2504.05644
作者: Yan Zhang,Zhong Ji,Changxu Meng,Yanwei Pang,Jungong Han
机构: School of Electrical and Information Engineering, Tianjin Key Laboratory of Brain-inspired Intelligence Technology, Tianjin University (天津大学); Department of Automation, Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent studies focus on the Remote Sensing Image-Text Retrieval (RSITR), which aims at searching for the corresponding targets based on the given query. Among these efforts, the application of Foundation Models (FMs), such as CLIP, to the domain of remote sensing has yielded encouraging outcomes. However, existing FM based methodologies neglect the negative impact of weakly correlated sample pairs and fail to account for the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs. To address these challenges, we propose an approach named iEBAKER (an Improved Eliminate Before Align strategy with Keyword Explicit Reasoning framework) for RSITR. Specifically, we propose an innovative Eliminate Before Align (EBA) strategy to filter out the weakly correlated sample pairs, thereby mitigating their deviations from optimal embedding space during this http URL, two specific schemes are introduced from the perspective of whether local similarity and global similarity affect each other. On this basis, we introduce an alternative Sort After Reversed Retrieval (SAR) strategy, aims at optimizing the similarity matrix via reverse retrieval. Additionally, we incorporate a Keyword Explicit Reasoning (KER) module to facilitate the beneficial impact of subtle key concept distinctions. Without bells and whistles, our approach enables a direct transition from FM to RSITR task, eliminating the need for additional pretraining on remote sensing data. Extensive experiments conducted on three popular benchmark datasets demonstrate that our proposed iEBAKER method surpasses the state-of-the-art models while requiring less training data. Our source code will be released at this https URL.
zh
[CV-76] me-Aware Auto White Balance in Mobile Photography
【速读】:该论文旨在解决相机自动白平衡(Auto White Balance, AWB)中基于场景光照和相机光谱敏感度导致的色彩偏差校正问题。传统方法通常仅依赖于从相机原始传感器图像的颜色信息推断全局色彩偏差,而忽略了其他可用的上下文线索。论文的关键解决方案在于提出了一种轻量级的光照估计方法,通过整合移动设备提供的上下文元数据(如拍摄时间戳和地理定位)、附加捕获信息以及图像颜色,构建了一个参数量仅为约5K的小型模型(~5K parameters),从而显著提升了光照估计的准确性。该方法通过引入包含3,224张智能手机图像的数据集进行验证,数据集不仅包含使用色卡确定的真实光照颜色标签,还结合了用户偏好光照的主观验证结果,为AWB性能评估提供了全面基准。
链接: https://arxiv.org/abs/2504.05623
作者: Mahmoud Afifi,Luxi Zhao,Abhijith Punnappurath,Mohammed A. Abdelsalam,Ran Zhang,Michael S. Brown
机构: AI Center-Toronto, Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cameras rely on auto white balance (AWB) to correct undesirable color casts caused by scene illumination and the camera’s spectral sensitivity. This is typically achieved using an illuminant estimator that determines the global color cast solely from the color information in the camera’s raw sensor image. Mobile devices provide valuable additional metadata-such as capture timestamp and geolocation-that offers strong contextual clues to help narrow down the possible illumination solutions. This paper proposes a lightweight illuminant estimation method that incorporates such contextual metadata, along with additional capture information and image colors, into a compact model (~5K parameters), achieving promising results, matching or surpassing larger models. To validate our method, we introduce a dataset of 3,224 smartphone images with contextual metadata collected at various times of day and under diverse lighting conditions. The dataset includes ground-truth illuminant colors, determined using a color chart, and user-preferred illuminants validated through a user study, providing a comprehensive benchmark for AWB evaluation.
zh
[CV-77] chnical Report: Full Version of Analyzing and Optimizing Perturbation of DP-SGD Geometrically ICDE2025
【速读】:该论文旨在解决深度学习任务中差分隐私(Differential Privacy, DP)保护与模型效率之间的权衡问题。具体而言,传统方法如DP-SGD通过直接扰动梯度来实现隐私保护,但其在噪声对梯度方向的影响方面未能有效缓解效率下降的问题,导致模型性能受限。论文的关键在于揭示了DP-SGD低效的根本原因,并提出了一种新的几何扰动策略GeoDP。GeoDP从几何角度出发,分别对梯度的方向和大小进行扰动,通过减少方向上的噪声,显著提升了模型效率,同时保持了相同的DP保证。实验验证了GeoDP在多个数据集和模型上的有效性与通用性。
链接: https://arxiv.org/abs/2504.05618
作者: Jiawei Duan,Haibo Hu,Qingqing Ye,Xinyue Sun
机构: The Hong Kong Polytechnic University (香港理工大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: This is the full version of our paper “Analyzing and Optimizing Perturbation of DP-SGD Geometrically”, which will appear in ICDE 2025 as a regular research paper
Abstract:Differential privacy (DP) has become a prevalent privacy model in a wide range of machine learning tasks, especially after the debut of DP-SGD. However, DP-SGD, which directly perturbs gradients in the training iterations, fails to mitigate the negative impacts of noise on gradient direction. As a result, DP-SGD is often inefficient. Although various solutions (e.g., clipping to reduce the sensitivity of gradients and amplifying privacy bounds to save privacy budgets) are proposed to trade privacy for model efficiency, the root cause of its inefficiency is yet unveiled. In this work, we first generalize DP-SGD and theoretically derive the impact of DP noise on the training process. Our analysis reveals that, in terms of a perturbed gradient, only the noise on direction has eminent impact on the model efficiency while that on magnitude can be mitigated by optimization techniques, i.e., fine-tuning gradient clipping and learning rate. Besides, we confirm that traditional DP introduces biased noise on the direction when adding unbiased noise to the gradient itself. Overall, the perturbation of DP-SGD is actually sub-optimal from a geometric perspective. Motivated by this, we design a geometric perturbation strategy GeoDP within the DP framework, which perturbs the direction and the magnitude of a gradient, respectively. By directly reducing the noise on the direction, GeoDP mitigates the negative impact of DP noise on model efficiency with the same DP guarantee. Extensive experiments on two public datasets (i.e., MNIST and CIFAR-10), one synthetic dataset and three prevalent models (i.e., Logistic Regression, CNN and ResNet) confirm the effectiveness and generality of our strategy. Comments: This is the full version of our paper “Analyzing and Optimizing Perturbation of DP-SGD Geometrically”, which will appear in ICDE 2025 as a regular research paper Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB) Cite as: arXiv:2504.05618 [cs.LG] (or arXiv:2504.05618v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.05618 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: International Conference of Data Engineering (ICDE 2025)
zh
[CV-78] Falcon: Fractional Alternating Cut with Overcoming Minima in Unsupervised Segmentation
【速读】:该论文旨在解决现有无监督图像分割算法在分割速度和精度上不如监督方法的问题。现代基于图割的方法通常依赖于Transformer基础模型产生的高维注意力图,但它们通过递归计算Fiedler向量来解决松弛的归一化割问题(Normalized Cut),导致性能仍落后于监督方法。论文提出了一种正则化的分数交替割(Falcon)方法,这是一种基于优化的K路归一化割,无需依赖递归特征向量计算,从而显著提升了分割速度和准确性。Falcon的关键在于其两阶段操作:第一阶段通过扩展到分数二次变换,并采用交替迭代过程及正则化避免局部最优解,快速实现K路归一化割;第二阶段利用互补的低级信息对结果掩码进行细化,以生成高质量的像素级分割。实验表明,Falcon在六个广泛认可的基准测试中平均超越现有最先进方法2.5%,最高提升可达4.3%,同时相比先前基于图的方法运行时间减少约30%。这些发现证明了可以通过高度可并行化的图割框架有效利用基础模型注意力中的语义信息,从而缩小无监督与有监督分割之间的差距,并增强实际应用中的可扩展性。
链接: https://arxiv.org/abs/2504.05613
作者: Xiao Zhang,Xiangyu Han,Xiwen Lai,Yao Sun,Pei Zhang,Konrad Kording
机构: University of Pennsylvania (宾夕法尼亚大学); Hong Kong Polytechnic University (香港理工大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Today’s unsupervised image segmentation algorithms often segment suboptimally. Modern graph-cut based approaches rely on high-dimensional attention maps from Transformer-based foundation models, typically employing a relaxed Normalized Cut solved recursively via the Fiedler vector (the eigenvector of the second smallest eigenvalue). Consequently, they still lag behind supervised methods in both mask generation speed and segmentation accuracy. We present a regularized fractional alternating cut (Falcon), an optimization-based K-way Normalized Cut without relying on recursive eigenvector computations, achieving substantially improved speed and accuracy. Falcon operates in two stages: (1) a fast K-way Normalized Cut solved by extending into a fractional quadratic transformation, with an alternating iterative procedure and regularization to avoid local minima; and (2) refinement of the resulting masks using complementary low-level information, producing high-quality pixel-level segmentations. Experiments show that Falcon not only surpasses existing state-of-the-art methods by an average of 2.5% across six widely recognized benchmarks (reaching up to 4.3% improvement on Cityscapes), but also reduces runtime by around 30% compared to prior graph-based approaches. These findings demonstrate that the semantic information within foundation-model attention can be effectively harnessed by a highly parallelizable graph cut framework. Consequently, Falcon can narrow the gap between unsupervised and supervised segmentation, enhancing scalability in real-world applications and paving the way for dense prediction-based vision pre-training in various downstream tasks. The code is released in this https URL.
zh
[CV-79] PyTopo3D: A Python Framework for 3D SIMP-based Topology Optimization
【速读】:该论文旨在解决在流行的Python科学环境中可用且开源的三维拓扑优化(3D Topology Optimization, 3D TO)工具有限的问题。为填补这一空白,论文引入了PyTopo3D,这是一个基于Python开发的软件框架。PyTopo3D的关键创新在于它不仅实现了经典的Solid Isotropic Material with Penalization (SIMP) 方法和Optimality Criteria (OC) 更新方案,还显著扩展了来自Liu和Tovar (2014)高效MATLAB代码的功能,并通过利用Python中的稀疏矩阵操作、可选的并行求解器以及加速的KD-Tree灵敏度滤波技术提升了性能。此外,PyTopo3D集成了直接导入复杂设计域和非设计障碍的STL文件功能、优化过程的集成三维可视化能力以及优化几何结构的直接STL导出功能,以支持实际工程工作流程。这些特性使得PyTopo3D成为一个易用且性能优化的工具,能够帮助工程师、学生和研究人员更便捷地将3D TO整合到其现有的Python工作流中。
链接: https://arxiv.org/abs/2504.05604
作者: Jihoon Kim,Namwoo Kang
机构: Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology(韩国科学技术院); Narnia Labs; Cho Chun Shik Graduate School of Mobility, Korea Advanced Institute of Science and Technology(韩国科学技术院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Three-dimensional topology optimization (TO) is a powerful technique in engineering design, but readily usable, open-source implementations remain limited within the popular Python scientific environment. This paper introduces PyTopo3D, a software framework developed to address this gap. PyTopo3D provides a feature-rich tool for 3D TO by implementing the well-established Solid Isotropic Material with Penalization (SIMP) method and an Optimality Criteria (OC) update scheme, adapted and significantly enhanced from the efficient MATLAB code by Liu and Tovar (2014). While building on proven methodology, PyTopo3D’s primary contribution is its integration and extension within Python, leveraging sparse matrix operations, optional parallel solvers, and accelerated KD-Tree sensitivity filtering for performance. Crucially, it incorporates functionalities vital for practical engineering workflows, including the direct import of complex design domains and non-design obstacles via STL files, integrated 3D visualization of the optimization process, and direct STL export of optimized geometries for manufacturing or further analysis. PyTopo3D is presented as an accessible, performance-aware tool and citable reference designed to empower engineers, students, and researchers to more easily utilize 3D TO within their existing Python-based workflows.
zh
[CV-80] AD-Det: Boosting Object Detection in UAV Images with Focused Small Objects and Balanced Tail Classes
【速读】:该论文旨在解决无人机(UAV)图像中目标检测面临的复杂尺度变化和类别不平衡两大挑战。现有方法通常分别处理这些问题,未能充分考虑UAV图像的复杂性以及这些挑战之间的潜在协同作用。论文提出了一种名为AD-Det的新框架,其关键在于采用一种连贯的粗到细策略,将两个核心组件无缝集成:自适应小目标增强(Adaptive Small Object Enhancement, ASOE)和动态类别平衡复制粘贴(Dynamic Class-balanced Copy-paste, DCC)。ASOE通过高分辨率特征图识别并聚类包含小目标的区域,并对这些区域进行放大后由细粒度检测器处理;而DCC则通过在ASOE得到的聚类中心周围动态粘贴尾部类别对象,为每个尾部类别维护动态记忆库,从而实现目标级别的重采样。这种方案不仅能够精确提取小目标区域,还能合理动态地处理尾部类别对象,通过协同与自适应的方式有效应对UAV图像中的尺度变化和类别不平衡问题。
链接: https://arxiv.org/abs/2504.05601
作者: Zhenteng Li,Sheng Lian,Dengfeng Pan,Youlin Wang,Wei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Object detection in Unmanned Aerial Vehicle (UAV) images poses significant challenges due to complex scale variations and class imbalance among objects. Existing methods often address these challenges separately, overlooking the intricate nature of UAV images and the potential synergy between them. In response, this paper proposes AD-Det, a novel framework employing a coherent coarse-to-fine strategy that seamlessly integrates two pivotal components: Adaptive Small Object Enhancement (ASOE) and Dynamic Class-balanced Copy-paste (DCC). ASOE utilizes a high-resolution feature map to identify and cluster regions containing small objects. These regions are subsequently enlarged and processed by a fine-grained detector. On the other hand, DCC conducts object-level resampling by dynamically pasting tail classes around the cluster centers obtained by ASOE, main-taining a dynamic memory bank for each tail class. This approach enables AD-Det to not only extract regions with small objects for precise detection but also dynamically perform reasonable resampling for tail-class objects. Consequently, AD-Det enhances the overall detection performance by addressing the challenges of scale variations and class imbalance in UAV images through a synergistic and adaptive framework. We extensively evaluate our approach on two public datasets, i.e., VisDrone and UAVDT, and demonstrate that AD-Det significantly outperforms existing competitive alternatives. Notably, AD-Det achieves a 37.5% Average Precision (AP) on the VisDrone dataset, surpassing its counterparts by at least 3.1%.
zh
[CV-81] uning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model
【速读】:该论文旨在解决文本驱动图像编辑(Text-based Image Editing, TIE)中保真度与可编辑性难以平衡的问题,传统方法通常依赖于注意力注入来保持结构完整性,并利用预训练文本到图像(Text-to-Image, T2I)模型的固有文本对齐能力以增强可编辑性,但缺乏明确且统一的机制来同时优化这两个目标。论文提出了一种无需微调的UnifyEdit方法,通过扩散潜空间优化实现保真度与可编辑性在统一框架内的均衡整合。关键创新在于开发了两种基于注意力的约束:自注意力(Self-Attention, SA)保真约束用于结构保真,交叉注意力(Cross-Attention, CA)对齐约束用于提升文本对齐效果以改善可编辑性。然而,同时应用这两种约束可能导致梯度冲突,使某一约束占主导地位从而引发过编辑或欠编辑问题。为此,论文引入了自适应时间步调度器,动态调整约束的影响权重,引导扩散潜空间向最优平衡点演化。通过广泛的定量与定性实验验证,UnifyEdit方法在多种编辑任务中实现了结构保真与文本对齐之间的稳健平衡,优于现有最先进的方法。
链接: https://arxiv.org/abs/2504.05594
作者: Qi Mao,Lan Chen,Yuchao Gu,Mike Zheng Shou,Ming-Hsuan Yang
机构: State Key Laboratory of Media Convergence and Communication, Communication University of China (传播大学媒体融合与传播国家重点实验室); Show Lab, National University of Singapore (新加坡国立大学Show实验室); University of California at Merced (加州大学默塞德分校); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review
Abstract:Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at this https URL.
zh
[CV-82] CoA: Towards Real Image Dehazing via Compression-and-Adaptation
【速读】:该论文旨在解决基于学习的去雾算法在实际场景中因计算资源限制和真实场景多样性而导致的去雾效果不确定性问题。论文的关键在于提出了一种名为Compression-and-Adaptation (CoA) 的计算流程,从分而治之的角度出发,通过在合成域进行模型压缩以构建高效的紧凑参数空间,并在真实域引入双层适应机制,在学习过程中整合合成去雾能力,从而实现对未知真实环境的适应性。这一方法展示了与领域无关的稳定性和与模型无关的灵活性,有效弥合了合成域与真实域之间的差距,显著提升了实际应用中的实用性。代码已公开发布。
链接: https://arxiv.org/abs/2504.05590
作者: Long Ma,Yuxin Feng,Yan Zhang,Jinyuan Liu,Weimin Wang,Guang-Yong Chen,Chengpei Xu,Zhuo Su
机构: Dalian University of Technology (大连理工大学); Sun Yat-sen University (中山大学); Fuzhou University (福州大学); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning-based image dehazing algorithms have shown remarkable success in synthetic domains. However, real image dehazing is still in suspense due to computational resource constraints and the diversity of real-world scenes. Therefore, there is an urgent need for an algorithm that excels in both efficiency and adaptability to address real image dehazing effectively. This work proposes a Compression-and-Adaptation (CoA) computational flow to tackle these challenges from a divide-and-conquer perspective. First, model compression is performed in the synthetic domain to develop a compact dehazing parameter space, satisfying efficiency demands. Then, a bilevel adaptation in the real domain is introduced to be fearless in unknown real environments by aggregating the synthetic dehazing capabilities during the learning process. Leveraging a succinct design free from additional constraints, our CoA exhibits domain-irrelevant stability and model-agnostic flexibility, effectively bridging the model chasm between synthetic and real domains to further improve its practical utility. Extensive evaluations and analyses underscore the approach’s superiority and effectiveness. The code is publicly available at this https URL.
zh
[CV-83] Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification
【速读】:该论文旨在解决现有深度神经网络在视觉分类任务中因注意力机制侧重特征表示而非精确位置而导致的误分类问题,特别是在迁移学习或分布外数据集上的表现不佳。论文指出,人类能够利用先验知识快速定位并比较细粒度属性,而现有方法难以实现类似能力,尤其是在复杂高变异性场景中。为解决此问题,论文提出的关键方案是引入Gaze-CIFAR-10这一包含人类注视时间序列的数据集,并设计一种双序列注视编码器以精准建模人类注意力在局部属性上的序列化定位。同时,结合Vision Transformer (ViT) 学习图像内容的序列化表示,并通过跨模态融合将人类注视先验与机器提取的视觉序列相结合,从而有效纠正图像特征表示中的不准确定位。实验结果表明,基于注视引导的认知线索显著提升了分类准确性。
链接: https://arxiv.org/abs/2504.05583
作者: Jiahang Li,Shibo Xue,Yong Su
机构: Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University (天津师范大学无线移动通信与电力传输重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 3 tables, URL: this https URL
Abstract:Inspired by human visual attention, deep neural networks have widely adopted attention mechanisms to learn locally discriminative attributes for challenging visual classification tasks. However, existing approaches primarily emphasize the representation of such features while neglecting their precise localization, which often leads to misclassification caused by shortcut biases. This limitation becomes even more pronounced when models are evaluated on transfer or out-of-distribution datasets. In contrast, humans are capable of leveraging prior object knowledge to quickly localize and compare fine-grained attributes, a capability that is especially crucial in complex and high-variance classification scenarios. Motivated by this, we introduce Gaze-CIFAR-10, a human gaze time-series dataset, along with a dual-sequence gaze encoder that models the precise sequential localization of human attention on distinct local attributes. In parallel, a Vision Transformer (ViT) is employed to learn the sequential representation of image content. Through cross-modal fusion, our framework integrates human gaze priors with machine-derived visual sequences, effectively correcting inaccurate localization in image feature representations. Extensive qualitative and quantitative experiments demonstrate that gaze-guided cognitive cues significantly enhance classification accuracy.
zh
[CV-84] APNext: Tracking Any Point (TAP) as Next Token Prediction
【速读】:该论文旨在解决视频中任意点跟踪(Tracking Any Point, TAP)这一具有挑战性的计算机视觉问题,现有方法依赖于复杂的特定跟踪归纳偏置和启发式规则,限制了其通用性和扩展潜力。为应对这些挑战,论文提出了一种名为TAPNext的新方法,将TAP重新表述为顺序掩码令牌解码任务。其关键在于模型因果性、纯在线跟踪能力以及消除特定于跟踪的归纳偏置,这使得TAPNext能够以极低延迟运行,并避免了大多数现有顶级跟踪器所需的时序窗口化操作。尽管设计简洁,TAPNext在在线和离线跟踪器中均达到了新的性能高度,同时通过端到端训练自然实现了许多广泛使用的跟踪启发式规则。
链接: https://arxiv.org/abs/2504.05579
作者: Artem Zholus,Carl Doersch,Yi Yang,Skanda Koppula,Viorica Patraucean,Xu Owen He,Ignacio Rocco,Mehdi S. M. Sajjadi,Sarath Chandar,Ross Goroshin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.
zh
[CV-85] SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding CVPR2025
【速读】:该论文旨在解决从新颖视角生成任意场景环境音的问题。现有方法通常依赖于对声源细节或特定约束的先验知识,而本文提出的方法无需此类限制,并能够高效适应多样的房间布局、参考麦克风配置及未见过的环境。关键在于引入了一个视觉-声学绑定模块(Visual-Acoustic Binding Module),该模块通过全景RGB和深度数据学习与局部声学属性相关的视觉嵌入(visual embeddings)。这一模块不仅用于优化任意场景中参考麦克风的布置,还在合成阶段利用从参考位置提取的多个嵌入,根据目标视角条件自适应调整其贡献权重。这些创新使得SoundVista在公开数据集和真实世界环境中均表现出显著性能提升。
链接: https://arxiv.org/abs/2504.05576
作者: Mingfei Chen,Israel D. Gebru,Ishwarya Ananthabhotla,Christian Richardt,Dejan Markovic,Jake Sandakly,Steven Krenn,Todd Keebler,Eli Shlizerman,Alexander Richard
机构: University of Washington (华盛顿大学); Codec Avatars Lab, Pittsburgh, Meta (编码avatar实验室,匹兹堡,Meta); Reality Labs Research, Meta (现实实验室研究中心,Meta)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Highlight Accepted to CVPR 2025
Abstract:We introduce SoundVista, a method to generate the ambient sound of an arbitrary scene at novel viewpoints. Given a pre-acquired recording of the scene from sparsely distributed microphones, SoundVista can synthesize the sound of that scene from an unseen target viewpoint. The method learns the underlying acoustic transfer function that relates the signals acquired at the distributed microphones to the signal at the target viewpoint, using a limited number of known recordings. Unlike existing works, our method does not require constraints or prior knowledge of sound source details. Moreover, our method efficiently adapts to diverse room layouts, reference microphone configurations and unseen environments. To enable this, we introduce a visual-acoustic binding module that learns visual embeddings linked with local acoustic properties from panoramic RGB and depth data. We first leverage these embeddings to optimize the placement of reference microphones in any given scene. During synthesis, we leverage multiple embeddings extracted from reference locations to get adaptive weights for their contribution, conditioned on target viewpoint. We benchmark the task on both publicly available data and real-world settings. We demonstrate significant improvements over existing methods.
zh
[CV-86] A Lightweight Large Vision-language Model for Multimodal Medical Images
【速读】:该论文旨在解决医疗视觉问答(Medical Visual Question Answering, VQA)领域中开发高效且高性能模型的挑战,特别是面对医学影像复杂性和多模态数据多样性的问题。为应对这一挑战,论文提出了一种轻量级、多模态的VQA模型,其关键在于结合BiomedCLIP用于图像特征提取(image feature extraction),以及LLaMA-3用于文本处理(text processing)。通过这种架构设计,该模型在OmniMedVQA数据集上实现了最先进的性能,参数量约为80亿,仅需两块NVIDIA 40 GB A100 GPU即可运行,展现出显著的资源效率。实验结果表明,该模型在开放式问题上的准确率达到73.4%,优于现有方法,并验证了其在实际临床应用中的潜力。论文的主要贡献包括专为医学任务设计的多模态VQA模型、高效的资源利用架构以及在开放式临床问题回答中的卓越表现。
链接: https://arxiv.org/abs/2504.05575
作者: Belal Alsinglawi,Chris McCarthy,Sara Webb,Christopher Fluke,Navid Toosy Saidy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures
Abstract:Medical Visual Question Answering (VQA) enhances clinical decision-making by enabling systems to interpret medical images and answer clinical queries. However, developing efficient, high-performance VQA models is challenging due to the complexity of medical imagery and diverse modalities. In this paper, we introduce a lightweight, multimodal VQA model integrating BiomedCLIP for image feature extraction and LLaMA-3 for text processing. Designed for medical VQA tasks, our model achieves state-of-the-art performance on the OmniMedVQA dataset. With approximately 8 billion parameters, it requires only two NVIDIA 40 GB A100 GPUs, demonstrating superior efficiency over larger models. Our results show 73.4% accuracy for open-end questions, surpassing existing models and validating its potential for real-world medical applications. Key contributions include a specialized multimodal VQA model, a resource-efficient architecture, and strong performance in answering open-ended clinical questions.
zh
[CV-87] Improved Stochastic Texture Filtering Through Sample Reuse SIGGRAPH
【速读】:该论文旨在解决使用随机纹理过滤(STF)进行纹理放大时出现的走样(aliasing)问题以及因无法平滑插值纹理中存储的材质属性(如表面法线)而导致的潜在不理想外观变化。论文的关键解决方案在于提出了一种新颖的方法来提升随机滤波放大的纹理质量,并减少与传统纹理过滤相比的图像差异。具体而言,当纹理被放大时,该方法通过在像素间共享纹理元素(texel)值的方式,在成本小幅增加(每帧0.04–0.14毫秒)的情况下改善了图像质量。此外,论文还改进了加权重要性采样技术,确保所提出的方法不会使误差超过单样本随机纹理过滤的水平。在高放大倍率下,该方法的峰值信噪比(PSNR)比单样本STF高出10 dB。实验结果表明,无论是否应用时空去噪,图像质量均得到了显著提升。
链接: https://arxiv.org/abs/2504.05562
作者: Bartlomiej Wronski,Matt Pharr,Tomas Akenine-Möller
机构: NVIDIA(Brooklyn)(英伟达); NVIDIA(San Francisco)(英伟达); NVIDIA(Lund)(英伟达)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2025 ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D 2025)
Abstract:Stochastic texture filtering (STF) has re-emerged as a technique that can bring down the cost of texture filtering of advanced texture compression methods, e.g., neural texture compression. However, during texture magnification, the swapped order of filtering and shading with STF can result in aliasing. The inability to smoothly interpolate material properties stored in textures, such as surface normals, leads to potentially undesirable appearance changes. We present a novel method to improve the quality of stochastically-filtered magnified textures and reduce the image difference compared to traditional texture filtering. When textures are magnified, nearby pixels filter similar sets of texels and we introduce techniques for sharing texel values among pixels with only a small increase in cost (0.04–0.14~ms per frame). We propose an improvement to weighted importance sampling that guarantees that our method never increases error beyond single-sample stochastic texture filtering. Under high magnification, our method has 10 dB higher PSNR than single-sample STF. Our results show greatly improved image quality both with and without spatiotemporal denoising.
zh
[CV-88] Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
【速读】:该论文旨在解决现有视频描述方法在细粒度物体级描述方面的不足,这些问题包括生成过于抽象的描述或缺乏物体级别的精确性。为应对这些挑战,论文提出了一种无需额外训练数据的框架CAT-V(Caption AnyThing in Video)。其关键在于整合了三个核心组件:基于SAMURAI的Segmenter用于精确跨帧物体分割;由TRACE-Uni驱动的Temporal Analyzer用于准确检测事件边界和时间分析;以及采用InternVL-2.5的Captioner用于生成详细的物体中心描述。通过时空视觉提示和链式思维推理,CAT-V能够生成包含物体属性、动作、状态、交互及环境上下文的详细且具有时间感知能力的描述,同时支持灵活的用户交互并通过跟踪物体状态和交互保持时间敏感性与空间准确性。
链接: https://arxiv.org/abs/2504.05541
作者: Yunlong Tang,Jing Bi,Chao Huang,Susan Liang,Daiki Shimada,Hang Hua,Yunzhong Xiao,Yizhi Song,Pinxin Liu,Mingqian Feng,Junjia Guo,Zhuo Liu,Luchuan Song,Ali Vosoughi,Jinxi He,Liu He,Zeliang Zhang,Jiebo Luo,Chenliang Xu
机构: University of Rochester (罗切斯特大学); Sony Group Corporation (索尼集团); CMU (卡内基梅隆大学); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning that enables detailed descriptions of user-selected objects through time. CAT-V integrates three key components: a Segmenter based on SAMURAI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-Uni for accurate event boundary detection and temporal analysis, and a Captioner using InternVL-2.5 for generating detailed object-centric descriptions. Through spatiotemporal visual prompts and chain-of-thought reasoning, our framework generates detailed, temporally-aware descriptions of objects’ attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data. CAT-V supports flexible user interactions through various visual prompts (points, bounding boxes, and irregular regions) and maintains temporal sensitivity by tracking object states and interactions across different time segments. Our approach addresses limitations of existing video captioning methods, which either produce overly abstract descriptions or lack object-level precision, enabling fine-grained, object-specific descriptions while maintaining temporal coherence and spatial accuracy. The GitHub repository for this project is available at this https URL
zh
[CV-89] owards Efficient Real-Time Video Motion Transfer via Generative Time Series Modeling
【速读】:该论文旨在解决运动迁移视频应用中的带宽优化问题,包括视频会议、虚拟现实交互、健康监测系统以及基于视觉的实时异常检测等场景。为有效捕捉复杂运动,论文提出利用First Order Motion Model (FOMM),通过自监督关键点检测器识别动态对象的关键点及其局部仿射变换,并将这些关键点按时间顺序排列以对应连续帧。解决方案的关键在于将Variational Recurrent Neural Network (VRNN) 和带有归一化流的门控循环单元 (GRU-NF) 两种先进的生成式时间序列模型集成到运动迁移流水线中进行关键点预测。随后,通过光流估计算法与生成网络合成预测的关键点为真实的视频帧,从而实现精准的视频预测及高效低帧率视频传输。论文验证了该方法在多个数据集上的有效性,并展示了VRNN-FOMM组合在多步超前预测任务(如视频会议)中的优越重建性能,以及GRU-NF-FOMM组合在保持高视觉质量的同时生成多样化未来样本的能力,特别是在实时基于视频的异常检测任务中表现出色。
链接: https://arxiv.org/abs/2504.05537
作者: Tasmiah Haque,Md. Asif Bin Syed,Byungheon Jeong,Xue Bai,Sumit Mohan,Somdyuti Paul,Imtiaz Ahmed,Srinjoy Das
机构: Department of Industrial and Management Systems Engineering (工业与管理系统工程系), West Virginia University (西弗吉尼亚大学); Coupa Software (库帕软件); Lyda Hill Department of Bioinformatics (莱达·希尔生物信息学系), UT Southwestern Medical Center (德克萨斯大学西南医学中心); Intel Corporation (英特尔公司); Department of Artificial Intelligence (人工智能系), Indian Institute of Technology Kharagpur (印度理工学院卡哈格普尔分校); School of Mathematical and Data Sciences (数学与数据科学学院), Department of Industrial and Management Systems Engineering (工业与管理系统工程系), West Virginia University (西弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a deep learning framework designed to significantly optimize bandwidth for motion-transfer-enabled video applications, including video conferencing, virtual reality interactions, health monitoring systems, and vision-based real-time anomaly detection. To capture complex motion effectively, we utilize the First Order Motion Model (FOMM), which encodes dynamic objects by detecting keypoints and their associated local affine transformations. These keypoints are identified using a self-supervised keypoint detector and arranged into a time series corresponding to the successive frames. Forecasting is performed on these keypoints by integrating two advanced generative time series models into the motion transfer pipeline, namely the Variational Recurrent Neural Network (VRNN) and the Gated Recurrent Unit with Normalizing Flow (GRU-NF). The predicted keypoints are subsequently synthesized into realistic video frames using an optical flow estimator paired with a generator network, thereby facilitating accurate video forecasting and enabling efficient, low-frame-rate video transmission. We validate our results across three datasets for video animation and reconstruction using the following metrics: Mean Absolute Error, Joint Embedding Predictive Architecture Embedding Distance, Structural Similarity Index, and Average Pair-wise Displacement. Our results confirm that by utilizing the superior reconstruction property of the Variational Autoencoder, the VRNN integrated FOMM excels in applications involving multi-step ahead forecasts such as video conferencing. On the other hand, by leveraging the Normalizing Flow architecture for exact likelihood estimation, and enabling efficient latent space sampling, the GRU-NF based FOMM exhibits superior capabilities for producing diverse future samples while maintaining high visual quality for tasks like real-time video-based anomaly detection.
zh
[CV-90] PartStickers: Generating Parts of Objects for Rapid Prototyping CVPR
【速读】:该论文试图解决在设计原型过程中仅生成整个对象而无法灵活生成对象局部的问题。传统文本到图像的方法通常只能生成完整的对象,而在需要特定部分(如为视频游戏构建新生物时)则难以满足需求。为了解决这一问题,论文提出了一种名为“部分贴纸生成”(part sticker generation)的新任务与方法,其关键是能够生成物体的孤立部分,并将其置于中性背景下,同时保持对象级别的生成能力。实验结果表明,该方法在真实感和文本对齐方面优于现有最先进的基线模型。
链接: https://arxiv.org/abs/2504.05508
作者: Mo Zhou,Josh Myers-Dean,Danna Gurari
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR CVEU workshop 2025
Abstract:Design prototyping involves creating mockups of products or concepts to gather feedback and iterate on ideas. While prototyping often requires specific parts of objects, such as when constructing a novel creature for a video game, existing text-to-image methods tend to only generate entire objects. To address this, we propose a novel task and method of ``part sticker generation", which entails generating an isolated part of an object on a neutral background. Experiments demonstrate our method outperforms state-of-the-art baselines with respect to realism and text alignment, while preserving object-level generation capabilities. We publicly share our code and models to encourage community-wide progress on this new task: this https URL.
zh
[CV-91] SelfMAD: Enhancing Generalization and Robustness in Morphing Attack Detection via Self-Supervised Learning
【速读】:该论文旨在解决因人脸 morphing 攻击(Face Morphing Attacks)导致现有面部验证系统面临的身份欺诈等挑战。传统基于监督学习的 Morphing 攻击检测(MAD)方法依赖于在真实图像(bona fide)和 morphed 图像样本上训练的判别模型,这类模型在处理与训练集中技术相似的 morph 时表现良好,但在面对未见过的新攻击技术时性能显著下降。尽管无监督方法在泛化能力上表现更好,但其错误率较高,主要由于难以有效捕捉细微的人工特征。为此,论文提出了一种新颖的自监督方法 SelfMAD,其关键在于通过模拟通用的 morphing 攻击特征,使分类器能够学习到更具通用性和鲁棒性的决策边界,避免过拟合于特定 morphing 方法产生的特定人工特征。实验结果表明,SelfMAD 在跨 morph 设置下相较于当前最先进的 MAD 方法将等错误率(EER)降低了超过 64%-66%,显著提升了检测性能。
链接: https://arxiv.org/abs/2504.05504
作者: Marija Ivanovska,Leon Todorov,Naser Damer,Deepak Kumar Jain,Peter Peer,Vitomir Štruc
机构: Faculty of Electrical Engineering, University in Ljubljana (电气工程学院, 卢布尔雅那大学), Slovenia; Faculty of Computer and Information Science, University in Ljubljana (计算机与信息科学学院, 卢布尔雅那大学), Slovenia; Dalian University of Technology (大连理工大学), China; Fraunhofer Institute for Computer Graphics Research (弗劳恩霍夫图形学研究所), Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Conference on Automatic Face and Gesture Recognition (FG 2025)
Abstract:With the continuous advancement of generative models, face morphing attacks have become a significant challenge for existing face verification systems due to their potential use in identity fraud and other malicious activities. Contemporary Morphing Attack Detection (MAD) approaches frequently rely on supervised, discriminative models trained on examples of bona fide and morphed images. These models typically perform well with morphs generated with techniques seen during training, but often lead to sub-optimal performance when subjected to novel unseen morphing techniques. While unsupervised models have been shown to perform better in terms of generalizability, they typically result in higher error rates, as they struggle to effectively capture features of subtle artifacts. To address these shortcomings, we present SelfMAD, a novel self-supervised approach that simulates general morphing attack artifacts, allowing classifiers to learn generic and robust decision boundaries without overfitting to the specific artifacts induced by particular face morphing methods. Through extensive experiments on widely used datasets, we demonstrate that SelfMAD significantly outperforms current state-of-the-art MADs, reducing the detection error by more than 64% in terms of EER when compared to the strongest unsupervised competitor, and by more than 66%, when compared to the best performing discriminative MAD model, tested in cross-morph settings. The source code for SelfMAD is available at this https URL.
zh
[CV-92] Few-shot Personalized Scanpath Prediction CVPR2025
【速读】:该论文旨在解决个性化扫视路径(scanpath)预测中数据需求高的问题,即如何在仅有少量样本的情况下,为新个体有效构建个性化的扫视路径预测模型。传统方法通常需要大量数据才能实现良好的性能,而本文提出的Few-Shot Personalized Scanpath Prediction (FS-PSP) 任务专注于利用极少的支持样本(少量个体的扫视行为数据)来预测未见过个体的扫视路径。
解决方案的关键在于引入了Subject-Embedding Network (SE-Net),这是一种专门设计用于捕捉每个个体独特扫视模式表示的方法。SE-Net通过生成能够有效区分不同个体同时减少同一人扫视路径内部变化的主体嵌入向量,实现了模型的适应性。基于这些主体嵌入向量,个性化扫视路径预测模型能够输出精确且个性化的结果。实验表明,该方法在多个眼动追踪数据集上表现出色,并且无需测试阶段的微调步骤。
链接: https://arxiv.org/abs/2504.05499
作者: Ruoyu Xue,Jingyi Xu,Sounak Mondal,Hieu Le,Gregory Zelinsky,Minh Hoai,Dimitris Samaras
机构: Stony Brook University (石溪大学), USA; EPFL (瑞士联邦理工学院), Switzerland; The University of Adelaide (阿德莱德大学), Australia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025,20 pages, 10 figures
Abstract:A personalized model for scanpath prediction provides insights into the visual preferences and attention patterns of individual subjects. However, existing methods for training scanpath prediction models are data-intensive and cannot be effectively personalized to new individuals with only a few available examples. In this paper, we propose few-shot personalized scanpath prediction task (FS-PSP) and a novel method to address it, which aims to predict scanpaths for an unseen subject using minimal support data of that subject’s scanpath behavior. The key to our method’s adaptability is the Subject-Embedding Network (SE-Net), specifically designed to capture unique, individualized representations for each subject’s scanpaths. SE-Net generates subject embeddings that effectively distinguish between subjects while minimizing variability among scanpaths from the same individual. The personalized scanpath prediction model is then conditioned on these subject embeddings to produce accurate, personalized results. Experiments on multiple eye-tracking datasets demonstrate that our method excels in FS-PSP settings and does not require any fine-tuning steps at test time. Code is available at: this https URL
zh
[CV-93] REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding CVPR
【速读】:该论文旨在解决现有视频理解方法在处理未剪辑视频时,通过基于相似性的贪心策略压缩视觉记忆可能导致忽略时空 token 上下文重要性的问题。为了解决这一局限,论文提出了一种高效的大型语言模型适配器(LLM Adapter),专注于优先考虑未剪辑视频在视频级别理解中的时空 token 的上下文相关性。关键创新在于利用评分网络选择性地压缩视觉记忆库,并结合可微分 Top-K 算子对空间 token 进行基于相关性的过滤,从而实现端到端的训练优化。实验结果表明,该方法在三个关键的视频级别理解任务(未剪辑视频分类、视频问答和视频描述生成)中,在四个大规模数据集上取得了具有竞争力或更优的结果,同时减少了高达 34% 的计算开销。
链接: https://arxiv.org/abs/2504.05491
作者: Sakib Reza,Xiyun Song,Heather Yu,Zongfang Lin,Mohsen Moghaddam,Octavia Camps
机构: Northeastern University (东北大学); Futurewei Technologies Inc (华为未来技术公司); Georgia Institute of Technology (乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPRW’25
Abstract:Integrating vision models into large language models (LLMs) has sparked significant interest in creating vision-language foundation models, especially for video understanding. Recent methods often utilize memory banks to handle untrimmed videos for video-level understanding. However, they typically compress visual memory using similarity-based greedy approaches, which can overlook the contextual importance of individual tokens. To address this, we introduce an efficient LLM adapter designed for video-level understanding of untrimmed videos that prioritizes the contextual relevance of spatio-temporal tokens. Our framework leverages scorer networks to selectively compress the visual memory bank and filter spatial tokens based on relevance, using a differentiable Top-K operator for end-to-end training. Across three key video-level understanding tasks \unicodex2013 untrimmed video classification, video question answering, and video captioning \unicodex2013 our method achieves competitive or superior results on four large-scale datasets while reducing computational overhead by up to 34%. The code will be available soon on GitHub.
zh
[CV-94] Secure Diagnostics: Adversarial Robustness Meets Clinical Interpretability
【速读】:该论文旨在解决深度神经网络在医学影像分类中因独立同分布(i.i.d.)假设被违反及决策过程不透明而导致的泛化性能不足问题。论文通过评估对抗攻击下的模型性能以及比较不同的可解释性方法与骨科医生标注的骨折区域,证明了鲁棒模型能够提供更符合临床意义的解释,表明鲁棒性促进了解剖学相关的特征优先级划分。解决方案的关键在于结合鲁棒性和可解释性作为互补基准,以弥合基准性能与安全、实用的临床部署之间的差距,从而促进人机协作并增强信任。
链接: https://arxiv.org/abs/2504.05483
作者: Mohammad Hossein Najafi,Mohammad Morsali,Mohammadreza Pashanejad,Saman Soleimani Roudi,Mohammad Norouzi,Saeed Bagheri Shouraki
机构: Sharif University of Technology; Sharif University of Technology; University of Tehran; Sharif University of Technology; Shahid Beheshti University of Medical Sciences; Sharif University of Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks for medical image classification often fail to generalize consistently in clinical practice due to violations of the i.i.d. assumption and opaque decision-making. This paper examines interpretability in deep neural networks fine-tuned for fracture detection by evaluating model performance against adversarial attack and comparing interpretability methods to fracture regions annotated by an orthopedic surgeon. Our findings prove that robust models yield explanations more aligned with clinically meaningful areas, indicating that robustness encourages anatomically relevant feature prioritization. We emphasize the value of interpretability for facilitating human-AI collaboration, in which models serve as assistants under a human-in-the-loop paradigm: clinically plausible explanations foster trust, enable error correction, and discourage reliance on AI for high-stakes decisions. This paper investigates robustness and interpretability as complementary benchmarks for bridging the gap between benchmark performance and safe, actionable clinical deployment.
zh
[CV-95] Studying Image Diffusion Features for Zero-Shot Video Object Segmentation CVPR
【速读】:该论文致力于解决零样本视频对象分割(Zero-Shot Video Object Segmentation, ZS-VOS)任务中无需针对视频数据微调或使用任何图像分割数据进行训练的问题。论文的关键在于通过识别扩散模型中最适合的时间步和特征提取层,优化ZS-VOS的特征提取过程,并发现扩散模型在ImageNet上训练的效果优于更大、更多样化数据集上的模型。此外,论文强调点对应关系在实现高精度分割中的重要性,并取得了当前最先进的ZS-VOS结果,其性能与在昂贵的图像分割数据集上训练的模型相当。
链接: https://arxiv.org/abs/2504.05468
作者: Thanos Delatolas,Vicky Kalogeiton,Dim P. Papadopoulos
机构: Technical University of Denmark (丹麦技术大学); Pioneer Center for AI (先锋人工智能中心); LIX, Ecole Polytechnique, CNRS, Institut Polytechnique de Paris (LIX, 巴黎理工学院, 法国国家科学研究中心, 巴黎高等先进技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPRW2025
Abstract:This paper investigates the use of large-scale diffusion models for Zero-Shot Video Object Segmentation (ZS-VOS) without fine-tuning on video data or training on any image segmentation data. While diffusion models have demonstrated strong visual representations across various tasks, their direct application to ZS-VOS remains underexplored. Our goal is to find the optimal feature extraction process for ZS-VOS by identifying the most suitable time step and layer from which to extract features. We further analyze the affinity of these features and observe a strong correlation with point correspondences. Through extensive experiments on DAVIS-17 and MOSE, we find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS-VOS. Additionally, we highlight the importance of point correspondences in achieving high segmentation accuracy, and we yield state-of-the-art results in ZS-VOS. Finally, our approach performs on par with models trained on expensive image segmentation datasets.
zh
[CV-96] REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
【速读】:该论文旨在解决视频问答(VideoQA)任务中复杂视觉关系随时间变化难以有效建模的问题,尤其对于先进的视频语言模型(VLM),这一挑战源于如何将视觉内容高效表示为适合这些模型输入的合理尺寸。为了解决此问题,论文提出了一种基于关系的视频表征学习框架(RElation-based Video rEpresentAtion Learning, REVEAL)。其关键是通过将视频序列编码为时空场景图中的关系三元组(subject-predicate-object形式)来捕捉视觉关系信息,并利用多对多噪声对比估计(MM-NCE)与Q-Former架构实现视频衍生查询与基于文本的关系描述之间的对齐,从而生成高效的查询驱动视频表征,用于后续的视频问答任务。实验结果表明,该方法在需要时序推理和关系理解的任务上表现优异。
链接: https://arxiv.org/abs/2504.05463
作者: Sofian Chaybouti,Walid Bousselham,Moritz Wolter,Hilde Kuehne
机构: Goethe University Frankfurt (法兰克福大学); Tuebingen AI Center/University of Tuebingen (图宾根人工智能中心/图宾根大学); University of Bonn (波恩大学); MIT-IBM Watson AI Lab (麻省理工学院-IBM 沃森人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures
Abstract:Video-Question-Answering (VideoQA) comprises the capturing of complex visual relation changes over time, remaining a challenge even for advanced Video Language Models (VLM), i.a., because of the need to represent the visual content to a reasonably sized input for those models. To address this problem, we propose RElation-based Video rEpresentAtion Learning (REVEAL), a framework designed to capture visual relation information by encoding them into structured, decomposed representations. Specifically, inspired by spatiotemporal scene graphs, we propose to encode video sequences as sets of relation triplets in the form of (\textitsubject-predicate-object) over time via their language embeddings. To this end, we extract explicit relations from video captions and introduce a Many-to-Many Noise Contrastive Estimation (MM-NCE) together with a Q-Former architecture to align an unordered set of video-derived queries with corresponding text-based relation descriptions. At inference, the resulting Q-former produces an efficient token representation that can serve as input to a VLM for VideoQA. We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA. It shows that the resulting query-based video representation is able to outperform global alignment-based CLS or patch token representations and achieves competitive results against state-of-the-art models, particularly on tasks requiring temporal reasoning and relation comprehension. The code and models will be publicly released. Comments: 18 pages, 7 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.05463 [cs.CV] (or arXiv:2504.05463v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.05463 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-97] Optimizing 4D Gaussians for Dynamic Scene Video from Single Landscape Images ICLR2025
【速读】:该论文旨在解决在动态场景视频中,利用单一图像构建完整三维空间时所面临的深度感知减弱及潜在失真问题,同时提升输出结果的多维灵活性。传统方法基于隐式表示的分层深度图像(Layered Depth Images, LDI)将景观图像分割成离散层次,这可能导致深度感知下降及相机移动时的失真。此外,其隐式建模限制了输出仅局限于二维视频域,降低了应用的多样性。
论文的关键解决方案在于提出一种显式表示的框架,通过从单个图像中建模4D高斯分布(4D Gaussians)来完整表示动态场景视频的三维空间。该框架的核心是优化三维高斯分布,通过生成多视角图像以及创建三维运动来优化4D高斯模型。其中,一致性三维运动估计是方案的关键部分,它通过对多视角图像估计共同运动,使三维空间中的运动更接近真实情况。据作者所述,这是首次尝试在表示完整三维空间的同时考虑景观图像的动画效果。实验结果验证了该模型在提供多种景观图像真实沉浸感方面的有效性。
链接: https://arxiv.org/abs/2504.05458
作者: In-Hwan Jin,Haesoo Choo,Seong-Hun Jeong,Heemoon Park,Junghwan Kim,Oh-joon Kwon,Kyeongbo Kong
机构: Pusan National University (釜山国立大学); Pukyong National University (釜庆国立大学); Busan Munhwa Broadcasting Corporation (釜山文化广播公司); Korea University (高丽大学); DM Studio
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025
Abstract:To achieve realistic immersion in landscape images, fluids such as water and clouds need to move within the image while revealing new scenes from various camera perspectives. Recently, a field called dynamic scene video has emerged, which combines single image animation with 3D photography. These methods use pseudo 3D space, implicitly represented with Layered Depth Images (LDIs). LDIs separate a single image into depth-based layers, which enables elements like water and clouds to move within the image while revealing new scenes from different camera perspectives. However, as landscapes typically consist of continuous elements, including fluids, the representation of a 3D space separates a landscape image into discrete layers, and it can lead to diminished depth perception and potential distortions depending on camera movement. Furthermore, due to its implicit modeling of 3D space, the output may be limited to videos in the 2D domain, potentially reducing their versatility. In this paper, we propose representing a complete 3D space for dynamic scene video by modeling explicit representations, specifically 4D Gaussians, from a single image. The framework is focused on optimizing 3D Gaussians by generating multi-view images from a single image and creating 3D motion to optimize 4D Gaussians. The most important part of proposed framework is consistent 3D motion estimation, which estimates common motion among multi-view images to bring the motion in 3D space closer to actual motions. As far as we know, this is the first attempt that considers animation while representing a complete 3D space from a single landscape image. Our model demonstrates the ability to provide realistic immersion in various landscape images through diverse experiments and metrics. Extensive experimental results are this https URL.
zh
[CV-98] axonomy-Aware Evaluation of Vision-Language Models CVPR2025
【速读】:该论文旨在解决两个关键问题:一是如何将视觉-语言模型(Vision-Language Model, VLM)生成的不受约束的文本映射到特定的标签空间;二是如何设计一个有效的分类度量标准,以部分认可那些虽然不够具体但仍然正确的预测结果。论文的关键解决方案在于提出了一种基于层次精度和召回率(Hierarchical Precision and Recall)的框架,用于评估VLM生成的文本预测与预定义分类学之间的相似性。通过此框架,研究者能够计算出生成文本与真实标签之间的层次相似度,并且验证了现有的文本相似性度量方法无法有效捕捉分类学上的相似性,从而强调了采用层次化评估方法的重要性。
链接: https://arxiv.org/abs/2504.05457
作者: Vésteinn Snæbjarnarson,Kevin Du,Niklas Stoehr,Serge Belongie,Ryan Cotterell,Nico Lang,Stella Frank
机构: University of Copenhagen (哥本哈根大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
Abstract:When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer ‘I see a conifer,’ rather than the specific label ‘norway spruce’. This raises two issues for evaluation: First, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., ‘conifer’). Second, a useful classification measure should give partial credit to less-specific, but not incorrect, answers (‘norway spruce’ being a type of ‘conifer’). To meet these requirements, we propose a framework for evaluating unconstrained text predictions, such as those generated from a vision-language model, against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.
zh
[CV-99] Generative Adversarial Networks with Limited Data: A Survey and Benchmarking
【速读】:该论文旨在综述生成式对抗网络(GANs)及其变体在多种视觉任务中的应用,并重点关注解决训练数据有限时性能迅速下降的问题。论文通过设计实验分析了现有最先进的GANs在小样本条件下的表现,并介绍了从不同角度尝试解决此问题的各种方法。解决方案的关键在于探索和开发能够有效利用有限数据提升GANs性能的技术与策略。最后,论文还讨论了该领域尚未解决的挑战及未来研究趋势。
链接: https://arxiv.org/abs/2504.05456
作者: Omar De Mitri,Ruyu Wang,Marco F. Huber
机构: Fraunhofer IPA (弗劳恩霍夫应用研究促进协会工业自动化研究所); 未知; 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative Adversarial Networks (GANs) have shown impressive results in various image synthesis tasks. Vast studies have demonstrated that GANs are more powerful in feature and expression learning compared to other generative models and their latent space encodes rich semantic information. However, the tremendous performance of GANs heavily relies on the access to large-scale training data and deteriorates rapidly when the amount of data is limited. This paper aims to provide an overview of GANs, its variants and applications in various vision tasks, focusing on addressing the limited data issue. We analyze state-of-the-art GANs in limited data regime with designed experiments, along with presenting various methods attempt to tackle this problem from different perspectives. Finally, we further elaborate on remaining challenges and trends for future research.
zh
[CV-100] Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation
【速读】:该论文致力于解决在极端视角差异且共享视觉内容极少的自然场景视频中,学习视点不变性表示的问题。传统方法依赖于受控的多视角设置,难以应对这种挑战。论文的关键解决方案在于首先定义了一种基于几何的度量方法,用于细粒度评估视点被遮挡的程度;然后通过这一排名结果,设计了一个知识蒸馏目标函数,并结合新颖的课程学习策略,逐步配对难度递增的视点组合,从而实现对极端视角差异的平滑适应。
链接: https://arxiv.org/abs/2504.05451
作者: Arjun Somayazulu,Efi Mavroudi,Changan Chen,Lorenzo Torresani,Kristen Grauman
机构: UT Austin (德克萨斯大学奥斯汀分校); FAIR, Meta (Meta 人工智能研究实验室, Meta); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional methods for view-invariant learning from video rely on controlled multi-view settings with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce a method for learning rich video representations in the presence of such severe view-occlusions. We first define a geometry-based metric that ranks views at a fine-grained temporal scale by their likely occlusion level. Then, using those rankings, we formulate a knowledge distillation objective that preserves action-centric semantics with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. We evaluate our approach on two tasks, outperforming SOTA models on both temporal keystep grounding and fine-grained keystep recognition benchmarks - particularly on views that exhibit severe occlusion.
zh
[CV-101] Biomechanical Constraints Assimilation in Deep-Learning Image Registration: Application to sliding and locally rigid deformations
【速读】:该论文试图解决传统医学图像配准(registration)策略在处理生物结构变形时缺乏局部结构感知的问题。常规方法通常采用统一约束(full-domain regularization),忽视了软组织和硬组织空间非均匀变形特性及生物力学行为,尤其在对比度较低的结构中表现不佳。为解决这一问题,论文提出了一种基于学习的图像配准方法,其关键在于通过引入固体力学领域的正则化损失函数,在训练过程中使推断出的形变(deformation)特性能够自适应于特定的生物力学特征。具体而言,首先在训练过程中施加局部刚体位移、剪切运动或伪弹性变形等机械约束;然后证明这些不同性质的机械属性能够在新图像对的形变推理中实现良好的泛化能力。该方法使得神经网络可以直接从输入图像中推断出组织特异性形变模式,确保形变结果具有生物力学合理性,同时在硬组织中保持刚性,在自然分离区域允许受控滑动,从而更真实地捕捉生理运动。
链接: https://arxiv.org/abs/2504.05444
作者: Ziad Kheil,Soleakhena Ken,Laurent Risser
机构: Centre de Recherches en Cancérologie de Toulouse, INSERM UMR1037; Institut de Mathématiques de Toulouse (UMR 5219), CNRS, Université de Toulouse; Institut Universitaire du Cancer – Oncopole Claudius Régaud
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Regularization strategies in medical image registration often take a one-size-fits-all approach by imposing uniform constraints across the entire image domain. Yet biological structures are anything but regular. Lacking structural awareness, these strategies may fail to consider a panoply of spatially inhomogeneous deformation properties, which would faithfully account for the biomechanics of soft and hard tissues, especially in poorly contrasted structures. To bridge this gap, we propose a learning-based image registration approach in which the inferred deformation properties can locally adapt themselves to trained biomechanical characteristics. Specifically, we first enforce in the training process local rigid displacements, shearing motions or pseudo-elastic deformations using regularization losses inspired from the field of solid-mechanics. We then show on synthetic and real 3D thoracic and abdominal images that these mechanical properties of different nature are well generalized when inferring the deformations between new image pairs. Our approach enables neural-networks to infer tissue-specific deformation patterns directly from input images, ensuring mechanically plausible motion. These networks preserve rigidity within hard tissues while allowing controlled sliding in regions where tissues naturally separate, more faithfully capturing physiological motion. The code is publicly available at this https URL . Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 2008 (Primary) 68U10, 68T99, 62P10, 74L15 (Secondary) Cite as: arXiv:2504.05444 [cs.CV] (or arXiv:2504.05444v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.05444 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-102] EP-Diffuser: An Efficient Diffusion Model for Traffic Scene Generation and Prediction via Polynomial Representations
【速读】:该论文旨在解决在长时间预测范围内,由于代理运动的多模态特性导致的交通场景未来演化预测难度增加的问题。传统最先进的预测模型主要关注最可能的未来轨迹,而忽略了对其他合理运动替代方案分布的覆盖,这对于自动驾驶车辆的安全运行同样重要。为了解决这一问题,论文提出了一种名为EP-Diffuser的新颖参数高效扩散生成模型,其关键是通过结合道路布局和代理历史信息,作为预测器生成多样化且合理的场景延续,从而捕捉可能的交通场景演化分布。
链接: https://arxiv.org/abs/2504.05422
作者: Yue Yao,Mohamed-Khalil Bouzidi,Daniel Goehring,Joerg Reichardt
机构: Continental Automotive Technologies GmbH (大陆汽车技术有限公司); Dahlem Center for Machine Learning and Robotics, Freie Universitaet Berlin (柏林自由大学达勒姆机器学习与机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:As the prediction horizon increases, predicting the future evolution of traffic scenes becomes increasingly difficult due to the multi-modal nature of agent motion. Most state-of-the-art (SotA) prediction models primarily focus on forecasting the most likely future. However, for the safe operation of autonomous vehicles, it is equally important to cover the distribution for plausible motion alternatives. To address this, we introduce EP-Diffuser, a novel parameter-efficient diffusion-based generative model designed to capture the distribution of possible traffic scene evolutions. Conditioned on road layout and agent history, our model acts as a predictor and generates diverse, plausible scene continuations. We benchmark EP-Diffuser against two SotA models in terms of accuracy and plausibility of predictions on the Argoverse 2 dataset. Despite its significantly smaller model size, our approach achieves both highly accurate and plausible traffic scene predictions. We further evaluate model generalization ability in an out-of-distribution (OoD) test setting using Waymo Open dataset and show superior robustness of our approach. The code and model checkpoints can be found here: this https URL.
zh
[CV-103] me-adaptive Video Frame Interpolation based on Residual Diffusion
【速读】:该论文致力于解决传统手工动画视频帧插值(Video Frame Interpolation, VFI)中的挑战,特别是在处理动画领域中显著较大的时间变化时。论文的关键解决方案包含三个主要贡献:首先,引入显式的插值时间处理,并在训练过程中重新估计该时间,以适应动画域中比自然视频更显著的变化;其次,将超分辨率社区提出的ResShift扩散方案适配并推广到VFI任务中,允许仅通过约10步扩散过程生成高质量估计;最后,利用扩散过程的随机性,为插值帧提供像素级不确定性估计,有助于预测模型可能出错的位置。这些创新使论文提出的模型在动画视频上的表现优于现有最先进的方法。
链接: https://arxiv.org/abs/2504.05402
作者: Victor Fonte Chavez,Claudia Esteves,Jean-Bernard Hayet
机构: Centro de Investigación en Matemáticas ( CIMAT ); university of Guanajuato ( Universidad de Guanajuato )
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages
Abstract:In this work, we propose a new diffusion-based method for video frame interpolation (VFI), in the context of traditional hand-made animation. We introduce three main contributions: The first is that we explicitly handle the interpolation time in our model, which we also re-estimate during the training process, to cope with the particularly large variations observed in the animation domain, compared to natural videos; The second is that we adapt and generalize a diffusion scheme called ResShift recently proposed in the super-resolution community to VFI, which allows us to perform a very low number of diffusion steps (in the order of 10) to produce our estimates; The third is that we leverage the stochastic nature of the diffusion process to provide a pixel-wise estimate of the uncertainty on the interpolated frame, which could be useful to anticipate where the model may be wrong. We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos.
zh
[CV-104] GARF: Learning Generalizable 3D Reassembly for Real-World Fractures
【速读】:该论文旨在解决3D拼接在真实世界中的泛化性不足问题,特别是针对复杂断裂模式的实际物体,现有基于大规模合成数据训练的模型难以有效推广。论文的关键解决方案是提出GARF(Generalizable 3D Reassembly Framework),通过引入断裂感知预训练学习碎片特征,并利用流匹配实现精确的6自由度对齐。此外,在推理阶段引入一步预组装步骤,增强了对未见物体和不同断裂数量的鲁棒性。这一方法显著提升了模型在真实世界数据上的性能,同时通过与考古学家、古人类学家及鸟类学家合作构建的Fractura数据集验证了其在多种实际断裂类型上的有效性。
链接: https://arxiv.org/abs/2504.05400
作者: Sihang Li,Zeyu Jiang,Grace Chen,Chenyang Xu,Siqi Tan,Xue Wang,Irving Fang,Kristof Zyskowski,Shannon P. McPherron,Radu Iovita,Chen Feng,Jing Zhang
机构: New York University (纽约大学); Yale University (耶鲁大学); Max Planck Institute for Evolutionary Anthropology (马克斯·普朗克进化人类学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 11 figures. Project Page this https URL
Abstract:3D reassembly is a challenging spatial intelligence task with broad applications across scientific domains. While large-scale synthetic datasets have fueled promising learning-based approaches, their generalizability to different domains is limited. Critically, it remains uncertain whether models trained on synthetic datasets can generalize to real-world fractures where breakage patterns are more complex. To bridge this gap, we propose GARF, a generalizable 3D reassembly framework for real-world fractures. GARF leverages fracture-aware pretraining to learn fracture features from individual fragments, with flow matching enabling precise 6-DoF alignments. At inference time, we introduce one-step preassembly, improving robustness to unseen objects and varying numbers of fractures. In collaboration with archaeologists, paleoanthropologists, and ornithologists, we curate Fractura, a diverse dataset for vision and learning communities, featuring real-world fracture types across ceramics, bones, eggshells, and lithics. Comprehensive experiments have shown our approach consistently outperforms state-of-the-art methods on both synthetic and real-world datasets, achieving 82.87% lower rotation error and 25.15% higher part accuracy. This sheds light on training on synthetic data to advance real-world 3D puzzle solving, demonstrating its strong generalization across unseen object shapes and diverse fracture types.
zh
[CV-105] A Nature-Inspired Colony of Artificial Intelligence System with Fast Detailed and Organized Learner Agents for Enhancing Diversity and Quality
【速读】:该论文试图解决如何构建一个基于卷积神经网络(Convolutional Neural Networks, CNNs)的多智能体系统(Multi-Agent System, MAS),以作为单一系统在环境中高效完成多种任务(如预测或分类)。论文的关键在于提出了一种角色导向的方法,通过引入快速学习者、细节学习者和组织学习者这三种不同特性的AI代理,并结合遗传算法的交叉与变异机制,增强系统的多样性和质量。此外,通过将这些学习者与预训练的VGG16、VGG19及ResNet50模型进行一对一映射,实现AI代理的进化,利用“AI内婚”和“AI间婚”的过程分享知识以产生多样化子代AI代理,从而形成由多模型和混合模型组成的AI群体,显著提升系统的预测性能和决策质量。
链接: https://arxiv.org/abs/2504.05365
作者: Shan Suthaharan
机构: Department of Computer Science (计算机科学系), UNC Greensboro (北卡罗来纳大学格林斯伯勒分校)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 12 pages, 8 figures
Abstract:The concepts of convolutional neural networks (CNNs) and multi-agent systems are two important areas of research in artificial intelligence (AI). In this paper, we present an approach that builds a CNN-based colony of AI agents to serve as a single system and perform multiple tasks (e.g., predictions or classifications) in an environment. The proposed system impersonates the natural environment of a biological system, like an ant colony or a human colony. The proposed colony of AI that is defined as a role-based system uniquely contributes to accomplish tasks in an environment by incorporating AI agents that are fast learners, detailed learners, and organized learners. These learners can enhance their localized learning and their collective decisions as a single system of colony of AI agents. This approach also enhances the diversity and quality of the colony of AI with the help of Genetic Algorithms and their crossover and mutation mechanisms. The evolution of fast, detailed, and organized learners in the colony of AI is achieved by introducing a unique one-to-one mapping between these learners and the pretrained VGG16, VGG19, and ResNet50 models, respectively. This role-based approach creates two parent-AI agents using the AI models through the processes, called the intra- and inter-marriage of AI, so that they can share their learned knowledge (weights and biases) based on a probabilistic rule and produce diversified child-AI agents to perform new tasks. This process will form a colony of AI that consists of families of multi-model and mixture-model AI agents to improve diversity and quality. Simulations show that the colony of AI, built using the VGG16, VGG19, and ResNet50 models, can provide a single system that generates child-AI agents of excellent predictive performance, ranging between 82% and 95% of F1-scores, to make diversified collective and quality decisions on a task.
zh
[CV-106] MASS: MoErging through Adaptive Subspace Selection
【速读】:该论文旨在解决现有模型合并方法(Model Merging)在保持轻量化的同时无法完全匹配单独微调模型(fine-tuned endpoints)的精度的问题。论文提出了一种名为MASS(通过自适应子空间选择的MoErging)的新方法,其关键是通过低秩分解任务特定更新来存储每个任务的关键奇异分量,并将这些分量合并到一个共享模型中。在推理阶段,非参数化、无数据路由器能够识别输入中间特征的最佳子空间或其组合,并激活相应的任务特定块。这种方法无需额外训练,相比单个预训练模型仅引入两轮推理开销及约2倍存储成本,同时实现了接近最先进的跨任务性能,显著提升了模型合并的精度,达到了单独微调模型平均精度的~98%,为模型合并提供了一个实用且低成本的替代方案。
链接: https://arxiv.org/abs/2504.05342
作者: Donato Crisostomi,Alessandro Zirilli,Antonio Andrea Gargiulo,Maria Sofia Bucarelli,Simone Scardapane,Fabrizio Silvestri,Iacopo Masi,Emanuele Rodolà
机构: Sapienza University of Rome (罗马大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input’s intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2 storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.
zh
[CV-107] Scale Up Composed Image Retrieval Learning via Modification Text Generation
【速读】:该论文旨在解决由参考图像和修改文本组合而成的查询在生成式图像检索(Composed Image Retrieval, CIR)任务中因训练数据有限及繁琐三元组标注过程而导致的挑战。为应对这一问题,论文提出通过合成训练三元组来扩充CIR任务的训练资源。解决方案的关键在于利用大规模多模态模型训练一个修改文本生成器,并在整个预训练和微调阶段扩展CIR学习。在预训练阶段,基于图像对直接生成面向修改文本的合成三元组(MTST)。在微调阶段,首先合成逆向修改文本以连接目标图像与参考图像,随后设计两跳对齐策略逐步缩小多模态对与目标图像之间的语义差距。具体而言,通过循环方式利用原始三元组及其逆向版本学习隐式原型特征,并结合修改文本实现与目标图像的精确对齐。实验结果验证了合成三元组的有效性,并表明所提方法在CIRR和FashionIQ基准上取得了具有竞争力的召回率。
链接: https://arxiv.org/abs/2504.05316
作者: Yinan Zhou,Yaxiong Wang,Haokun Lin,Chen Ma,Li Zhu,Zhedong Zheng
机构: School of Electronics and Information Engineering, Xi’an Jiaotong University (西安交通大学电子与信息工程学院), China;
School of Electronics and Information Engineering, Hefei University of Technology (合肥工业大学电子与信息工程学院), China;
School of Artificial Intelligence, University of the Chinese Academy of Sciences (中国科学院大学人工智能学院), China;
Department of Computer Science, City University of Hong Kong (香港城市大学计算机科学系), China;
Faculty of Science and Technology, and Institute of Collaborative Innovation, University of Macau (澳门大学科技学院及协同创新研究所), China
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures
Abstract:Composed Image Retrieval (CIR) aims to search an image of interest using a combination of a reference image and modification text as the query. Despite recent advancements, this task remains challenging due to limited training data and laborious triplet annotation processes. To address this issue, this paper proposes to synthesize the training triplets to augment the training resource for the CIR problem. Specifically, we commence by training a modification text generator exploiting large-scale multimodal models and scale up the CIR learning throughout both the pretraining and fine-tuning stages. During pretraining, we leverage the trained generator to directly create Modification Text-oriented Synthetic Triplets(MTST) conditioned on pairs of images. For fine-tuning, we first synthesize reverse modification text to connect the target image back to the reference image. Subsequently, we devise a two-hop alignment strategy to incrementally close the semantic gap between the multimodal pair and the target image. We initially learn an implicit prototype utilizing both the original triplet and its reversed version in a cycle manner, followed by combining the implicit prototype feature with the modification text to facilitate accurate alignment with the target image. Extensive experiments validate the efficacy of the generated triplets and confirm that our proposed methodology attains competitive recall on both the CIRR and FashionIQ benchmarks.
zh
[CV-108] HRMedSeg: Unlocking High-resolution Medical Image segmentation via Memory-efficient Attention Modeling
【速读】:该论文旨在解决现有基于Transformer的编码器-解码器框架在处理高分辨率医学图像分割任务时,因需要预测大尺寸分割掩码而导致的高内存消耗问题,这一问题限制了其在实际应用场景中的应用。为了解决这一局限性,论文提出了一种名为HRMedSeg的记忆高效高分辨率医学图像分割框架。解决方案的关键在于两个方面:首先,设计了一个轻量级门控视觉Transformer(LGViT)作为图像编码器,以线性复杂度建模长距离依赖关系;其次,提出了一个高效的跨多尺度解码器(ECM-Decoder),用于生成高分辨率分割掩码。此外,通过预训练期间的特征蒸馏进一步释放模型潜力。实验结果表明,HRMedSeg在多种高分辨率医学图像分割任务中超越了现有最先进的方法,并且在微调过程中仅需每批次0.59GB的GPU内存,展示了较低的训练成本。
链接: https://arxiv.org/abs/2504.06205
作者: Qing Xu,Zhenye Lou,Chenxin Li,Xiangjian He,Rong Qu,Tesema Fiseha Berhanu,Yi Wang,Wenting Duan,Zhen Chen
机构: UNNC (宁波诺丁汉大学); Sichuan University (四川大学); CUHK (香港中文大学); University of Nottingham (诺丁汉大学); Dalian University of Technology (大连理工大学); University of Lincoln (林肯大学); HKISI, CAS (中国科学院香港创新研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:High-resolution segmentation is critical for precise disease diagnosis by extracting micro-imaging information from medical images. Existing transformer-based encoder-decoder frameworks have demonstrated remarkable versatility and zero-shot performance in medical segmentation. While beneficial, they usually require huge memory costs when handling large-size segmentation mask predictions, which are expensive to apply to real-world scenarios. To address this limitation, we propose a memory-efficient framework for high-resolution medical image segmentation, called HRMedSeg. Specifically, we first devise a lightweight gated vision transformer (LGViT) as our image encoder to model long-range dependencies with linear complexity. Then, we design an efficient cross-multiscale decoder (ECM-Decoder) to generate high-resolution segmentation masks. Moreover, we utilize feature distillation during pretraining to unleash the potential of our proposed model. Extensive experiments reveal that HRMedSeg outperforms state-of-the-arts in diverse high-resolution medical image segmentation tasks. In particular, HRMedSeg uses only 0.59GB GPU memory per batch during fine-tuning, demonstrating low training costs. Besides, when HRMedSeg meets the Segment Anything Model (SAM), our HRMedSegSAM takes 0.61% parameters of SAM-H. The code is available at this https URL.
zh
[CV-109] Rethinking the Nested U-Net Approach: Enhancing Biomarker Segmentation with Attention Mechanisms and Multiscale Feature Fusion
【速读】:该论文致力于解决医学影像分割中因形态变化和染色差异导致的特征提取困难,以及在数据样本有限的情况下,现有端到端方法难以有效迁移多尺度特征的问题。为应对这些挑战,论文提出了一种嵌套UNet架构,其关键是通过多尺度特征融合(Multiscale Feature Fusion)与注意力机制(Attention Mechanisms)捕获局部和全局上下文信息,从而改进编码器特征的整合能力,突出关键通道和区域,并恢复空间细节以提升分割性能。实验结果表明,该方法在多个数据集上超越了当前最先进的方法。
链接: https://arxiv.org/abs/2504.06158
作者: Saad Wazir,Daeyoung Kim
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the Proceedings of the 2024 International Conference on Medical Imaging and Computer-Aided Diagnosis (MICAD 2024), Lecture Notes in Electrical Engineering (LNEE), Volume 1372, Springer Nature, Singapore
Abstract:Identifying biomarkers in medical images is vital for a wide range of biotech applications. However, recent Transformer and CNN based methods often struggle with variations in morphology and staining, which limits their feature extraction capabilities. In medical image segmentation, where data samples are often limited, state-of-the-art (SOTA) methods improve accuracy by using pre-trained encoders, while end-to-end approaches typically fall short due to difficulties in transferring multiscale features effectively between encoders and decoders. To handle these challenges, we introduce a nested UNet architecture that captures both local and global context through Multiscale Feature Fusion and Attention Mechanisms. This design improves feature integration from encoders, highlights key channels and regions, and restores spatial details to enhance segmentation performance. Our method surpasses SOTA approaches, as evidenced by experiments across four datasets and detailed ablation studies. Code: this https URL
zh
[CV-110] Under-Sampled High-Dimensional Data Recovery via Symbiotic Multi-Prior Tensor Reconstruction
【速读】:该论文旨在解决高维数据在采样率极低情况下因缺失条目等问题导致的张量重构难题。论文的关键在于提出了一种融合多种先验信息的张量重构方法,通过结合可学习的张量分解实现重构数据的低秩约束,利用预训练卷积神经网络进行平滑与去噪,并采用块匹配与三维滤波正则化增强重构数据中的非局部相似性。此外,设计了乘子交替方向法将优化问题分解为三个子问题以提高求解效率。实验结果表明,该方法在极端条件下优于现有先进方法。
链接: https://arxiv.org/abs/2504.05992
作者: Jie Yang,Chang Su,Yuhan Zhang,Jianjun Zhu,Jianli Wang
机构: Jiangsu Key Laboratory for Design and Manufacture of Micro-Nano Biomedical Instruments, Department of Mechanical Engineering, Southeast University (东南大学机械工程学院微纳生物医学器械设计与制造江苏省重点实验室), Nanjing 210096, China; College of Mechanical and Transportation Engineering, China University of Petroleum (中国石油大学机械与运输工程学院), Beijing 102249, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The advancement of sensing technology has driven the widespread application of high-dimensional data. However, issues such as missing entries during acquisition and transmission negatively impact the accuracy of subsequent tasks. Tensor reconstruction aims to recover the underlying complete data from under-sampled observed data by exploring prior information in high-dimensional data. However, due to insufficient exploration, reconstruction methods still face challenges when sampling rate is extremely low. This work proposes a tensor reconstruction method integrating multiple priors to comprehensively exploit the inherent structure of the data. Specifically, the method combines learnable tensor decomposition to enforce low-rank constraints of the reconstructed data, a pre-trained convolutional neural network for smoothing and denoising, and block-matching and 3D filtering regularization to enhance the non-local similarity in the reconstructed data. An alternating direction method of the multipliers algorithm is designed to decompose the resulting optimization problem into three subproblems for efficient resolution. Extensive experiments on color images, hyperspectral images, and grayscale videos datasets demonstrate the superiority of our method in extreme cases as compared with state-of-the-art methods.
zh
[CV-111] AI analysis of medical images at scale as a health disparities probe: a feasibility demonstration using chest radiographs
【速读】:该论文试图解决利用医疗图像作为数据源评估健康差异相关表型(phenotypes)的问题,以增强对社会决定因素(Social Determinants of Health, SDOH)与健康差异之间潜在关联的研究。论文的关键在于开发了一种管道流程,通过从医学图像中自动提取定量指标,并将其作为输入用于计算健康差异指数(health disparities indices, HDIs)。解决方案的核心是结合深度学习模型量化肺实质严重疾病的可能性,并将不同成像类型的测量结果合并为单一数值化的图像表型,进而通过无监督聚类方法将患者分组成不同的表型组(phenogroups),最终定义各表型组的健康率并计算四种基于成像的健康差异指数(iHDIs),验证了医疗图像作为探索健康差异新型数据来源的潜力。
链接: https://arxiv.org/abs/2504.05990
作者: Heather M. Whitney,Hui Li,Karen Drukker,Elbert Huang,Maryellen L. Giger
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 4 figures
Abstract:Health disparities (differences in non-genetic conditions that influence health) can be associated with differences in burden of disease by groups within a population. Social determinants of health (SDOH) are domains such as health care access, dietary access, and economics frequently studied for potential association with health disparities. Evaluating SDOH-related phenotypes using routine medical images as data sources may enhance health disparities research. We developed a pipeline for using quantitative measures automatically extracted from medical images as inputs into health disparities index calculations. Our study focused on the use case of two SDOH demographic correlates (sex and race) and data extracted from chest radiographs of 1,571 unique patients. The likelihood of severe disease within the lung parenchyma from each image type, measured using an established deep learning model, was merged into a single numerical image-based phenotype for each patient. Patients were then separated into phenogroups by unsupervised clustering of the image-based phenotypes. The health rate for each phenogroup was defined as the median image-based phenotype for each SDOH used as inputs to four imaging-derived health disparities indices (iHDIs): one absolute measure (between-group variance) and three relative measures (index of disparity, Theil index, and mean log deviation). The iHDI measures demonstrated feasible values for each SDOH demographic correlate, showing potential for medical images to serve as a novel probe for health disparities. Large-scale AI analysis of medical images can serve as a probe for a novel data source for health disparities research.
zh
[CV-112] AVP-AP: Self-supervised Automatic View Positioning in 3D cardiac CT via Atlas Prompting
【速读】:该论文旨在解决心脏CT扫描中自动视图定位的问题,特别是在任意取向的语义2D切片嵌入到任意三维体积的可变坐标空间中的挑战。现有方法依赖于耗时的手动标注来训练特定视图的模型,且仅限于预测固定平面集,无法应对临床实际场景中更复杂的需求。
解决方案的关键在于引入了一种新颖的框架AVP-AP,它首次利用解剖图提示(Atlas Prompting)实现三维CT体积内的自监督自动视图定位。具体而言,该框架首先提出一种解剖图提示方法,通过自监督方式生成一个三维标准解剖图,并将切片映射到解剖图空间的相应位置;然后,基于参考CT中与查询图像对应的解剖图提示,利用三维解剖图与目标CT体积之间的刚体变换确定切片在目标CT体积中的粗略位置,从而有效缩小搜索空间;最后,在给定基础模型的特征空间中,通过最大化预测切片与查询图像之间的相似性进一步优化粗略位置。此方法相比其他方法更为灵活高效,在任意视图定位任务上平均结构相似度(SSIM)提升了19.8%,在两腔视图任务上达到9%的SSIM,同时超过四位放射科医生的表现。实验结果验证了该框架的通用性。
链接: https://arxiv.org/abs/2504.05966
作者: Xiaolin Fan,Yan Wang,Yingying Zhang,Mingkun Bao,Bosen Jia,Dong Lu,Yifan Gu,Jian Cheng,Haogang Zhu
机构: School of Instrumentation and Optoelectronic Engineering, Beihang University (北京航空航天大学仪器光电工程学院); School of Computer Science and Engineering, Beihang University (北京航空航天大学计算机科学与工程学院); School of Biological Sciences, Victoria University of Wellington (维多利亚大学生物科学学院); Hangzhou International Innovation Institute, Beihang University (北京航空航天大学杭州国际创新研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures, published to TMI
Abstract:Automatic view positioning is crucial for cardiac computed tomography (CT) examinations, including disease diagnosis and surgical planning. However, it is highly challenging due to individual variability and large 3D search space. Existing work needs labor-intensive and time-consuming manual annotations to train view-specific models, which are limited to predicting only a fixed set of planes. However, in real clinical scenarios, the challenge of positioning semantic 2D slices with any orientation into varying coordinate space in arbitrary 3D volume remains unsolved. We thus introduce a novel framework, AVP-AP, the first to use Atlas Prompting for self-supervised Automatic View Positioning in the 3D CT volume. Specifically, this paper first proposes an atlas prompting method, which generates a 3D canonical atlas and trains a network to map slices into their corresponding positions in the atlas space via a self-supervised manner. Then, guided by atlas prompts corresponding to the given query images in a reference CT, we identify the coarse positions of slices in the target CT volume using rigid transformation between the 3D atlas and target CT volume, effectively reducing the search space. Finally, we refine the coarse positions by maximizing the similarity between the predicted slices and the query images in the feature space of a given foundation model. Our framework is flexible and efficient compared to other methods, outperforming other methods by 19.8% average structural similarity (SSIM) in arbitrary view positioning and achieving 9% SSIM in two-chamber view compared to four radiologists. Meanwhile, experiments on a public dataset validate our framework’s generalizability.
zh
[CV-113] Diabetic Retinopathy Detection Based on Convolutional Neural Networks with SMOTE and CLAHE Techniques Applied to Fundus Images
【速读】:该论文旨在评估人工智能(Artificial Intelligence, AI)在诊断糖尿病视网膜病变(Diabetic Retinopathy, DR)中的准确性。论文的关键解决方案是采用合成少数过采样技术(Synthetic Minority Over-sampling Technique, SMOTE)算法,并结合卷积神经网络(Convolutional Neural Network, CNN),通过处理APTOS 2019盲症检测公开数据集中的眼底图像,实现DR及其严重程度阶段的识别。研究结果表明,二分类任务(正常 vs. DR)的准确率高达99.55%,而多分类任务(无DR、轻度、中度、重度、增殖性DR)的准确率为95.26%,验证了该方法在提升DR诊断准确性方面具有显著潜力,优于传统的人类分析。
链接: https://arxiv.org/abs/2504.05696
作者: Sidhiq Mardianta,Affandy,Catur Supriyanto,Catur Supriyanto,Adi Wijaya
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 6 pages, 6 figures, 2 tables
Abstract:Diabetic retinopathy (DR) is one of the major complications in diabetic patients’ eyes, potentially leading to permanent blindness if not detected timely. This study aims to evaluate the accuracy of artificial intelligence (AI) in diagnosing DR. The method employed is the Synthetic Minority Over-sampling Technique (SMOTE) algorithm, applied to identify DR and its severity stages from fundus images using the public dataset “APTOS 2019 Blindness Detection.” Literature was reviewed via ScienceDirect, ResearchGate, Google Scholar, and IEEE Xplore. Classification results using Convolutional Neural Network (CNN) showed the best performance for the binary classes normal (0) and DR (1) with an accuracy of 99.55%, precision of 99.54%, recall of 99.54%, and F1-score of 99.54%. For the multiclass classification No_DR (0), Mild (1), Moderate (2), Severe (3), Proliferate_DR (4), the accuracy was 95.26%, precision 95.26%, recall 95.17%, and F1-score 95.23%. Evaluation using the confusion matrix yielded results of 99.68% for binary classification and 96.65% for multiclass. This study highlights the significant potential in enhancing the accuracy of DR diagnosis compared to traditional human analysis
zh
[CV-114] POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction
【速读】:该论文旨在解决动态场景下3D重建中的几何估计与匹配模块分离导致的动态区域模糊匹配问题,以及由此引发的相机和物体运动干扰。论文的关键在于提出了一种名为POMATO的统一框架,通过结合点图(pointmap)匹配与时间运动建模,实现了动态场景中精确的几何估计与可靠的匹配。具体而言,该方法首先在统一坐标系中学习动态和静态区域的显式匹配关系,然后引入时间运动模块以确保不同帧之间的尺度一致性,并提升需要精确几何和可靠匹配的任务性能,如视频深度估计、3D点跟踪和姿态估计。
链接: https://arxiv.org/abs/2504.05692
作者: Songyan Zhang,Yongtao Ge,Jinyuan Tian,Guangkai Xu,Hao Chen,Chen Lv,Chunhua Shen
机构: Nanyang Technology University (南洋理工大学); Zhejiang University (浙江大学); The University of Adelaide (阿德莱德大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: code: this https URL
Abstract:3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at this https URL.
zh
[CV-115] CTI-Unet: Cascaded Threshold Integration for Improved U-Net Segmentation of Pathology Images
【速读】:该论文旨在解决慢性肾病(Chronic Kidney Disease, CKD)图像分析中传统分割模型因单一阈值调参需求导致的效率与精度不足的问题。论文提出了一种新颖的级联阈值集成 U-Net(\textit{Cascaded Threshold-Integrated U-Net, CTI-Unet)**)作为解决方案。其关键在于通过逐级整合多阈值分割结果,实现了噪声抑制与精细结构细节保留之间的平衡,从而克服了单阈值方法的局限性。实验表明,CTI-Unet 在 KPIs2024 数据集上的表现优于现有最先进的架构,如 nnU-Net、Swin-Unet 和 CE-Net,提供了更鲁棒且灵活的肾脏病理图像分割框架。
链接: https://arxiv.org/abs/2504.05640
作者: Mingyang Zhu,Yuqiu Liang,Jiacheng Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chronic kidney disease (CKD) is a growing global health concern, necessitating precise and efficient image analysis to aid diagnosis and treatment planning. Automated segmentation of kidney pathology images plays a central role in facilitating clinical workflows, yet conventional segmentation models often require delicate threshold tuning. This paper proposes a novel \textitCascaded Threshold-Integrated U-Net (CTI-Unet) to overcome the limitations of single-threshold segmentation. By sequentially integrating multiple thresholded outputs, our approach can reconcile noise suppression with the preservation of finer structural details. Experiments on the challenging KPIs2024 dataset demonstrate that CTI-Unet outperforms state-of-the-art architectures such as nnU-Net, Swin-Unet, and CE-Net, offering a robust and flexible framework for kidney pathology image segmentation.
zh
[CV-116] A Multi-Modal AI System for Screening Mammography: Integrating 2D and 3D Imaging to Improve Breast Cancer Detection in a Prospective Clinical Study
【速读】:该论文旨在解决乳腺癌筛查中数字乳腺断层合成(Digital Breast Tomosynthesis, DBT)虽提升了诊断性能但仍然存在较高假阳性召回率的问题。论文提出的关键解决方案是开发一个多模态人工智能系统,将全视野数字乳腺摄影(Full-Field Digital Mammography, FFDM)、合成 mammography 和 DBT 结合起来,提供乳腺层面的可疑病灶预测及边界框定位。这一系统通过整合多种成像模态信息,并利用大规模训练数据(最终超过 750,000 次检查),实现了在保持 100% 敏感性的同时显著降低召回率(减少 31.7%)和放射科医生的工作负担(减少 43.8%)。其核心在于充分利用多模态影像数据与先进的人工智能技术相结合的能力,同时通过持续扩大训练集规模进一步优化模型性能。
链接: https://arxiv.org/abs/2504.05636
作者: Jungkyu Park,Jan Witowski,Yanqi Xu,Hari Trivedi,Judy Gichoya,Beatrice Brown-Mulry,Malte Westerhoff,Linda Moy,Laura Heacock,Alana Lewin,Krzysztof J. Geras
机构: Vilcek Institute of Graduate Biomedical Sciences, NYU Grossman School of Medicine (NYU格罗斯曼医学院韦尔切克研究生生物医学科学研究所), New York, NY, USA; Department of Radiology, NYU Grossman School of Medicine (NYU格罗斯曼医学院放射科系), New York, NY, USA; Center for Data Science, New York University (纽约大学数据科学中心), New York, NY, USA; HITI Lab, Emory University (埃默里大学HITI实验室), Atlanta, GA, USA; Visage Imaging, Inc. (维萨日成像公司), San Diego, CA, USA
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Although digital breast tomosynthesis (DBT) improves diagnostic performance over full-field digital mammography (FFDM), false-positive recalls remain a concern in breast cancer screening. We developed a multi-modal artificial intelligence system integrating FFDM, synthetic mammography, and DBT to provide breast-level predictions and bounding-box localizations of suspicious findings. Our AI system, trained on approximately 500,000 mammography exams, achieved 0.945 AUROC on an internal test set. It demonstrated capacity to reduce recalls by 31.7% and radiologist workload by 43.8% while maintaining 100% sensitivity, underscoring its potential to improve clinical workflows. External validation confirmed strong generalizability, reducing the gap to a perfect AUROC by 35.31%-69.14% relative to strong baselines. In prospective deployment across 18 sites, the system reduced recall rates for low-risk cases. An improved version, trained on over 750,000 exams with additional labels, further reduced the gap by 18.86%-56.62% across large external datasets. Overall, these results underscore the importance of utilizing all available imaging modalities, demonstrate the potential for clinical impact, and indicate feasibility of further reduction of the test error with increased training set when using large-capacity neural networks.
zh
[CV-117] Class Imbalance Correction for Improved Universal Lesion Detection and Tagging in CT MICCAI
【速读】:该论文旨在解决DeepLesion数据集中因缺失测量值和标签以及类别严重不平衡导致的挑战,这些问题可能限制其在多病灶检测任务中的应用潜力。论文的关键解决方案是通过三种基于数据平衡策略的实验来优化VFNet模型的性能:1)按身体部位标签平衡数据;2)按患者病灶数量平衡数据;3)按病灶大小平衡数据。研究发现,特别是针对数据量较少的类别(如骨、肾、软组织和骨盆),通过平衡身体部位标签显著提高了小病灶(≤1cm)的敏感性。此外,按病灶大小平衡数据同样有助于提高所有类别的召回率。这些方法不仅改善了模型性能,还提出了放射学报告中“病灶”部分的结构化报告指南。
链接: https://arxiv.org/abs/2504.05591
作者: Peter D. Erickson,Tejas Sudharshan Mathai,Ronald M. Summers
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at MICCAI MILLAND Workshop 2022
Abstract:Radiologists routinely detect and size lesions in CT to stage cancer and assess tumor burden. To potentially aid their efforts, multiple lesion detection algorithms have been developed with a large public dataset called DeepLesion (32,735 lesions, 32,120 CT slices, 10,594 studies, 4,427 patients, 8 body part labels). However, this dataset contains missing measurements and lesion tags, and exhibits a severe imbalance in the number of lesions per label category. In this work, we utilize a limited subset of DeepLesion (6%, 1331 lesions, 1309 slices) containing lesion annotations and body part label tags to train a VFNet model to detect lesions and tag them. We address the class imbalance by conducting three experiments: 1) Balancing data by the body part labels, 2) Balancing data by the number of lesions per patient, and 3) Balancing data by the lesion size. In contrast to a randomly sampled (unbalanced) data subset, our results indicated that balancing the body part labels always increased sensitivity for lesions = 1cm for classes with low data quantities (Bone: 80% vs. 46%, Kidney: 77% vs. 61%, Soft Tissue: 70% vs. 60%, Pelvis: 83% vs. 76%). Similar trends were seen for three other models tested (FasterRCNN, RetinaNet, FoveaBox). Balancing data by lesion size also helped the VFNet model improve recalls for all classes in contrast to an unbalanced dataset. We also provide a structured reporting guideline for a Lesions'' subsection to be entered into the
Findings’’ section of a radiology report. To our knowledge, we are the first to report the class imbalance in DeepLesion, and have taken data-driven steps to address it in the context of joint lesion detection and tagging.
zh
[CV-118] A Novel Approach to Linking Histology Images with DNA Methylation
【速读】:该论文旨在解决利用传统DNA甲基化检测方法成本高且耗时长,难以在常规临床实践中广泛应用的问题。通过探索全切片图像(Whole Slide Images, WSIs)与DNA甲基化模式之间的潜在关系,提出了一种端到端基于图神经网络的弱监督学习框架,用于预测表现出样本间一致性模式的基因组甲基化状态。关键在于设计了能够有效整合WSIs信息与基因甲基化数据的深度学习模型,并通过多队列验证实现了显著优于现有最先进的方法(超过20%的AUROC提升),同时结合功能富集分析和空间热图揭示了组织学模式与基因组甲基化状态间的联系。
链接: https://arxiv.org/abs/2504.05403
作者: Manahil Raza,Muhammad Dawood,Talha Qaiser,Nasir M. Rajpoot
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:DNA methylation is an epigenetic mechanism that regulates gene expression by adding methyl groups to DNA. Abnormal methylation patterns can disrupt gene expression and have been linked to cancer development. To quantify DNA methylation, specialized assays are typically used. However, these assays are often costly and have lengthy processing times, which limits their widespread availability in routine clinical practice. In contrast, whole slide images (WSIs) for the majority of cancer patients can be more readily available. As such, given the ready availability of WSIs, there is a compelling need to explore the potential relationship between WSIs and DNA methylation patterns. To address this, we propose an end-to-end graph neural network based weakly supervised learning framework to predict the methylation state of gene groups exhibiting coherent patterns across samples. Using data from three cohorts from The Cancer Genome Atlas (TCGA) - TCGA-LGG (Brain Lower Grade Glioma), TCGA-GBM (Glioblastoma Multiforme) ( n =729) and TCGA-KIRC (Kidney Renal Clear Cell Carcinoma) ( n =511) - we demonstrate that the proposed approach achieves significantly higher AUROC scores than the state-of-the-art (SOTA) methods, by more than 20% . We conduct gene set enrichment analyses on the gene groups and show that majority of the gene groups are significantly enriched in important hallmarks and pathways. We also generate spatially enriched heatmaps to further investigate links between histological patterns and DNA methylation states. To the best of our knowledge, this is the first study that explores association of spatially resolved histological patterns with gene group methylation states across multiple cancer types using weakly supervised deep learning.
zh
人工智能
[AI-0] GOLLuM: Gaussian Process Optimized LLM s – Reframing LLM Finetuning through Bayesian Optimization
【速读】:本文旨在解决大型语言模型(LLMs)在优化不确定性环境中的应用挑战,特别是如何有效利用其潜在空间中编码的复杂关系。为填补这一空白,论文提出了一种创新架构,将LLM微调重新定义为通过深度核方法进行高斯过程(GP)边缘似然优化。解决方案的关键在于引入基于LLM的深度核,并与GPs联合优化,以保留两者的优势:LLMs提供丰富且灵活的输入空间用于贝叶斯优化,而GPs则通过预测不确定性建模此空间,从而实现更高效的采样。这种方法在Buchwald-Hartwig反应优化中显著提高了高性能反应的发现率,并在广泛的基准测试中展示了其鲁棒性、通用性和一致性改进。
链接: https://arxiv.org/abs/2504.06265
作者: Bojana Ranković,Philippe Schwaller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) can encode complex relationships in their latent spaces, yet harnessing them for optimization under uncertainty remains challenging. We address this gap with a novel architecture that reframes LLM finetuning as Gaussian process (GP) marginal likelihood optimization via deep kernel methods. We introduce LLM-based deep kernels, jointly optimized with GPs to preserve the benefits of both - LLMs to provide a rich and flexible input space for Bayesian optimization and - GPs to model this space with predictive uncertainty for more efficient sampling. Applied to Buchwald-Hartwig reaction optimization, our method nearly doubles the discovery rate of high-performing reactions compared to static LLM embeddings (from 24% to 43% coverage of the top 5% reactions in just 50 optimization iterations). We also observe a 14% improvement over domain-specific representations without requiring specialized features. Extensive empirical evaluation across 19 benchmarks - ranging from general chemistry to reaction and molecular property optimization - demonstrates our method’s robustness, generality, and consistent improvements across: (1) tasks, (2) LLM architectures (encoder, decoder, encoder-decoder), (3) pretraining domains (chemistry-related or general-purpose) and (4) hyperparameter settings (tuned once on a single dataset). Finally, we explain these improvements: joint LLM-GP optimization through marginal likelihood implicitly performs contrastive learning, aligning representations to produce (1) better-structured embedding spaces, (2) improved uncertainty calibration, and (3) more efficient sampling - without requiring any external loss. This work provides both practical advances in sample-efficient optimization and insights into what makes effective Bayesian optimization.
zh
[AI-1] Decentralized Federated Domain Generalization with Style Sharing: A Formal Modeling and Convergence Analysis
【速读】:该论文旨在解决两个主要问题:(1) 域泛化 (Domain Generalization, DG) 目标与训练过程缺乏形式化的数学分析;(2) FL 中的 DG 研究局限于传统的星型拓扑架构。为解决第二个问题,论文提出了去中心化联邦域泛化与风格共享算法 (\textttStyleDDG),允许点对点网络中的设备通过共享从其数据集中推断的风格信息实现域泛化。为填补第一个问题,论文首次系统性地对基于风格的 DG 训练优化进行了数学分析,并将现有集中式 DG 算法的形式化模型应用于 \textttStyleDDG,从而得到了其以次线性收敛速率工作的分析条件。关键在于引入风格共享机制以实现去中心化的域泛化,同时结合数学分析提供理论支持。
链接: https://arxiv.org/abs/2504.06235
作者: Shahryar Zehtabi,Dong-Jun Han,Seyyedali Hosseinalipour,Christopher G. Brinton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Much of the federated learning (FL) literature focuses on settings where local dataset statistics remain the same between training and testing time. Recent advances in domain generalization (DG) aim to use data from source (training) domains to train a model that generalizes well to data from unseen target (testing) domains. In this paper, we are motivated by two major gaps in existing work on FL and DG: (1) the lack of formal mathematical analysis of DG objectives and training processes; and (2) DG research in FL being limited to the conventional star-topology architecture. Addressing the second gap, we develop \textitDecentralized Federated Domain Generalization with Style Sharing ( \textttStyleDDG ), a fully decentralized DG algorithm designed to allow devices in a peer-to-peer network to achieve DG based on sharing style information inferred from their datasets. Additionally, we fill the first gap by providing the first systematic approach to mathematically analyzing style-based DG training optimization. We cast existing centralized DG algorithms within our framework, and employ their formalisms to model \textttStyleDDG . Based on this, we obtain analytical conditions under which a sub-linear convergence rate of \textttStyleDDG can be obtained. Through experiments on two popular DG datasets, we demonstrate that \textttStyleDDG can obtain significant improvements in accuracy across target domains with minimal added communication overhead compared to decentralized gradient methods that do not employ style sharing.
zh
[AI-2] An experimental survey and Perspective View on Meta-Learning for Automated Algorithms Selection and Parametrization
【速读】:该论文旨在解决**算法选择与参数化(Algorithms Selection and Parametrization, ASP)**问题,特别是在多种元学习设置下的多样化挑战。论文的关键在于通过提供一个全面的综述和基准知识库来填补现有研究在系统性评估和比较上的空白。解决方案的关键是基于一个通用框架,系统地讨论分类器选择的不同阶段,并提出一个包含4百万个先前训练模型的基准知识库,同时对主流方法进行广泛的对比评估。这些评估基于8种分类算法和400个基准数据集,定量分析了现有方法的性能,强调其优势与局限性,从而为ASP问题提供了深入的见解和实用的参考。
链接: https://arxiv.org/abs/2504.06207
作者: Moncef Garouani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Considerable progress has been made in the recent literature studies to tackle the Algorithms Selection and Parametrization (ASP) problem, which is diversified in multiple meta-learning setups. Yet there is a lack of surveys and comparative evaluations that critically analyze, summarize and assess the performance of existing methods. In this paper, we provide an overview of the state of the art in this continuously evolving field. The survey sheds light on the motivational reasons for pursuing classifiers selection through meta-learning. In this regard, Automated Machine Learning (AutoML) is usually treated as an ASP problem under the umbrella of the democratization of machine learning. Accordingly, AutoML makes machine learning techniques accessible to domain scientists who are interested in applying advanced analytics but lack the required expertise. It can ease the task of manually selecting ML algorithms and tuning related hyperparameters. We comprehensively discuss the different phases of classifiers selection based on a generic framework that is formed as an outcome of reviewing prior works. Subsequently, we propose a benchmark knowledge base of 4 millions previously learned models and present extensive comparative evaluations of the prominent methods for classifiers selection based on 08 classification algorithms and 400 benchmark datasets. The comparative study quantitatively assesses the performance of algorithms selection methods along while emphasizing the strengths and limitations of existing studies.
zh
[AI-3] Heuristic Methods are Good Teachers to Distill MLPs for Graph Link Prediction
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在推理过程中依赖图结构计算成本高的问题,并探索如何通过知识蒸馏技术将GNN的知识高效迁移至多层感知机(Multi-Layer Perceptrons, MLPs),从而实现高效的链接预测(Link Prediction)。论文的关键在于发现传统的知识蒸馏方法仅使用标准GNN作为教师模型,而忽略了更专业化(如专用于链接预测的GNN4LP)或启发式方法(如基于共同邻居的方法)等替代教师模型的价值。研究结果表明,更强的教师并不总是带来更强的学生,启发式方法在降低训练成本的同时能够使MLP接近GNN的性能。基于此洞察,论文提出了一种集成启发式蒸馏MLP(Ensemble Heuristic-Distilled MLPs, EHDM)的方法,通过门控机制有效整合互补信号,消除图依赖性的同时保持高性能。实验结果显示,EHDM在十个数据集上的平均性能提升了7.93%,且训练时间减少了1.95到3.32倍,证明其是一种高效且有效的链接预测方法。
链接: https://arxiv.org/abs/2504.06193
作者: Zongyue Qin,Shichang Zhang,Mingxuan Ju,Tong Zhao,Neil Shah,Yizhou Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Link prediction is a crucial graph-learning task with applications including citation prediction and product recommendation. Distilling Graph Neural Networks (GNNs) teachers into Multi-Layer Perceptrons (MLPs) students has emerged as an effective approach to achieve strong performance and reducing computational cost by removing graph dependency. However, existing distillation methods only use standard GNNs and overlook alternative teachers such as specialized model for link prediction (GNN4LP) and heuristic methods (e.g., common neighbors). This paper first explores the impact of different teachers in GNN-to-MLP distillation. Surprisingly, we find that stronger teachers do not always produce stronger students: MLPs distilled from GNN4LP can underperform those distilled from simpler GNNs, while weaker heuristic methods can teach MLPs to near-GNN performance with drastically reduced training costs. Building on these insights, we propose Ensemble Heuristic-Distilled MLPs (EHDM), which eliminates graph dependencies while effectively integrating complementary signals via a gating mechanism. Experiments on ten datasets show an average 7.93% improvement over previous GNN-to-MLP approaches with 1.95-3.32 times less training time, indicating EHDM is an efficient and effective link prediction method.
zh
[AI-4] A Self-Supervised Framework for Space Object Behaviour Characterisation
【速读】:该论文旨在开发一种针对空间目标行为分析的Foundation Model,以应对快速增长的轨道物体群体所带来的空间安全挑战。当前领域缺乏专门为此任务设计的基础模型,而自动化表征空间目标行为的方法对于保障空间安全至关重要。
论文的关键解决方案在于提出了一种名为“空间安全与可持续性基础模型”(Space Safety and Sustainability Foundation Model),专注于通过光曲线(Light Curves, LCs)进行空间目标行为分析。该模型基于Perceiver-Variational Autoencoder (VAE) 架构,通过自监督的重建和掩码重建方法,在来自MMT-9天文台的227,000条光曲线数据上预训练。关键创新点包括利用预训练模型实现异常检测、运动预测以及合成光曲线生成,并通过独立的光曲线模拟器(CASSANDRA 和 GRIAL)对模型进行微调。最终,模型在异常检测和运动模式预测任务中分别达到了88%和82%的准确率,ROC AUC分数分别为0.90和0.95,验证了其有效性。这一成果通过自监督学习实现了从丰富表征中同时支持异常检测、运动预测及合成数据生成的能力,从而为自动化监测和仿真提供了技术支持,助力空间安全与可持续性。
链接: https://arxiv.org/abs/2504.06176
作者: Ian Groves,Andrew Campbell,James Fernandes,Diego Rodriguez,Paul Murray,Massimiliano Vasile,Victoria Nockles
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Space Physics (physics.space-ph)
备注: 15 pages, 10 figures
Abstract:Foundation Models, pre-trained on large unlabelled datasets before task-specific fine-tuning, are increasingly being applied to specialised domains. Recent examples include ClimaX for climate and Clay for satellite Earth observation, but a Foundation Model for Space Object Behavioural Analysis has not yet been developed. As orbital populations grow, automated methods for characterising space object behaviour are crucial for space safety. We present a Space Safety and Sustainability Foundation Model focusing on space object behavioural analysis using light curves (LCs). We implemented a Perceiver-Variational Autoencoder (VAE) architecture, pre-trained with self-supervised reconstruction and masked reconstruction on 227,000 LCs from the MMT-9 observatory. The VAE enables anomaly detection, motion prediction, and LC generation. We fine-tuned the model for anomaly detection motion prediction using two independent LC simulators (CASSANDRA and GRIAL respectively), using CAD models of boxwing, Sentinel-3, SMOS, and Starlink platforms. Our pre-trained model achieved a reconstruction error of 0.01%, identifying potentially anomalous light curves through reconstruction difficulty. After fine-tuning, the model scored 88% and 82% accuracy, with 0.90 and 0.95 ROC AUC scores respectively in both anomaly detection and motion mode prediction (sun-pointing, spin, etc.). Analysis of high-confidence anomaly predictions on real data revealed distinct patterns including characteristic object profiles and satellite glinting. Here, we demonstrate how self-supervised learning can simultaneously enable anomaly detection, motion prediction, and synthetic data generation from rich representations learned in pre-training. Our work therefore supports space safety and sustainability through automated monitoring and simulation capabilities.
zh
[AI-5] Multi-Modality Sensing in mmWave Beamforming for Connected Vehicles Using Deep Learning
【速读】:该论文旨在解决传统基于标准定义的波束选择方法在毫米波(mmWave)通信中进行精确波束对齐以高效配置链路时所面临的高计算和通信开销问题,尤其是在车辆到基础设施(V2I)和车辆到车辆(V2V)通信等动态场景中的应用局限性。为了解决这一问题,论文提出了一种基于深度学习的解决方案,利用多模态传感数据预测具有足够毫米波接收功率的最优波束,从而主动确保最佳的视距(Line-of-Sight, LoS)V2I和V2V链路。解决方案的关键在于通过引入传感数据(如传感器设备获取的感知数据)替代传统的基于信道状态信息和穷尽搜索的波束扫描方法,显著减少了波束扫描的空间搜索范围和时间开销,同时保持了高达98.19%的预测准确性。
链接: https://arxiv.org/abs/2504.06173
作者: Muhammad Baqer Mollah,Honggang Wang,Mohammad Ataul Karim,Hua Fang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 15 Pages
Abstract:Beamforming techniques are considered as essential parts to compensate severe path losses in millimeter-wave (mmWave) communications. In particular, these techniques adopt large antenna arrays and formulate narrow beams to obtain satisfactory received powers. However, performing accurate beam alignment over narrow beams for efficient link configuration by traditional standard defined beam selection approaches, which mainly rely on channel state information and beam sweeping through exhaustive searching, imposes computational and communications overheads. And, such resulting overheads limit their potential use in vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) communications involving highly dynamic scenarios. In comparison, utilizing out-of-band contextual information, such as sensing data obtained from sensor devices, provides a better alternative to reduce overheads. This paper presents a deep learning-based solution for utilizing the multi-modality sensing data for predicting the optimal beams having sufficient mmWave received powers so that the best V2I and V2V line-of-sight links can be ensured proactively. The proposed solution has been tested on real-world measured mmWave sensing and communication data, and the results show that it can achieve up to 98.19% accuracies while predicting top-13 beams. Correspondingly, when compared to existing been sweeping approach, the beam sweeping searching space and time overheads are greatly shortened roughly by 79.67% and 91.89%, respectively which confirm a promising solution for beamforming in mmWave enabled communications.
zh
[AI-6] Real-Time Pitch/F0 Detection Using Spectrogram Images and Convolutional Neural Networks
【速读】:该论文旨在解决通过信号处理方法检测基频(F0)的难题,提出了一种基于卷积神经网络(Convolutional Neural Networks, CNN)与图像处理技术的新方法,直接从声谱图(spectrogram)图像估计音高(pitch)。其解决方案的关键在于将声学信号转换为视觉化的声谱图,并利用CNN和图像处理技术提取特征以实现高精度的音高轮廓检测,最终达到92%的预测音高轮廓与真实值具有强或中度相关性的检测准确率,同时在不同信噪比(Signal-to-Noise Ratio)条件下较其他先进CNN方法提升了约5%的检测率。
链接: https://arxiv.org/abs/2504.06165
作者: Xufang Zhao,Omer Tsimhoni
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a novel approach to detect F0 through Convolutional Neural Networks and image processing techniques to directly estimate pitch from spectrogram images. Our new approach demonstrates a very good detection accuracy; a total of 92% of predicted pitch contours have strong or moderate correlations to the true pitch contours. Furthermore, the experimental comparison between our new approach and other state-of-the-art CNN methods reveals that our approach can enhance the detection rate by approximately 5% across various Signal-to-Noise Ratio conditions.
zh
[AI-7] ARLO: A Tailorable Approach for Transforming Natural Language Software Requirements into Architecture using LLM s
【速读】:该论文旨在解决自然语言(Natural Language, NL)表达的软件需求由于冗长、模糊和不一致所导致的一系列挑战,这些问题阻碍了系统架构的选择与不同架构方案的评估。传统方法依赖人工专业知识将自然语言需求映射到架构设计,但这种方法耗时且容易出错。为应对这些挑战,论文提出了一种名为ARLO的自动化方法。ARLO的关键在于结合自然语言需求、已有的架构相关软件质量属性标准以及现成的大型语言模型(Large Language Model, LLM),自动识别系统中与架构相关的自然语言需求子集,并将其映射到可调整的架构选择矩阵中。通过整数线性规划(Integer Linear Programming, ILP)优化该矩阵,ARLO能够确定当前需求下的最优架构。此外,ARLO还具备追溯选定架构选择至需求的能力,并能隔离对系统架构有特定影响的需求,从而支持基于需求和约束的架构选择、对比评估及替代方案的探索。
链接: https://arxiv.org/abs/2504.06143
作者: Tooraj Helmi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Software requirements expressed in natural language (NL) frequently suffer from verbosity, ambiguity, and inconsistency. This creates a range of challenges, including selecting an appropriate architecture for a system and assessing different architectural alternatives. Relying on human expertise to accomplish the task of mapping NL requirements to architecture is time-consuming and error-prone. This paper proposes ARLO, an approach that automates this task by leveraging (1) a set of NL requirements for a system, (2) an existing standard that specifies architecturally relevant software quality attributes, and (3) a readily available Large Language Model (LLM). Specifically, ARLO determines the subset of NL requirements for a given system that is architecturally relevant and maps that subset to a tailorable matrix of architectural choices. ARLO applies integer linear programming on the architectural-choice matrix to determine the optimal architecture for the current requirements. We demonstrate ARLO’s efficacy using a set of real-world examples. We highlight ARLO’s ability (1) to trace the selected architectural choices to the requirements and (2) to isolate NL requirements that exert a particular influence on a system’s architecture. This allows the identification, comparative assessment, and exploration of alternative architectural choices based on the requirements and constraints expressed therein.
zh
[AI-8] A Multimedia Analytics Model for the Foundation Model Era
【速读】:该论文旨在解决现有概念模型未能充分捕捉基础模型(Foundation Models)和主动人工智能(agentic Artificial Intelligence)引入的复杂性的问题,特别是在多媒体分析领域。解决方案的关键在于提出了一种专门设计用于基础模型时代的综合性多媒体分析模型。此模型基于视觉分析、多媒体分析、知识生成、分析任务定义、混合主动引导以及人机回环强化学习等已建立的框架构建,强调从技术和概念角度实现人类与AI团队协作的集成。模型的核心是一个无缝但明确可分离的人类专家与半自主分析过程之间的交互通道,确保用户意图与AI行为之间持续一致。通过在敏感领域如情报分析和调查新闻中应对实际挑战,该模型展示了如何促进多媒体分析解决方案的深度理解和针对性改进,从而为系统设计、对比及未来研究指明方向。
链接: https://arxiv.org/abs/2504.06138
作者: Marcel Worring,Jan Zahálka,Stef van den Elzen,Maximilian Fischer,Daniel Keim
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:The rapid advances in Foundation Models and agentic Artificial Intelligence are transforming multimedia analytics by enabling richer, more sophisticated interactions between humans and analytical systems. Existing conceptual models for visual and multimedia analytics, however, do not adequately capture the complexity introduced by these powerful AI paradigms. To bridge this gap, we propose a comprehensive multimedia analytics model specifically designed for the foundation model era. Building upon established frameworks from visual analytics, multimedia analytics, knowledge generation, analytic task definition, mixed-initiative guidance, and human-in-the-loop reinforcement learning, our model emphasizes integrated human-AI teaming based on visual analytics agents from both technical and conceptual perspectives. Central to the model is a seamless, yet explicitly separable, interaction channel between expert users and semi-autonomous analytical processes, ensuring continuous alignment between user intent and AI behavior. The model addresses practical challenges in sensitive domains such as intelligence analysis, investigative journalism, and other fields handling complex, high-stakes data. We illustrate through detailed case studies how our model facilitates deeper understanding and targeted improvement of multimedia analytics solutions. By explicitly capturing how expert users can optimally interact with and guide AI-powered multimedia analytics systems, our conceptual framework sets a clear direction for system design, comparison, and future research.
zh
[AI-9] Decentralizing AI Memory: SHIMI a Semantic Hierarchical Memory Index for Scalable Agent Reasoning
【速读】:该论文试图解决现有 Retrieval-Augmented Generation (RAG) 和基于向量的搜索在抽象能力、可扩展性及语义精确性方面的局限性,尤其是在去中心化环境中的表现。论文提出了一种名为 SHIMI (Semantic Hierarchical Memory Index) 的统一架构,其关键是将知识建模为动态结构化的概念层次,使智能体能够基于语义而非表面相似性检索信息。SHIMI 通过分层语义节点组织记忆,并支持从抽象意图到具体实体的自顶向下遍历,从而实现更精确且可解释的信息检索。此外,SHIMI 原生适配去中心化生态系统,在该环境中,代理维护本地内存树并通过异步方式在网络中同步。为实现这一点,论文引入了一种轻量级同步协议,利用默克尔有向无环图 (Merkle-DAG) 摘要、布隆过滤器 (Bloom filters) 和基于一致性复制决策表 (CRDT) 的冲突解决机制,以最小开销实现部分同步。通过基准实验和去中心化代理协作的实际案例,论文展示了 SHIMI 在检索准确性、语义保真度和可扩展性方面的优势,确立了其作为去中心化认知系统核心基础设施层的地位。
链接: https://arxiv.org/abs/2504.06135
作者: Tooraj Helmi
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Retrieval-Augmented Generation (RAG) and vector-based search have become foundational tools for memory in AI systems, yet they struggle with abstraction, scalability, and semantic precision - especially in decentralized environments. We present SHIMI (Semantic Hierarchical Memory Index), a unified architecture that models knowledge as a dynamically structured hierarchy of concepts, enabling agents to retrieve information based on meaning rather than surface similarity. SHIMI organizes memory into layered semantic nodes and supports top-down traversal from abstract intent to specific entities, offering more precise and explainable retrieval. Critically, SHIMI is natively designed for decentralized ecosystems, where agents maintain local memory trees and synchronize them asynchronously across networks. We introduce a lightweight sync protocol that leverages Merkle-DAG summaries, Bloom filters, and CRDT-style conflict resolution to enable partial synchronization with minimal overhead. Through benchmark experiments and use cases involving decentralized agent collaboration, we demonstrate SHIMI’s advantages in retrieval accuracy, semantic fidelity, and scalability - positioning it as a core infrastructure layer for decentralized cognitive systems.
zh
[AI-10] Leanabell-Prover: Posttraining Scaling in Formal Reasoning
【速读】:本文旨在解决自动化定理证明(Automated Theorem Proving, ATP)领域中,通过大规模语言模型(LLMs)实现形式推理潜力未被充分挖掘的问题。尽管自然语言推理模型在OpenAI的O1/O3和Deepseek的R1等工作中取得了突破性进展,但ATP尚未受益于类似的后训练扩展方法。为弥合这一差距,研究的关键在于设计一种结合混合数据集的持续训练方案,该数据集包含大量的陈述-证明对以及额外的数据,以模拟人类的推理和假设改进行为。此外,通过引入基于Lean 4编译器返回结果奖励的强化学习机制,进一步优化了现有形式定理证明器(如DeepSeek-Prover-v1.5和Goedel-Prover),显著提升了整体证明生成的能力。例如,在MiniF2F基准测试中实现了59.8%的通过率(pass@32)。这是一项正在进行的工作,未来将逐步更新研究进展并公开数据与训练细节。
链接: https://arxiv.org/abs/2504.06122
作者: Jingyuan Zhang,Qi Wang,Xingguang Ji,Yahui Liu,Yang Yue,Fuzheng Zhang,Di Zhang,Guorui Zhou,Kun Gai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 6 figures
Abstract:Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural this http URL begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.
zh
[AI-11] Uncertainty-Aware Hybrid Machine Learning in Virtual Sensors for Vehicle Sideslip Angle Estimation
【速读】:本文旨在解决由于车载传感器系统成本限制导致可测量状态数量及其精度受限的问题,特别是在使用当前光学传感器精确测量车辆侧滑角(Vehicle Sideslip Angle, VSA)等关键量时面临的商业挑战。为应对这些局限性,论文提出了一种名为不确定性感知混合学习(Uncertainty-Aware Hybrid Learning, UAHL)的架构,以提升主动安全中的车辆状态估计性能。UAHL架构的关键在于结合机器学习模型与车辆运动模型,直接从车载传感器数据中估算VSA,并通过量化单个模型预测的不确定性及混合融合机制,实现对来自机器学习和车辆运动模型的不确定性感知预测进行动态加权,从而生成准确且可靠的混合VSA估计值。此外,论文还构建了一个名为真实世界车辆状态估计数据集(Real-world Vehicle State Estimation Dataset, ReV-StED)的新数据集,用于支持实验验证。研究结果表明,所提方法在VSA估计方面表现出色,证明了UAHL架构在推进虚拟传感器发展和增强自动驾驶汽车主动安全性方面的潜力。
链接: https://arxiv.org/abs/2504.06105
作者: Abinav Kalyanasundaram,Karthikeyan Chandra Sekaran,Philipp Stauber,Michael Lange,Wolfgang Utschick,Michael Botsch
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the 2025 IEEE Intelligent Vehicles Symposium (IV)
Abstract:Precise vehicle state estimation is crucial for safe and reliable autonomous driving. The number of measurable states and their precision offered by the onboard vehicle sensor system are often constrained by cost. For instance, measuring critical quantities such as the Vehicle Sideslip Angle (VSA) poses significant commercial challenges using current optical sensors. This paper addresses these limitations by focusing on the development of high-performance virtual sensors to enhance vehicle state estimation for active safety. The proposed Uncertainty-Aware Hybrid Learning (UAHL) architecture integrates a machine learning model with vehicle motion models to estimate VSA directly from onboard sensor data. A key aspect of the UAHL architecture is its focus on uncertainty quantification for individual model estimates and hybrid fusion. These mechanisms enable the dynamic weighting of uncertainty-aware predictions from machine learning and vehicle motion models to produce accurate and reliable hybrid VSA estimates. This work also presents a novel dataset named Real-world Vehicle State Estimation Dataset (ReV-StED), comprising synchronized measurements from advanced vehicle dynamic sensors. The experimental results demonstrate the superior performance of the proposed method for VSA estimation, highlighting UAHL as a promising architecture for advancing virtual sensors and enhancing active safety in autonomous vehicles.
zh
[AI-12] Real-Time LaCAM
【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Finding, MAPF)中实时规划与完整解保证之间的矛盾。大多数具有完整解保证的MAPF方法需要计算整个时间范围内的路径,这在实际应用中可能耗时过长且不切实际。为应对这一挑战,论文提出了一种具有完备性保证的实时MAPF方法。其关键在于通过增量方式利用LaCAM算法(Okumura 2023),使系统能够在毫秒级的时间限制内迭代规划,同时保持与全范围路径规划相同的成功率。此外,该方法还展示了如何与单步学习的MAPF策略结合使用,并为未来基于实时约束的MAPF算法提供了通用的完备性机制。
链接: https://arxiv.org/abs/2504.06091
作者: Runzhe Liang,Rishi Veerapaneni,Daniel Harabor,Jiaoyang Li,Maxim Likhachev
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:The vast majority of Multi-Agent Path Finding (MAPF) methods with completeness guarantees require planning full horizon paths. However, planning full horizon paths can take too long and be impractical in real-world applications. Instead, real-time planning and execution, which only allows the planner a finite amount of time before executing and replanning, is more practical for real world multi-agent systems. Several methods utilize real-time planning schemes but none are provably complete, which leads to livelock or deadlock. Our main contribution is to show the first Real-Time MAPF method with provable completeness guarantees. We do this by leveraging LaCAM (Okumura 2023) in an incremental fashion. Our results show how we can iteratively plan for congested environments with a cutoff time of milliseconds while still maintaining the same success rate as full horizon LaCAM. We also show how it can be used with a single-step learned MAPF policy. The proposed Real-Time LaCAM also provides us with a general mechanism for using iterative constraints for completeness in future real-time MAPF algorithms.
zh
[AI-13] Information-Theoretic Reward Decomposition for Generalizable RLHF
【速读】:该论文旨在解决现有奖励模型在 Reinforcement Learning from Human Feedback (RLHF) 中泛化能力不足的问题。具体而言,现有的奖励模型通过增大被选响应与拒绝响应之间的奖励差距进行训练,但忽略了响应所依赖的提示(prompt),导致在未见的数据分布上评估时,奖励模型的泛化性能较差。为了解决这一问题,论文的关键创新在于将奖励值分解为两个独立组成部分:无提示奖励(prompt-free reward)和提示相关奖励(prompt-related reward)。无提示奖励仅由响应决定,而提示相关奖励则同时考虑提示和响应。这两个分量从信息论的角度提取,无需额外引入模型。基于此,论文提出了一种新的奖励学习算法,该算法通过优先处理无提示奖励值较高的数据样本来改进训练过程。实验结果表明,这种分解方法有效刻画了奖励模型的两部分特性,并显著提升了奖励模型的对齐性能和泛化能力。
链接: https://arxiv.org/abs/2504.06020
作者: Liyuan Mao,Haoran Xu,Amy Zhang,Weinan Zhang,Chenjia Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work done during internships at Institute of Artificial Intelligence (TeleAI), China Telecom
Abstract:A generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models lack this ability, as they are typically trained by increasing the reward gap between chosen and rejected responses, while overlooking the prompts that the responses are conditioned on. Consequently, when the trained reward model is evaluated on prompt-response pairs that lie outside the data distribution, neglecting the effect of prompts may result in poor generalization of the reward model. To address this issue, we decompose the reward value into two independent components: prompt-free reward and prompt-related reward. Prompt-free reward represents the evaluation that is determined only by responses, while the prompt-related reward reflects the reward that derives from both the prompt and the response. We extract these two components from an information-theoretic perspective, which requires no extra models. Subsequently, we propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values. Through toy examples, we demonstrate that the extracted prompt-free and prompt-related rewards effectively characterize two parts of the reward model. Further, standard evaluations show that our method improves both the alignment performance and the generalization capability of the reward model.
zh
[AI-14] Optuna vs Code Llama: Are LLM s a New Paradigm for Hyperparameter Tuning?
【速读】:本文旨在解决神经网络超参数优化过程中计算开销大且效率低的问题。传统方法如Optuna依赖于穷举试验,耗时且资源密集。为应对这一挑战,论文提出利用经过LoRA参数高效微调的Code Llama大型语言模型(LLM)来生成针对不同神经网络架构的精准且高效的超参数推荐。关键在于通过LoRA实现LLM的微调,使其能够以单一推理步骤生成超参数,显著降低了计算开销,同时在均方根误差(RMSE)方面达到了与树结构Parzen估计器(TPE)等先进方法相当甚至更优的结果。这种方法不仅提升了优化效率,还特别适合资源受限的环境,如边缘设备和移动应用,从而证明了LLM作为传统优化技术替代方案的潜力。
链接: https://arxiv.org/abs/2504.06006
作者: Roman Kochnev,Arash Torabi Goodarzi,Zofia Antonina Bentyn,Dmitry Ignatov,Radu Timofte
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Optimal hyperparameter selection is critical for maximizing neural network performance, especially as models grow in complexity. This work investigates the viability of using large language models (LLMs) for hyperparameter optimization by employing a fine-tuned version of Code Llama. Through parameter-efficient fine-tuning using LoRA, we adapt the LLM to generate accurate and efficient hyperparameter recommendations tailored to diverse neural network architectures. Unlike traditional methods such as Optuna, which rely on exhaustive trials, the proposed approach achieves competitive or superior results in terms of Root Mean Square Error (RMSE) while significantly reducing computational overhead. Our approach highlights that LLM-based optimization not only matches state-of-the-art methods like Tree-structured Parzen Estimators but also accelerates the tuning process. This positions LLMs as a promising alternative to conventional optimization techniques, particularly for rapid experimentation. Furthermore, the ability to generate hyperparameters in a single inference step makes this method particularly well-suited for resource-constrained environments such as edge devices and mobile applications, where computational efficiency is paramount. The results confirm that LLMs, beyond their efficiency, offer substantial time savings and comparable stability, underscoring their value in advancing machine learning workflows. All generated hyperparameters are included in the LEMUR Neural Network (NN) Dataset, which is publicly available and serves as an open-source benchmark for hyperparameter optimization research.
zh
[AI-15] Representing Normative Regulations in OWL DL for Automated Compliance Checking Supported by Text Annotation
【速读】:该论文试图解决手动合规性检查效率低、依赖高技能专家且易出错的问题,并探索通过自动化方法实现基于本体推理的合规性检查。解决方案的关键在于提出了一种标注模式和一种算法,能够将文本注释转换为机器可解析的OWL DL代码,从而实现基于OWL推理的自动化合规性检查。
链接: https://arxiv.org/abs/2504.05951
作者: Ildar Baimuratov,Denis Turygin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Compliance checking is the process of determining whether a regulated entity adheres to these regulations. Currently, compliance checking is predominantly manual, requiring significant time and highly skilled experts, while still being prone to errors caused by the human factor. Various approaches have been explored to automate compliance checking, however, representing regulations in OWL DL language which enables compliance checking through OWL reasoning has not been adopted. In this work, we propose an annotation schema and an algorithm that transforms text annotations into machine-interpretable OWL DL code. The proposed approach is validated through a proof-of-concept implementation applied to examples from the building construction domain.
zh
[AI-16] AEGIS: Human Attention-based Explainable Guidance for Intelligent Vehicle Systems
【速读】:该论文旨在解决自主智能车辆(AIVs)在提升决策能力过程中面临的挑战,特别是如何通过机器学习模型有效捕捉场景中的关键感兴趣区域(Regions of Interest, RoIs),以实现与人类感知和推理相当的全面场景理解。目前,这一领域仍存在显著的技术难题。为应对这一挑战,论文提出了一种名为“基于人类注意力的可解释性引导框架”(Human Attention-based Explainable Guidance for Intelligent Vehicle Systems, AEGIS)的新方法。AEGIS的关键在于利用从眼动追踪数据转换而来的(Eye-Tracking)人类注意力信息,指导强化学习(Reinforcement Learning, RL)模型识别决策相关的RoIs。通过收集来自六个场景的20名参与者共计120万帧的数据,并预训练出能够预测人类注意力模式的模型,AEGIS实现了将人类注意力转化为指导信号的核心目标。
链接: https://arxiv.org/abs/2504.05950
作者: Zhuoli Zhuang,Cheng-You Lu,Yu-Cheng Fred Chang,Yu-Kai Wang,Thomas Do,Chin-Teng Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Improving decision-making capabilities in Autonomous Intelligent Vehicles (AIVs) has been a heated topic in recent years. Despite advancements, training machines to capture regions of interest for comprehensive scene understanding, like human perception and reasoning, remains a significant challenge. This study introduces a novel framework, Human Attention-based Explainable Guidance for Intelligent Vehicle Systems (AEGIS). AEGIS utilizes human attention, converted from eye-tracking, to guide reinforcement learning (RL) models to identify critical regions of interest for decision-making. AEGIS uses a pre-trained human attention model to guide RL models to identify critical regions of interest for decision-making. By collecting 1.2 million frames from 20 participants across six scenarios, AEGIS pre-trains a model to predict human attention patterns.
zh
[AI-17] Uncovering Fairness through Data Complexity as an Early Indicator
【速读】:该论文试图解决机器学习应用中公平性研究的一个空白问题:即特权组与非特权组在分类复杂度上的差异如何影响解决方案的公平性,并以此作为潜在不公平性的初步指标。论文的关键在于通过合成数据集评估不同复杂度度量指标的差异与群体公平性度量之间的相关性,并利用关联规则挖掘识别出能够将群体间复杂度差异与公平性结果联系起来的模式,从而提供以数据为中心的指标来指导偏差缓解。研究结果通过实际问题的应用得到了验证,证明量化群体层面的分类复杂度可以揭示潜在公平性挑战的早期指标。这种探索帮助从业者主动应对分类任务中的偏差问题。
链接: https://arxiv.org/abs/2504.05923
作者: Juliett Suárez Ferreira,Marija Slavkovik,Jorge Casillas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:
Abstract:Fairness constitutes a concern within machine learning (ML) applications. Currently, there is no study on how disparities in classification complexity between privileged and unprivileged groups could influence the fairness of solutions, which serves as a preliminary indicator of potential unfairness. In this work, we investigate this gap, specifically, we focus on synthetic datasets designed to capture a variety of biases ranging from historical bias to measurement and representational bias to evaluate how various complexity metrics differences correlate with group fairness metrics. We then apply association rule mining to identify patterns that link disproportionate complexity differences between groups with fairness-related outcomes, offering data-centric indicators to guide bias mitigation. Our findings are also validated by their application in real-world problems, providing evidence that quantifying group-wise classification complexity can uncover early indicators of potential fairness challenges. This investigation helps practitioners to proactively address bias in classification tasks.
zh
[AI-18] Systematic Parameter Decision in Approximate Model Counting
【速读】:该论文旨在解决基于哈希的近似模型计数算法 \mathsfApproxMC 的内部参数确定问题,目标是在确保算法 Probably Approximately Correct (PAC) 的同时,最大化其效率。现有方法依赖于启发式策略,而本文提出将此问题转化为一个优化问题,通过将 \mathsfApproxMC 的正确性证明推广到任意参数值实现这一转化。解决方案的关键在于分离算法的正确性和最优性关注点,从而避免重复的案例分析,并为参数优化建立清晰框架。此外,经过简化后,优化问题具有简单形式,可使用基本搜索算法求解,同时揭示参数值对算法性能的影响。实验结果表明,优化后的参数使最新版 \mathsfApproxMC 的运行时间提升了 1.6 到 2.4 倍,具体提升取决于误差容忍度。
链接: https://arxiv.org/abs/2504.05874
作者: Jinping Lei,Toru Takisaka,Junqiang Peng,Mingyu Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a novel approach to determining the internal parameters of the hashing-based approximate model counting algorithm \mathsfApproxMC . In this problem, the chosen parameter values must ensure that \mathsfApproxMC is Probably Approximately Correct (PAC), while also making it as efficient as possible. The existing approach to this problem relies on heuristics; in this paper, we solve this problem by formulating it as an optimization problem that arises from generalizing \mathsfApproxMC 's correctness proof to arbitrary parameter values. Our approach separates the concerns of algorithm soundness and optimality, allowing us to address the former without the need for repetitive case-by-case argumentation, while establishing a clear framework for the latter. Furthermore, after reduction, the resulting optimization problem takes on an exceptionally simple form, enabling the use of a basic search algorithm and providing insight into how parameter values affect algorithm performance. Experimental results demonstrate that our optimized parameters improve the runtime performance of the latest \mathsfApproxMC by a factor of 1.6 to 2.4, depending on the error tolerance. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2504.05874 [cs.AI] (or arXiv:2504.05874v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.05874 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-19] Agent Guide: A Simple Agent Behavioral Watermarking Framework
【速读】:本文旨在解决在数字生态系统(如社交媒体平台)中部署智能代理所引发的可追溯性和问责制问题,特别是在网络安全和数字内容保护领域。传统基于标记(token)操作的大语言模型(LLM)水印技术由于行为标记化困难以及行为到动作转换过程中的信息丢失,难以适用于智能代理。为此,论文提出了一种名为Agent Guide的新颖行为水印框架,通过概率偏差引导代理的高级决策(行为),同时保持特定执行(动作)的自然性来嵌入水印。该方案的关键在于将代理的行为解耦为两个层次:行为(例如选择收藏)和动作(例如使用特定标签进行收藏),并对行为概率分布应用带有水印引导偏差的方法。此外,采用基于z统计量的统计分析方法检测水印,确保在多轮次中可靠提取。实验结果表明,Agent Guide能够在具有多样代理配置的社会媒体场景下实现有效的水印检测,并且具有较低的误报率。这一框架为代理水印提供了一个实用且稳健的解决方案,可用于识别恶意代理和保护专有代理系统。
链接: https://arxiv.org/abs/2504.05871
作者: Kaibo Huang,Zhongliang Yang,Linna Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing deployment of intelligent agents in digital ecosystems, such as social media platforms, has raised significant concerns about traceability and accountability, particularly in cybersecurity and digital content protection. Traditional large language model (LLM) watermarking techniques, which rely on token-level manipulations, are ill-suited for agents due to the challenges of behavior tokenization and information loss during behavior-to-action translation. To address these issues, we propose Agent Guide, a novel behavioral watermarking framework that embeds watermarks by guiding the agent’s high-level decisions (behavior) through probability biases, while preserving the naturalness of specific executions (action). Our approach decouples agent behavior into two levels, behavior (e.g., choosing to bookmark) and action (e.g., bookmarking with specific tags), and applies watermark-guided biases to the behavior probability distribution. We employ a z-statistic-based statistical analysis to detect the watermark, ensuring reliable extraction over multiple rounds. Experiments in a social media scenario with diverse agent profiles demonstrate that Agent Guide achieves effective watermark detection with a low false positive rate. Our framework provides a practical and robust solution for agent watermarking, with applications in identifying malicious agents and protecting proprietary agent systems.
zh
[AI-20] owards an AI-Driven Video-Based American Sign Language Dictionary: Exploring Design and Usage Experience with Learners
【速读】:该论文旨在解决美国手语(American Sign Language, ASL)学习者在查询不熟悉手语符号时面临的挑战,因为与口语语言不同,学习者无法通过文本查询来查找陌生的手语符号。为了解决这一问题,论文提出利用最先进的孤立手语识别技术开发一个自动化的基于视频的手语词典。解决方案的关键在于结合先前人机交互(HCI)研究的设计建议,并采用先进的手语识别算法,实现从用户提交的手语视频到返回匹配符号列表的功能,从而帮助学习者更高效地进行手语学习和理解任务。
链接: https://arxiv.org/abs/2504.05857
作者: Saad Hassan,Matyas Bohacek,Chaelin Kim,Denise Crochet
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Searching for unfamiliar American Sign Language (ASL) signs is challenging for learners because, unlike spoken languages, they cannot type a text-based query to look up an unfamiliar sign. Advances in isolated sign recognition have enabled the creation of video-based dictionaries, allowing users to submit a video and receive a list of the closest matching signs. Previous HCI research using Wizard-of-Oz prototypes has explored interface designs for ASL dictionaries. Building on these studies, we incorporate their design recommendations and leverage state-of-the-art sign-recognition technology to develop an automated video-based dictionary. We also present findings from an observational study with twelve novice ASL learners who used this dictionary during video-comprehension and question-answering tasks. Our results address human-AI interaction challenges not covered in previous WoZ research, including recording and resubmitting signs, unpredictable outputs, system latency, and privacy concerns. These insights offer guidance for designing and deploying video-based ASL dictionary systems.
zh
[AI-21] Physics-aware generative models for turbulent fluid flows through energy-consistent stochastic interpolants
【速读】:该论文致力于解决流体动力学领域中湍流模拟的问题,特别是在传统数值求解器计算成本高昂的情况下。论文的关键在于提出了一种基于随机插值的新型随机生成模型,该模型能够进行概率预测的同时融入物理约束(如能量稳定性与散度自由性)。不同于通常不考虑底层物理定律的传统随机生成模型,本文方法通过使随机插值参数成为可学习系数来嵌入能量一致性。通过在Kolmogorov流这一基准湍流问题上的评估,展示了相较于最先进的自回归条件扩散模型(ACDMs)和PDE-Refiner的优越精度与稳定性,并实现了标准随机插值更长时段内的稳定结果。这表明了具备物理意识的生成模型在加速和提升湍流模拟方面的潜力,同时保持基本守恒属性。
链接: https://arxiv.org/abs/2504.05852
作者: Nikolaj T. Mücke,Benjamin Sanderse
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:Generative models have demonstrated remarkable success in domains such as text, image, and video synthesis. In this work, we explore the application of generative models to fluid dynamics, specifically for turbulence simulation, where classical numerical solvers are computationally expensive. We propose a novel stochastic generative model based on stochastic interpolants, which enables probabilistic forecasting while incorporating physical constraints such as energy stability and divergence-freeness. Unlike conventional stochastic generative models, which are often agnostic to underlying physical laws, our approach embeds energy consistency by making the parameters of the stochastic interpolant learnable coefficients. We evaluate our method on a benchmark turbulence problem - Kolmogorov flow - demonstrating superior accuracy and stability over state-of-the-art alternatives such as autoregressive conditional diffusion models (ACDMs) and PDE-Refiner. Furthermore, we achieve stable results for significantly longer roll-outs than standard stochastic interpolants. Our results highlight the potential of physics-aware generative models in accelerating and enhancing turbulence simulations while preserving fundamental conservation properties.
zh
[AI-22] PathGPT : Leverag ing Large Language Models for Personalized Route Generation
【速读】:该论文旨在解决个性化路线推荐(Personalized Route Recommendation, PRR)问题,同时克服现有基于机器学习的方法在适应新场景时需要重新训练或部署多个模型的局限性。论文的关键创新在于利用大规模语言模型(Large Language Models, LLMs)的自然语言理解能力,构建了一个统一且可无缝适配新场景的模型。通过结合LLMs在训练过程中积累的广泛知识以及外部手工设计的上下文信息(类似于RAG系统),该模型能够根据用户需求生成个性化路径,而无需针对新场景进行额外的训练。实验结果表明,该方法显著提升了LLMs在PRR任务中的性能。
链接: https://arxiv.org/abs/2504.05846
作者: Steeve Cuthbert Marcelyn,Yucen Gao,Yuzhe Zhang,Xiaofeng Gao,Guihai Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The proliferation of GPS enabled devices has led to the accumulation of a substantial corpus of historical trajectory data. By leveraging these data for training machine learning models,researchers have devised novel data-driven methodologies that address the personalized route recommendation (PRR) problem. In contrast to conventional algorithms such as Dijkstra shortest path algorithm,these novel algorithms possess the capacity to discern and learn patterns within the data,thereby facilitating the generation of more personalized paths. However,once these models have been trained,their application is constrained to the generation of routes that align with their training patterns. This limitation renders them less adaptable to novel scenarios and the deployment of multiple machine learning models might be necessary to address new possible scenarios,which can be costly as each model must be trained separately. Inspired by recent advances in the field of Large Language Models (LLMs),we leveraged their natural language understanding capabilities to develop a unified model to solve the PRR problem while being seamlessly adaptable to new scenarios without additional training. To accomplish this,we combined the extensive knowledge LLMs acquired during training with further access to external hand-crafted context information,similar to RAG (Retrieved Augmented Generation) systems,to enhance their ability to generate paths according to user-defined requirements. Extensive experiments on different datasets show a considerable uplift in LLM performance on the PRR problem.
zh
[AI-23] Momentum Boosted Episodic Memory for Improving Learning in Long-Tailed RL Environments
【速读】:该论文旨在解决传统强化学习算法在处理非均匀分布数据(特别是Zipfian分布)时的性能局限性问题。在实际应用中,如自动驾驶或自然界中,数据通常呈现Zipfian分布,即某些经验频繁发生,而大多数经验很少出现。传统方法难以高效地为这些稀有但重要的轨迹分配信用(credit assignment),从而影响学习效果。为应对这一挑战,论文提出了一种基于互补学习系统理论的新架构,其关键是通过引入一个带优先级的记忆模块的回放缓冲区(episodic memory buffer),以无监督的方式发现重要且稀有的长尾轨迹,并确保这些轨迹被更长时间保留。此外,该架构通过从记忆中恢复经验并赋予其加权重要性来构建执行轨迹,从而实现样本高效的信用分配。此方案的关键创新在于其模块化设计,可轻松集成到任何强化学习框架中,并在多个Zipfian任务中显著优于传统方法,尤其在Zipfian、Uniform和Rare Accuracy三种评估指标下均大幅超越IMPALA算法,同时提升了大多数Atari环境中的性能表现。
链接: https://arxiv.org/abs/2504.05840
作者: Dolton Fernandes,Pramod Kaushik,Harsh Shukla,Bapi Raju Surampudi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional Reinforcement Learning (RL) algorithms assume the distribution of the data to be uniform or mostly uniform. However, this is not the case with most real-world applications like autonomous driving or in nature where animals roam. Some experiences are encountered frequently, and most of the remaining experiences occur rarely; the resulting distribution is called Zipfian. Taking inspiration from the theory of complementary learning systems, an architecture for learning from Zipfian distributions is proposed where important long tail trajectories are discovered in an unsupervised manner. The proposal comprises an episodic memory buffer containing a prioritised memory module to ensure important rare trajectories are kept longer to address the Zipfian problem, which needs credit assignment to happen in a sample efficient manner. The experiences are then reinstated from episodic memory and given weighted importance forming the trajectory to be executed. Notably, the proposed architecture is modular, can be incorporated in any RL architecture and yields improved performance in multiple Zipfian tasks over traditional architectures. Our method outperforms IMPALA by a significant margin on all three tasks and all three evaluation metrics (Zipfian, Uniform, and Rare Accuracy) and also gives improvements on most Atari environments that are considered challenging
zh
[AI-24] Meta-Continual Learning of Neural Fields
【速读】:本文致力于解决神经场(Neural Fields, NF)在元连续学习(Meta-Continual Learning, MCL)中的两个主要挑战:灾难性遗忘(catastrophic forgetting)和收敛速度慢(slow convergence)。为应对这些挑战,论文提出了一种基于模块化架构与基于优化的元学习策略的新方法。关键在于引入了最大化Fisher信息的损失函数(Fisher Information Maximization loss, FIM),用于神经辐射场(neural radiance fields, NeRF),以提升样本级别的信息增益,从而增强泛化能力,并保证收敛性和泛化界。通过在六个多样化的数据集上进行广泛的评估,证明了所提方法在图像、音频、视频重建以及视图合成任务中的优越性,尤其在大规模城市场景的NeRF渲染中实现了快速适应且降低了参数需求。
链接: https://arxiv.org/abs/2504.05806
作者: Seungyoon Woo,Junhyeog Yun,Gunhee Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Neural Fields (NF) have gained prominence as a versatile framework for complex data representation. This work unveils a new problem setting termed \emphMeta-Continual Learning of Neural Fields (MCL-NF) and introduces a novel strategy that employs a modular architecture combined with optimization-based meta-learning. Focused on overcoming the limitations of existing methods for continual learning of neural fields, such as catastrophic forgetting and slow convergence, our strategy achieves high-quality reconstruction with significantly improved learning speed. We further introduce Fisher Information Maximization loss for neural radiance fields (FIM-NeRF), which maximizes information gains at the sample level to enhance learning generalization, with proved convergence guarantee and generalization bound. We perform extensive evaluations across image, audio, video reconstruction, and view synthesis tasks on six diverse datasets, demonstrating our method’s superiority in reconstruction quality and speed over existing MCL and CL-NF approaches. Notably, our approach attains rapid adaptation of neural fields for city-scale NeRF rendering with reduced parameter requirement.
zh
[AI-25] From Superficial to Deep: Integrating External Knowledge for Follow-up Question Generation Using Knowledge Graph and LLM
【速读】:该论文旨在解决现有方法生成的跟进问题局限于浅层上下文且缺乏启发性,与人类提问水平存在较大差距的问题。为克服这一挑战,论文提出了一种基于三阶段外部知识增强的跟进问题生成方法。其关键是通过识别上下文主题、在线构建知识图谱(Knowledge Graph, KG),并结合大语言模型进行最终问题生成,同时引入常识知识与知识融合操作,以生成信息丰富且具有探索性的高质量跟进问题。实验表明,该方法生成的问题在信息量和上下文相关性方面优于基线模型,并更接近人类提问水平。
链接: https://arxiv.org/abs/2504.05801
作者: Jianyu Liu,Yi Huang,Sheng Bi,Junlan Feng,Guilin Qi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Proceedings of the 31st International Conference on Computational Linguistics
Abstract:In a conversational system, dynamically generating follow-up questions based on context can help users explore information and provide a better user experience. Humans are usually able to ask questions that involve some general life knowledge and demonstrate higher order cognitive skills. However, the questions generated by existing methods are often limited to shallow contextual questions that are uninspiring and have a large gap to the human level. In this paper, we propose a three-stage external knowledge-enhanced follow-up question generation method, which generates questions by identifying contextual topics, constructing a knowledge graph (KG) online, and finally combining these with a large language model to generate the final question. The model generates information-rich and exploratory follow-up questions by introducing external common sense knowledge and performing a knowledge fusion operation. Experiments show that compared to baseline models, our method generates questions that are more informative and closer to human questioning levels while maintaining contextual relevance.
zh
[AI-26] mporal Dynamic Embedding for Irregularly Sampled Time Series
【速读】:该论文旨在解决临床数据中因不规则采样导致的时间序列数据稀疏性和无序性问题,这种特性使得传统神经网络模型难以有效处理。论文提出了一种名为时间动态嵌入(Temporal Dynamic Embedding, TDE)的方法作为解决方案。TDE 的关键在于将每个时间序列变量视为随时间演化的嵌入向量,而非传统的固定结构化表示,从而克服了因数据缺失引起的核心难题。通过在每一时间步仅选择并聚合已观测变量子集,TDE 能够基于当前观察结果表示患者的当前状态。实验结果显示,TDE 在 PhysioNet 2012、MIMIC-III 和 PhysioNet 2019 三个临床数据集上的表现优于或至少与基于插补的基线方法及几种最新先进方法相当,并且具有更短的训练时间。
链接: https://arxiv.org/abs/2504.05768
作者: Mincheol Kim,Soo-Yong Shin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In several practical applications, particularly healthcare, clinical data of each patient is individually recorded in a database at irregular intervals as required. This causes a sparse and irregularly sampled time series, which makes it difficult to handle as a structured representation of the prerequisites of neural network models. We therefore propose temporal dynamic embedding (TDE), which enables neural network models to receive data that change the number of variables over time. TDE regards each time series variable as an embedding vector evolving over time, instead of a conventional fixed structured representation, which causes a critical missing problem. For each time step, TDE allows for the selective adoption and aggregation of only observed variable subsets and represents the current status of patient based on current observations. The experiment was conducted on three clinical datasets: PhysioNet 2012, MIMIC-III, and PhysioNet 2019. The TDE model performed competitively or better than the imputation-based baseline and several recent state-of-the-art methods with reduced training runtime.
zh
[AI-27] Unraveling Human-AI Teaming: A Review and Outlook
【速读】:该论文旨在解决人类与人工智能(Human-AI)团队协作中的两个关键问题:一是AI代理与人类价值观和目标的对齐难度;二是AI作为真正团队成员能力的未充分利用。论文提出了一种以四个核心方面为中心的研究展望——形成(formulation)、协调(coordination)、维护(maintenance)和训练(training),以应对这些挑战。关键在于通过共享心理模型、建立信任、冲突解决以及技能适应来促进高效团队协作,并强调根据不同团队组成、目标和复杂性调整方法的重要性。这一研究为未来可持续、高性能的人机协作团队提供了基础框架。
链接: https://arxiv.org/abs/2504.05755
作者: Bowen Lou,Tian Lu,Raghu Santanam,Yingjie Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:
Abstract:Artificial Intelligence (AI) is advancing at an unprecedented pace, with clear potential to enhance decision-making and productivity. Yet, the collaborative decision-making process between humans and AI remains underdeveloped, often falling short of its transformative possibilities. This paper explores the evolution of AI agents from passive tools to active collaborators in human-AI teams, emphasizing their ability to learn, adapt, and operate autonomously in complex environments. This paradigm shifts challenges traditional team dynamics, requiring new interaction protocols, delegation strategies, and responsibility distribution frameworks. Drawing on Team Situation Awareness (SA) theory, we identify two critical gaps in current human-AI teaming research: the difficulty of aligning AI agents with human values and objectives, and the underutilization of AI’s capabilities as genuine team members. Addressing these gaps, we propose a structured research outlook centered on four key aspects of human-AI teaming: formulation, coordination, maintenance, and training. Our framework highlights the importance of shared mental models, trust-building, conflict resolution, and skill adaptation for effective teaming. Furthermore, we discuss the unique challenges posed by varying team compositions, goals, and complexities. This paper provides a foundational agenda for future research and practical design of sustainable, high-performing human-AI teams.
zh
[AI-28] AI-Driven Prognostics for State of Health Prediction in Li-ion Batteries: A Comprehensive Analysis with Validation
【速读】:该论文旨在解决锂离子电池健康状态(State of Health, SoH)预测中的准确性与鲁棒性问题。论文通过全面评估多种人工智能算法(包括FFNN、LSTM和BiLSTM)在不同数据集(如CALCE、NASA、UDDS)和场景(如温度变化和不同驾驶条件)下的表现,分析影响SoH波动的关键因素(如温度和充放电速率),并通过仿真验证研究结果。解决方案的关键在于采用双向长短期记忆网络(BiLSTM),其结果显示相比LSTM平均均方根误差(RMSE)降低了15%,证明了其在实际应用中的优越性能与鲁棒性。
链接: https://arxiv.org/abs/2504.05728
作者: Tianqi Ding,Dawei Xiang,Tianyao Sun,YiJiashum Qi,Zunduo Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 12 figures, Accepted by 2025 6th International Conference on Electrical Technology and Automatic Control(ICETAC 2025)
Abstract:This paper presents a comprehensive review of AI-driven prognostics for State of Health (SoH) prediction in lithium-ion batteries. We compare the effectiveness of various AI algorithms, including FFNN, LSTM, and BiLSTM, across multiple datasets (CALCE, NASA, UDDS) and scenarios (e.g., varying temperatures and driving conditions). Additionally, we analyze the factors influencing SoH fluctuations, such as temperature and charge-discharge rates, and validate our findings through simulations. The results demonstrate that BiLSTM achieves the highest accuracy, with an average RMSE reduction of 15% compared to LSTM, highlighting its robustness in real-world applications.
zh
[AI-29] Automated Archival Descriptions with Federated Intelligence of LLM s
【速读】:该论文旨在解决手动创建档案材料元数据描述这一繁琐且易出错的问题,并探索智能体AI (Agentic AI) 和大型语言模型 (LLMs) 在实现标准化档案描述流程中的潜力。论文的关键解决方案是提出了一种基于智能体AI的系统,用于自动生成高质量的档案材料元数据描述,并通过联邦优化方法整合多个LLMs的智能以构建最优的档案元数据。此外,论文还提出了克服使用LLMs进行一致元数据生成相关挑战的方法。通过在真实档案数据集上的实验验证,结果表明所提技术的可行性及联邦优化方法相较于单模型方案在元数据质量和可靠性方面的优越性。
链接: https://arxiv.org/abs/2504.05711
作者: Jinghua Groppe,Andreas Marquet,Annabel Walz,Sven Groppe
机构: 未知
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 15 pages
Abstract:Enforcing archival standards requires specialized expertise, and manually creating metadata descriptions for archival materials is a tedious and error-prone task. This work aims at exploring the potential of agentic AI and large language models (LLMs) in addressing the challenges of implementing a standardized archival description process. To this end, we introduce an agentic AI-driven system for automated generation of high-quality metadata descriptions of archival materials. We develop a federated optimization approach that unites the intelligence of multiple LLMs to construct optimal archival metadata. We also suggest methods to overcome the challenges associated with using LLMs for consistent metadata generation. To evaluate the feasibility and effectiveness of our techniques, we conducted extensive experiments using a real-world dataset of archival materials, which covers a variety of document types and data formats. The evaluation results demonstrate the feasibility of our techniques and highlight the superior performance of the federated optimization approach compared to single-model solutions in metadata quality and reliability.
zh
[AI-30] Architecture independent generalization bounds for overparametrized deep ReLU networks
【速读】:该论文试图解决神经网络过参数化(Overparameterization)条件下的泛化能力问题,即在模型容量远超数据量的情况下,证明测试误差不依赖于过参数化的程度,也不依赖于Vapnik-Chervonenkis (VC) 维度。论文的关键解决方案在于通过构建显式理论界,表明泛化误差仅依赖于测试集与训练集的度量几何性质、激活函数的正则性属性,以及权重算子范数和偏置范数。对于深度ReLU网络,在训练样本量小于输入空间维度时,论文明确构造了零损失极小化器(Zero Loss Minimizers),且无需使用梯度下降方法,同时证明了泛化误差独立于网络架构。
链接: https://arxiv.org/abs/2504.05695
作者: Thomas Chen,Chun-Kai Kevin Chien,Patricia Muñoz Ewald,Andrew G. Moore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: AMS Latex, 12 pages
Abstract:We prove that overparametrized neural networks are able to generalize with a test error that is independent of the level of overparametrization, and independent of the Vapnik-Chervonenkis (VC) dimension. We prove explicit bounds that only depend on the metric geometry of the test and training sets, on the regularity properties of the activation function, and on the operator norms of the weights and norms of biases. For overparametrized deep ReLU networks with a training sample size bounded by the input space dimension, we explicitly construct zero loss minimizers without use of gradient descent, and prove that the generalization error is independent of the network architecture.
zh
[AI-31] Large Language Models Enhanced Hyperbolic Space Recommender Systems
【速读】:该论文旨在解决现有推荐系统方法在利用欧几里得空间时难以有效捕捉文本和语义数据中丰富的层次信息的问题,这对于准确捕获用户偏好至关重要。论文的关键在于提出了一种名为HyperLLM的模型不可知框架,它从结构和语义两个角度提取并整合层次信息。结构上,HyperLLM利用大型语言模型生成具有层次父子关系的多级分类标签,并通过对比学习联合学习和对齐标签-项目和用户-项目交互,从而为模型提供清晰的层次信息。语义上,引入了一种新颖的元优化策略,从语义嵌入中提取层次信息,并弥合语义和协同空间之间的差距以实现无缝集成。这一解决方案的关键在于有效地结合了大型语言模型与双曲空间的优势,显著提升了推荐性能并增强了训练稳定性。
链接: https://arxiv.org/abs/2504.05694
作者: Wentao Cheng,Zhida Qin,Zexue Wu,Pengzhan Zhou,Tianyu Huang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have attracted significant attention in recommender systems for their excellent world knowledge capabilities. However, existing methods that rely on Euclidean space struggle to capture the rich hierarchical information inherent in textual and semantic data, which is essential for capturing user preferences. The geometric properties of hyperbolic space offer a promising solution to address this issue. Nevertheless, integrating LLMs-based methods with hyperbolic space to effectively extract and incorporate diverse hierarchical information is non-trivial. To this end, we propose a model-agnostic framework, named HyperLLM, which extracts and integrates hierarchical information from both structural and semantic perspectives. Structurally, HyperLLM uses LLMs to generate multi-level classification tags with hierarchical parent-child relationships for each item. Then, tag-item and user-item interactions are jointly learned and aligned through contrastive learning, thereby providing the model with clear hierarchical information. Semantically, HyperLLM introduces a novel meta-optimized strategy to extract hierarchical information from semantic embeddings and bridge the gap between the semantic and collaborative spaces for seamless integration. Extensive experiments show that HyperLLM significantly outperforms recommender systems based on hyperbolic space and LLMs, achieving performance improvements of over 40%. Furthermore, HyperLLM not only improves recommender performance but also enhances training stability, highlighting the critical role of hierarchical information in recommender systems.
zh
[AI-32] StayLTC: A Cost-Effective Multimodal Framework for Hospital Length of Stay Forecasting
【速读】:该论文旨在解决医院住院时间(Length of Stay, LOS)预测不准确的问题,以提升医疗服务质量、优化资源管理及降低运营成本。论文提出StayLTC框架,其核心解决方案是利用Liquid Time-Constant Networks (LTCs) 的连续时间递归动态特性,结合结构化电子健康记录(Electronic Health Records, EHRs)数据与临床笔记等多模态信息进行实时LOS预测。与传统模型相比,LTCs在MIMIC-III数据集上的评估显示,其显著提升了预测准确性、鲁棒性以及资源利用效率,同时在性能上与时间序列大型语言模型相当,但所需计算资源和内存大幅减少,展现了其在医疗自然语言处理(Natural Language Processing, NLP)任务中的潜力。
链接: https://arxiv.org/abs/2504.05691
作者: Sudeshna Jana,Manjira Sinha,Tirthankar Dasgupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures
Abstract:Accurate prediction of Length of Stay (LOS) in hospitals is crucial for improving healthcare services, resource management, and cost efficiency. This paper presents StayLTC, a multimodal deep learning framework developed to forecast real-time hospital LOS using Liquid Time-Constant Networks (LTCs). LTCs, with their continuous-time recurrent dynamics, are evaluated against traditional models using structured data from Electronic Health Records (EHRs) and clinical notes. Our evaluation, conducted on the MIMIC-III dataset, demonstrated that LTCs significantly outperform most of the other time series models, offering enhanced accuracy, robustness, and efficiency in resource utilization. Additionally, LTCs demonstrate a comparable performance in LOS prediction compared to time series large language models, while requiring significantly less computational power and memory, underscoring their potential to advance Natural Language Processing (NLP) tasks in healthcare.
zh
[AI-33] kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization ICASSP2025
【速读】:该论文旨在解决零样本歌唱语音转换(Singing Voice Conversion, SVC)中kNN-VC框架的鲁棒性问题。论文提出了两种创新方法:首先,针对WavLM核心表示缺乏谐波强调的问题,利用WavLM与音高轮廓及频谱图之间的双射关系进行加法合成,并将生成的波形融入模型以缓解单调音色和振铃伪影;其次,引入一种新的距离度量方法,在推理过程中筛选不合适的kNN候选者并优化候选者的加权求和,从而提升连接平滑性这一关键感知因素。这些解决方案的关键在于通过改进核心表示和增强平滑性来全面提升SVC的性能。实验结果验证了所提修改的有效性。
链接: https://arxiv.org/abs/2504.05686
作者: Keren Shao,Ke Chen,Matthew Baas,Shlomo Dubnov
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 5 pages, 6 figures, 1 table, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025
Abstract:Robustness is critical in zero-shot singing voice conversion (SVC). This paper introduces two novel methods to strengthen the robustness of the kNN-VC framework for SVC. First, kNN-VC’s core representation, WavLM, lacks harmonic emphasis, resulting in dull sounds and ringing artifacts. To address this, we leverage the bijection between WavLM, pitch contours, and spectrograms to perform additive synthesis, integrating the resulting waveform into the model to mitigate these issues. Second, kNN-VC overlooks concatenative smoothness, a key perceptual factor in SVC. To enhance smoothness, we propose a new distance metric that filters out unsuitable kNN candidates and optimize the summing weights of the candidates during inference. Although our techniques are built on the kNN-VC framework for implementation convenience, they are broadly applicable to general concatenative neural synthesis models. Experimental results validate the effectiveness of these modifications in achieving robust SVC. Demo: this http URL Code: this https URL
zh
[AI-34] Lattice: Learning to Efficiently Compress the Memory
【速读】:该论文旨在解决传统注意力机制在序列学习中的高计算复杂度(O(n²))问题,通过引入Lattice这一新颖的循环神经网络(RNN)机制,利用K-V矩阵固有的低秩结构,将缓存高效压缩至固定数量的记忆槽位,实现亚二次复杂度(sub-quadratic complexity)。解决方案的关键在于正交更新(orthogonal update)策略:每个记忆槽仅更新与其当前状态正交的信息,从而确保仅纳入新颖且非冗余的数据,最大程度减少对已有存储信息的干扰。此外,论文将此压缩过程形式化为一个在线优化问题,并基于单一梯度下降步骤推导出动态内存更新规则,赋予记忆更新过程可解释性。实验结果表明,Lattice在不同上下文长度下均实现了最低困惑度(perplexity),且性能提升随着上下文长度增加而更加显著。
链接: https://arxiv.org/abs/2504.05646
作者: Mahdi Karami,Vahab Mirrokni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Attention mechanisms have revolutionized sequence learning but suffer from quadratic computational complexity. This paper introduces Lattice, a novel recurrent neural network (RNN) mechanism that leverages the inherent low-rank structure of K-V matrices to efficiently compress the cache into a fixed number of memory slots, achieving sub-quadratic complexity. We formulate this compression as an online optimization problem and derive a dynamic memory update rule based on a single gradient descent step. The resulting recurrence features a state- and input-dependent gating mechanism, offering an interpretable memory update process. The core innovation is the orthogonal update: each memory slot is updated exclusively with information orthogonal to its current state hence incorporation of only novel, non-redundant data, which minimizes the interference with previously stored information. The experimental results show that Lattice achieves the best perplexity compared to all baselines across diverse context lengths, with performance improvement becoming more pronounced as the context length increases.
zh
[AI-35] Continual Learning of Multiple Cognitive Functions with Brain-inspired Temporal Development Mechanism
【速读】:该论文试图解决人工神经网络在实现连续学习多认知功能(Multiple Cognitive Functions)时面临的网络规模膨胀和能量消耗过高的问题。传统方法通常依赖正则化、回放或冻结策略来缓解灾难性遗忘(Catastrophic Forgetting),但这些方法往往导致网络复杂度增加或性能下降。为了解决这一挑战,论文提出了一种基于脑启发的时间发展机制(Temporal Development Mechanism, TD-MCL),通过模拟人脑从基础到高级区域连接的逐步形成、重组和剪枝过程,促进知识迁移并抑制任务间的冗余。关键在于通过长程连接的顺序演化(Sequential Evolution)实现正向知识迁移,同时结合基于反馈引导的局部连接抑制与剪枝(Feedback-Guided Local Connection Inhibition and Pruning),有效消除先前任务的冗余,从而在降低网络规模的同时保持已习得的知识,显著减少能耗并提升新任务的准确性。
链接: https://arxiv.org/abs/2504.05621
作者: Bing Han,Feifei Zhao,Yinqian Sun,Wenxuan Pan,Yi Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cognitive functions in current artificial intelligence networks are tied to the exponential increase in network scale, whereas the human brain can continuously learn hundreds of cognitive functions with remarkably low energy consumption. This advantage is in part due to the brain cross-regional temporal development mechanisms, where the progressive formation, reorganization, and pruning of connections from basic to advanced regions, facilitate knowledge transfer and prevent network redundancy. Inspired by these, we propose the Continual Learning of Multiple Cognitive Functions with Brain-inspired Temporal Development Mechanism(TD-MCL), enabling cognitive enhancement from simple to complex in Perception-Motor-Interaction(PMI) multiple cognitive task scenarios. The TD-MCL model proposes the sequential evolution of long-range connections between different cognitive modules to promote positive knowledge transfer, while using feedback-guided local connection inhibition and pruning to effectively eliminate redundancies in previous tasks, reducing energy consumption while preserving acquired knowledge. Experiments show that the proposed method can achieve continual learning capabilities while reducing network scale, without introducing regularization, replay, or freezing strategies, and achieving superior accuracy on new tasks compared to direct learning. The proposed method shows that the brain’s developmental mechanisms offer a valuable reference for exploring biologically plausible, low-energy enhancements of general cognitive abilities.
zh
[AI-36] FedEFC: Federated Learning Using Enhanced Forward Correction Against Noisy Labels
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中处理噪声标签的问题,这是由于数据分布异质性和通信限制导致的一个重大挑战,可能严重损害模型性能。为了解决这一问题,论文提出了一种名为FedEFC的新方法。FedEFC的关键解决方案包括两个核心技术:(1) 预停止(prestopping),通过在最优时刻动态终止训练来防止对错误标记数据的过拟合;(2) 损失校正(loss correction),调整模型更新以考虑标签噪声的影响。特别地,针对FL的独特挑战(如数据异质性和去中心化训练),论文开发了有效的损失校正技术,并提供了理论分析,证明在噪声标签分布下FL的目标函数可以与清洁标签分布对齐。广泛的实验结果验证了该方法的有效性,在异构数据设置下显著优于现有FL技术(例如,相比现有损失校正方法相对性能提升高达41.64%)。
链接: https://arxiv.org/abs/2504.05615
作者: Seunghun Yu,Jin-Hyun Ahn,Joonhyuk Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures
Abstract:Federated Learning (FL) is a powerful framework for privacy-preserving distributed learning. It enables multiple clients to collaboratively train a global model without sharing raw data. However, handling noisy labels in FL remains a major challenge due to heterogeneous data distributions and communication constraints, which can severely degrade model performance. To address this issue, we propose FedEFC, a novel method designed to tackle the impact of noisy labels in FL. FedEFC mitigates this issue through two key techniques: (1) prestopping, which prevents overfitting to mislabeled data by dynamically halting training at an optimal point, and (2) loss correction, which adjusts model updates to account for label noise. In particular, we develop an effective loss correction tailored to the unique challenges of FL, including data heterogeneity and decentralized training. Furthermore, we provide a theoretical analysis, leveraging the composite proper loss property, to demonstrate that the FL objective function under noisy label distributions can be aligned with the clean label distribution. Extensive experimental results validate the effectiveness of our approach, showing that it consistently outperforms existing FL techniques in mitigating the impact of noisy labels, particularly under heterogeneous data settings (e.g., achieving up to 41.64% relative performance improvement over the existing loss correction method).
zh
[AI-37] Multi-fidelity Reinforcement Learning Control for Complex Dynamical Systems
【速读】:该论文旨在解决在科学与工程应用中控制复杂动力学系统中的不稳定性问题,尤其是在需要多次查询(many-query)的任务中,由于实验数据收集稀疏或高保真模拟昂贵,导致控制任务面临计算成本高昂的挑战。此外,在基于代理模型的控制中,尽管可以降低计算开销,但利用离线训练快速学习的模型难以精确捕捉混沌动力学的点态动态行为。为弥合上述差距,论文提出了一种多保真强化学习(Multi-Fidelity Reinforcement Learning, MFRL)框架,其关键是结合可微混合模型,并通过有限的高保真数据校正基于物理的混合模型。同时,论文还设计了一种基于频谱的奖励函数以优化强化学习过程。通过在两个物理复杂动力学系统的验证,所提框架的控制结果统计特性与高保真环境下的多次查询评估结果一致,并优于现有最先进的基线方法。
链接: https://arxiv.org/abs/2504.05588
作者: Luning Sun,Xin-Yang Liu,Siyan Zhao,Aditya Grover,Jian-Xun Wang,Jayaraman J. Thiagarajan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Controlling instabilities in complex dynamical systems is challenging in scientific and engineering applications. Deep reinforcement learning (DRL) has seen promising results for applications in different scientific applications. The many-query nature of control tasks requires multiple interactions with real environments of the underlying physics. However, it is usually sparse to collect from the experiments or expensive to simulate for complex dynamics. Alternatively, controlling surrogate modeling could mitigate the computational cost issue. However, a fast and accurate learning-based model by offline training makes it very hard to get accurate pointwise dynamics when the dynamics are chaotic. To bridge this gap, the current work proposes a multi-fidelity reinforcement learning (MFRL) framework that leverages differentiable hybrid models for control tasks, where a physics-based hybrid model is corrected by limited high-fidelity data. We also proposed a spectrum-based reward function for RL learning. The effect of the proposed framework is demonstrated on two complex dynamics in physics. The statistics of the MFRL control result match that computed from many-query evaluations of the high-fidelity environments and outperform other SOTA baselines.
zh
[AI-38] Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations
【速读】:本文旨在解决稀疏激活混合专家模型(Sparsely Activated Mixture-of-Experts, SMoE)在实际应用中的三个关键问题:(1) 如何从不同视角全面评估专家的重要性,并以最小性能损失识别可裁剪的最不重要专家子集;(2) 在进行专家裁剪时,是一次性裁剪还是迭代裁剪更优,以及如何通过任务无关的微调机制(MoE Lottery Subnetworks)减轻裁剪对模型子网络能力的冲击;(3) 哪些全SMoE的关键能力会因裁剪而显著受损,以及如何通过外部指令跟随能力的增强恢复这些能力。为了解决这些问题,论文提出了MoE Experts Compression Suite (MC-Suite),用于综合评估专家重要性,并引入迭代裁剪结合MC-Suite准则的重新估计方法,同时探索任务无关微调作为修正机制的有效性。关键在于通过MC-Suite全面评估专家重要性、采用迭代裁剪策略并结合任务无关微调来最小化裁剪对模型性能的影响,并通过外部指令跟随能力增强恢复裁剪后的主要能力损失。
链接: https://arxiv.org/abs/2504.05586
作者: Ajay Jaiswal,Jianyu Wang,Yixiao Li,Pingzhi Li,Tianlong Chen,Zhangyang Wang,Chong Wang,Ruoming Pang,Xianzhi Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks. However, vanilla SMoEs have issues such as expert redundancy and heavy memory requirements, making them inefficient and non-scalable, especially for resource-constrained scenarios. Expert-level sparsification of SMoEs involves pruning the least important experts to address these limitations. In this work, we aim to address three questions: (1) What is the best recipe to identify the least knowledgeable subset of experts that can be dropped with minimal impact on performance? (2) How should we perform expert dropping (one-shot or iterative), and what correction measures can we undertake to minimize its drastic impact on SMoE subnetwork capabilities? (3) What capabilities of full-SMoEs are severely impacted by the removal of the least dominant experts, and how can we recover them? Firstly, we propose MoE Experts Compression Suite (MC-Suite), which is a collection of some previously explored and multiple novel recipes to provide a comprehensive benchmark for estimating expert importance from diverse perspectives, as well as unveil numerous valuable insights for SMoE experts. Secondly, unlike prior works with a one-shot expert pruning approach, we explore the benefits of iterative pruning with the re-estimation of the MC-Suite criterion. Moreover, we introduce the benefits of task-agnostic fine-tuning as a correction mechanism during iterative expert dropping, which we term MoE Lottery Subnetworks. Lastly, we present an experimentally validated conjecture that, during expert dropping, SMoEs’ instruction-following capabilities are predominantly hurt, which can be restored to a robust level subject to external augmentation of instruction-following capabilities using k-shot examples and supervised fine-tuning.
zh
[AI-39] W-CRL: Time-Weighted Contrastive Reward Learning for Efficient Inverse Reinforcement Learning
【速读】:本文针对强化学习中离散任务因稀疏奖励信号和高维状态空间导致的学习效率低下问题,以及隐藏“陷阱状态”(trap states)难以被有效规避的挑战提出了解决方案。论文的关键在于引入了时间加权对比奖励学习(Time-Weighted Contrastive Reward Learning, TW-CRL),这是一种基于逆向强化学习(Inverse Reinforcement Learning, IRL)的框架,利用成功与失败演示数据,通过整合时间信息来学习一个密集的奖励函数。该奖励函数能够识别与成功或失败相关的关键状态,从而不仅帮助智能体避免陷入陷阱状态,还促进了有意义的探索而非简单模仿专家轨迹。实验评估表明,TW-CRL在导航任务和机器人操作基准测试中超越了现有最先进方法,提升了效率与鲁棒性。
链接: https://arxiv.org/abs/2504.05585
作者: Yuxuan Li,Ning Yang,Stephen Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Episodic tasks in Reinforcement Learning (RL) often pose challenges due to sparse reward signals and high-dimensional state spaces, which hinder efficient learning. Additionally, these tasks often feature hidden “trap states” – irreversible failures that prevent task completion but do not provide explicit negative rewards to guide agents away from repeated errors. To address these issues, we propose Time-Weighted Contrastive Reward Learning (TW-CRL), an Inverse Reinforcement Learning (IRL) framework that leverages both successful and failed demonstrations. By incorporating temporal information, TW-CRL learns a dense reward function that identifies critical states associated with success or failure. This approach not only enables agents to avoid trap states but also encourages meaningful exploration beyond simple imitation of expert trajectories. Empirical evaluations on navigation tasks and robotic manipulation benchmarks demonstrate that TW-CRL surpasses state-of-the-art methods, achieving improved efficiency and robustness.
zh
[AI-40] MicroNN: An On-device Disk-resident Updatable Vector Database
【速读】:该论文旨在解决在低资源环境中(如嵌入式设备)进行可扩展相似性搜索的问题,特别是针对包含更新操作和混合查询(结合最近邻搜索与结构化属性过滤)的实际工作负载。现有最先进的系统通常面向拥有大量内存的大服务器、静态不可更新的向量集合以及孤立的最近邻搜索场景,无法很好地适应上述需求。论文提出的解决方案是Micro Nearest Neighbour (MicroNN),这是一个专为低资源环境设计的嵌入式最近邻向量搜索引擎。其关键在于采用磁盘高效索引结构和算法,支持连续插入和删除操作,同时能够在极小的内存开销下(~10 MB)实现高效的相似性搜索,检索百万级向量集合时能在不到7毫秒内达到Top-100最近邻搜索且召回率达到90%。
链接: https://arxiv.org/abs/2504.05573
作者: Jeffrey Pound,Floris Chabert,Arjun Bhushan,Ankur Goswami,Anil Pacaci,Shihabur Rahman Chowdhury
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Nearest neighbour search over dense vector collections has important applications in information retrieval, retrieval augmented generation (RAG), and content ranking. Performing efficient search over large vector collections is a well studied problem with many existing approaches and open source implementations. However, most state-of-the-art systems are generally targeted towards scenarios using large servers with an abundance of memory, static vector collections that are not updatable, and nearest neighbour search in isolation of other search criteria. We present Micro Nearest Neighbour (MicroNN), an embedded nearest-neighbour vector search engine designed for scalable similarity search in low-resource environments. MicroNN addresses the problem of on-device vector search for real-world workloads containing updates and hybrid search queries that combine nearest neighbour search with structured attribute filters. In this scenario, memory is highly constrained and disk-efficient index structures and algorithms are required, as well as support for continuous inserts and deletes. MicroNN is an embeddable library that can scale to large vector collections with minimal resources. MicroNN is used in production and powers a wide range of vector search use-cases on-device. MicroNN takes less than 7 ms to retrieve the top-100 nearest neighbours with 90% recall on publicly available million-scale vector benchmark while using ~10 MB of memory.
zh
[AI-41] SciSciGPT : Advancing Human-AI Collaboration in the Science of Science
【速读】:该论文旨在解决因大规模数据集的可用性增加所带来的科学研究机会与复杂分析挑战之间的矛盾,同时探索如何通过大型语言模型(Large Language Models, LLMs)驱动的研究工具提升人类与人工智能协作的效率。论文的关键解决方案是提出并开发了SciSciGPT,这是一种基于“科学的科学”领域验证的开源原型AI合作者。SciSciGPT通过自动化复杂工作流、支持多样化分析方法、加速研究原型设计与迭代以及促进可重复性,展示了在简化各类实证与分析任务中的潜力,并进一步提出了一个用于衡量LLM代理能力成熟度的模型,以指导未来框架的改进与扩展。其核心在于利用LLM技术增强科研生产力的同时,平衡人机协作关系并应对由此带来的透明性、伦理使用及贡献分配等挑战。
链接: https://arxiv.org/abs/2504.05559
作者: Erzhuo Shao,Yifang Wang,Yifan Qian,Zhenyu Pan,Han Liu,Dashun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing availability of large-scale datasets has fueled rapid progress across many scientific fields, creating unprecedented opportunities for research and discovery while posing significant analytical challenges. Recent advances in large language models (LLMs) and AI agents have opened new possibilities for human-AI collaboration, offering powerful tools to navigate this complex research landscape. In this paper, we introduce SciSciGPT, an open-source, prototype AI collaborator that uses the science of science as a testbed to explore the potential of LLM-powered research tools. SciSciGPT automates complex workflows, supports diverse analytical approaches, accelerates research prototyping and iteration, and facilitates reproducibility. Through case studies, we demonstrate its ability to streamline a wide range of empirical and analytical research tasks while highlighting its broader potential to advance research. We further propose an LLM Agent capability maturity model for human-AI collaboration, envisioning a roadmap to further improve and expand upon frameworks like SciSciGPT. As AI capabilities continue to evolve, frameworks like SciSciGPT may play increasingly pivotal roles in scientific research and discovery, unlocking further opportunities. At the same time, these new advances also raise critical challenges, from ensuring transparency and ethical use to balancing human and AI contributions. Addressing these issues may shape the future of scientific inquiry and inform how we train the next generation of scientists to thrive in an increasingly AI-integrated research ecosystem.
zh
[AI-42] Path Database Guidance for Motion Planning
【速读】:本文针对机器人运动规划中利用先验经验的问题,提出了一种名为路径数据库引导(Path Database Guidance, PDG)的新方法。传统方法通常通过直接粘贴查询路径或调整采样分布来利用路径数据库,而本文的关键创新在于:首先,PDG 使用数据库计算启发式函数以指导搜索树节点的扩展,而非直接使用查询路径,这一改进使得 PDG 可以更灵活地与其它搜索算法结合,在探索与利用之间实现动态平衡;其次,不同于将数据库视为固定先验的方法,PDG 的数据库(及其对应的启发式函数)随着隐式定义的机器人构型空间的搜索过程动态更新。这些创新显著提升了 PDG 在多种模拟环境中的性能。
链接: https://arxiv.org/abs/2504.05550
作者: Amnon Attali,Praval Telagi,Marco Morales,Nancy M. Amato
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:One approach to using prior experience in robot motion planning is to store solutions to previously seen problems in a database of paths. Methods that use such databases are characterized by how they query for a path and how they use queries given a new problem. In this work we present a new method, Path Database Guidance (PDG), which innovates on existing work in two ways. First, we use the database to compute a heuristic for determining which nodes of a search tree to expand, in contrast to prior work which generally pastes the (possibly transformed) queried path or uses it to bias a sampling distribution. We demonstrate that this makes our method more easily composable with other search methods by dynamically interleaving exploration according to a baseline algorithm with exploitation of the database guidance. Second, in contrast to other methods that treat the database as a single fixed prior, our database (and thus our queried heuristic) updates as we search the implicitly defined robot configuration space. We experimentally demonstrate the effectiveness of PDG in a variety of explicitly defined environment distributions in simulation.
zh
[AI-43] FORCE: Feature-Oriented Representation with Clustering and Explanation
【速读】:该论文试图解决通过深度学习模型利用未观测到的潜在结构来揭示数据中的隐含模式以提升预测准确性的问题。现有的大多数方法通过聚类原始特征来捕获特定的潜在结构,但这些信息往往可以通过足够复杂的模型隐式获得,导致实际收益有限。论文的关键解决方案在于提出了一种基于SHAP(Shapley Additive exPlanations)的监督深度学习框架FORCE,其核心在于分两阶段利用SHAP值:首先,通过聚类SHAP值引入一个额外的潜在特征来指导模型训练;其次,在网络架构中利用潜在信息初始化注意力机制。这种方法使神经网络能够感知未观测值对特征重要性的影响,从而增强对潜在模式的学习能力和整体判别能力。该框架在三个真实数据集上的实验结果表明,与未采用潜在特征和注意力机制的网络相比,FORCE显著提升了模型的整体性能(如心病存在的F1分数从0.72提高到0.80)。
链接: https://arxiv.org/abs/2504.05530
作者: Rishav Mukherjee,Jeffrey Ahearn Thompson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 12 pages, 3 figures
Abstract:Learning about underlying patterns in data using latent unobserved structures to improve the accuracy of predictive models has become an active avenue of deep learning research. Most approaches cluster the original features to capture certain latent structures. However, the information gained in the process can often be implicitly derived by sufficiently complex models. Thus, such approaches often provide minimal benefits. We propose a SHAP (Shapley Additive exPlanations) based supervised deep learning framework FORCE which relies on two-stage usage of SHAP values in the neural network architecture, (i) an additional latent feature to guide model training, based on clustering SHAP values, and (ii) initiating an attention mechanism within the architecture using latent information. This approach gives a neural network an indication about the effect of unobserved values that modify feature importance for an observation. The proposed framework is evaluated on three real life datasets. Our results demonstrate that FORCE led to dramatic improvements in overall performance as compared to networks that did not incorporate the latent feature and attention framework (e.g., F1 score for presence of heart disease 0.80 vs 0.72). Using cluster assignments and attention based on SHAP values guides deep learning, enhancing latent pattern learning and overall discriminative capability.
zh
[AI-44] Prism: Dynamic and Flexible Benchmarking of LLM s Code Generation with Monte Carlo Tree Search
【速读】:该论文旨在解决传统静态基准测试无法充分评估大型语言模型(Large Language Models, LLMs)深度和广度能力的问题,以及现有动态方法对LLMs依赖性强或受限于预定义测试集的局限性。论文提出了一种名为Prism的灵活动态基准框架,用于全面评估LLMs。Prism的关键解决方案包括三个核心组件:(1) 基于树的状态表示,将评估建模为马尔可夫决策过程(Markov Decision Process);(2) 针对发现挑战性评估场景优化的蒙特卡洛树搜索算法(Monte Carlo Tree Search);(3) 支持同时评估多样化能力的多智能体评估管道。此外,Prism通过整合树探索模式的结构测量与不同难度级别的性能指标,提供详细的错误模式、测试覆盖率和解决方案诊断,确保评估的鲁棒性。实验结果表明,Prism能够随着模型的进步动态演化,并深入揭示其局限性。
链接: https://arxiv.org/abs/2504.05500
作者: Vahid Majdinasab,Amin Nikanjam,Foutse Khomh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:The rapid advancement of Large Language Models (LLMs) has outpaced traditional evaluation methods. Static benchmarks fail to capture the depth and breadth of LLM capabilities and eventually become obsolete, while most dynamic approaches either rely too heavily on LLM-based evaluation or remain constrained by predefined test sets. We introduce Prism, a flexible, dynamic benchmarking framework designed for comprehensive LLM assessment. Prism builds on three key components: (1) a tree-based state representation that models evaluation as a Markov Decision Process, (2) a Monte Carlo Tree Search algorithm adapted to uncover challenging evaluation scenarios, and (3) a multi-agent evaluation pipeline that enables simultaneous assessment of diverse capabilities. To ensure robust evaluation, Prism integrates structural measurements of tree exploration patterns with performance metrics across difficulty levels, providing detailed diagnostics of error patterns, test coverage, and solution approaches. Through extensive experiments on five state-of-the-art LLMs, we analyze how model architecture and scale influence code generation performance across varying task difficulties. Our results demonstrate Prism’s effectiveness as a dynamic benchmark that evolves with model advancements while offering deeper insights into their limitations.
zh
[AI-45] GraphPINE: Graph Importance Propagation for Interpretable Drug Response Prediction
【速读】:该论文旨在解决生物医学研究中可解释性方法无法有效利用强相关先验知识(prior knowledge)以及未能基于预测特征之间的已知关系约束可解释性结果的问题。现有方法如注意力机制、梯度分析及Shapley值虽能提供一定程度的解释,但缺乏结合领域特定先验知识的能力。为克服这一局限,论文提出GraphPINE,一种基于图神经网络(Graph Neural Network, GNN)的架构,通过初始化节点重要性并在训练过程中优化,以利用领域知识增强药物反应预测的可解释性。GraphPINE的关键创新在于引入了一种融合特征矩阵更新与节点重要性传播的重要传播层(importance propagation layer),并通过类似于LSTM的顺序格式实现特征值的图传播。这种初始化与更新机制不仅实现了信息驱动的特征学习,还提升了图表示能力。此外,GraphPINE在基因-基因图和药物-靶点相互作用图的基础上,结合加权的文献计数构建初始重要性,从而显著提高了药物反应预测的性能,其PR-AUC达到0.894,ROC-AUC达到0.796。
链接: https://arxiv.org/abs/2504.05454
作者: Yoshitaka Inoue,Tianfan Fu,Augustin Luna
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
备注:
Abstract:Explainability is necessary for many tasks in biomedical research. Recent explainability methods have focused on attention, gradient, and Shapley value. These do not handle data with strong associated prior knowledge and fail to constrain explainability results based on known relationships between predictive features. We propose GraphPINE, a graph neural network (GNN) architecture leveraging domain-specific prior knowledge to initialize node importance optimized during training for drug response prediction. Typically, a manual post-prediction step examines literature (i.e., prior knowledge) to understand returned predictive features. While node importance can be obtained for gradient and attention after prediction, node importance from these methods lacks complementary prior knowledge; GraphPINE seeks to overcome this limitation. GraphPINE differs from other GNN gating methods by utilizing an LSTM-like sequential format. We introduce an importance propagation layer that unifies 1) updates for feature matrix and node importance and 2) uses GNN-based graph propagation of feature values. This initialization and updating mechanism allows for informed feature learning and improved graph representation. We apply GraphPINE to cancer drug response prediction using drug screening and gene data collected for over 5,000 gene nodes included in a gene-gene graph with a drug-target interaction (DTI) graph for initial importance. The gene-gene graph and DTIs were obtained from curated sources and weighted by article count discussing relationships between drugs and genes. GraphPINE achieves a PR-AUC of 0.894 and ROC-AUC of 0.796 across 952 drugs. Code is available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM) Cite as: arXiv:2504.05454 [cs.LG] (or arXiv:2504.05454v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.05454 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-46] A Behavior-Based Knowledge Representation Improves Prediction of Players Moves in Chess by 25%
【速读】:该论文旨在解决预测非顶尖棋手(中级及以下水平)在国际象棋开局阶段的下一步走法这一挑战。尽管现有的强大象棋引擎(如Deep Blue、AlphaZero和Stockfish)能够在与顶级特级大师对弈时表现出色,但它们在预测普通棋手的走法时仍面临困难。这些困难主要源于人类行为的不可预测性以及棋局潜在结果数量的庞大复杂性。论文的关键解决方案在于结合领域专家知识与机器学习技术,通过基于专业知识的特征工程方法来揭示中级棋手在开局阶段的走法模式。这种方法提供了一个有前景的框架,用于预测人类行为,同时推动人工智能和人机交互领域的进步。
链接: https://arxiv.org/abs/2504.05425
作者: Benny Skidanov,Daniel Erbesfeld,Gera Weiss,Achiya Elyasaf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 tables, 2 figures
Abstract:Predicting player behavior in strategic games, especially complex ones like chess, presents a significant challenge. The difficulty arises from several factors. First, the sheer number of potential outcomes stemming from even a single position, starting from the initial setup, makes forecasting a player’s next move incredibly complex. Second, and perhaps even more challenging, is the inherent unpredictability of human behavior. Unlike the optimized play of engines, humans introduce a layer of variability due to differing playing styles and decision-making processes. Each player approaches the game with a unique blend of strategic thinking, tactical awareness, and psychological tendencies, leading to diverse and often unexpected actions. This stylistic variation, combined with the capacity for creativity and even irrational moves, makes predicting human play difficult. Chess, a longstanding benchmark of artificial intelligence research, has seen significant advancements in tools and automation. Engines like Deep Blue, AlphaZero, and Stockfish can defeat even the most skilled human players. However, despite their exceptional ability to outplay top-level grandmasters, predicting the moves of non-grandmaster players, who comprise most of the global chess community – remains complicated for these engines. This paper proposes a novel approach combining expert knowledge with machine learning techniques to predict human players’ next moves. By applying feature engineering grounded in domain expertise, we seek to uncover the patterns in the moves of intermediate-level chess players, particularly during the opening phase of the game. Our methodology offers a promising framework for anticipating human behavior, advancing both the fields of AI and human-computer interaction.
zh
[AI-47] Safe Automated Refactoring for Efficient Migration of Imperative Deep Learning Programs to Graph Execution
【速读】:该论文旨在解决深度学习(Deep Learning, DL)系统在处理日益增长的数据集时效率不足的问题,特别是传统符号图式(symbolic, graph-based)DL代码开发方式存在的错误易发性、非直观性和调试困难等挑战。同时,虽然更自然的命令式(imperative)DL框架通过即时执行(eager execution)提升了开发体验,但其运行时性能较差。混合执行模式试图结合两者优势,但有效利用混合方法需要精细考量以确保代码能够安全、精确且高效地执行为计算图。论文的关键在于提出了一种自动重构方法,通过一种新颖的命令式张量分析技术,自动判断何时将命令式DL代码安全且有利地迁移到图式执行,同时保持语义一致性。该方法被实现为PyDev Eclipse插件,并集成WALA Ariadne分析框架,在19个Python项目(总计132.05 KLOC)上进行评估,结果显示766个候选函数中有326个(42.56%)可重构,性能测试平均加速比达到2.16倍,表明该方法有助于充分优化命令式DL代码的性能潜力。
链接: https://arxiv.org/abs/2504.05424
作者: Raffi Khatchadourian,Tatiana Castro Vélez,Mehdi Bagherzadeh,Nan Jia,Anita Raja
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code – supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged at the expense of run-time performance. Though hybrid approaches aim for the “best of both worlds,” using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution. We present an automated refactoring approach that assists developers in specifying whether their otherwise eagerly-executed imperative DL code could be reliably and efficiently executed as graphs while preserving semantics. The approach, based on a novel imperative tensor analysis, automatically determines when it is safe and potentially advantageous to migrate imperative DL code to graph execution. The approach is implemented as a PyDev Eclipse IDE plug-in that integrates the WALA Ariadne analysis framework and evaluated on 19 Python projects consisting of 132.05 KLOC. We found that 326 of 766 candidate functions (42.56%) were refactorable, and an average speedup of 2.16 on performance tests was observed. The results indicate that the approach is useful in optimizing imperative DL code to its full potential.
zh
[AI-48] SoK: Frontier AIs Impact on the Cybersecurity Landscape
【速读】:该论文旨在解决如何全面理解前沿人工智能(Frontier AI)对网络安全的影响及其固有风险,并提供具体建议以确保其安全和可靠使用的问题。论文的关键在于提出了一种系统性的风险评估与缓解框架,通过定义和分类前沿AI在网络安全中的边际风险,定量与定性分析其当前及未来的网络安全影响,并探讨为何短期内前沿AI可能更有利于攻击者而非防御者。基于这些研究结果,论文提出了包括构建细粒度基准、设计防御用AI代理、建立混合系统的安全机制与可证明防御、加强部署前的安全测试与透明度以及强化用户防护等关键安全建议。
链接: https://arxiv.org/abs/2504.05408
作者: Wenbo Guo,Yujin Potter,Tianneng Shi,Zhun Wang,Andy Zhang,Dawn Song
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:As frontier AI advances rapidly, understanding its impact on cybersecurity and inherent risks is essential to ensuring safe AI evolution (e.g., guiding risk mitigation and informing policymakers). While some studies review AI applications in cybersecurity, none of them comprehensively discuss AI’s future impacts or provide concrete recommendations for navigating its safe and secure usage. This paper presents an in-depth analysis of frontier AI’s impact on cybersecurity and establishes a systematic framework for risk assessment and mitigation. To this end, we first define and categorize the marginal risks of frontier AI in cybersecurity and then systemically analyze the current and future impacts of frontier AI in cybersecurity, qualitatively and quantitatively. We also discuss why frontier AI likely benefits attackers more than defenders in the short term from equivalence classes, asymmetry, and economic impact. Next, we explore frontier AI’s impact on future software system development, including enabling complex hybrid systems while introducing new risks. Based on our findings, we provide security recommendations, including constructing fine-grained benchmarks for risk assessment, designing AI agents for defenses, building security mechanisms and provable defenses for hybrid systems, enhancing pre-deployment security testing and transparency, and strengthening defenses for users. Finally, we present long-term research questions essential for understanding AI’s future impacts and unleashing its defensive capabilities.
zh
[AI-49] RATSS: Transformer-Based Task Scheduling System for Autonomous Vehicles
【速读】:该论文致力于解决图基环境中单智能体调度任务中的复杂优化问题,特别是那些属于NP难类的问题,旨在实现资源的最优分配并最大化生产力。论文提出了一种名为Transformer-Based Task Scheduling System (TRATSS) 的框架,其关键在于结合了强化学习的最新进展与变压器架构,通过利用变压器的自注意力机制,有效地捕捉复杂的任务依赖关系。这种设计使TRATSS能够动态适应不断变化的任务需求和资源可用性,输出优化的调度决策,从而提高资源利用率和任务完成效率。实验评估表明,TRATSS在处理涉及多种操作配置文件的调度问题时,能够提供高质量的解决方案。
链接: https://arxiv.org/abs/2504.05407
作者: Yazan Youssef,Paulo Ricardo Marques de Araujo,Aboelmagd Noureldin,Sidney Givigi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages
Abstract:Efficient scheduling remains a critical challenge in various domains, requiring solutions to complex NP-hard optimization problems to achieve optimal resource allocation and maximize productivity. In this paper, we introduce a framework called Transformer-Based Task Scheduling System (TRATSS), designed to address the intricacies of single agent scheduling in graph-based environments. By integrating the latest advancements in reinforcement learning and transformer architecture, TRATSS provides a novel system that outputs optimized task scheduling decisions while dynamically adapting to evolving task requirements and resource availability. Leveraging the self-attention mechanism in transformers, TRATSS effectively captures complex task dependencies, thereby providing solutions with enhanced resource utilization and task completion efficiency. Experimental evaluations on benchmark datasets demonstrate TRATSS’s effectiveness in providing high-quality solutions to scheduling problems that involve multiple action profiles.
zh
[AI-50] he Role of Environment Access in Agnostic Reinforcement Learning
【速读】:该论文旨在解决在具有大规模状态空间的环境中,如何实现样本高效的强化学习(Reinforcement Learning, RL)的问题。具体而言,它关注最弱形式的函数逼近——agnostic policy learning,即在已知策略类 (\Pi) 中寻找最优策略,而不保证 (\Pi) 包含任务的最优策略。论文探讨了通过增强与环境的交互方式,是否能够克服这种agnostic设置下的统计不可处理性。
解决方案的关键在于提出了一种新的算法,该算法通过构建一个策略模拟器(policy emulator),即一个小状态空间的表格化马尔可夫决策过程(MDP),来近似所有策略 (\pi \in \Pi) 的价值函数。这一方法的独特之处在于,它无需显式定义价值函数类即可完成价值函数的近似,从而实现了agnostic policy learning在特定条件下(如Block MDPs结合局部模拟器或(\mu)-reset设置)的统计可处理性。
链接: https://arxiv.org/abs/2504.05405
作者: Akshay Krishnamurthy,Gene Li,Ayush Sekhari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: comments welcome
Abstract:We study Reinforcement Learning (RL) in environments with large state spaces, where function approximation is required for sample-efficient learning. Departing from a long history of prior work, we consider the weakest possible form of function approximation, called agnostic policy learning, where the learner seeks to find the best policy in a given class \Pi , with no guarantee that \Pi contains an optimal policy for the underlying task. Although it is known that sample-efficient agnostic policy learning is not possible in the standard online RL setting without further assumptions, we investigate the extent to which this can be overcome with stronger forms of access to the environment. Specifically, we show that: 1. Agnostic policy learning remains statistically intractable when given access to a local simulator, from which one can reset to any previously seen state. This result holds even when the policy class is realizable, and stands in contrast to a positive result of [MFR24] showing that value-based learning under realizability is tractable with local simulator access. 2. Agnostic policy learning remains statistically intractable when given online access to a reset distribution with good coverage properties over the state space (the so-called \mu -reset setting). We also study stronger forms of function approximation for policy learning, showing that PSDP [BKSN03] and CPI [KL02] provably fail in the absence of policy completeness. 3. On a positive note, agnostic policy learning is statistically tractable for Block MDPs with access to both of the above reset models. We establish this via a new algorithm that carefully constructs a policy emulator: a tabular MDP with a small state space that approximates the value functions of all policies \pi \in \Pi . These values are approximated without any explicit value function class.
zh
[AI-51] Interactive Explanations for Reinforcement-Learning Agents
【速读】:该论文旨在解决现有可解释强化学习(Explainable Reinforcement Learning, XRL)方法静态解释局限性的问题,即这些方法通常仅反映开发者对“何事应被解释”及“如何解释”的直觉,而未能充分考虑用户与智能体之间动态交互的需求。论文提出,有意义的解释应基于解释者与被解释者之间的对话结构,强调用户在理解智能体行为中的主动作用及其与智能体的沟通需求。为解决此问题,论文的关键创新在于提出了ASQ-IT系统,该系统允许用户通过描述感兴趣行为的时间特性来发起查询,并基于这些查询生成智能体在环境中的视频片段展示。其解决方案的核心在于将用户查询映射到一种有限轨迹线性时间逻辑(Linear Temporal Logic over Finite Traces, LTLf)的片段,并利用自动机理论开发了一种基于查询处理的算法。用户研究验证了ASQ-IT能够帮助用户理解和制定查询,并有效识别智能体的错误行为。
链接: https://arxiv.org/abs/2504.05393
作者: Yotam Amitai,Ofra Amir,Guy Avni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As reinforcement learning methods increasingly amass accomplishments, the need for comprehending their solutions becomes more crucial. Most explainable reinforcement learning (XRL) methods generate a static explanation depicting their developers’ intuition of what should be explained and how. In contrast, literature from the social sciences proposes that meaningful explanations are structured as a dialog between the explainer and the explainee, suggesting a more active role for the user and her communication with the agent. In this paper, we present ASQ-IT – an interactive explanation system that presents video clips of the agent acting in its environment based on queries given by the user that describe temporal properties of behaviors of interest. Our approach is based on formal methods: queries in ASQ-IT’s user interface map to a fragment of Linear Temporal Logic over finite traces (LTLf), which we developed, and our algorithm for query processing is based on automata theory. User studies show that end-users can understand and formulate queries in ASQ-IT and that using ASQ-IT assists users in identifying faulty agent behaviors.
zh
[AI-52] EduPlanner: LLM -Based Multi-Agent Systems for Customized and Intelligent Instructional Design
【速读】:本文旨在解决在智能教育领域中,单一大型语言模型(Large Language Model, LLM)难以有效管理整个教学计划自动生成与优化流程的问题。具体而言,当前挑战在于如何实现针对不同学生能力水平的定制化内容生成(Customized Generation)以及基于学习效果反馈的智能化迭代优化(Intelligent Optimization)。为应对这些挑战,论文提出的关键解决方案是开发了一种基于LLM的多智能体系统EduPlanner,它由评估者智能体、优化者智能体及问题分析智能体组成,通过对抗协作的方式生成个性化的课程与学习活动设计。此外,引入CIDDP五维评价模块,从清晰度、完整性、深度、实用性和针对性五个维度全面评估数学课程计划质量,并支持智能优化过程。实验结果表明,EduPlanner在GSM8K和Algebra数据集上的表现优于现有方法,消融研究进一步验证了各组件的重要性。
链接: https://arxiv.org/abs/2504.05370
作者: Xueqiao Zhang,Chao Zhang,Jianwen Sun,Jun Xiao,Yi Yang,Yawei Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have significantly advanced smart education in the Artificial General Intelligence (AGI) era. A promising application lies in the automatic generalization of instructional design for curriculum and learning activities, focusing on two key aspects: (1) Customized Generation: generating niche-targeted teaching content based on students’ varying learning abilities and states, and (2) Intelligent Optimization: iteratively optimizing content based on feedback from learning effectiveness or test scores. Currently, a single large LLM cannot effectively manage the entire process, posing a challenge for designing intelligent teaching plans. To address these issues, we developed EduPlanner, an LLM-based multi-agent system comprising an evaluator agent, an optimizer agent, and a question analyst, working in adversarial collaboration to generate customized and intelligent instructional design for curriculum and learning activities. Taking mathematics lessons as our example, EduPlanner employs a novel Skill-Tree structure to accurately model the background mathematics knowledge of student groups, personalizing instructional design for curriculum and learning activities according to students’ knowledge levels and learning abilities. Additionally, we introduce the CIDDP, an LLM-based five-dimensional evaluation module encompassing clarity, Integrity, Depth, Practicality, and Pertinence, to comprehensively assess mathematics lesson plan quality and bootstrap intelligent optimization. Experiments conducted on the GSM8K and Algebra datasets demonstrate that EduPlanner excels in evaluating and optimizing instructional design for curriculum and learning activities. Ablation studies further validate the significance and effectiveness of each component within the framework. Our code is publicly available at this https URL
zh
[AI-53] Of All StrIPEs: Investigating Structure-informed Positional Encoding for Efficient Music Generation
【速读】:该论文旨在解决在音乐生成任务中,如何有效利用位置编码(Positional Encoding, PE)模块以提升基于Transformer的生成模型性能的问题。论文的关键在于提出了一种统一的核方法框架,用于分析基于随机傅里叶特征(Random Fourier Features, RFF)和旋转矩阵的位置编码方法,并在此基础上开发了一种新的位置编码方法RoPEPool。RoPEPool能够从时间序列中提取因果关系,其创新之处在于通过考虑内容-上下文交互效应,将看似不同的位置编码方法进行联合研究。实验验证表明,在符号音乐生成任务(如旋律编配)中,结合高度结构化先验信息的RoPEPool优于现有方法。
链接: https://arxiv.org/abs/2504.05364
作者: Manvi Agarwal,Changhong Wang(LTCI),Gael Richard(S2A, IDS)
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:While music remains a challenging domain for generative models like Transformers, a two-pronged approach has recently proved successful: inserting musically-relevant structural information into the positional encoding (PE) module and using kernel approximation techniques based on Random Fourier Features (RFF) to lower the computational cost from quadratic to linear. Yet, it is not clear how such RFF-based efficient PEs compare with those based on rotation matrices, such as Rotary Positional Encoding (RoPE). In this paper, we present a unified framework based on kernel methods to analyze both families of efficient PEs. We use this framework to develop a novel PE method called RoPEPool, capable of extracting causal relationships from temporal sequences. Using RFF-based PEs and rotation-based PEs, we demonstrate how seemingly disparate PEs can be jointly studied by considering the content-context interactions they induce. For empirical validation, we use a symbolic music generation task, namely, melody harmonization. We show that RoPEPool, combined with highly-informative structural priors, outperforms all methods.
zh
[AI-54] Debate-Feedback: A Multi-Agent Framework for Efficient Legal Judgment Prediction
【速读】:该论文旨在解决传统法律分析与预测(LegalAI)方法依赖大规模历史数据集以及未能充分发挥现代大型语言模型(Large Language Models, LLMs)潜力的问题。为应对这些挑战,论文提出了一种基于Debate-Feedback架构的新颖法律判决预测模型,其关键在于结合LLM多智能体辩论机制与可靠性评估模型,通过模拟真实法庭辩论过程实现高效推理,从而大幅减少对大规模训练数据的依赖,提供一种轻量级且鲁棒的解决方案。
链接: https://arxiv.org/abs/2504.05358
作者: Xi Chen,Mao Mao,Shuo Li,Haotian Shangguan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:The use of AI in legal analysis and prediction (LegalAI) has gained widespread attention, with past research focusing on retrieval-based methods and fine-tuning large models. However, these approaches often require large datasets and underutilize the capabilities of modern large language models (LLMs). In this paper, inspired by the debate phase of real courtroom trials, we propose a novel legal judgment prediction model based on the Debate-Feedback architecture, which integrates LLM multi-agent debate and reliability evaluation models. Unlike traditional methods, our model achieves significant improvements in efficiency by minimizing the need for large historical datasets, thus offering a lightweight yet robust solution. Comparative experiments show that it outperforms several general-purpose and domain-specific legal models, offering a dynamic reasoning process and a promising direction for future LegalAI research.
zh
[AI-55] Find A Winning Sign: Sign Is All We Need to Win the Lottery ICLR2025
【速读】:该论文旨在解决如何在不依赖特定初始化或小型架构的情况下,通过迭代剪枝(Iterative Pruning, IP)方法找到能够与过参数化模型相当的稀疏子网络(即“彩票假设”中的“中奖票”)。现有IP方法通常面临泛化能力受限于特定初始化或小规模数据集的问题,或者通过直接作用于训练后的权重而非初始权重来规避这些挑战。论文的关键在于揭示参数符号配置在向任意随机初始化网络传递有用信息以实现泛化方面的重要性,并提出通过保持稀疏网络的参数符号和归一化层参数,可以保留其吸引力盆地。进一步地,论文通过降低沿线性路径上的误差障碍,减轻了对归一化层参数的依赖,从而使得任何随机初始化的网络都能通过继承稀疏性和参数符号信息优化至低误差屏障状态,最终可能达到与原始模型相当的性能。
链接: https://arxiv.org/abs/2504.05357
作者: Junghun Oh,Sungyong Baik,Kyoung Mu Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR2025
Abstract:The Lottery Ticket Hypothesis (LTH) posits the existence of a sparse subnetwork (a.k.a. winning ticket) that can generalize comparably to its over-parameterized counterpart when trained from scratch. The common approach to finding a winning ticket is to preserve the original strong generalization through Iterative Pruning (IP) and transfer information useful for achieving the learned generalization by applying the resulting sparse mask to an untrained network. However, existing IP methods still struggle to generalize their observations beyond ad-hoc initialization and small-scale architectures or datasets, or they bypass these challenges by applying their mask to trained weights instead of initialized ones. In this paper, we demonstrate that the parameter sign configuration plays a crucial role in conveying useful information for generalization to any randomly initialized network. Through linear mode connectivity analysis, we observe that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved. To take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with initialized normalization layer parameters. Interestingly, across various architectures and datasets, we observe that any randomly initialized network can be optimized to exhibit low error barriers along the linear path to the sparse network trained by our method by inheriting its sparsity and parameter sign information, potentially achieving performance comparable to the original. The code is available at this https URL\this http URL
zh
[AI-56] DyTTP: Trajectory Prediction with Normalization-Free Transformers
【速读】:该论文旨在解决基于Transformer架构的轨迹预测在自动驾驶系统中的计算开销高及训练不稳定的问题。为应对这些挑战,论文提出了一种双管齐下的解决方案。首先,引入DynamicTanh (DyT),一种最新的促进Transformer发展的方法,替代传统的层归一化(Layer Normalization),以简化网络结构并提升推理稳定性;这是首次将DyT应用于轨迹预测任务。其次,采用快照集成(snapshot ensemble)策略,在单次训练过程中通过循环学习率调度捕获多个模型快照,并在推理阶段通过简单平均的方式聚合这些快照,从而无需显著增加计算成本即可从多样化假设中受益。综合实验表明,该方法显著提升了Argoverse数据集上多种驾驶场景中的预测准确性、推理速度及鲁棒性。
链接: https://arxiv.org/abs/2504.05356
作者: Yunxiang Liu,Hongkuo Niu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate trajectory prediction is a cornerstone for the safe operation of autonomous driving systems, where understanding the dynamic behavior of surrounding agents is crucial. Transformer-based architectures have demonstrated significant promise in capturing complex spatio-temporality dependencies. However, their reliance on normalization layers can lead to computation overhead and training instabilities. In this work, we present a two-fold approach to address these challenges. First, we integrate DynamicTanh (DyT), which is the latest method to promote transformers, into the backbone, replacing traditional layer normalization. This modification simplifies the network architecture and improves the stability of the inference. We are the first work to deploy the DyT to the trajectory prediction task. Complementing this, we employ a snapshot ensemble strategy to further boost trajectory prediction performance. Using cyclical learning rate scheduling, multiple model snapshots are captured during a single training run. These snapshots are then aggregated via simple averaging at inference time, allowing the model to benefit from diverse hypotheses without incurring substantial additional computational cost. Extensive experiments on Argoverse datasets demonstrate that our combined approach significantly improves prediction accuracy, inference speed and robustness in diverse driving scenarios. This work underscores the potential of normalization-free transformer designs augmented with lightweight ensemble techniques in advancing trajectory forecasting for autonomous vehicles.
zh
[AI-57] Achieving binary weight and activation for LLM s using Post-Training Quantization
【速读】:该论文旨在解决在将大语言模型(Large Language Models, LLMs)量化到低于4位精度(如1位权重和激活,即W1A1)时,现有量化技术通常会导致显著性能下降的问题。论文的关键创新在于提出了一种名为W(1+1)A(1*4)的后训练量化框架:对于权重采用1位量化,并额外引入1位用于细粒度分组;对于激活则使用1位量化,但通过通道数扩展至原来的4倍来增强表达能力。针对权重量化,论文提出了结合Hessian感知的细粒度分组策略以及基于期望最大化(EM-based)的量化方案;而对于激活量化,则将INT4量化的结果等效分解为4个INT1格式,并同步平滑缩放因子以减少量化误差。这些方法使该技术在W2A4配置下超越了现有的最先进(SOTA)LLM量化基线,在多个任务上取得了更好的性能,推动了全二值化模型(fully binarized models)方向的发展。
链接: https://arxiv.org/abs/2504.05352
作者: Siqing Song,Chuang Wang,Ruiqi Wang,Yi Yang,Xuyao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models.
zh
[AI-58] Divergent Paths: Separating Homophilic and Heterophilic Learning for Enhanced Graph-level Representations
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理异质图(heterophilic graphs)时表现不佳的问题,特别是在图级任务中的局限性。现有研究主要集中在节点级任务中区分同质(homophilic)和异质成分的学习策略,而在图级任务中对此问题的关注较少。论文的关键在于提出了一种新的方法,通过分离图内(intra-category)和图间(inter-category)成分来优化图级任务的表现。具体而言,论文设计了IntraNet和InterNet两个模块:IntraNet利用复杂的图预处理步骤和基于类别的图读出函数提取图内信息;InterNet则通过高通滤波器放大节点差异,增强高频成分的细节识别能力。最终,DivGNN模型通过门控机制整合这两个模块,在图级分类任务中显著提升了性能,超越了传统的GNN基线模型。
链接: https://arxiv.org/abs/2504.05344
作者: Han Lei,Jiaxing Xu,Xia Dong,Yiping Ke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures
Abstract:Graph Convolutional Networks (GCNs) are predominantly tailored for graphs displaying homophily, where similar nodes connect, but often fail on heterophilic graphs. The strategy of adopting distinct approaches to learn from homophilic and heterophilic components in node-level tasks has been widely discussed and proven effective both theoretically and experimentally. However, in graph-level tasks, research on this topic remains notably scarce. Addressing this gap, our research conducts an analysis on graphs with nodes’ category ID available, distinguishing intra-category and inter-category components as embodiment of homophily and heterophily, respectively. We find while GCNs excel at extracting information within categories, they frequently capture noise from inter-category components. Consequently, it is crucial to employ distinct learning strategies for intra- and inter-category elements. To alleviate this problem, we separately learn the intra- and inter-category parts by a combination of an intra-category convolution (IntraNet) and an inter-category high-pass graph convolution (InterNet). Our IntraNet is supported by sophisticated graph preprocessing steps and a novel category-based graph readout function. For the InterNet, we utilize a high-pass filter to amplify the node disparities, enhancing the recognition of details in the high-frequency components. The proposed approach, DivGNN, combines the IntraNet and InterNet with a gated mechanism and substantially improves classification performance on graph-level tasks, surpassing traditional GNN baselines in effectiveness.
zh
[AI-59] AROMA: Autonomous Rank-one Matrix Adaptation
【速读】:该论文旨在解决大型语言模型参数高效微调中静态秩分配导致的次优性能问题。现有的低秩适应(Low-Rank Adaptation, LoRA)方法通过低秩更新实现参数高效微调,但其静态秩分配可能无法充分利用模型潜力;而自适应低秩适应(Adaptive Low-Rank Adaptation, AdaLoRA)虽引入动态秩调整,但仍对初始秩和目标秩配置敏感。为此,论文提出AROMA框架,其关键在于通过双环架构实现秩的增长而非传统的秩缩减机制。内层循环从每个秩一子空间提取信息,外层循环则确定秩一子空间的数量以找到最优秩。此外,AROMA通过迭代构建秩一组件,在极少数可训练参数下逐步将这些组件衰减至零,同时重置优化器状态以保持子空间独立性。最终,AROMA在自然语言理解和常识推理任务中实现了优于LoRA和AdaLoRA的性能,同时显著减少了所需参数量,为自适应参数高效微调提供了新思路。
链接: https://arxiv.org/abs/2504.05343
作者: Hao Nan Sheng,Zhi-yong Wang,Mingrui Yang,Hing Cheung So
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models continue to grow in size, parameter-efficient fine-tuning has become increasingly crucial. While low-rank adaptation (LoRA) offers a solution through low-rank updates, its static rank allocation may yield suboptimal results. Adaptive low-rank adaptation (AdaLoRA) improves this with dynamic allocation but remains sensitive to initial and target rank configurations. We introduce AROMA, a framework that automatically constructs layer-specific updates by iteratively building up rank-one components with very few trainable parameters that gradually diminish to zero. Unlike existing methods that employ rank reduction mechanisms, AROMA introduces a dual-loop architecture for rank growth. The inner loop extracts information from each rank-one subspace, while the outer loop determines the number of rank-one subspaces, i.e., the optimal rank. We reset optimizer states to maintain subspace independence. AROMA significantly reduces parameters compared to LoRA and AdaLoRA while achieving superior performance on natural language understanding and commonsense reasoning tasks, offering new insights into adaptive parameter-efficient fine-tuning. The code is available at \hrefthis https URLAROMA.
zh
[AI-60] hree-Factor Learning in Spiking Neural Networks: An Overview of Methods and Trends from a Machine Learning Perspective
【速读】:该论文旨在探索三因子学习规则在脉冲神经网络(Spiking Neural Networks, SNNs)中的应用及其在人工智能领域的潜力,试图解决传统Hebbian学习和基于时间依赖可塑性(Spike-Timing-Dependent Plasticity, STDP)方法在信用分配(credit assignment)和生物合理性方面的局限性。论文的关键在于通过引入神经调制信号(neuromodulatory signals),增强学习效率与适应能力,同时从机器学习视角探讨其理论基础、算法实现以及与强化学习(Reinforcement Learning)和类脑计算(Neuromorphic Computing)的关联。此外,论文还关注其在机器人学、认知建模及AI系统中的跨学科应用,并提出了缩小神经科学与人工智能之间差距的研究方向与潜在解决方案。
链接: https://arxiv.org/abs/2504.05341
作者: Szymon Mazurek,Jakub Caputa,Jan K. Argasiński,Maciej Wielgosz
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Pre-print
Abstract:Three-factor learning rules in Spiking Neural Networks (SNNs) have emerged as a crucial extension to traditional Hebbian learning and Spike-Timing-Dependent Plasticity (STDP), incorporating neuromodulatory signals to improve adaptation and learning efficiency. These mechanisms enhance biological plausibility and facilitate improved credit assignment in artificial neural systems. This paper takes a view on this topic from a machine learning perspective, providing an overview of recent advances in three-factor learning, discusses theoretical foundations, algorithmic implementations, and their relevance to reinforcement learning and neuromorphic computing. In addition, we explore interdisciplinary approaches, scalability challenges, and potential applications in robotics, cognitive modeling, and AI systems. Finally, we highlight key research gaps and propose future directions for bridging the gap between neuroscience and artificial intelligence.
zh
[AI-61] Improving Early Prediction of Type 2 Diabetes Mellitus with ECG-DiaNet: A Multimodal Neural Network Leverag ing Electrocardiogram and Clinical Risk Factors
【速读】:该论文旨在解决2型糖尿病(Type 2 Diabetes Mellitus, T2DM)早期准确风险预测这一全球性健康挑战。论文提出了一种名为ECG-DiaNet的多模态深度学习模型,通过整合心电图(Electrocardiogram, ECG)特征与临床风险因素(Clinical Risk Factors, CRFs),提升T2DM发病预测能力。其关键在于利用非侵入性的ECG信号与传统临床数据相结合,构建了一个能够更精准识别高危个体的模型,显著提高了曲线下面积(Area Under Receiver Operating Characteristic Curve, AUROC)并改善了再分类指标(如Net Reclassification Improvement, NRI和Integrated Discrimination Improvement, IDI),从而实现更优的风险分层及阳性预测值(Positive Predictive Value, PPV)。这种结合心脏电生理信息与系统性风险评估的方法有效应对了T2DM的多因素特性,为精准预防提供了支持。
链接: https://arxiv.org/abs/2504.05338
作者: Farida Mohsen,Zubair Shah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Type 2 Diabetes Mellitus (T2DM) remains a global health challenge, underscoring the need for early and accurate risk prediction. This study presents ECG-DiaNet, a multimodal deep learning model that integrates electrocardiogram (ECG) features with clinical risk factors (CRFs) to enhance T2DM onset prediction. Using data from Qatar Biobank (QBB), we trained and validated models on a development cohort (n=2043) and evaluated performance on a longitudinal test set (n=395) with five-year follow-up. ECG-DiaNet outperformed unimodal ECG-only and CRF-only models, achieving a higher AUROC (0.845 vs 0.8217) than the CRF-only model, with statistical significance (DeLong p0.001). Reclassification metrics further confirmed improvements: Net Reclassification Improvement (NRI=0.0153) and Integrated Discrimination Improvement (IDI=0.0482). Risk stratification into low-, medium-, and high-risk groups showed ECG-DiaNet achieved superior positive predictive value (PPV) in high-risk individuals. The model’s reliance on non-invasive and widely available ECG signals supports its feasibility in clinical and community health settings. By combining cardiac electrophysiology and systemic risk profiles, ECG-DiaNet addresses the multifactorial nature of T2DM and supports precision prevention. These findings highlight the value of multimodal AI in advancing early detection and prevention strategies for T2DM, particularly in underrepresented Middle Eastern populations.
zh
[AI-62] Level Generation with Constrained Expressive Range
【速读】:该论文试图解决如何通过表达范围分析有效评估生成模型在游戏关卡生成中的性能,并探索如何系统性地利用约束生成器遍历潜在创作空间以生成独特且有趣的关卡。解决方案的关键在于采用基于约束的生成器,通过学习不同的瓦片模式(tile patterns)来系统性地探索表达范围,而非依赖随机生成。这种方法能够更全面地覆盖表达范围,同时通过分析不同瓦片模式对探索过程的影响(如时间消耗、成功与失败的样本生成数量以及生成关卡的整体趣味性),揭示生成器的优势与局限性。
链接: https://arxiv.org/abs/2504.05334
作者: Mahsa Bazzaz,Seth Cooper
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Expressive range analysis is a visualization-based technique used to evaluate the performance of generative models, particularly in game level generation. It typically employs two quantifiable metrics to position generated artifacts on a 2D plot, offering insight into how content is distributed within a defined metric space. In this work, we use the expressive range of a generator as the conceptual space of possible creations. Inspired by the quality diversity paradigm, we explore this space to generate levels. To do so, we use a constraint-based generator that systematically traverses and generates levels in this space. To train the constraint-based generator we use different tile patterns to learn from the initial example levels. We analyze how different patterns influence the exploration of the expressive range. Specifically, we compare the exploration process based on time, the number of successful and failed sample generations, and the overall interestingness of the generated levels. Unlike typical quality diversity approaches that rely on random generation and hope to get good coverage of the expressive range, this approach systematically traverses the grid ensuring more coverage. This helps create unique and interesting game levels while also improving our understanding of the generator’s strengths and limitations.
zh
[AI-63] When is using AI the rational choice? The importance of counterfactuals in AI deployment decisions
【速读】:该论文试图解决如何将反事实结果(counterfactual outcomes)纳入人工智能(AI)使用决策的预期效用评估中,以帮助AI开发者和部署决策者更好地理解AI应用的实际影响。论文指出,AI的部署决策通常基于与未使用AI时可能做出的决策进行对比(即反事实分析),但这种分析可能导致对受益方和利益相关者产生截然不同的效用结果。论文的关键在于揭示在包含反事实结果时,AI使用预期效用评估所表现出的几个重要特性,包括AI对受益方的正向效用与对利益相关者和部署决策者的负向效用并存、互补性带来的潜在重大负面后果、用户交互方式的细微变化对利益相关者效用的重大影响,以及认知偏差加剧反事实错误的成本感知频率。通过这些洞察,论文旨在提供一种方法,使AI开发人员和决策者能够更有效地权衡反事实的影响,从而确保有益的AI能力得到合理利用。
链接: https://arxiv.org/abs/2504.05333
作者: Paul Lehner,Elinor Yeo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 31 pages, 8 figures
Abstract:Decisions to deploy AI capabilities are often driven by counterfactuals - a comparison of decisions made using AI to decisions that would have been made if the AI were not used. Counterfactual misses, which are poor decisions that are attributable to using AI, may have disproportionate disutility to AI deployment decision makers. Counterfactual hits, which are good decisions attributable to AI usage, may provide little benefit beyond the benefit of better decisions. This paper explores how to include counterfactual outcomes into usage decision expected utility assessments. Several properties emerge when counterfactuals are explicitly included. First, there are many contexts where the expected utility of AI usage is positive for intended beneficiaries and strongly negative for stakeholders and deployment decision makers. Second, high levels of complementarity, where differing AI and user assessments are merged beneficially, often leads to substantial disutility for stakeholders. Third, apparently small changes in how users interact with an AI capability can substantially impact stakeholder utility. Fourth, cognitive biases such as expert overconfidence and hindsight bias exacerbate the perceived frequency of costly counterfactual misses. The expected utility assessment approach presented here is intended to help AI developers and deployment decision makers to navigate the subtle but substantial impact of counterfactuals so as to better ensure that beneficial AI capabilities are used.
zh
[AI-64] Not someone but something: Rethinking trust in the age of medical AI
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)在医疗领域应用过程中信任建立的核心问题。论文指出,AI的信任并非简单地从人类转移到机器,而是一种动态且不断演进的关系,需要被构建与维护。解决方案的关键在于强调通过透明性(transparency)、问责制(accountability)以及与关怀价值观的一致性(alignment with the values of care)来赢得信任,而非依赖于模仿共情或直觉。此外,论文主张信任的建立应基于深思熟虑的设计、负责任的部署以及明确的道德责任,从而实现一种平衡的观点,避免盲目的乐观或本能的恐惧。最终目标是将对AI的信任视为需要随着时间逐步赢得的东西,而非理所当然的前提。
链接: https://arxiv.org/abs/2504.05331
作者: Jan Beger
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:As artificial intelligence (AI) becomes embedded in healthcare, trust in medical decision-making is changing fast. This opinion paper argues that trust in AI isn’t a simple transfer from humans to machines – it’s a dynamic, evolving relationship that must be built and maintained. Rather than debating whether AI belongs in medicine, this paper asks: what kind of trust must AI earn, and how? Drawing from philosophy, bioethics, and system design, it explores the key differences between human trust and machine reliability – emphasizing transparency, accountability, and alignment with the values of care. It argues that trust in AI shouldn’t rely on mimicking empathy or intuition, but on thoughtful design, responsible deployment, and clear moral responsibility. The goal is a balanced view – one that avoids blind optimism and reflexive fear. Trust in AI must be treated not as a given, but as something to be earned over time.
zh
[AI-65] VALUE: Value-Aware Large Language Model for Query Rewriting via Weighted Trie in Sponsored Search
【速读】:本文旨在解决在 sponsored search advertising 领域中,现有 query-to-bidwords 重写方法中存在的两个关键问题:一是生成结果虽具有高语义相关性但缺乏对商业价值等内在价值的认知;二是传统偏好对齐方法难以有效捕捉细粒度的价值属性且易过拟合,从而影响生成质量与效果。为应对这些挑战,论文提出了 VALUE(Value-Aware Large language model for qUery rewriting via wEighted trie)框架,其核心在于创新性地利用加权 trie 数据结构,通过在解码过程中结合 trie 中的价值信息来调整大型语言模型(LLM)的输出概率分布,从而限制生成空间并引导文本生成轨迹。这种方法不仅提升了语义匹配和偏好对齐的能力,还显著增强了生成结果的价值属性,并在实际部署中实现了 RPM 提升 1.64% 的效果。
链接: https://arxiv.org/abs/2504.05321
作者: Boyang Zuo,Xiao Zhang,Feng Li,Pengjie Wang,Jian Xu,Bo Zheng
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In the realm of sponsored search advertising, matching advertisements with the search intent of a user’s query is crucial. Query-to-bidwords(i.e. bidding keywords) rewriting is a vital technique that has garnered significant attention. Recently, with the prevalence of LLMs, generative retrieval methods have proven effective in producing high-relevance rewrites. However, we have identified a significant limitation in existing approaches: While fine-tuning LLMs for specific domains enhances semantic relevance, these models have no perception of the intrinsic value of their generated outputs, such as commercial value. Therefore, after SFT, a RLHF phase is often employed to address this issue. Nevertheless, traditional preference alignment methods often face challenges in aligning fine-grained values and are susceptible to overfitting, which diminishes the effectiveness and quality of the generated results. To address these challenges, we propose VALUE(Value-Aware Large language model for qUery rewriting via wEighted trie), the first framework that ensures the generation of high-value and highly relevant bidwords. Our approach utilizes weighted trie, an innovative modification of the traditional trie data structure. By modulating the LLM’s output probability distribution with value information from the trie during decoding process, we constrain the generation space and guide the trajectory of text production. Offline experiments demonstrate the effectiveness of our method in semantic matching and preference alignment, showing a remarkable improvement in the value attribute by more than fivefold. Online A/B tests further revealed that our Revenue Per Mille (RPM) metric increased by 1.64%. VALUE has been deployed on our advertising system since October 2024 and served the Double Eleven promotions, the biggest shopping carnival in China.
zh
[AI-66] Predictive Modeling: BIM Command Recommendation Based on Large-scale Usage Logs
【速读】:该论文旨在解决建筑信息建模(Building Information Modeling, BIM)在 Architecture, Engineering, and Construction (AEC) 行业推广受阻的问题,主要源于业界普遍认为使用 BIM 创作工具比传统二维制图需要更多努力。为提升设计效率,论文提出了一种基于历史交互预测最优下一步操作的 BIM 命令推荐框架。解决方案的关键在于引入一种综合过滤与增强方法处理大规模原始 BIM 日志数据,并开发一种新型命令推荐模型。该模型基于大型语言模型(Large Language Models, LLMs)的先进 Transformer 架构构建,结合自定义特征融合模块、专用损失函数及针对性学习策略,从而实现从匿名用户交互序列中学习通用且可泛化的建模模式,并在生成命令推荐时达到 Recall@10 约为 84% 的性能。
链接: https://arxiv.org/abs/2504.05319
作者: Changyu Du,Zihan Deng,Stavros Nousias,André Borrmann
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:The adoption of Building Information Modeling (BIM) and model-based design within the Architecture, Engineering, and Construction (AEC) industry has been hindered by the perception that using BIM authoring tools demands more effort than conventional 2D drafting. To enhance design efficiency, this paper proposes a BIM command recommendation framework that predicts the optimal next actions in real-time based on users’ historical interactions. We propose a comprehensive filtering and enhancement method for large-scale raw BIM log data and introduce a novel command recommendation model. Our model builds upon the state-of-the-art Transformer backbones originally developed for large language models (LLMs), incorporating a custom feature fusion module, dedicated loss function, and targeted learning strategy. In a case study, the proposed method is applied to over 32 billion rows of real-world log data collected globally from the BIM authoring software Vectorworks. Experimental results demonstrate that our method can learn universal and generalizable modeling patterns from anonymous user interaction sequences across different countries, disciplines, and projects. When generating recommendations for the next command, our approach achieves a Recall@10 of approximately 84%.
zh
[AI-67] Efficient Multi-Task Learning via Generalist Recommender
【速读】:该论文试图解决多任务学习(Multi-task Learning, MTL)在推荐系统中的可扩展性问题,即随着任务数量增加,现有MTL实现的训练和推理性能会下降,从而限制其生产环境的应用。论文的关键解决方案是提出了一种端到端高效且可扩展的通用推荐器(Generalist Recommender, GRec)。GRec通过利用自然语言处理(NLP)头、并行Transformer架构以及宽深结构来处理多模态输入,并引入了一种创新的任务-句子级路由机制,在不牺牲性能的前提下扩展模型在多任务上的能力。
链接: https://arxiv.org/abs/2504.05318
作者: Luyang Wang,Cangcheng Tang,Chongyang Zhang,Jun Ruan,Kai Huang,Jason Dai
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-task learning (MTL) is a common machine learning technique that allows the model to share information across different tasks and improve the accuracy of recommendations for all of them. Many existing MTL implementations suffer from scalability issues as the training and inference performance can degrade with the increasing number of tasks, which can limit production use case scenarios for MTL-based recommender systems. Inspired by the recent advances of large language models, we developed an end-to-end efficient and scalable Generalist Recommender (GRec). GRec takes comprehensive data signals by utilizing NLP heads, parallel Transformers, as well as a wide and deep structure to process multi-modal inputs. These inputs are then combined and fed through a newly proposed task-sentence level routing mechanism to scale the model capabilities on multiple tasks without compromising performance. Offline evaluations and online experiments show that GRec significantly outperforms our previous recommender solutions. GRec has been successfully deployed on one of the largest telecom websites and apps, effectively managing high volumes of online traffic every day.
zh
[AI-68] owards Adaptive Memory-Based Optimization for Enhanced Retrieval-Augmented Generation
【速读】:该论文旨在解决现有 Retrieval-Augmented Generation (RAG) 方法在开放域问答 (open-domain QA) 任务中面临的挑战,具体表现为独立的检索操作、直接的信息整合方式以及缺乏总结记忆或自适应检索策略,导致冗余信息引入噪声及信息整合不足的问题。为应对这些挑战,论文提出了一种基于自适应记忆优化的增强型 RAG 方法(Adaptive memory-based optimization for enhanced RAG,简称 Amber)。其关键在于通过基于多智能体协作的方式优化语言模型的记忆,并采用基于代理的记忆更新器、自适应信息收集器和多粒度内容过滤器,在迭代记忆更新范式下实现全面的知识整合与高效检索,同时通过多层级内容过滤减少噪声,确保核心信息的有效保留,从而显著提升模型性能。
链接: https://arxiv.org/abs/2504.05312
作者: Qitao Qin,Yucong Luo,Yihang Lu,Zhibo Chu,Xianwei Meng
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8pages
Abstract:Retrieval-Augmented Generation (RAG), by integrating non-parametric knowledge from external knowledge bases into models, has emerged as a promising approach to enhancing response accuracy while mitigating factual errors and hallucinations. This method has been widely applied in tasks such as Question Answering (QA). However, existing RAG methods struggle with open-domain QA tasks because they perform independent retrieval operations and directly incorporate the retrieved information into generation without maintaining a summarizing memory or using adaptive retrieval strategies, leading to noise from redundant information and insufficient information integration. To address these challenges, we propose Adaptive memory-based optimization for enhanced RAG (Amber) for open-domain QA tasks, which comprises an Agent-based Memory Updater, an Adaptive Information Collector, and a Multi-granular Content Filter, working together within an iterative memory updating paradigm. Specifically, Amber integrates and optimizes the language model’s memory through a multi-agent collaborative approach, ensuring comprehensive knowledge integration from previous retrieval steps. It dynamically adjusts retrieval queries and decides when to stop retrieval based on the accumulated knowledge, enhancing retrieval efficiency and effectiveness. Additionally, it reduces noise by filtering irrelevant content at multiple levels, retaining essential information to improve overall model performance. We conduct extensive experiments on several open-domain QA datasets, and the results demonstrate the superiority and effectiveness of our method and its components. The source code is available \footnotethis https URL.
zh
[AI-69] IterQR: An Iterative Framework for LLM -based Query Rewrite in e-Commercial Search System
【速读】:本文旨在解决现代电子商务搜索系统中因用户查询输入错误(如歧义或拼写错误)导致的搜索结果不准确的问题。为应对这一挑战,传统查询重写方法依赖于静态的重写词表,但这种方法手动构建且缺乏与电子商务领域知识及现实世界常识的交互能力。论文的关键解决方案在于提出了一种基于大语言模型(Large Language Models, LLMs)文本生成能力的迭代框架,用于生成查询重写。该框架包含三个阶段:利用检索增强生成(Retrieval-Augmented Generation, RAG)结合链式思维(Chain-of-Thoughts, CoT)利用领域知识生成重写;通过自动收集在线信号更新正向重写;以及以多任务目标对LLM进行后训练以生成新的重写。此方法(名为IterQR)不仅整合了领域知识与真实世界知识,还在迭代过程中实现了自动更新和自我校正,显著提升了美团配送搜索系统的用户体验。
链接: https://arxiv.org/abs/2504.05309
作者: Shangyu Chen,Xinyu Jia,Yingfei Zhang,Shuai Zhang,Xiang Li,Wei Lin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:The essence of modern e-Commercial search system lies in matching user’s intent and available candidates depending on user’s query, providing personalized and precise service. However, user’s query may be incorrect due to ambiguous input and typo, leading to inaccurate search. These cases may be released by query rewrite: modify query to other representation or expansion. However, traditional query rewrite replies on static rewrite vocabulary, which is manually established meanwhile lacks interaction with both domain knowledge in e-Commercial system and common knowledge in the real world. In this paper, with the ability to generate text content of Large Language Models (LLMs), we provide an iterative framework to generate query rewrite. The framework incorporates a 3-stage procedure in each iteration: Rewrite Generation with domain knowledge by Retrieval-Augmented Generation (RAG) and query understanding by Chain-of-Thoughts (CoT); Online Signal Collection with automatic positive rewrite update; Post-training of LLM with multi task objective to generate new rewrites. Our work (named as IterQR) provides a comprehensive framework to generate \textbfQuery \textbfRewrite with both domain / real-world knowledge. It automatically update and self-correct the rewrites during \textbfiterations. \method has been deployed in Meituan Delivery’s search system (China’s leading food delivery platform), providing service for users with significant improvement.
zh
[AI-70] oward Total Recall: Enhancing FAIRness through AI-Driven Metadata Standardization
【速读】:该论文试图解决科学数据集中元数据(metadata)普遍存在不完整、不一致以及格式错误的问题,这些问题阻碍了数据的有效重用和发现。解决方案的关键在于结合GPT-4与一个元数据知识库(CEDAR),提出了一种标准化元数据的方法,确保其遵循社区标准。此方法通过修正和优化元数据条目以符合既定指南,显著提升了检索性能和召回率(recall)指标。研究以BioSample和GEO存储库为例验证了这些改进的影响,展示了标准化元数据如何带来更好的检索结果,使平均召回率从基线原始数据集的17.65%提高到提议的标准管道下的62.87%,从而凸显了将先进AI模型与结构化元数据整理工具相结合在实现更高效、可靠的数据检索中的变革性影响。
链接: https://arxiv.org/abs/2504.05307
作者: Sowmya S Sundaram,Mark A Musen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Current metadata often suffer from incompleteness, inconsistency, and incorrect formatting, hindering effective data reuse and discovery. Using GPT-4 and a metadata knowledge base (CEDAR), we devised a method that standardizes metadata in scientific data sets, ensuring the adherence to community standards. The standardization process involves correcting and refining metadata entries to conform to established guidelines, significantly improving search performance and recall metrics. The investigation uses BioSample and GEO repositories to demonstrate the impact of these enhancements, showcasing how standardized metadata lead to better retrieval outcomes. The average recall improves significantly, rising from 17.65% with the baseline raw datasets of BioSample and GEO to 62.87% with our proposed metadata standardization pipeline. This finding highlights the transformative impact of integrating advanced AI models with structured metadata curation tools in achieving more effective and reliable data retrieval.
zh
[AI-71] When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Reasoning Models on Complex Reasoning Tasks
【速读】:该论文试图解决大型推理模型(LRMs)在复杂推理任务中的高计算成本问题。尽管现有开源LRMs在复杂推理任务中表现出色,但其庞大的参数量使得个人用户难以负担其计算资源需求。为降低这一成本,模型压缩成为一种有效的解决方案。然而,目前关于压缩后的LLMs在复杂推理任务上的性能研究,尤其是针对LRMs的研究尚显不足。大多数量化和剪枝工作侧重于保持语言建模性能,而现有的蒸馏方法并未全面评估学生模型在推理难度或知识与推理压缩影响方面的表现。
论文的关键解决方案在于通过量化、蒸馏和剪枝等方法对DeepSeek-R1模型进行压缩,并在四个不同的推理数据集(AIME 2024、FOLIO、Temporal Sequences of BIG-Bench Hard和MuSiQue)上进行全面基准测试。这些数据集涵盖了从数学到多跳推理的不同任务类型。研究发现,参数数量对LRMs的知识记忆影响远大于其推理能力,这为选择合适的压缩技术提供了指导。此外,通过对测试时间计算的实证分析表明,在多个基准测试中,较短的模型输出通常比较长的输出表现更好,强调了更简洁推理链的重要性。
链接: https://arxiv.org/abs/2504.02010
作者: Nan Zhang,Yusen Zhang,Prasenjit Mitra,Rui Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent open-source large reasoning models (LRMs) exhibit strong performance on complex reasoning tasks, but their large parameter count makes them prohibitively expensive for individuals. The compression of large language models (LLMs) offers an effective solution to reduce cost of computational resources. However, systematic studies on the performance of compressed LLMs in complex reasoning tasks, especially for LRMs, are lacking. Most works on quantization and pruning focus on preserving language modeling performance, while existing distillation works do not comprehensively benchmark student models based on reasoning difficulty or compression impact on knowledge and reasoning. In this paper, we benchmark compressed DeepSeek-R1 models on four different reasoning datasets (AIME 2024, FOLIO, Temporal Sequences of BIG-Bench Hard, and MuSiQue), ranging from mathematical to multihop reasoning, using quantization, distillation, and pruning methods. We benchmark 2.51-, 1.73-, and 1.58-bit R1 models that adopt dynamic quantization. We also benchmark distilled R1 models that are based on LLaMA or Qwen and run SparseGPT on them to obtain various sparsity levels. Studying the performance and behavior of compressed LRMs, we report their performance scores and test-time compute (number of tokens spent on each question). Notably, using MuSiQue, we find that parameter count has a much greater impact on LRMs’ knowledge memorization than on their reasoning capability, which can inform the choice of compression techniques. Through our empirical analysis of test-time compute, we find that shorter model outputs generally achieve better performance than longer ones across several benchmarks for both R1 and its compressed variants, highlighting the need for more concise reasoning chains.
zh
[AI-72] Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing
【速读】:该论文旨在解决语音基础模型高维输出特征与下游任务模型低维输入需求之间的维度不匹配问题。传统解决方案通过引入维度 Reduction (DR) 层缓解该问题,但这种方式会增加参数开销、计算成本,并可能导致有价值信息的丢失。为应对这些挑战,论文提出了一种名为 Nested Res2Net (Nes2Net) 的轻量级后端架构,其关键在于无需使用 DR 层即可直接处理高维特征。Nes2Net 通过嵌套结构增强了多尺度特征提取能力,提升了特征交互效果,并有效保留了高维信息。实验结果表明,Nes2Net 在多个数据集上的性能显著优于现有方法,同时大幅降低了计算成本,验证了其在鲁棒性和泛化能力方面的优越性。
链接: https://arxiv.org/abs/2504.05657
作者: Tianchi Liu,Duc-Tuan Truong,Rohan Kumar Das,Kong Aik Lee,Haizhou Li
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: This manuscript has been submitted for peer review
Abstract:Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net’s superior robustness and generalization capabilities. The code package and pre-trained models are available at this https URL.
zh
[AI-73] Deep Reinforcement Learning Algorithms for Option Hedging
【速读】:该论文旨在解决动态套期保值(Dynamic Hedging)问题,通过引入多种深度强化学习(Deep Reinforcement Learning, DRL)算法,提供更全面的性能比较。论文的关键在于评估八种不同的DRL算法在动态套期保值任务中的表现,包括Monte Carlo Policy Gradient (MCPG)、Proximal Policy Optimization (PPO)、四种Deep Q-Learning (DQL)变体以及两种Deep Deterministic Policy Gradient (DDPG)变体,其中两种变体为动态套期保值任务的新应用。通过将Black-Scholes Delta对冲作为基准,并基于GJR-GARCH(1,1)模型生成数据集,研究发现MCPG和PPO在根半二次惩罚指标下表现出最佳性能,且MCPG是唯一在分配计算预算内优于基准的算法,这可能归因于环境中奖励稀疏性的特点。
链接: https://arxiv.org/abs/2504.05521
作者: Andrei Neagu,Frédéric Godin,Leila Kosseim
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Dynamic hedging is a financial strategy that consists in periodically transacting one or multiple financial assets to offset the risk associated with a correlated liability. Deep Reinforcement Learning (DRL) algorithms have been used to find optimal solutions to dynamic hedging problems by framing them as sequential decision-making problems. However, most previous work assesses the performance of only one or two DRL algorithms, making an objective comparison across algorithms difficult. In this paper, we compare the performance of eight DRL algorithms in the context of dynamic hedging; Monte Carlo Policy Gradient (MCPG), Proximal Policy Optimization (PPO), along with four variants of Deep Q-Learning (DQL) and two variants of Deep Deterministic Policy Gradient (DDPG). Two of these variants represent a novel application to the task of dynamic hedging. In our experiments, we use the Black-Scholes delta hedge as a baseline and simulate the dataset using a GJR-GARCH(1,1) model. Results show that MCPG, followed by PPO, obtain the best performance in terms of the root semi-quadratic penalty. Moreover, MCPG is the only algorithm to outperform the Black-Scholes delta hedge baseline with the allotted computational budget, possibly due to the sparsity of rewards in our environment.
zh
[AI-74] Large-Scale Classification of Shortwave Communication Signals with Machine Learning
【速读】:该论文旨在解决短波频谱分类中的典型挑战,包括大量不同信号类型的识别、多种模拟调制的存在以及电离层传播的影响。为应对这些挑战,论文提出了一种基于深度卷积神经网络(Deep Convolutional Neural Network, DCNN)的盲分类方法。该方法的关键在于无需先验知识或针对特定信号的手动特征设计,而是通过大规模合成生成的信号与高质量录音数据集对网络进行训练,从而实现对160种典型短波信号类别的自动识别。最终,该网络在真实世界环境中部署的接收机硬件采集的实际信号上评估,仅需1秒观测时间即可达到高达90%的分类准确率。
链接: https://arxiv.org/abs/2504.05455
作者: Stefan Scholl
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper presents a deep learning approach to the classification of 160 shortwave radio signals. It addresses the typical challenges of the shortwave spectrum, which are the large number of different signal types, the presence of various analog modulations and ionospheric propagation. As a classifier a deep convolutional neural network is used, that is trained to recognize 160 typical shortwave signal classes. The approach is blind and therefore does not require preknowledge or special preprocessing of the signal and no manual design of discriminative features for each signal class. The network is trained on a large number of synthetically generated signals and high quality recordings. Finally, the network is evaluated on real-world radio signals obtained from globally deployed receiver hardware and achieves up to 90% accuracy for an observation time of only 1 second.
zh
[AI-75] Non-linear Phillips Curve for India: Evidence from Explainable Machine Learning
【速读】:该论文试图解决传统线性Phillips曲线模型在存在结构断裂和固有非线性时难以提供准确通胀预测的问题。解决方案的关键在于将机器学习方法融入New Keynesian Phillips曲线框架,通过采用可解释的机器学习技术,不仅显著提升了预测精度,还揭示了印度通胀Phillips曲线关系的高度非线性特性,包括阈值效应和变量间的交互影响。
链接: https://arxiv.org/abs/2504.05350
作者: Shovon Sengupta,Bhanu Pratap,Amit Pawar
机构: 未知
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The conventional linear Phillips curve model, while widely used in policymaking, often struggles to deliver accurate forecasts in the presence of structural breaks and inherent nonlinearities. This paper addresses these limitations by leveraging machine learning methods within a New Keynesian Phillips Curve framework to forecast and explain headline inflation in India, a major emerging economy. Our analysis demonstrates that machine learning-based approaches significantly outperform standard linear models in forecasting accuracy. Moreover, by employing explainable machine learning techniques, we reveal that the Phillips curve relationship in India is highly nonlinear, characterized by thresholds and interaction effects among key variables. Headline inflation is primarily driven by inflation expectations, followed by past inflation and the output gap, while supply shocks, except rainfall, exert only a marginal influence. These findings highlight the ability of machine learning models to improve forecast accuracy and uncover complex, nonlinear dynamics in inflation data, offering valuable insights for policymakers.
zh
[AI-76] Hyperflows: Pruning Reveals the Importance of Weights
【速读】:该论文旨在解决现有网络剪枝方法在极端稀疏性下难以准确评估单个权重重要性的问题,导致性能下降。论文的关键创新在于提出了一种名为Hyperflows的动态剪枝方法,通过观察网络梯度对权重移除的响应来估计每个权重的重要性。解决方案的核心在于引入一个全局压力项,持续推动所有权重向剪枝方向发展,同时自动重新生长对准确性至关重要的权重,其依据是这些权重被移除时的累积梯度信号(flow)。此外,作者还探索了最终稀疏性和压力之间的关系,并得到了与神经网络缩放定律类似的幂律方程。实证结果显示,在CIFAR-10和CIFAR-100数据集上,Hyperflows在ResNet-50和VGG-19模型上达到了最先进的剪枝效果。
链接: https://arxiv.org/abs/2504.05349
作者: Eugen Barbulescu,Antonio Alexoaie
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Network pruning is used to reduce inference latency and power consumption in large neural networks. However, most existing methods struggle to accurately assess the importance of individual weights due to their inherent interrelatedness, leading to poor performance, especially at extreme sparsity levels. We introduce Hyperflows, a dynamic pruning approach that estimates each weight’s importance by observing the network’s gradient response to the weight’s removal. A global pressure term continuously drives all weights toward pruning, with those critical for accuracy being automatically regrown based on their flow, the aggregated gradient signal when they are absent. We explore the relationship between final sparsity and pressure, deriving power-law equations similar to those found in neural scaling laws. Empirically, we demonstrate state-of-the-art results with ResNet-50 and VGG-19 on CIFAR-10 and CIFAR-100.
zh
机器学习
[LG-0] NNN: Next-Generation Neural Networks for Marketing Mix Modeling
链接: https://arxiv.org/abs/2504.06212
作者: Thomas Mulc,Mike Anderson,Paul Cubre,Huikun Zhang,Ivy Liu,Saket Kumar
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:We present NNN, a Transformer-based neural network approach to Marketing Mix Modeling (MMM) designed to address key limitations of traditional methods. Unlike conventional MMMs which rely on scalar inputs and parametric decay functions, NNN uses rich embeddings to capture both quantitative and qualitative aspects of marketing and organic channels (e.g., search queries, ad creatives). This, combined with its attention mechanism, enables NNN to model complex interactions, capture long-term effects, and potentially improve sales attribution accuracy. We show that L1 regularization permits the use of such expressive models in typical data-constrained settings. Evaluating NNN on simulated and real-world data demonstrates its efficacy, particularly through considerable improvement in predictive power. Beyond attribution, NNN provides valuable, complementary insights through model probing, such as evaluating keyword or creative effectiveness, enhancing model interpretability.
[LG-1] he Work Capacity of Channels with Memory: Maximum Extractable Work in Percept-Action Loops
链接: https://arxiv.org/abs/2504.06209
作者: Lukas J. Fiderer,Paul C. Barth,Isaac D. Smith,Hans J. Briegel
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Adaptation and Self-Organizing Systems (nlin.AO); Chaotic Dynamics (nlin.CD); Quantum Physics (quant-ph)
*备注: 10+32 pages; 6+19 figures
Abstract:Predicting future observations plays a central role in machine learning, biology, economics, and many other fields. It lies at the heart of organizational principles such as the variational free energy principle and has even been shown – based on the second law of thermodynamics – to be necessary for reaching the fundamental energetic limits of sequential information processing. While the usefulness of the predictive paradigm is undisputed, complex adaptive systems that interact with their environment are more than just predictive machines: they have the power to act upon their environment and cause change. In this work, we develop a framework to analyze the thermodynamics of information processing in percept-action loops – a model of agent-environment interaction – allowing us to investigate the thermodynamic implications of actions and percepts on equal footing. To this end, we introduce the concept of work capacity – the maximum rate at which an agent can expect to extract work from its environment. Our results reveal that neither of two previously established design principles for work-efficient agents – maximizing predictive power and forgetting past actions – remains optimal in environments where actions have observable consequences. Instead, a trade-off emerges: work-efficient agents must balance prediction and forgetting, as remembering past actions can reduce the available free energy. This highlights a fundamental departure from the thermodynamics of passive observation, suggesting that prediction and energy efficiency may be at odds in active learning systems.
[LG-2] Hall Effect Thruster Forecasting using a Topological Approach for Data Assimilation
链接: https://arxiv.org/abs/2504.06157
作者: Max M. Chumley,Firas A. Khasawneh
类目: Machine Learning (cs.LG)
*备注: 12 pages, 13 figures
Abstract:Hall Effect Thrusters (HETs) are electric thrusters that eject heavy ionized gas particles from the spacecraft to generate thrust. Although traditionally they were used for station keeping, recently They have been used for interplanetary space missions due to their high delta-V potential and their operational longevity in contrast to other thrusters, e.g., chemical. However, the operation of HETs involves complex processes such as ionization of gases, strong magnetic fields, and complicated solar panel power supply interactions. Therefore, their operation is extremely difficult to model thus necessitating Data Assimilation (DA) approaches for estimating and predicting their operational states. Because HET’s operating environment is often noisy with non-Gaussian sources, this significantly limits applicable DA tools. We describe a topological approach for data assimilation that bypasses these limitations that does not depend on the noise model, and utilize it to forecast spatiotemporal plume field states of HETs. Our approach is a generalization of the Topological Approach for Data Assimilation (TADA) method that allows including different forecast functions. We show how TADA can be combined with the Long Short-Term Memory network for accurate forecasting. We then apply our approach to high-fidelity Hall Effect Thruster (HET) simulation data from the Air Force Research Laboratory (AFRL) rocket propulsion division where we demonstrate the forecast resiliency of TADA on noise contaminated, high-dimensional data.
[LG-3] Adversarial Training of Reward Models
链接: https://arxiv.org/abs/2504.06141
作者: Alexander Bukharin,Haifeng Qian,Shengyang Sun,Adithya Renduchintala,Soumye Singhal,Zhilin Wang,Oleksii Kuchaiev,Olivier Delalleau,Tuo Zhao
类目: Machine Learning (cs.LG)
*备注: 16 pages, 7 figures
Abstract:Reward modeling has emerged as a promising approach for the scalable alignment of language models. However, contemporary reward models (RMs) often lack robustness, awarding high rewards to low-quality, out-of-distribution (OOD) samples. This can lead to reward hacking, where policies exploit unintended shortcuts to maximize rewards, undermining alignment. To address this challenge, we introduce Adv-RM, a novel adversarial training framework that automatically identifies adversarial examples – responses that receive high rewards from the target RM but are OOD and of low quality. By leveraging reinforcement learning, Adv-RM trains a policy to generate adversarial examples that reliably expose vulnerabilities in large state-of-the-art reward models such as Nemotron 340B RM. Incorporating these adversarial examples into the reward training process improves the robustness of RMs, mitigating reward hacking and enhancing downstream performance in RLHF. We demonstrate that Adv-RM significantly outperforms conventional RM training, increasing stability and enabling more effective RLHF training in both synthetic and real-data settings.
[LG-4] Accelerating Vehicle Routing via AI-Initialized Genetic Algorithms
链接: https://arxiv.org/abs/2504.06126
作者: Ido Greenberg,Piotr Sielski,Hugo Linsenmaier,Rajesh Gandham,Shie Mannor,Alex Fender,Gal Chechik,Eli Meirom
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Vehicle Routing Problems (VRP) are an extension of the Traveling Salesperson Problem and are a fundamental NP-hard challenge in combinatorial optimization. Solving VRP in real-time at large scale has become critical in numerous applications, from growing markets like last-mile delivery to emerging use-cases like interactive logistics planning. Such applications involve solving similar problem instances repeatedly, yet current state-of-the-art solvers treat each instance on its own without leveraging previous examples. We introduce a novel optimization framework that uses a reinforcement learning agent - trained on prior instances - to quickly generate initial solutions, which are then further optimized by genetic algorithms. Our framework, Evolutionary Algorithm with Reinforcement Learning Initialization (EARLI), consistently outperforms current state-of-the-art solvers across various time scales. For example, EARLI handles vehicle routing with 500 locations within 1s, 10x faster than current solvers for the same solution quality, enabling applications like real-time and interactive routing. EARLI can generalize to new data, as demonstrated on real e-commerce delivery data of a previously unseen city. Our hybrid framework presents a new way to combine reinforcement learning and genetic algorithms, paving the road for closer interdisciplinary collaboration between AI and optimization communities towards real-time optimization in diverse domains.
[LG-5] Robo-taxi Fleet Coordination at Scale via Reinforcement Learning
链接: https://arxiv.org/abs/2504.06125
作者: Luigi Tresca,Carolin Schmidt,James Harrison,Filipe Rodrigues,Gioele Zardini,Daniele Gammelli,Marco Pavone
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 12 pages, 6 figures, 6 tables
Abstract:Fleets of robo-taxis offering on-demand transportation services, commonly known as Autonomous Mobility-on-Demand (AMoD) systems, hold significant promise for societal benefits, such as reducing pollution, energy consumption, and urban congestion. However, orchestrating these systems at scale remains a critical challenge, with existing coordination algorithms often failing to exploit the systems’ full potential. This work introduces a novel decision-making framework that unites mathematical modeling with data-driven techniques. In particular, we present the AMoD coordination problem through the lens of reinforcement learning and propose a graph network-based framework that exploits the main strengths of graph representation learning, reinforcement learning, and classical operations research tools. Extensive evaluations across diverse simulation fidelities and scenarios demonstrate the flexibility of our approach, achieving superior system performance, computational efficiency, and generalizability compared to prior methods. Finally, motivated by the need to democratize research efforts in this area, we release publicly available benchmarks, datasets, and simulators for network-level coordination alongside an open-source codebase designed to provide accessible simulation platforms and establish a standardized validation process for comparing methodologies. Code available at: this https URL
[LG-6] Leverag ing Axis-Aligned Subspaces for High-Dimensional Bayesian Optimization with Group Testing
链接: https://arxiv.org/abs/2504.06111
作者: Erik Hellsten,Carl Hvarfner,Leonard Papenmeier,Luigi Nardi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian optimization (BO ) is an effective method for optimizing expensive-to-evaluate black-box functions. While high-dimensional problems can be particularly challenging, due to the multitude of parameter choices and the potentially high number of data points required to fit the model, this limitation can be addressed if the problem satisfies simplifying assumptions. Axis-aligned subspace approaches, where few dimensions have a significant impact on the objective, motivated several algorithms for high-dimensional BO . However, the validity of this assumption is rarely verified, and the assumption is rarely exploited to its full extent. We propose a group testing ( GT) approach to identify active variables to facilitate efficient optimization in these domains. The proposed algorithm, Group Testing Bayesian Optimization (GTBO), first runs a testing phase where groups of variables are systematically selected and tested on whether they influence the objective, then terminates once active dimensions are identified. To that end, we extend the well-established GT theory to functions over continuous domains. In the second phase, GTBO guides optimization by placing more importance on the active dimensions. By leveraging the axis-aligned subspace assumption, GTBO outperforms state-of-the-art methods on benchmarks satisfying the assumption of axis-aligned subspaces, while offering improved interpretability.
[LG-7] Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training
链接: https://arxiv.org/abs/2504.06095
作者: Daiyaan Arfeen,Dheevatsa Mudigere,Ankit More,Bhargava Gopireddy,Ahmet Inci,Gregory R. Ganger
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:LLM training is scaled up to 10Ks of GPUs by a mix of data-(DP) and model-parallel (MP) execution. Critical to achieving efficiency is tensor-parallel (TP; a form of MP) execution within tightly-coupled subsets of GPUs, referred to as a scale-up domain, and the larger the scale-up domain the better the performance. New datacenter architectures are emerging with more GPUs able to be tightly-coupled in a scale-up domain, such as moving from 8 GPUs to 72 GPUs connected via NVLink. Unfortunately, larger scale-up domains increase the blast-radius of failures, with a failure of single GPU potentially impacting TP execution on the full scale-up domain, which can degrade overall LLM training throughput dramatically. With as few as 0.1% of GPUs being in a failed state, a high TP-degree job can experience nearly 10% reduction in LLM training throughput. We propose nonuniform-tensor-parallelism (NTP) to mitigate this amplified impact of GPU failures. In NTP, a DP replica that experiences GPU failures operates at a reduced TP degree, contributing throughput equal to the percentage of still-functional GPUs. We also propose a rack-design with improved electrical and thermal capabilities in order to sustain power-boosting of scale-up domains that have experienced failures; combined with NTP, this can allow the DP replica with the reduced TP degree (i.e., with failed GPUs) to keep up with the others, thereby achieving near-zero throughput loss for large-scale LLM training.
[LG-8] Collaborative Prediction: Tractable Information Aggregation via Agreement
链接: https://arxiv.org/abs/2504.06075
作者: Natalie Collina,Ira Globus-Harris,Surbhi Goel,Varun Gupta,Aaron Roth,Mirah Shi
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:We give efficient “collaboration protocols” through which two parties, who observe different features about the same instances, can interact to arrive at predictions that are more accurate than either could have obtained on their own. The parties only need to iteratively share and update their own label predictions-without either party ever having to share the actual features that they observe. Our protocols are efficient reductions to the problem of learning on each party’s feature space alone, and so can be used even in settings in which each party’s feature space is illegible to the other-which arises in models of human/AI interaction and in multi-modal learning. The communication requirements of our protocols are independent of the dimensionality of the data. In an online adversarial setting we show how to give regret bounds on the predictions that the parties arrive at with respect to a class of benchmark policies defined on the joint feature space of the two parties, despite the fact that neither party has access to this joint feature space. We also give simpler algorithms for the same task in the batch setting in which we assume that there is a fixed but unknown data distribution. We generalize our protocols to a decision theoretic setting with high dimensional outcome spaces, where parties communicate only “best response actions.” Our theorems give a computationally and statistically tractable generalization of past work on information aggregation amongst Bayesians who share a common and correct prior, as part of a literature studying “agreement” in the style of Aumann’s agreement theorem. Our results require no knowledge of (or even the existence of) a prior distribution and are computationally efficient. Nevertheless we show how to lift our theorems back to this classical Bayesian setting, and in doing so, give new information aggregation theorems for Bayesian agreement. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2504.06075 [cs.LG] (or arXiv:2504.06075v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.06075 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-9] PINP: Physics-Informed Neural Predictor with latent estimation of fluid flows
链接: https://arxiv.org/abs/2504.06070
作者: Huaguan Chen,Yang Liu,Hao Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurately predicting fluid dynamics and evolution has been a long-standing challenge in physical sciences. Conventional deep learning methods often rely on the nonlinear modeling capabilities of neural networks to establish mappings between past and future states, overlooking the fluid dynamics, or only modeling the velocity field, neglecting the coupling of multiple physical quantities. In this paper, we propose a new physics-informed learning approach that incorporates coupled physical quantities into the prediction process to assist with forecasting. Central to our method lies in the discretization of physical equations, which are directly integrated into the model architecture and loss function. This integration enables the model to provide robust, long-term future predictions. By incorporating physical equations, our model demonstrates temporal extrapolation and spatial generalization capabilities. Experimental results show that our approach achieves the state-of-the-art performance in spatiotemporal prediction across both numerical simulations and real-world extreme-precipitation nowcasting benchmarks.
[LG-10] Explainable AI for building energy retrofitting under data scarcity
链接: https://arxiv.org/abs/2504.06055
作者: Panagiota Rempi,Sotiris Pelekis,Alexandros Menelaos Tzortzis,Evangelos Karakolis,Christos Ntanos,Dimitris Askounis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Enhancing energy efficiency in residential buildings is a crucial step toward mitigating climate change and reducing greenhouse gas emissions. Retrofitting existing buildings, which account for a significant portion of energy consumption, is critical particularly in regions with outdated and inefficient building stocks. This study presents an Artificial Intelligence (AI) and Machine Learning (ML)-based framework to recommend energy efficiency measures for residential buildings, leveraging accessible building characteristics to achieve energy class targets. Using Latvia as a case study, the methodology addresses challenges associated with limited datasets, class imbalance and data scarcity. The proposed approach integrates Conditional Tabular Generative Adversarial Networks (CTGAN) to generate synthetic data, enriching and balancing the dataset. A Multi-Layer Perceptron (MLP) model serves as the predictive model performing multi-label classification to predict appropriate retrofit strategies. Explainable Artificial Intelligence (XAI), specifically SHapley Additive exPlanations (SHAP), ensures transparency and trust by identifying key features that influence recommendations and guiding feature engineering choices for improved reliability and performance. The evaluation of the approach shows that it notably overcomes data limitations, achieving improvements up to 54% in precision, recall and F1 score. Although this study focuses on Latvia, the methodology is adaptable to other regions, underscoring the potential of AI in reducing the complexity and cost of building energy retrofitting overcoming data limitations. By facilitating decision-making processes and promoting stakeholders engagement, this work supports the global transition toward sustainable energy use in the residential building sector.
[LG-11] rust-Region Twisted Policy Improvement
链接: https://arxiv.org/abs/2504.06048
作者: Joery A. de Vries,Jinke He,Yaniv Oren,Matthijs T.J. Spaan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically for RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.
[LG-12] Smart Exploration in Reinforcement Learning using Bounded Uncertainty Models
链接: https://arxiv.org/abs/2504.05978
作者: J.S. van Hulst,W.P.M.H. Heemels,D.J. Antunes
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted for publication
Abstract:Reinforcement learning (RL) is a powerful tool for decision-making in uncertain environments, but it often requires large amounts of data to learn an optimal policy. We propose using prior model knowledge to guide the exploration process to speed up this learning process. This model knowledge comes in the form of a model set to which the true transition kernel and reward function belong. We optimize over this model set to obtain upper and lower bounds on the Q-function, which are then used to guide the exploration of the agent. We provide theoretical guarantees on the convergence of the Q-function to the optimal Q-function under the proposed class of exploring policies. Furthermore, we also introduce a data-driven regularized version of the model set optimization problem that ensures the convergence of the class of exploring policies to the optimal policy. Lastly, we show that when the model set has a specific structure, namely the bounded-parameter MDP (BMDP) framework, the regularized model set optimization problem becomes convex and simple to implement. In this setting, we also show that we obtain finite-time convergence to the optimal policy under additional assumptions. We demonstrate the effectiveness of the proposed exploration strategy in a simulation study. The results indicate that the proposed method can significantly speed up the learning process in reinforcement learning.
[LG-13] MLPROP – an open interactive web interface for thermophysical property prediction with machine learning
链接: https://arxiv.org/abs/2504.05970
作者: Marco Hoffmann,Thomas Specht,Nicolas Hayer,Hans Hasse,Fabian Jirasek
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures
Abstract:Machine learning (ML) enables the development of powerful methods for predicting thermophysical properties with unprecedented scope and accuracy. However, technical barriers like cumbersome implementation in established workflows hinder their application in practice. With MLPROP, we provide an interactive web interface for directly applying advanced ML methods to predict thermophysical properties without requiring ML expertise, thereby substantially increasing the accessibility of novel models. MLPROP currently includes models for predicting the vapor pressure of pure components (GRAPPA), activity coefficients and vapor-liquid equilibria in binary mixtures (UNIFAC 2.0, mod. UNIFAC 2.0, and HANNA), and a routine to fit NRTL parameters to the model predictions. MLPROP will be continuously updated and extended and is accessible free of charge via this https URL. MLPROP removes the barrier to learning and experimenting with new ML-based methods for predicting thermophysical properties. The source code of all models is available as open source, which allows integration into existing workflows.
[LG-14] Autoencoder-Based Detection of Anomalous Stokes V Spectra in the Flare-Producing Active Region 13663 Using Hinode/SP Observations
链接: https://arxiv.org/abs/2504.05962
作者: Jargalmaa Batmunkh(1),Yusuke Iida(1),Takayoshi Oba(2) ((1) Niigata University, (2) Max Planck Institute for Solar System Research)
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR)
*备注:
Abstract:Detecting unusual signals in observational solar spectra is crucial for understanding the features associated with impactful solar events, such as solar flares. However, existing spectral analysis techniques face challenges, particularly when relying on pre-defined, physics-based calculations to process large volumes of noisy and complex observational data. To address these limitations, we applied deep learning to detect anomalies in the Stokes V spectra from the Hinode/SP instrument. Specifically, we developed an autoencoder model for spectral compression, which serves as an anomaly detection method. Our model effectively identifies anomalous spectra within spectro-polarimetric maps captured prior to the onset of the X1.3 flare on May 5, 2024, in NOAA AR 13663. These atypical spectral points exhibit highly complex profiles and spatially align with polarity inversion lines in magnetogram images, indicating their potential as sites of magnetic energy storage and possible triggers for flares. Notably, the detected anomalies are highly localized, making them particularly challenging to identify in magnetogram images using current manual methods.
[LG-15] Drought forecasting using a hybrid neural architecture for integrating time series and static data ICLR2025
链接: https://arxiv.org/abs/2504.05957
作者: Julian Agudelo,Vincent Guigue,Cristina Manfredotti,Hadrien Piot
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, published as a workshop paper at Tackling Climate Change with Machine Learning at ICLR 2025, Tackling Climate Change with Machine Learning is a non-archival workshop
Abstract:Reliable forecasting is critical for early warning systems and adaptive drought management. Most previous deep learning approaches focus solely on homogeneous regions and rely on single-structured data. This paper presents a hybrid neural architecture that integrates time series and static data, achieving state-of-the-art performance on the DroughtED dataset. Our results illustrate the potential of designing neural models for the treatment of heterogeneous data in climate related tasks and present reliable prediction of USDM categories, an expert-informed drought metric. Furthermore, this work validates the potential of DroughtED for enabling location-agnostic training of deep learning models.
[LG-16] Evaluation of the impact of expert knowledge: How decision support scores impact the effectiveness of automatic knowledge-driven feature engineering (aKDFE)
链接: https://arxiv.org/abs/2504.05928
作者: Olof Björneld,Tora Hammar,Daniel Nilsson,Alisa Lincke,Welf Löwe
类目: Machine Learning (cs.LG)
*备注: 43 pages, including the Appendix, 19 tables and 13 figures
Abstract:Adverse Drug Events (ADEs), harmful medication effects, pose significant healthcare challenges, impacting patient safety and costs. This study evaluates automatic Knowledge-Driven Feature Engineering (aKDFE) for improved ADE prediction from Electronic Health Record (EHR) data, comparing it with automated event-based Knowledge Discovery in Databases (KDD). We investigated how incorporating domain-specific ADE risk scores for prolonged heart QT interval, extracted from the Janusmed Riskprofile (Janusmed) Clinical Decision Support System (CDSS), affects prediction performance using EHR data and medication handling events. Results indicate that, while aKDFE step 1 (event-based feature generation) alone did not significantly improve ADE prediction performance, aKDFE step 2 (patient-centric transformation) enhances the prediction performance. High Area Under the Receiver Operating Characteristic curve (AUROC) values suggest strong feature correlations to the outcome, aligning with the predictive power of patients’ prior healthcare history for ADEs. Statistical analysis did not confirm that incorporating the Janusmed information (i) risk scores and (ii) medication route of administration into the model’s feature set enhanced predictive performance. However, the patient-centric transformation applied by aKDFE proved to be a highly effective feature engineering approach. Limitations include a single-project focus, potential bias from machine learning pipeline methods, and reliance on AUROC. In conclusion, aKDFE, particularly with patient-centric transformation, improves ADE prediction from EHR data. Future work will explore attention-based models, event feature sequences, and automatic methods for incorporating domain knowledge into the aKDFE framework.
[LG-17] Deep RL-based Autonomous Navigation of Micro Aerial Vehicles (MAVs) in a complex GPS-denied Indoor Environment
链接: https://arxiv.org/abs/2504.05918
作者: Amit Kumar Singh,Prasanth Kumar Duba,P. Rajalakshmi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:The Autonomy of Unmanned Aerial Vehicles (UAVs) in indoor environments poses significant challenges due to the lack of reliable GPS signals in enclosed spaces such as warehouses, factories, and indoor facilities. Micro Aerial Vehicles (MAVs) are preferred for navigating in these complex, GPS-denied scenarios because of their agility, low power consumption, and limited computational capabilities. In this paper, we propose a Reinforcement Learning based Deep-Proximal Policy Optimization (D-PPO) algorithm to enhance realtime navigation through improving the computation efficiency. The end-to-end network is trained in 3D realistic meta-environments created using the Unreal Engine. With these trained meta-weights, the MAV system underwent extensive experimental trials in real-world indoor environments. The results indicate that the proposed method reduces computational latency by 91% during training period without significant degradation in performance. The algorithm was tested on a DJI Tello drone, yielding similar results.
[LG-18] HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
链接: https://arxiv.org/abs/2504.05897
作者: Shuzhang Zhong,Yanfan Sun,Ling Liang,Runsheng Wang,Ru Huang,Meng Li
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by DAC 25
Abstract:The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33 \times in the prefill stage and 1.70 \times in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: this https URL.
[LG-19] Why do zeroes happen? A model-based approach for demand classification
链接: https://arxiv.org/abs/2504.05894
作者: Ivan Svetunkov,Anna Sroginis
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 39 pages, 11 figures, 3 tables
Abstract:Effective demand forecasting is critical for inventory management, production planning, and decision making across industries. Selecting the appropriate model and suitable features to efficiently capture patterns in the data is one of the main challenges in demand forecasting. In reality, this becomes even more complicated when the recorded sales have zeroes, which can happen naturally or due to some anomalies, such as stockouts and recording errors. Mistreating the zeroes can lead to the application of inappropriate forecasting methods, and thus leading to poor decision making. Furthermore, the demand itself can have different fundamental characteristics, and being able to distinguish one type from another might bring substantial benefits in terms of accuracy and thus decision making. We propose a two-stage model-based classification framework that in the first step, identifies artificially occurring zeroes, and then classifies demand to one of the possible types: regular/intermittent, intermittent smooth/lumpy, fractional/count. The framework utilises statistical modelling and information criteria to detect anomalous zeroes and then classify demand into those categories. We then argue that different types of demand need different features, and show empirically that they tend to increase the accuracy of the forecasting methods compared to those applied directly to the dataset without the generated features and the two-stage framework. Our general practical recommendation based on that is to use the mixture approach for intermittent demand, capturing the demand sizes and demand probability separately, as it seems to improve the accuracy of different forecasting approaches.
[LG-20] o Give or Not to Give? The Impacts of Strategically Withheld Recourse
链接: https://arxiv.org/abs/2504.05891
作者: Yatong Chen,Andrew Estornell,Yevgeniy Vorobeychik,Yang Liu
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:Individuals often aim to reverse undesired outcomes in interactions with automated systems, like loan denials, by either implementing system-recommended actions (recourse), or manipulating their features. While providing recourse benefits users and enhances system utility, it also provides information about the decision process that can be used for more effective strategic manipulation, especially when the individuals collectively share such information with each other. We show that this tension leads rational utility-maximizing systems to frequently withhold recourse, resulting in decreased population utility, particularly impacting sensitive groups. To mitigate these effects, we explore the role of recourse subsidies, finding them effective in increasing the provision of recourse actions by rational systems, as well as lowering the potential social cost and mitigating unfairness caused by recourse withholding. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2504.05891 [cs.GT] (or arXiv:2504.05891v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2504.05891 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Artificial Intelligence and Statistics (AISTATS 2025)
[LG-21] Energy-Conserving Neural Network Closure Model for Long-Time Accurate and Stable LES
链接: https://arxiv.org/abs/2504.05868
作者: Toby van Gastelen,Wouter Edeling,Benjamin Sanderse
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 40 pages, 11 figures, source code can be found at this https URL
Abstract:Machine learning-based closure models for LES have shown promise in capturing complex turbulence dynamics but often suffer from instabilities and physical inconsistencies. In this work, we develop a novel skew-symmetric neural architecture as closure model that enforces stability while preserving key physical conservation laws. Our approach leverages a discretization that ensures mass, momentum, and energy conservation, along with a face-averaging filter to maintain mass conservation in coarse-grained velocity fields. We compare our model against several conventional data-driven closures (including unconstrained convolutional neural networks), and the physics-based Smagorinsky model. Performance is evaluated on decaying turbulence and Kolmogorov flow for multiple coarse-graining factors. In these test cases we observe that unconstrained machine learning models suffer from numerical instabilities. In contrast, our skew-symmetric model remains stable across all tests, though at the cost of increased dissipation. Despite this trade-off, we demonstrate that our model still outperforms the Smagorinsky model in unseen scenarios. These findings highlight the potential of structure-preserving machine learning closures for reliable long-time LES.
[LG-22] Adaptive Substructure-Aware Expert Model for Molecular Property Prediction
链接: https://arxiv.org/abs/2504.05844
作者: Tianyi Jiang,Zeyu Wang,Shanqing Yu,Qi Xuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Molecular property prediction is essential for applications such as drug discovery and toxicity assessment. While Graph Neural Networks (GNNs) have shown promising results by modeling molecules as molecular graphs, their reliance on data-driven learning limits their ability to generalize, particularly in the presence of data imbalance and diverse molecular substructures. Existing methods often overlook the varying contributions of different substructures to molecular properties, treating them uniformly. To address these challenges, we propose ASE-Mol, a novel GNN-based framework that leverages a Mixture-of-Experts (MoE) approach for molecular property prediction. ASE-Mol incorporates BRICS decomposition and significant substructure awareness to dynamically identify positive and negative substructures. By integrating a MoE architecture, it reduces the adverse impact of negative motifs while improving adaptability to positive motifs. Experimental results on eight benchmark datasets demonstrate that ASE-Mol achieves state-of-the-art performance, with significant improvements in both accuracy and interpretability.
[LG-23] Federated Unlearning Made Practical: Seamless Integration via Negated Pseudo-Gradients
链接: https://arxiv.org/abs/2504.05822
作者: Alessio Mora,Carlo Mazzocca,Rebecca Montanari,Paolo Bellavista
类目: Machine Learning (cs.LG)
*备注:
Abstract:The right to be forgotten is a fundamental principle of privacy-preserving regulations and extends to Machine Learning (ML) paradigms such as Federated Learning (FL). While FL enhances privacy by enabling collaborative model training without sharing private data, trained models still retain the influence of training data. Federated Unlearning (FU) methods recently proposed often rely on impractical assumptions for real-world FL deployments, such as storing client update histories or requiring access to a publicly available dataset. To address these constraints, this paper introduces a novel method that leverages negated Pseudo-gradients Updates for Federated Unlearning (PUF). Our approach only uses standard client model updates, anyway employed during regular FL rounds, and interprets them as pseudo-gradients. When a client needs to be forgotten, we apply the negated of their pseudo-gradients, appropriately scaled, to the global model. Unlike state-of-the-art mechanisms, PUF seamlessly integrates with FL workflows, incurs no additional computational and communication overhead beyond standard FL rounds, and supports concurrent unlearning requests. We extensively evaluated the proposed method on two well-known benchmark image classification datasets (CIFAR-10 and CIFAR-100) and a real-world medical imaging dataset for segmentation (ProstateMRI), using three different neural architectures: two residual networks and a vision transformer. The experimental results across various settings demonstrate that PUF achieves state-of-the-art forgetting effectiveness and recovery time, without relying on any additional assumptions, thus underscoring its practical applicability.
[LG-24] Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization
链接: https://arxiv.org/abs/2504.05812
作者: Qingyang Zhang,Haitao Wu,Changqing Zhang,Peilin Zhao,Yatao Bian
类目: Machine Learning (cs.LG)
*备注: Ongoing work
Abstract:While large language models (LLMs) have demonstrated exceptional capabilities in challenging tasks such as mathematical reasoning, existing methods to enhance reasoning ability predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data after pre-training. However, these approaches critically depend on external supervisions–such as human labelled reasoning traces, verified golden answers, or pre-trained reward models–which limits scalability and practical applicability. In this work, we propose Entropy Minimized Policy Optimization (EMPO), which makes an early attempt at fully unsupervised LLM reasoning incentivization. EMPO does not require any supervised information for incentivizing reasoning capabilities (i.e., neither verifiable reasoning traces, problems with golden answers, nor additional pre-trained reward models). By continuously minimizing the predictive entropy of LLMs on unlabeled user queries in a latent semantic space, EMPO enables purely self-supervised evolution of reasoning capabilities with strong flexibility and practicality. Our experiments demonstrate competitive performance of EMPO on both mathematical reasoning and free-form commonsense reasoning tasks. Specifically, without any supervised signals, EMPO boosts the accuracy of Qwen2.5-Math-7B Base from 30.7% to 48.1% on mathematical benchmarks and improves truthfulness accuracy of Qwen2.5-7B Instruct from 87.16% to 97.25% on TruthfulQA.
[LG-25] AiGAS-dEVL-RC: An Adaptive Growing Neural Gas Model for Recurrently Drifting Unsupervised Data Streams
链接: https://arxiv.org/abs/2504.05761
作者: Maria Arostegi,Miren Nekane Bilbao,Jesus L. Lobo,Javier Del Ser
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:Concept drift and extreme verification latency pose significant challenges in data stream learning, particularly when dealing with recurring concept changes in dynamic environments. This work introduces a novel method based on the Growing Neural Gas (GNG) algorithm, designed to effectively handle abrupt recurrent drifts while adapting to incrementally evolving data distributions (incremental drifts). Leveraging the self-organizing and topological adaptability of GNG, the proposed approach maintains a compact yet informative memory structure, allowing it to efficiently store and retrieve knowledge of past or recurring concepts, even under conditions of delayed or sparse stream supervision. Our experiments highlight the superiority of our approach over existing data stream learning methods designed to cope with incremental non-stationarities and verification latency, demonstrating its ability to quickly adapt to new drifts, robustly manage recurring patterns, and maintain high predictive accuracy with a minimal memory footprint. Unlike other techniques that fail to leverage recurring knowledge, our proposed approach is proven to be a robust and efficient online learning solution for unsupervised drifting data flows.
[LG-26] Addressing Class Imbalance with Probabilistic Graphical Models and Variational Inference
链接: https://arxiv.org/abs/2504.05758
作者: Yujia Lou,Jie Liu,Yuan Sheng,Jiawei Wang,Yiwei Zhang,Yaokun Ren
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study proposes a method for imbalanced data classification based on deep probabilistic graphical models (DPGMs) to solve the problem that traditional methods have insufficient learning ability for minority class samples. To address the classification bias caused by class imbalance, we introduce variational inference optimization probability modeling, which enables the model to adaptively adjust the representation ability of minority classes and combines the class-aware weight adjustment strategy to enhance the classifier’s sensitivity to minority classes. In addition, we combine the adversarial learning mechanism to generate minority class samples in the latent space so that the model can better characterize the category boundary in the high-dimensional feature space. The experiment is evaluated on the Kaggle “Credit Card Fraud Detection” dataset and compared with a variety of advanced imbalanced classification methods (such as GAN-based sampling, BRF, XGBoost-Cost Sensitive, SAAD, HAN). The results show that the method in this study has achieved the best performance in AUC, Precision, Recall and F1-score indicators, effectively improving the recognition rate of minority classes and reducing the false alarm rate. This method can be widely used in imbalanced classification tasks such as financial fraud detection, medical diagnosis, and anomaly detection, providing a new solution for related research.
[LG-27] Interpretable Non-linear Survival Analysis with Evolutionary Symbolic Regression
链接: https://arxiv.org/abs/2504.05756
作者: Luigi Rovito,Marco Virgolin
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Survival Regression (SuR) is a key technique for modeling time to event in important applications such as clinical trials and semiconductor manufacturing. Currently, SuR algorithms belong to one of three classes: non-linear black-box – allowing adaptability to many datasets but offering limited interpretability (e.g., tree ensembles); linear glass-box – being easier to interpret but limited to modeling only linear interactions (e.g., Cox proportional hazards); and non-linear glass-box – allowing adaptability and interpretability, but empirically found to have several limitations (e.g., explainable boosting machines, survival trees). In this work, we investigate whether Symbolic Regression (SR), i.e., the automated search of mathematical expressions from data, can lead to non-linear glass-box survival models that are interpretable and accurate. We propose an evolutionary, multi-objective, and multi-expression implementation of SR adapted to SuR. Our empirical results on five real-world datasets show that SR consistently outperforms traditional glass-box methods for SuR in terms of accuracy per number of dimensions in the model, while exhibiting comparable accuracy with black-box methods. Furthermore, we offer qualitative examples to assess the interpretability potential of SR models for SuR. Code at: this https URL.
[LG-28] Single-Agent vs. Multi-Agent LLM Strategies for Automated Student Reflection Assessment PAKDD2025
链接: https://arxiv.org/abs/2504.05716
作者: Gen Li,Li Chen,Cheng Tang,Valdemar Švábenský,Daisuke Deguchi,Takayoshi Yamashita,Atsushi Shimada
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: To be published in Proceedings of the 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2025)
Abstract:We explore the use of Large Language Models (LLMs) for automated assessment of open-text student reflections and prediction of academic performance. Traditional methods for evaluating reflections are time-consuming and may not scale effectively in educational settings. In this work, we employ LLMs to transform student reflections into quantitative scores using two assessment strategies (single-agent and multi-agent) and two prompting techniques (zero-shot and few-shot). Our experiments, conducted on a dataset of 5,278 reflections from 377 students over three academic terms, demonstrate that the single-agent with few-shot strategy achieves the highest match rate with human evaluations. Furthermore, models utilizing LLM-assessed reflection scores outperform baselines in both at-risk student identification and grade prediction tasks. These findings suggest that LLMs can effectively automate reflection assessment, reduce educators’ workload, and enable timely support for students who may need additional assistance. Our work emphasizes the potential of integrating advanced generative AI technologies into educational practices to enhance student engagement and academic success.
[LG-29] Dual Boost-Driven Graph-Level Clustering Network
链接: https://arxiv.org/abs/2504.05670
作者: John Smith,Wenxuan Tu,Junlong Wu,Wenxin Zhang,Jingxin Liu,Haotian Wang,Jieren Cheng,Huajie Lei,Guangzhen Yao,Lingren Wang,Mengfei Li,Renda Han,Yu Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph-level clustering remains a pivotal yet formidable challenge in graph learning. Recently, the integration of deep learning with representation learning has demonstrated notable advancements, yielding performance enhancements to a certain degree. However, existing methods suffer from at least one of the following issues: 1. the original graph structure has noise, and 2. during feature propagation and pooling processes, noise is gradually aggregated into the graph-level embeddings through information propagation. Consequently, these two limitations mask clustering-friendly information, leading to suboptimal graph-level clustering performance. To this end, we propose a novel Dual Boost-Driven Graph-Level Clustering Network (DBGCN) to alternately promote graph-level clustering and filtering out interference information in a unified framework. Specifically, in the pooling step, we evaluate the contribution of features at the global and optimize them using a learnable transformation matrix to obtain high-quality graph-level representation, such that the model’s reasoning capability can be improved. Moreover, to enable reliable graph-level clustering, we first identify and suppress information detrimental to clustering by evaluating similarities between graph-level representations, providing more accurate guidance for multi-view fusion. Extensive experiments demonstrated that DBGCN outperforms the state-of-the-art graph-level clustering methods on six benchmark datasets.
[LG-30] Curved representational Bregman divergences and their applications
链接: https://arxiv.org/abs/2504.05654
作者: Frank Nielsen
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 9 pages
Abstract:By analogy to curved exponential families, we define curved Bregman divergences as restrictions of Bregman divergences to sub-dimensional parameter subspaces, and prove that the barycenter of a finite weighted parameter set with respect to a curved Bregman divergence amounts to the Bregman projection onto the subspace induced by the constraint of the barycenter with respect to the unconstrained full Bregman divergence. We demonstrate the significance of curved Bregman divergences with two examples: (1) symmetrized Bregman divergences and (2) the Kullback-Leibler divergence between circular complex normal distributions. We then consider monotonic embeddings to define representational curved Bregman divergences and show that the \alpha -divergences are representational curved Bregman divergences with respect to \alpha -embeddings of the probability simplex into the positive measure cone. As an application, we report an efficient method to calculate the intersection of a finite set of \alpha -divergence spheres.
[LG-31] AGC: Optimizing Gradient Communication in Distributed Transformer Training
链接: https://arxiv.org/abs/2504.05638
作者: Igor Polyakov,Alexey Dukhanov,Egor Spirin
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:The increasing complexity of large language models (LLMs) necessitates efficient training strategies to mitigate the high computational costs associated with distributed training. A significant bottleneck in this process is gradient synchronization across multiple GPUs, particularly in the zero-redundancy parallelism mode. In this paper, we introduce Transformer-Aware Gradient Compression (TAGC), an optimized gradient compression algorithm designed specifically for transformer-based models. TAGC extends the lossless homomorphic compression method by adapting it for sharded models and incorporating transformer-specific optimizations, such as layer-selective compression and dynamic sparsification. Our experimental results demonstrate that TAGC accelerates training by up to 15% compared to the standard Fully Sharded Data Parallel (FSDP) approach, with minimal impact on model quality. We integrate TAGC into the PyTorch FSDP framework, the implementation is publicly available at this https URL.
[LG-32] o Start Up a Start-Up-Embedding Strategic Demand Development in Operational On-Demand Fulfillm ent via Reinforcement Learning with Information Shaping
链接: https://arxiv.org/abs/2504.05633
作者: Xinwei Chen,Marlin W. Ulmer,Barrett W. Thomas
类目: Machine Learning (cs.LG)
*备注:
Abstract:The last few years have witnessed rapid growth in the on-demand delivery market, with many start-ups entering the field. However, not all of these start-ups have succeeded due to various reasons, among others, not being able to establish a large enough customer base. In this paper, we address this problem that many on-demand transportation start-ups face: how to establish themselves in a new market. When starting, such companies often have limited fleet resources to serve demand across a city. Depending on the use of the fleet, varying service quality is observed in different areas of the city, and in turn, the service quality impacts the respective growth of demand in each area. Thus, operational fulfillment decisions drive the longer-term demand development. To integrate strategic demand development into real-time fulfillment operations, we propose a two-step approach. First, we derive analytical insights into optimal allocation decisions for a stylized problem. Second, we use these insights to shape the training data of a reinforcement learning strategy for operational real-time fulfillment. Our experiments demonstrate that combining operational efficiency with long-term strategic planning is highly advantageous. Further, we show that the careful shaping of training data is essential for the successful development of demand.
[LG-33] Maternal and Fetal Health Status Assessment by Using Machine Learning on Optical 3D Body Scans
链接: https://arxiv.org/abs/2504.05627
作者: Ruting Cheng,Yijiang Zheng,Boyuan Feng,Chuhui Qiu,Zhuoxin Long,Joaquin A. Calderon,Xiaoke Zhang,Jaclyn M.Phillips,James K. Hahn
类目: Machine Learning (cs.LG)
*备注:
Abstract:Monitoring maternal and fetal health during pregnancy is crucial for preventing adverse outcomes. While tests such as ultrasound scans offer high accuracy, they can be costly and inconvenient. Telehealth and more accessible body shape information provide pregnant women with a convenient way to monitor their health. This study explores the potential of 3D body scan data, captured during the 18-24 gestational weeks, to predict adverse pregnancy outcomes and estimate clinical parameters. We developed a novel algorithm with two parallel streams which are used for extract body shape features: one for supervised learning to extract sequential abdominal circumference information, and another for unsupervised learning to extract global shape descriptors, alongside a branch for demographic data. Our results indicate that 3D body shape can assist in predicting preterm labor, gestational diabetes mellitus (GDM), gestational hypertension (GH), and in estimating fetal weight. Compared to other machine learning models, our algorithm achieved the best performance, with prediction accuracies exceeding 88% and fetal weight estimation accuracy of 76.74% within a 10% error margin, outperforming conventional anthropometric methods by 22.22%. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.05627 [cs.LG] (or arXiv:2504.05627v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.05627 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-34] Model-Agnostic Policy Explanations with Large Language Models
链接: https://arxiv.org/abs/2504.05625
作者: Zhang Xi-Jia,Yue Guo,Shufei Chen,Simon Stepputtis,Matthew Gombolay,Katia Sycara,Joseph Campbell
类目: Machine Learning (cs.LG)
*备注: This paper significantly extends our prior preprint [ arXiv:2311.18062 ], which was not peer-reviewed and has since been substantially revised in methods, results, and authorship
Abstract:Intelligent agents, such as robots, are increasingly deployed in real-world, human-centric environments. To foster appropriate human trust and meet legal and ethical standards, these agents must be able to explain their behavior. However, state-of-the-art agents are typically driven by black-box models like deep neural networks, limiting their interpretability. We propose a method for generating natural language explanations of agent behavior based only on observed states and actions – without access to the agent’s underlying model. Our approach learns a locally interpretable surrogate model of the agent’s behavior from observations, which then guides a large language model to generate plausible explanations with minimal hallucination. Empirical results show that our method produces explanations that are more comprehensible and correct than those from baselines, as judged by both language models and human evaluators. Furthermore, we find that participants in a user study more accurately predicted the agent’s future actions when given our explanations, suggesting improved understanding of agent behavior.
[LG-35] Fairness in Machine Learning-based Hand Load Estimation: A Case Study on Load Carriage Tasks
链接: https://arxiv.org/abs/2504.05610
作者: Arafat Rahman,Sol Lim,Seokhyun Chung
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predicting external hand load from sensor data is essential for ergonomic exposure assessments, as obtaining this information typically requires direct observation or supplementary data. While machine learning methods have been used to estimate external hand load from worker postures or force exertion data, our findings reveal systematic bias in these predictions due to individual differences such as age and biological sex. To explore this issue, we examined bias in hand load prediction by varying the sex ratio in the training dataset. We found substantial sex disparity in predictive performance, especially when the training dataset is more sex-imbalanced. To address this bias, we developed and evaluated a fair predictive model for hand load estimation that leverages a Variational Autoencoder (VAE) with feature disentanglement. This approach is designed to separate sex-agnostic and sex-specific latent features, minimizing feature overlap. The disentanglement capability enables the model to make predictions based solely on sex-agnostic features of motion patterns, ensuring fair prediction for both biological sexes. Our proposed fair algorithm outperformed conventional machine learning methods (e.g., Random Forests) in both fairness and predictive accuracy, achieving a lower mean absolute error (MAE) difference across male and female sets and improved fairness metrics such as statistical parity (SP) and positive and negative residual differences (PRD and NRD), even when trained on imbalanced sex datasets. These findings emphasize the importance of fairness-aware machine learning algorithms to prevent potential disadvantages in workplace health and safety for certain worker populations.
[LG-36] From Fairness to Truthfulness: Rethinking Data Valuation Design
链接: https://arxiv.org/abs/2504.05563
作者: Dongyang Fan,Tyler J. Rotello,Sai Praneeth Karimireddy
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:As large language models increasingly rely on external data sources, fairly compensating data contributors has become a central concern. In this paper, we revisit the design of data markets through a game-theoretic lens, where data owners face private, heterogeneous costs for data sharing. We show that commonly used valuation methods–such as Leave-One-Out and Data Shapley–fail to ensure truthful reporting of these costs, leading to inefficient market outcomes. To address this, we adapt well-established payment rules from mechanism design, namely Myerson and Vickrey-Clarke-Groves (VCG), to the data market setting. We demonstrate that the Myerson payment is the minimal truthful payment mechanism, optimal from the buyer’s perspective, and that VCG and Myerson payments coincide in unconstrained allocation settings. Our findings highlight the importance of incorporating incentive compatibility into data valuation, paving the way for more robust and efficient data markets.
[LG-37] Federated Hierarchical Reinforcement Learning for Adaptive Traffic Signal Control
链接: https://arxiv.org/abs/2504.05553
作者: Yongjie Fu,Lingyun Zhong,Zifan Li,Xuan Di
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-agent reinforcement learning (MARL) has shown promise for adaptive traffic signal control (ATSC), enabling multiple intersections to coordinate signal timings in real time. However, in large-scale settings, MARL faces constraints due to extensive data sharing and communication requirements. Federated learning (FL) mitigates these challenges by training shared models without directly exchanging raw data, yet traditional FL methods such as FedAvg struggle with highly heterogeneous intersections. Different intersections exhibit varying traffic patterns, demands, and road structures, so performing FedAvg across all agents is inefficient. To address this gap, we propose Hierarchical Federated Reinforcement Learning (HFRL) for ATSC. HFRL employs clustering-based or optimization-based techniques to dynamically group intersections and perform FedAvg independently within groups of intersections with similar characteristics, enabling more effective coordination and scalability than standard FedAvg. Our experiments on synthetic and real-world traffic networks demonstrate that HFRL not only outperforms both decentralized and standard federated RL approaches but also identifies suitable grouping patterns based on network structure or traffic demand, resulting in a more robust framework for distributed, heterogeneous systems.
[LG-38] L3GS: Layered 3D Gaussian Splats for Efficient 3D Scene Delivery
链接: https://arxiv.org/abs/2504.05517
作者: Yi-Zhen Tsai,Xuechen Zhang,Zheng Li,Jiasi Chen
类目: Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:
Abstract:Traditional 3D content representations include dense point clouds that consume large amounts of data and hence network bandwidth, while newer representations such as neural radiance fields suffer from poor frame rates due to their non-standard volumetric rendering pipeline. 3D Gaussian splats (3DGS) can be seen as a generalization of point clouds that meet the best of both worlds, with high visual quality and efficient rendering for real-time frame rates. However, delivering 3DGS scenes from a hosting server to client devices is still challenging due to high network data consumption (e.g., 1.5 GB for a single scene). The goal of this work is to create an efficient 3D content delivery framework that allows users to view high quality 3D scenes with 3DGS as the underlying data representation. The main contributions of the paper are: (1) Creating new layered 3DGS scenes for efficient delivery, (2) Scheduling algorithms to choose what splats to download at what time, and (3) Trace-driven experiments from users wearing virtual reality headsets to evaluate the visual quality and latency. Our system for Layered 3D Gaussian Splats delivery L3GS demonstrates high visual quality, achieving 16.9% higher average SSIM compared to baselines, and also works with other compressed 3DGS representations.
[LG-39] Neural network-enhanced integrators for simulating ordinary differential equations
链接: https://arxiv.org/abs/2504.05493
作者: Amine Othmane,Kathrin Flaßkamp
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Numerous applications necessitate the computation of numerical solutions to differential equations across a wide range of initial conditions and system parameters, which feeds the demand for efficient yet accurate numerical integration this http URL study proposes a neural network (NN) enhancement of classical numerical integrators. NNs are trained to learn integration errors, which are then used as additive correction terms in numerical schemes. The performance of these enhanced integrators is compared with well-established methods through numerical studies, with a particular emphasis on computational efficiency. Analytical properties are examined in terms of local errors and backward error analysis. Embedded Runge-Kutta schemes are then employed to develop enhanced integrators that mitigate generalization risk, ensuring that the neural network’s evaluation in previously unseen regions of the state space does not destabilize the integrator. It is guaranteed that the enhanced integrators perform at least as well as the desired classical Runge-Kutta schemes. The effectiveness of the proposed approaches is demonstrated through extensive numerical studies using a realistic model of a wind turbine, with parameters derived from the established simulation framework OpenFast.
[LG-40] Optimal Bayesian Affine Estimator and Active Learning for the Wiener Model
链接: https://arxiv.org/abs/2504.05490
作者: Sasan Vakili,Manuel Mazo Jr.,Peyman Mohajerin Esfahani
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 23 pages, 4 figures
Abstract:This paper presents a Bayesian estimation framework for Wiener models, focusing on learning nonlinear output functions under known linear state dynamics. We derive a closed-form optimal affine estimator for the unknown parameters, characterized by the so-called “dynamic basis statistics (DBS).” Several features of the proposed estimator are studied, including Bayesian unbiasedness, closed-form posterior statistics, error monotonicity in trajectory length, and consistency condition (also known as persistent excitation). In the special case of Fourier basis functions, we demonstrate that the closed-form description is computationally available, as the Fourier DBS enjoys explicit expression. Furthermore, we identify an inherent inconsistency in single-trajectory measurements, regardless of input excitation. Leveraging the closed-form estimation error, we develop an active learning algorithm synthesizing input signals to minimize estimation error. Numerical experiments validate the efficacy of our approach, showing significant improvements over traditional regularized least-squares methods.
[LG-41] Graph Neural Networks for Enhancing Ensemble Forecasts of Extreme Rainfall ICLR2025 WWW
链接: https://arxiv.org/abs/2504.05471
作者: Christopher Bülte,Sohir Maskey,Philipp Scholl,Jonas von Berg,Gitta Kutyniok
类目: Machine Learning (cs.LG)
*备注: Accepted paper at ICLR 2025 - Tackling Climate Change with Machine Learning Workshop ( this https URL )
Abstract:Climate change is increasing the occurrence of extreme precipitation events, threatening infrastructure, agriculture, and public safety. Ensemble prediction systems provide probabilistic forecasts but exhibit biases and difficulties in capturing extreme weather. While post-processing techniques aim to enhance forecast accuracy, they rarely focus on precipitation, which exhibits complex spatial dependencies and tail behavior. Our novel framework leverages graph neural networks to post-process ensemble forecasts, specifically modeling the extremes of the underlying distribution. This allows to capture spatial dependencies and improves forecast accuracy for extreme events, thus leading to more reliable forecasts and mitigating risks of extreme precipitation and flooding.
[LG-42] Intermediate Layer Classifiers for OOD generalization ICLR2025
链接: https://arxiv.org/abs/2504.05461
作者: Arnas Uselis,Seong Joon Oh
类目: Machine Learning (cs.LG)
*备注: ICLR 2025
Abstract:Deep classifiers are known to be sensitive to data distribution shifts, primarily due to their reliance on spurious correlations in training data. It has been suggested that these classifiers can still find useful features in the network’s last layer that hold up under such shifts. In this work, we question the use of last-layer representations for out-of-distribution (OOD) generalisation and explore the utility of intermediate layers. To this end, we introduce \textitIntermediate Layer Classifiers (ILCs). We discover that intermediate layer representations frequently offer substantially better generalisation than those from the penultimate layer. In many cases, zero-shot OOD generalisation using earlier-layer representations approaches the few-shot performance of retraining on penultimate layer representations. This is confirmed across multiple datasets, architectures, and types of distribution shifts. Our analysis suggests that intermediate layers are less sensitive to distribution shifts compared to the penultimate layer. These findings highlight the importance of understanding how information is distributed across network layers and its role in OOD generalisation, while also pointing to the limits of penultimate layer representation utility. Code is available at this https URL
[LG-43] Handling Weather Uncertainty in Air Traffic Prediction through an Inverse Approach
链接: https://arxiv.org/abs/2504.05366
作者: G. Lancia,D. Falanga,S. Alam,G. Lulli
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adverse weather conditions, particularly convective phenomena, pose significant challenges to Air Traffic Management, often requiring real-time rerouting decisions that impact efficiency and safety. This study introduces a 3-D Gaussian Mixture Model to predict long lead-time flight trajectory changes, incorporating comprehensive weather and traffic data. Utilizing high-resolution meteorological datasets, including convective weather maps and wind data, alongside traffic records, the model demonstrates robust performance in forecasting reroutes up to 60 minutes. The novel 3-D Gaussian Mixture Model framework employs a probabilistic approach to capture uncertainty while providing accurate forecasts of altitude, latitude, and longitude. Extensive evaluation revealed a Mean Absolute Percentage Error below 0.02 across varying lead times, highlighting the model’s accuracy and scalability. By integrating explainability techniques such as the Vanilla Gradient algorithm, the study provides insights into feature contributions, showing that they contribute to improving Air Traffic Management strategies to mitigate weather-induced disruptions.
[LG-44] Deep Learning for Double Auction
链接: https://arxiv.org/abs/2504.05355
作者: Jiayin Liu,Chenglong Zhang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
*备注:
Abstract:Auctions are important mechanisms extensively implemented in various markets, e.g., search engines’ keyword auctions, antique auctions, etc. Finding an optimal auction mechanism is extremely difficult due to the constraints of imperfect information, incentive compatibility (IC), and individual rationality (IR). In addition to the traditional economic methods, some recently attempted to find the optimal (single) auction using deep learning methods. Unlike those attempts focusing on single auctions, we develop deep learning methods for double auctions, where imperfect information exists on both the demand and supply sides. The previous attempts on single auction cannot directly apply to our contexts and those attempts additionally suffer from limited generalizability, inefficiency in ensuring the constraints, and learning fluctuations. We innovate in designing deep learning models for solving the more complex problem and additionally addressing the previous models’ three limitations. Specifically, we achieve generalizability by leveraging a transformer-based architecture to model market participants as sequences for varying market sizes; we utilize the numerical features of the constraints and pre-treat them for a higher learning efficiency; we develop a gradient-conflict-elimination scheme to address the problem of learning fluctuation. Extensive experimental evaluations demonstrate the superiority of our approach to classical and machine learning baselines.
[LG-45] Structuring Multiple Simple Cycle Reservoirs with Particle Swarm Optimization
链接: https://arxiv.org/abs/2504.05347
作者: Ziqiang Li,Robert Simon Fong,Kantaro Fujiwara,Kazuyuki Aihara,Gouhei Tanaka
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Reservoir Computing (RC) is a time-efficient computational paradigm derived from Recurrent Neural Networks (RNNs). The Simple Cycle Reservoir (SCR) is an RC model that stands out for its minimalistic design, offering extremely low construction complexity and proven capability of universally approximating time-invariant causal fading memory filters, even in the linear dynamics regime. This paper introduces Multiple Simple Cycle Reservoirs (MSCRs), a multi-reservoir framework that extends Echo State Networks (ESNs) by replacing a single large reservoir with multiple interconnected SCRs. We demonstrate that optimizing MSCR using Particle Swarm Optimization (PSO) outperforms existing multi-reservoir models, achieving competitive predictive performance with a lower-dimensional state space. By modeling interconnections as a weighted Directed Acyclic Graph (DAG), our approach enables flexible, task-specific network topology adaptation. Numerical simulations on three benchmark time-series prediction tasks confirm these advantages over rival algorithms. These findings highlight the potential of MSCR-PSO as a promising framework for optimizing multi-reservoir systems, providing a foundation for further advancements and applications of interconnected SCRs for developing efficient AI devices.
[LG-46] ZeroED: Hybrid Zero-shot Error Detection through Large Language Model Reasoning
链接: https://arxiv.org/abs/2504.05345
作者: Wei Ni,Kaihang Zhang,Xiaoye Miao,Xiangyu Zhao,Yangyang Wu,Yaoshu Wang,Jianwei Yin
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 12 pages
Abstract:Error detection (ED) in tabular data is crucial yet challenging due to diverse error types and the need for contextual understanding. Traditional ED methods often rely heavily on manual criteria and labels, making them labor-intensive. Large language models (LLM) can minimize human effort but struggle with errors requiring a comprehensive understanding of data context. In this paper, we propose ZeroED, a novel hybrid zero-shot error detection framework, which combines LLM reasoning ability with the manual label-based ED pipeline. ZeroED operates in four steps, i.e., feature representation, error labeling, training data construction, and detector training. Initially, to enhance error distinction, ZeroED generates rich data representations using error reason-aware binary features, pre-trained embeddings, and statistical features. Then, ZeroED employs LLM to label errors holistically through in-context learning, guided by a two-step reasoning process for detailed error detection guidelines. To reduce token costs, LLMs are applied only to representative data selected via clustering-based sampling. High-quality training data is constructed through in-cluster label propagation and LLM augmentation with verification. Finally, a classifier is trained to detect all errors. Extensive experiments on seven public datasets demonstrate that, ZeroED substantially outperforms state-of-the-art methods by a maximum 30% improvement in F1 score and up to 90% token cost reduction.
[LG-47] Impact of Price Inflation on Algorithmic Collusion Through Reinforcement Learning Agents
链接: https://arxiv.org/abs/2504.05335
作者: Sebastián Tinoco,Andrés Abeliuk,Javier Ruiz del Solar
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:Algorithmic pricing is increasingly shaping market competition, raising concerns about its potential to compromise competitive dynamics. While prior work has shown that reinforcement learning (RL)-based pricing algorithms can lead to tacit collusion, less attention has been given to the role of macroeconomic factors in shaping these dynamics. This study examines the role of inflation in influencing algorithmic collusion within competitive markets. By incorporating inflation shocks into a RL-based pricing model, we analyze whether agents adapt their strategies to sustain supra-competitive profits. Our findings indicate that inflation reduces market competitiveness by fostering implicit coordination among agents, even without direct collusion. However, despite achieving sustained higher profitability, agents fail to develop robust punishment mechanisms to deter deviations from equilibrium strategies. The results suggest that inflation amplifies non-competitive dynamics in algorithmic pricing, emphasizing the need for regulatory oversight in markets where AI-driven pricing is prevalent.
[LG-48] Document clustering with evolved multiword search queries
链接: https://arxiv.org/abs/2504.05320
作者: Laurence Hirsch,Robin Hirsch,Bayode Ogunleye
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 15 pages
Abstract:Text clustering holds significant value across various domains due to its ability to identify patterns and group related information. Current approaches which rely heavily on a computed similarity measure between documents are often limited in accuracy and interpretability. We present a novel approach to the problem based on a set of evolved search queries. Clusters are formed as the set of documents matched by a single search query in the set of queries. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word they are interpreted disjunctively. We have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the set returned by the root word. Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that as well as achieving the highest accuracy on these datasets the search query format provides the qualitative benefits of being interpretable and modifiable whilst providing a causal explanation of cluster construction.
[LG-49] A Systematic Survey on Federated Sequential Recommendation
链接: https://arxiv.org/abs/2504.05313
作者: Yichen Li,Qiyu Qin,Gaoyang Zhu,Wenchao Xu,Haozhao Wang,Yuhua Li,Rui Zhang,Ruixuan Li
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Sequential recommendation is an advanced recommendation technique that utilizes the sequence of user behaviors to generate personalized suggestions by modeling the temporal dependencies and patterns in user preferences. However, it requires a server to centrally collect users’ data, which poses a threat to the data privacy of different users. In recent years, federated learning has emerged as a distributed architecture that allows participants to train a global model while keeping their private data locally. This survey pioneers Federated Sequential Recommendation (FedSR), where each user joins as a participant in federated training to achieve a recommendation service that balances data privacy and model performance. We begin with an introduction to the background and unique challenges of FedSR. Then, we review existing solutions from two levels, each of which includes two specific techniques. Additionally, we discuss the critical challenges and future research directions in FedSR.
[LG-50] Cache-Aware Reinforcement Learning in Large-Scale Recommender Systems
链接: https://arxiv.org/abs/2404.14961
作者: Xiaoshuang Chen,Gengrui Zhang,Yao Wang,Yulin Wu,Shuo Su,Kaiqiao Zhan,Ben Wang
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 8 pages, 8 figures
Abstract:Modern large-scale recommender systems are built upon computation-intensive infrastructure and usually suffer from a huge difference in traffic between peak and off-peak periods. In peak periods, it is challenging to perform real-time computation for each request due to the limited budget of computational resources. The recommendation with a cache is a solution to this problem, where a user-wise result cache is used to provide recommendations when the recommender system cannot afford a real-time computation. However, the cached recommendations are usually suboptimal compared to real-time computation, and it is challenging to determine the items in the cache for each user. In this paper, we provide a cache-aware reinforcement learning (CARL) method to jointly optimize the recommendation by real-time computation and by the cache. We formulate the problem as a Markov decision process with user states and a cache state, where the cache state represents whether the recommender system performs recommendations by real-time computation or by the cache. The computational load of the recommender system determines the cache state. We perform reinforcement learning based on such a model to improve user engagement over multiple requests. Moreover, we show that the cache will introduce a challenge called critic dependency, which deteriorates the performance of reinforcement learning. To tackle this challenge, we propose an eigenfunction learning (EL) method to learn independent critics for CARL. Experiments show that CARL can significantly improve the users’ engagement when considering the result cache. CARL has been fully launched in Kwai app, serving over 100 million users.
[LG-51] Fractal and Regular Geometry of Deep Neural Networks
链接: https://arxiv.org/abs/2504.06250
作者: Simmaco Di Lillo,Domenico Marinucci,Michele Salvi,Stefano Vigogna
类目: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the geometric properties of random neural networks by investigating the boundary volumes of their excursion sets for different activation functions, as the depth increases. More specifically, we show that, for activations which are not very regular (e.g., the Heaviside step function), the boundary volumes exhibit fractal behavior, with their Hausdorff dimension monotonically increasing with the depth. On the other hand, for activations which are more regular (e.g., ReLU, logistic and \tanh ), as the depth increases, the expected boundary volumes can either converge to zero, remain constant or diverge exponentially, depending on a single spectral parameter which can be easily computed. Our theoretical results are confirmed in some numerical experiments based on Monte Carlo simulations.
[LG-52] Accurate Ab-initio Neural-network Solutions to Large-Scale Electronic Structure Problems
链接: https://arxiv.org/abs/2504.06087
作者: Michael Scherbela,Nicholas Gao,Philipp Grohs,Stephan Günnemann
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 13 pages, 5 figures + 9 pages supplementary information
Abstract:We present finite-range embeddings (FiRE), a novel wave function ansatz for accurate large-scale ab-initio electronic structure calculations. Compared to contemporary neural-network wave functions, FiRE reduces the asymptotic complexity of neural-network variational Monte Carlo (NN-VMC) by \sim n_\textel , the number of electrons. By restricting electron-electron interactions within the neural network, FiRE accelerates all key operations – sampling, pseudopotentials, and Laplacian computations – resulting in a real-world 10\times acceleration in now-feasible 180-electron calculations. We validate our method’s accuracy on various challenging systems, including biochemical compounds, conjugated hydrocarbons, and organometallic compounds. On these systems, FiRE’s energies are consistently within chemical accuracy of the most reliable data, including experiments, even in cases where high-accuracy methods such as CCSD(T), AFQMC, or contemporary NN-VMC fall short. With these improvements in both runtime and accuracy, FiRE represents a new `gold-standard’ method for fast and accurate large-scale ab-initio calculations, potentially enabling new computational studies in fields like quantum chemistry, solid-state physics, and material design.
[LG-53] Actuarial Learning for Pension Fund Mortality Forecasting
链接: https://arxiv.org/abs/2504.05881
作者: Eduardo Fraga L. de Melo,Helton Graziadei,Rodrigo Targino
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 12 figures
Abstract:For the assessment of the financial soundness of a pension fund, it is necessary to take into account mortality forecasting so that longevity risk is consistently incorporated into future cash flows. In this article, we employ machine learning models applied to actuarial science (\it actuarial learning) to make mortality predictions for a relevant sample of pension funds’ participants. Actuarial learning represents an emerging field that involves the application of machine learning (ML) and artificial intelligence (AI) techniques in actuarial science. This encompasses the use of algorithms and computational models to analyze large sets of actuarial data, such as regression trees, random forest, boosting, XGBoost, CatBoost, and neural networks (eg. FNN, LSTM, and MHA). Our results indicate that some ML/AI algorithms present competitive out-of-sample performance when compared to the classical Lee-Carter model. This may indicate interesting alternatives for consistent liability evaluation and effective pension fund risk management.
[LG-54] Improved Inference of Inverse Ising Problems under Missing Observations in Restricted Boltzmann Machines
链接: https://arxiv.org/abs/2504.05643
作者: Kaiji Sekimoto,Muneki Yasuda
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Restricted Boltzmann machines (RBMs) are energy-based models analogous to the Ising model and are widely applied in statistical machine learning. The standard inverse Ising problem with a complete dataset requires computing both data and model expectations and is computationally challenging because model expectations have a combinatorial explosion. Furthermore, in many applications, the available datasets are partially incomplete, making it difficult to compute even data expectations. In this study, we propose a approximation framework for these expectations in the practical inverse Ising problems that integrates mean-field approximation or persistent contrastive divergence to generate refined initial points and spatial Monte Carlo integration to enhance estimator accuracy. We demonstrate that the proposed method effectively and accurately tunes the model parameters in comparison to the conventional method.
[LG-55] Cross-functional transferability in universal machine learning interatomic potentials
链接: https://arxiv.org/abs/2504.05565
作者: Xu Huang,Bowen Deng,Peichen Zhong,Aaron D. Kaplan,Kristin A. Persson,Gerbrand Ceder
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:The rapid development of universal machine learning interatomic potentials (uMLIPs) has demonstrated the possibility for generalizable learning of the universal potential energy surface. In principle, the accuracy of uMLIPs can be further improved by bridging the model from lower-fidelity datasets to high-fidelity ones. In this work, we analyze the challenge of this transfer learning problem within the CHGNet framework. We show that significant energy scale shifts and poor correlations between GGA and r ^2 SCAN pose challenges to cross-functional data transferability in uMLIPs. By benchmarking different transfer learning approaches on the MP-r ^2 SCAN dataset of 0.24 million structures, we demonstrate the importance of elemental energy referencing in the transfer learning of uMLIPs. By comparing the scaling law with and without the pre-training on a low-fidelity dataset, we show that significant data efficiency can still be achieved through transfer learning, even with a target dataset of sub-million structures. We highlight the importance of proper transfer learning and multi-fidelity learning in creating next-generation uMLIPs on high-fidelity data.
[LG-56] Riemannian Geometry for the classification of brain states with intracortical brain-computer interfaces
链接: https://arxiv.org/abs/2504.05534
作者: Arnau Marin-Llobet,Arnau Manasanch,Sergio Sanchez-Manso,Lluc Tresserras,Xinhe Zhang,Yining Hua,Hao Zhao,Melody Torao-Angosto,Maria V Sanchez-Vives,Leonardo Dalla Porta
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: Preprint
Abstract:This study investigates the application of Riemannian geometry-based methods for brain decoding using invasive electrophysiological recordings. Although previously employed in non-invasive, the utility of Riemannian geometry for invasive datasets, which are typically smaller and scarcer, remains less explored. Here, we propose a Minimum Distance to Mean (MDM) classifier using a Riemannian geometry approach based on covariance matrices extracted from intracortical Local Field Potential (LFP) recordings across various regions during different brain state dynamics. For benchmarking, we evaluated the performance of our approach against Convolutional Neural Networks (CNNs) and Euclidean MDM classifiers. Our results indicate that the Riemannian geometry-based classification not only achieves a superior mean F1 macro-averaged score across different channel configurations but also requires up to two orders of magnitude less computational training time. Additionally, the geometric framework reveals distinct spatial contributions of brain regions across varying brain states, suggesting a state-dependent organization that traditional time series-based methods often fail to capture. Our findings align with previous studies supporting the efficacy of geometry-based methods and extending their application to invasive brain recordings, highlighting their potential for broader clinical use, such as brain computer interface applications.
[LG-57] Quantum Mechanics and Neural Networks
链接: https://arxiv.org/abs/2504.05462
作者: Christian Ferko,James Halverson
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Probability (math.PR); Quantum Physics (quant-ph)
*备注: 67 pages, 8 figures
Abstract:We demonstrate that any Euclidean-time quantum mechanical theory may be represented as a neural network, ensured by the Kosambi-Karhunen-Loève theorem, mean-square path continuity, and finite two-point functions. The additional constraint of reflection positivity, which is related to unitarity, may be achieved by a number of mechanisms, such as imposing neural network parameter space splitting or the Markov property. Non-differentiability of the networks is related to the appearance of non-trivial commutators. Neural networks acting on Markov processes are no longer Markov, but still reflection positive, which facilitates the definition of deep neural network quantum systems. We illustrate these principles in several examples using numerical implementations, recovering classic quantum mechanical results such as Heisenberg uncertainty, non-trivial commutators, and the spectrum.
[LG-58] Survey on Algorithms for multi-index models
链接: https://arxiv.org/abs/2504.05426
作者: Joan Bruna,Daniel Hsu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:We review the literature on algorithms for estimating the index space in a multi-index model. The primary focus is on computationally efficient (polynomial-time) algorithms in Gaussian space, the assumptions under which consistency is guaranteed by these methods, and their sample complexity. In many cases, a gap is observed between the sample complexity of the best known computationally efficient methods and the information-theoretical minimum. We also review algorithms based on estimating the span of gradients using nonparametric methods, and algorithms based on fitting neural networks using gradient descent
[LG-59] Quantum Adaptive Self-Attention for Quantum Transformer Models
链接: https://arxiv.org/abs/2504.05336
作者: Chi-Sheng Chen,En-Jui Kuo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Transformer models have revolutionized sequential learning across various domains, yet their self-attention mechanism incurs quadratic computational cost, posing limitations for real-time and resource-constrained tasks. To address this, we propose Quantum Adaptive Self-Attention (QASA), a novel hybrid architecture that enhances classical Transformer models with a quantum attention mechanism. QASA replaces dot-product attention with a parameterized quantum circuit (PQC) that adaptively captures inter-token relationships in the quantum Hilbert space. Additionally, a residual quantum projection module is introduced before the feedforward network to further refine temporal features. Our design retains classical efficiency in earlier layers while injecting quantum expressiveness in the final encoder block, ensuring compatibility with current NISQ hardware. Experiments on synthetic time-series tasks demonstrate that QASA achieves faster convergence and superior generalization compared to both standard Transformers and reduced classical variants. Preliminary complexity analysis suggests potential quantum advantages in gradient computation, opening new avenues for efficient quantum deep learning models.
信息检索
[IR-0] Knowledge Graph Completion with Relation-Aware Anchor Enhancement
链接: https://arxiv.org/abs/2504.06129
作者: Duanyang Yuan,Sihang Zhou,Xiaoshu Chen,Dong Wang,Ke Liang,Xinwang Liu,Jian Huang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Text-based knowledge graph completion methods take advantage of pre-trained language models (PLM) to enhance intrinsic semantic connections of raw triplets with detailed text descriptions. Typical methods in this branch map an input query (textual descriptions associated with an entity and a relation) and its candidate entities into feature vectors, respectively, and then maximize the probability of valid triples. These methods are gaining promising performance and increasing attention for the rapid development of large language models. According to the property of the language models, the more related and specific context information the input query provides, the more discriminative the resultant embedding will be. In this paper, through observation and validation, we find a neglected fact that the relation-aware neighbors of the head entities in queries could act as effective contexts for more precise link prediction. Driven by this finding, we propose a relation-aware anchor enhanced knowledge graph completion method (RAA-KGC). Specifically, in our method, to provide a reference of what might the target entity be like, we first generate anchor entities within the relation-aware neighborhood of the head entity. Then, by pulling the query embedding towards the neighborhoods of the anchors, it is tuned to be more discriminative for target entity matching. The results of our extensive experiments not only validate the efficacy of RAA-KGC but also reveal that by integrating our relation-aware anchor enhancement strategy, the performance of current leading methods can be notably enhanced without substantial modifications.
[IR-1] Widening the Role of Group Recommender Systems with CAJO SIGIR
链接: https://arxiv.org/abs/2504.05934
作者: Francesco Ricci,Amra Delić
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: 10.5 pages opinion paper + references submitted to ACM SIGIR Forum for the upcoming issue in June 2025
Abstract:Group Recommender Systems (GRSs) have been studied and developed for more than twenty years. However, their application and usage has not grown. They can even be labeled as failures, if compared to the very successful and common recommender systems (RSs) used on all the major ecommerce and social platforms. As a result, the RSs that we all use now, are only targeted for individual users, aiming at choosing an item exclusively for themselves; no choice support is provided to groups trying to select a service, a product, an experience, a person, serving equally well all the group members. In this opinion article we discuss why the success of group recommender systems is lagging and we propose a research program unfolding on the analysis and development of new forms of collaboration between humans and intelligent systems. We define a set of roles, named CAJO, that GRSs should play in order to become more useful tools for group decision making.
[IR-2] Why is Normalization Necessary for Linear Recommenders? SIGIR2025
链接: https://arxiv.org/abs/2504.05805
作者: Seongmin Park,Mincheol Yoon,Hye-young Kim,Jongwuk Lee
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR 2025
Abstract:Despite their simplicity, linear autoencoder (LAE)-based models have shown comparable or even better performance with faster inference speed than neural recommender models. However, LAEs face two critical challenges: (i) popularity bias, which tends to recommend popular items, and (ii) neighborhood bias, which overly focuses on capturing local item correlations. To address these issues, this paper first analyzes the effect of two existing normalization methods for LAEs, i.e., random-walk and symmetric normalization. Our theoretical analysis reveals that normalization highly affects the degree of popularity and neighborhood biases among items. Inspired by this analysis, we propose a versatile normalization solution, called Data-Adaptive Normalization (DAN), which flexibly controls the popularity and neighborhood biases by adjusting item- and user-side normalization to align with unique dataset characteristics. Owing to its model-agnostic property, DAN can be easily applied to various LAE-based models. Experimental results show that DAN-equipped LAEs consistently improve existing LAE-based models across six benchmark datasets, with significant gains of up to 128.57% and 12.36% for long-tail items and unbiased evaluations, respectively. Refer to our code in this https URL.
[IR-3] StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization
链接: https://arxiv.org/abs/2504.05804
作者: Yiming Tang,Yi Fan,Chenxiao Yu,Tiankai Yang,Yue Zhao,Xiyang Hu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The integration of large language models (LLMs) into information retrieval systems introduces new attack surfaces, particularly for adversarial ranking manipulations. We present StealthRank, a novel adversarial ranking attack that manipulates LLM-driven product recommendation systems while maintaining textual fluency and stealth. Unlike existing methods that often introduce detectable anomalies, StealthRank employs an energy-based optimization framework combined with Langevin dynamics to generate StealthRank Prompts (SRPs)-adversarial text sequences embedded within product descriptions that subtly yet effectively influence LLM ranking mechanisms. We evaluate StealthRank across multiple LLMs, demonstrating its ability to covertly boost the ranking of target products while avoiding explicit manipulation traces that can be easily detected. Our results show that StealthRank consistently outperforms state-of-the-art adversarial ranking baselines in both effectiveness and stealth, highlighting critical vulnerabilities in LLM-driven recommendation systems.
[IR-4] Unified Generative Search and Recommendation
链接: https://arxiv.org/abs/2504.05730
作者: Teng Shi,Jun Xu,Xiao Zhang,Xiaoxue Zang,Kai Zheng,Yang Song,Enyun Yu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Modern commercial platforms typically offer both search and recommendation functionalities to serve diverse user needs, making joint modeling of these tasks an appealing direction. While prior work has shown that integrating search and recommendation can be mutually beneficial, it also reveals a performance trade-off: enhancements in one task often come at the expense of the other. This challenge arises from their distinct information requirements: search emphasizes semantic relevance between queries and items, whereas recommendation depends more on collaborative signals among users and items. Effectively addressing this trade-off requires tackling two key problems: (1) integrating both semantic and collaborative signals into item representations, and (2) guiding the model to distinguish and adapt to the unique demands of search and recommendation. The emergence of generative retrieval with Large Language Models (LLMs) presents new possibilities. This paradigm encodes items as identifiers and frames both search and recommendation as sequential generation tasks, offering the flexibility to leverage multiple identifiers and task-specific prompts. In light of this, we introduce GenSAR, a unified generative framework for balanced search and recommendation. Our approach designs dual-purpose identifiers and tailored training strategies to incorporate complementary signals and align with task-specific objectives. Experiments on both public and commercial datasets demonstrate that GenSAR effectively reduces the trade-off and achieves state-of-the-art performance on both tasks.
[IR-5] xMTF: A Formula-Free Model for Reinforcement-Learning-Based Multi-Task Fusion in Recommender Systems WWW2025
链接: https://arxiv.org/abs/2504.05669
作者: Yang Cao,Changhao Zhang,Xiaoshuang Chen,Kaiqiao Zhan,Ben Wang
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 8 figues; WWW 2025 Accepted
Abstract:Recommender systems need to optimize various types of user feedback, e.g., clicks, likes, and shares. A typical recommender system handling multiple types of feedback has two components: a multi-task learning (MTL) module, predicting feedback such as click-through rate and like rate; and a multi-task fusion (MTF) module, integrating these predictions into a single score for item ranking. MTF is essential for ensuring user satisfaction, as it directly influences recommendation outcomes. Recently, reinforcement learning (RL) has been applied to MTF tasks to improve long-term user satisfaction. However, existing RL-based MTF methods are formula-based methods, which only adjust limited coefficients within pre-defined formulas. The pre-defined formulas restrict the RL search space and become a bottleneck for MTF. To overcome this, we propose a formula-free MTF framework. We demonstrate that any suitable fusion function can be expressed as a composition of single-variable monotonic functions, as per the Sprecher Representation Theorem. Leveraging this, we introduce a novel learnable monotonic fusion cell (MFC) to replace pre-defined formulas. We call this new MFC-based model eXtreme MTF (xMTF). Furthermore, we employ a two-stage hybrid (TSH) learning strategy to train xMTF effectively. By expanding the MTF search space, xMTF outperforms existing methods in extensive offline and online experiments.
[IR-6] Simplifying Data Integration: SLM-Driven Systems for Unified Semantic Queries Across Heterogeneous Databases
链接: https://arxiv.org/abs/2504.05634
作者: Teng Lin
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:
Abstract:The integration of heterogeneous databases into a unified querying framework remains a critical challenge, particularly in resource-constrained environments. This paper presents a novel Small Language Model(SLM)-driven system that synergizes advancements in lightweight Retrieval-Augmented Generation (RAG) and semantic-aware data structuring to enable efficient, accurate, and scalable query resolution across diverse data formats. By integrating MiniRAG’s semantic-aware heterogeneous graph indexing and topology-enhanced retrieval with SLM-powered structured data extraction, our system addresses the limitations of traditional methods in handling Multi-Entity Question Answering (Multi-Entity QA) and complex semantic queries. Experimental results demonstrate superior performance in accuracy and efficiency, while the introduction of semantic entropy as an unsupervised evaluation metric provides robust insights into model uncertainty. This work pioneers a cost-effective, domain-agnostic solution for next-generation database systems.
[IR-7] Stratified Expert Cloning with Adaptive Selection for User Retention in Large-Scale Recommender Systems
链接: https://arxiv.org/abs/2504.05628
作者: Chengzhi Lin,Annan Xie,Shuchang Liu,Wuhong Wang,Chuyuan Wang,Yongqi Liu
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 8 figures, 4 tables
Abstract:User retention has emerged as a critical challenge in large-scale recommender systems, significantly impacting the long-term success of online platforms. Existing methods often focus on short-term engagement metrics, failing to capture the complex dynamics of user preferences and behaviors over extended periods. While reinforcement learning (RL) approaches have shown promise in optimizing long-term rewards, they face difficulties in credit assignment, sample efficiency, and exploration when applied to the user retention problem. In this work, we propose Stratified Expert Cloning (SEC), a novel imitation learning framework that effectively leverages abundant logged data from high-retention users to learn robust recommendation policies. SEC introduces three key innovations: 1) a multi-level expert stratification strategy that captures the nuances in expert user behaviors at different retention levels; 2) an adaptive expert selection mechanism that dynamically assigns users to the most suitable policy based on their current state and historical retention level; and 3) an action entropy regularization technique that promotes recommendation diversity and mitigates the risk of policy collapse. Through extensive offline experiments and online A/B tests on two major video platforms, Kuaishou and Kuaishou Lite, with hundreds of millions of daily active users, we demonstrate SEC’s significant improvements over state-of-the-art methods in user retention. The results demonstrate significant improvements in user retention, with cumulative lifts of 0.098% and 0.122% in active days on Kuaishou and Kuaishou Lite respectively, additionally bringing tens of thousands of daily active users to each platform.
[IR-8] User Feedback Alignment for LLM -powered Exploration in Large-scale Recommendation Systems
链接: https://arxiv.org/abs/2504.05522
作者: Jianling Wang,Yifan Liu,Yinghao Sun,Xuejian Ma,Yueqi Wang,He Ma,Steven Su,Ed H. Chi,Minmin Chen,Lichan Hong,Ningren Han,Haokai Lu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Exploration, the act of broadening user experiences beyond their established preferences, is challenging in large-scale recommendation systems due to feedback loops and limited signals on user exploration patterns. Large Language Models (LLMs) offer potential by leveraging their world knowledge to recommend novel content outside these loops. A key challenge is aligning LLMs with user preferences while preserving their knowledge and reasoning. While using LLMs to plan for the next novel user interest, this paper introduces a novel approach combining hierarchical planning with LLM inference-time scaling to improve recommendation relevancy without compromising novelty. We decouple novelty and user-alignment, training separate LLMs for each objective. We then scale up the novelty-focused LLM’s inference and select the best-of-n predictions using the user-aligned LLM. Live experiments demonstrate efficacy, showing significant gains in both user satisfaction (measured by watch activity and active user counts) and exploration diversity.
[IR-9] Balancing Benefits and Risks: RL Approaches for Addiction-Aware Social Media Recommenders
链接: https://arxiv.org/abs/2504.05322
作者: Luca Bolis,Stefano Livella,Sabrina Patania,Dimitri Ognibene,Matteo Papini,Kenji Morita
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注: 5 pages, presented at RLDM 2025
Abstract:Social media platforms provide valuable opportunities for users to gather information, interact with friends, and enjoy entertainment. However, their addictive potential poses significant challenges, including overuse and negative psycho-logical or behavioral impacts [4, 2, 8]. This study explores strategies to mitigate compulsive social media usage while preserving its benefits and ensuring economic sustainability, focusing on recommenders that promote balanced usage. We analyze user behaviors arising from intrinsic diversities and environmental interactions, offering insights for next-generation social media recommenders that prioritize well-being. Specifically, we examine the temporal predictability of overuse and addiction using measures available to recommenders, aiming to inform mechanisms that prevent addiction while avoiding user disengagement [7]. Building on RL-based computational frameworks for addiction modelling [6], our study introduces: - A recommender system adapting to user preferences, introducing non-stationary and non-Markovian dynamics. - Differentiated state representations for users and recommenders to capture nuanced interactions. - Distinct usage conditions-light and heavy use-addressing RL’s limitations in distinguishing prolonged from healthy engagement. - Complexity in overuse impacts, highlighting their role in user adaptation [7]. Simulations demonstrate how model-based (MB) and model-free (MF) decision-making interact with environmental dynamics to influence user behavior and addiction. Results reveal the significant role of recommender systems in shaping addiction tendencies or fostering healthier engagement. These findings support ethical, adaptive recommender design, advancing sustainable social media ecosystems [9, 1]. Keywords: multi-agent systems, recommender systems, addiction, social media Comments: 5 pages, presented at RLDM 2025 Subjects: Information Retrieval (cs.IR); Computers and Society (cs.CY) Cite as: arXiv:2504.05322 [cs.IR] (or arXiv:2504.05322v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.05322 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Luca Bolis [view email] [v1] Tue, 25 Feb 2025 09:43:25 UTC (2,548 KB) Full-text links: Access Paper: View a PDF of the paper titled Balancing Benefits and Risks: RL Approaches for Addiction-Aware Social Media Recommenders, by Luca Bolis and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.IR prev | next new | recent | 2025-04 Change to browse by: cs cs.CY References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[IR-10] Coherency Improved Explainable Recommendation via Large Language Model AAAI2025
链接: https://arxiv.org/abs/2504.05315
作者: Shijie Liu,Ruixing Ding,Weihai Lu,Jun Wang,Mo Yu,Xiaoming Shi,Wei Zhang
类目: Information Retrieval (cs.IR)
*备注: Accepted by AAAI 2025, with 9 pages
Abstract:Explainable recommender systems are designed to elucidate the explanation behind each recommendation, enabling users to comprehend the underlying logic. Previous works perform rating prediction and explanation generation in a multi-task manner. However, these works suffer from incoherence between predicted ratings and explanations. To address the issue, we propose a novel framework that employs a large language model (LLM) to generate a rating, transforms it into a rating vector, and finally generates an explanation based on the rating vector and user-item information. Moreover, we propose utilizing publicly available LLMs and pre-trained sentiment analysis models to automatically evaluate the coherence without human annotations. Extensive experimental results on three datasets of explainable recommendation show that the proposed framework is effective, outperforming state-of-the-art baselines with improvements of 7.3% in explainability and 4.4% in text quality.
[IR-11] GRIT: Graph-based Recall Improvement for Task-oriented E-commerce Queries WWW2025
链接: https://arxiv.org/abs/2504.05310
作者: Hrishikesh Kulkarni,Surya Kallumadi,Sean MacAvaney,Nazli Goharian,Ophir Frieder
类目: Information Retrieval (cs.IR)
*备注: LLM4ECommerce at WWW 2025
Abstract:Many e-commerce search pipelines have four stages, namely: retrieval, filtering, ranking, and personalized-reranking. The retrieval stage must be efficient and yield high recall because relevant products missed in the first stage cannot be considered in later stages. This is challenging for task-oriented queries (queries with actionable intent) where user requirements are contextually intensive and difficult to understand. To foster research in the domain of e-commerce, we created a novel benchmark for Task-oriented Queries (TQE) by using LLM, which operates over the existing ESCI product search dataset. Furthermore, we propose a novel method ‘Graph-based Recall Improvement for Task-oriented queries’ (GRIT) to address the most crucial first-stage recall improvement needs. GRIT leads to robust and statistically significant improvements over state-of-the-art lexical, dense, and learned-sparse baselines. Our system supports both traditional and task-oriented e-commerce queries, yielding up to 6.3% recall improvement. In the indexing stage, GRIT first builds a product-product similarity graph using user clicks or manual annotation data. During retrieval, it locates neighbors with higher contextual and action relevance and prioritizes them over the less relevant candidates from the initial retrieval. This leads to a more comprehensive and relevant first-stage result set that improves overall system recall. Overall, GRIT leverages the locality relationships and contextual insights provided by the graph using neighboring nodes to enrich the first-stage retrieval results. We show that the method is not only robust across all introduced parameters, but also works effectively on top of a variety of first-stage retrieval methods.
[IR-12] RARe: Raising Ad Revenue Framework with Context-Aware Reranking
链接: https://arxiv.org/abs/2504.05308
作者: Ekaterina Solodneva,Alexandra Khirianova,Aleksandr Katrutsa,Roman Loginov,Andrey Tikhanov,Egor Samosvat,Yuriy Dorn
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Modern recommender systems excel at optimizing search result relevance for e-commerce platforms. While maintaining this relevance, platforms seek opportunities to maximize revenue through search result adjustments. To address the trade-off between relevance and revenue, we propose the \mathsfRARe ( \textbfR aising \textbfA dvertisement \textbfRe venue) framework. \mathsfRARe stacks a click model and a reranking model. We train the \mathsfRARe framework with a loss function to find revenue and relevance trade-offs. According to our experience, the click model is crucial in the \mathsfRARe framework. We propose and compare two different click models that take into account the context of items in a search result. The first click model is a Gradient-Boosting Decision Tree with Concatenation (GBDT-C), which includes a context in the traditional GBDT model for click prediction. The second model, SAINT-Q, adapts the Sequential Attention model to capture influences between search results. Our experiments indicate that the proposed click models outperform baselines and improve the overall quality of our framework. Experiments on the industrial dataset, which will be released publicly, show \mathsfRARe 's significant revenue improvements while preserving a high relevance.