Arxiv今日论文 | 2025-06-27

本篇博文主要内容为 2025-06-27 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决视觉-语言分割模型中存在的情境幻觉（hallucination）问题，即模型在没有图像内容支持的情况下生成分割掩码或错误标记无关区域。现有评估协议主要关注标签或文本幻觉，而未涉及视觉上下文的操控，限制了对关键故障的诊断能力。解决方案的关键在于引入HalluSegBench，这是首个专门针对通过反事实视觉推理视角评估视觉定位幻觉的基准，包含1340对反事实实例数据集和新的量化指标，能够评估在视觉一致场景编辑下的幻觉敏感性，从而揭示视觉驱动型幻觉比标签驱动型更为普遍，并强调了反事实推理在诊断定位准确性的必要性。

链接: https://arxiv.org/abs/2506.21546
作者: Xinzhuo Li,Adheesh Juvekar,Xingyou Liu,Muntasir Wahed,Kiet A. Nguyen,Ismini Lourentzou
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.
zh

[NLP-1] Data Efficacy for Language Model Training

【速读】：该论文试图解决语言模型（Language Model, LM）训练中数据效率与数据效用的优化问题，旨在通过改进训练数据的组织方式来提升模型性能。其解决方案的关键在于提出一种通用范式DELT，该范式包含数据评分（Data Scoring）、数据选择（Data Selection）和数据排序（Data Ordering）三个核心组件，其中Learnability-Quality Scoring（LQS）和Folding Ordering（FO）是两个创新性方法，分别从梯度一致性角度评估数据样本的可学习性和质量，并解决模型遗忘和数据分布偏差问题。

链接: https://arxiv.org/abs/2506.21545
作者: Yalun Dai,Yangyu Huang,Xin Zhang,Wenshan Wu,Chong Li,Wenhui Lu,Shijie Cao,Li Dong,Scarlett Li
机构: Microsoft Research(微软研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data is fundamental to the training of language models (LM). Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data. Techniques such as data filtering, sampling, and selection play a crucial role in this area. To complement it, we define Data Efficacy, which focuses on maximizing performance by optimizing the organization of training data and remains relatively underexplored. This work introduces a general paradigm, DELT, for considering data efficacy in LM training, which highlights the significance of training data organization. DELT comprises three components: Data Scoring, Data Selection, and Data Ordering. Among these components, we design Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which considers both the learnability and quality of each data sample from the gradient consistency perspective. We also devise Folding Ordering (FO), as a novel instance of Data Ordering, which addresses issues such as model forgetting and data distribution bias. Comprehensive experiments validate the data efficacy in LM training, which demonstrates the following: Firstly, various instances of the proposed DELT enhance LM performance to varying degrees without increasing the data scale and model size. Secondly, among these instances, the combination of our proposed LQS for data scoring and Folding for data ordering achieves the most significant improvement. Lastly, data efficacy can be achieved together with data efficiency by applying data selection. Therefore, we believe that data efficacy is a promising foundational area in LM training.
zh

[NLP-2] “Whats Up Doc?”: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets ALT

【速读】：该论文试图解决用户通过交互式聊天机器人从大型语言模型（Large Language Models, LLMs）获取医疗信息时，对话的性质及潜在风险尚未被充分探索的问题。其解决方案的关键在于构建了一个名为HealthChat-11K的高质量数据集，包含11,000条真实对话和25,000条用户消息，并结合临床驱动的分类体系，系统地研究用户在21个不同医疗专科中与LLMs互动的行为模式，从而揭示用户寻求健康信息的方式和动机，以及对话中存在的不完整上下文、情感行为和可能引发迎合性回应的互动类型。

链接: https://arxiv.org/abs/2506.21532
作者: Akshay Paruchuri,Maryam Aziz,Rohit Vartak,Ayman Ali,Best Uchehara,Xin Liu,Ishan Chatterjee,Monica Agrawal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 25 pages, 6 figures, 4 tables, corresponds to initial HealthChat-11K dataset release

点击查看摘要

Abstract:People are increasingly seeking healthcare information from large language models (LLMs) via interactive chatbots, yet the nature and inherent risks of these conversations remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-11K, a curated dataset of 11K real-world conversations composed of 25K user messages. We use HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study user interactions across 21 distinct health specialties. Our analysis reveals insights into the nature of how and why users seek health information, such as common interactions, instances of incomplete context, affective behaviors, and interactions (e.g., leading questions) that can induce sycophancy, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. Code and artifacts to retrieve our analyses and combine them into a curated dataset can be found here: this https URL
zh

[NLP-3] Potemkin Understanding in Large Language Models

【速读】：该论文试图解决的问题是：为何可以根据大型语言模型（Large Language Models, LLMs）对精心设计问题的回答来推断其能力？论文提出，基准测试（如AP考试）用于评估LLMs的能力，但这些基准的有效性依赖于LLMs在理解概念时的错误是否与人类相似。如果LLMs的错误模式与人类不同，则其在基准上的成功可能仅体现为“虚假理解”（potemkin understanding），即表面上的理解，但实际上与人类对概念的解释不一致。解决方案的关键在于识别和量化这种虚假理解的存在，论文通过两种方法进行验证：一种是在三个领域内使用专门设计的基准进行量化，另一种是采用通用方法估计其普遍性。研究发现，虚假理解在不同模型、任务和领域中普遍存在，并且这些失败不仅反映错误的理解，还表明概念表征中存在更深层次的不一致性。

链接: https://arxiv.org/abs/2506.21521
作者: Marina Mancoridis,Bec Weeks,Keyon Vafa,Sendhil Mullainathan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM’s capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs – such as AP exams – are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
zh

[NLP-4] skLEP: A Slovak General Language Understanding Benchmark ACL2025

【速读】：该论文旨在解决 Slovak 自然语言理解（Natural Language Understanding, NLU）模型评估缺乏全面基准的问题。为了解决这一问题，研究者提出了 skLEP，这是首个专为评估 Slovak NLU 模型而设计的综合性基准。其关键在于构建涵盖九种多样化任务的基准，这些任务包括词级别、句子对级别和文档级别挑战，同时通过创建针对 Slovak 的新原始数据集以及精心翻译的英文 NLU 资源来确保数据的多样性和适用性。此外，该工作还首次系统地评估了多种 Slovak 特定、多语言及英文预训练语言模型，并提供了完整的基准数据、开源工具包和公开排行榜，以促进可复现性和未来研究。

链接: https://arxiv.org/abs/2506.21508
作者: Marek Šuppa,Andrej Ridzik,Daniel Hládek,Tomáš Javůrek,Viktória Ondrejová,Kristína Sásiková,Martin Tamajka,Marián Šimko
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: ACL 2025 Findings

点击查看摘要

Abstract:In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at this https URL in the hopes of fostering reproducibility and drive future research in Slovak NLU.
zh

[NLP-5] Mind2Web 2: Evaluating Agent ic Search with Agent -as-a-Judge

【速读】：该论文旨在解决当前代理式搜索（agentic search）系统在面对长时程、复杂且动态变化的任务时，缺乏有效的评估基准与方法的问题。现有评估体系多假设搜索过程较短且答案静态，难以适应代理式搜索日益增长的复杂性和开放性。论文提出的解决方案关键在于构建了Mind2Web 2基准，包含130个真实、高质量且长时程的任务，并引入了一种新颖的Agent-as-a-Judge框架，通过任务特定的判官代理（judge agents）和树状评分标准，自动评估答案的正确性与来源归属，从而有效应对时间变化和复杂回答的评估挑战。

链接: https://arxiv.org/abs/2506.21506
作者: Boyu Gou,Zanming Huang,Yuting Ning,Yu Gu,Michael Lin,Weijian Qi,Andrei Kopanev,Botao Yu,Bernal Jiménez Gutiérrez,Yiheng Shu,Chan Hee Song,Jiaman Wu,Shijie Chen,Hanane Nour Moussa,Tianshu Zhang,Jian Xie,Yifei Li,Tianci Xue,Zeyi Liao,Kai Zhang,Boyuan Zheng,Zhaowei Cai,Viktor Rozgic,Morteza Ziyadi,Huan Sun,Yu Su
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Homepage: this https URL

点击查看摘要

Abstract:Agentic search such as Deep Research systems, where large language models autonomously browse the web, synthesize information, and return comprehensive citation-backed answers, represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1,000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of nine frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, showing a great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
zh

[NLP-6] Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments

【速读】：该论文试图解决在社交驱动对话中，如何有效提升用户参与度的问题。现有方法虽已优化模型对相关知识的推理或对话行为流的规划，但用户参与度与知识或对话行为之间的关系复杂且难以保证。解决方案的关键在于通过利用对话未来发展的信号来让交互式大语言模型（LLM）学习用户参与度，具体是采用用户在互动后对对话意图的反应作为奖励信号，以对齐交互式LLM。为此，研究者开发了一个用户模拟器，并通过\textit{MCTS}（Monte Carlo Tree Search for interaction）探索用户与交互式LLM系统的互动，从而收集高质量和低质量经验对，并通过直接偏好优化（DPO）提升用户参与度。

链接: https://arxiv.org/abs/2506.21497
作者: Jiashuo Wang,Kaitao Song,Chunpu Xu,Changhe Song,Yang Xiao,Dongsheng Li,Lili Qiu,Wenjie Li
机构: The Hong Kong Polytechnic University (香港理工大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enhancing user engagement through interactions plays an essential role in socially-driven dialogues. While prior works have optimized models to reason over relevant knowledge or plan a dialogue act flow, the relationship between user engagement and knowledge or dialogue acts is subtle and does not guarantee user engagement in socially-driven dialogues. To this end, we enable interactive LLMs to learn user engagement by leveraging signals from the future development of conversations. Specifically, we adopt a more direct and relevant indicator of user engagement, i.e., the user’s reaction related to dialogue intention after the interaction, as a reward to align interactive LLMs. To achieve this, we develop a user simulator to interact with target interactive LLMs and explore interactions between the user and the interactive LLM system via \textiti \times MCTS (\textitMonte \textitCarlo \textitTree \textitSearch for \textitinteraction). In this way, we collect a dataset containing pairs of higher and lower-quality experiences using \textiti \times MCTS, and align interactive LLMs for high-level user engagement by direct preference optimization (DPO) accordingly. Experiments conducted on two socially-driven dialogue scenarios (emotional support conversations and persuasion for good) demonstrate that our method effectively enhances user engagement in interactive LLMs.
zh

[NLP-7] Bridging Offline and Online Reinforcement Learning for LLM s

【速读】：该论文试图解决在从离线到半在线再到全在线模式过渡过程中，强化学习方法在微调大型语言模型时的有效性问题，特别是在可验证任务和不可验证任务中的表现。其解决方案的关键在于对比分析在线和半在线的直接偏好优化（Direct Preference Optimization）与群体奖励策略优化（Group Reward Policy Optimization）目标，并发现这些变体在性能和收敛性上表现相似，且均显著优于离线方法。此外，通过详细分析训练动态和超参数选择策略，以及结合可验证和不可验证奖励的多任务学习，进一步提升了两种任务类型的性能。

链接: https://arxiv.org/abs/2506.21495
作者: Jack Lanchantin,Angelica Chen,Janice Lan,Xian Li,Swarnadeep Saha,Tianlu Wang,Jing Xu,Ping Yu,Weizhe Yuan,Jason E Weston,Sainbayar Sukhbaatar,Ilia Kulikov
机构: Meta(元)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.
zh

[NLP-8] Logios : An open source Greek Polytonic Optical Character Recognition system

【速读】：该论文旨在解决希腊多调文字（Greek polytonic）的准确识别与数字化问题，传统OCR方法在处理此类复杂脚本时存在局限性。解决方案的关键在于结合卷积层（convolutional layers）进行特征提取与循环层（recurrent layers）进行序列学习，从而有效应对希腊多调文字的独特挑战，显著提升识别的准确性和效率。

链接: https://arxiv.org/abs/2506.21474
作者: Perifanos Konstantinos,Goutsos Dionisis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use.
zh

[NLP-9] opK Language Models

【速读】：该论文试图解决稀疏自编码器（Sparse Autoencoders, SAEs）在分析和解释基于Transformer的语言模型（Language Models, LMs）激活空间时所面临的局限性，包括其后训练机制导致的实用性下降、内部有效性不足以及特征稳定性差等问题。为了解决这些限制，论文提出了一种对Transformer架构的改进，在选定层中引入TopK激活函数，使模型的隐藏状态等价于TopK稀疏自编码器（TopK SAE）的潜在特征。该解决方案的关键在于将稀疏性直接嵌入模型结构中，从而无需进行后训练即可实现可解释性，同时保持模型性能并提升解释能力的稳定性和可靠性。

链接: https://arxiv.org/abs/2506.21468
作者: Ryosuke Takahashi,Tatsuro Inaba,Kentaro Inui,Benjamin Heinzerling
机构: Tohoku University (东北大学); RIKEN (理化学研究所); MBZUAI (穆罕默德本扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal validity. Since SAEs are trained post-hoc, it is unclear if the failure to discover a particular concept is a failure on the SAE’s side or due to the underlying LM not representing this concept. This problem is exacerbated by training conditions and architecture choices affecting which features an SAE learns. When tracing how LMs learn concepts during training, the lack of feature stability also makes it difficult to compare SAEs features across different checkpoints. To address these limitations, we introduce a modification to the transformer architecture that incorporates a TopK activation function at chosen layers, making the model’s hidden states equivalent to the latent features of a TopK SAE. This approach eliminates the need for post-hoc training while providing interpretability comparable to SAEs. The resulting TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability. Despite this simple architectural change, TopK LMs maintain their original capabilities while providing robust interpretability benefits. Our experiments demonstrate that the sparse representations learned by TopK LMs enable successful steering through targeted neuron interventions and facilitate detailed analysis of neuron formation processes across checkpoints and layers. These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts, which we believe will significantly advance future research on model interpretability and controllability.
zh

[NLP-10] Aligning Spoken Dialogue Models from User Interactions ICML2025

【速读】：该论文试图解决实时语音对话模型在用户交互中进行偏好对齐的问题，当前的偏好学习方法主要针对文本语言模型，难以适应实时语音交互中更复杂的动态特性（如打断、插话）以及说话人之间没有明确分段的特点。解决方案的关键在于构建一个大规模的数据集，包含超过150,000个偏好对，这些数据来自原始多轮语音对话，并通过AI反馈进行标注，以覆盖语言内容和时间上下文的变化。随后，利用离线对齐方法微调一个全双工自回归语音到语音模型，实验表明通用对话的反馈可以有效提升语音对话模型生成更准确、安全且上下文一致的交互。

链接: https://arxiv.org/abs/2506.21463
作者: Anne Wu,Laurent Mazaré,Neil Zeghidour,Alexandre Défossez
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at ICML 2025

点击查看摘要

Abstract:We propose a novel preference alignment framework for improving spoken dialogue models on real-time conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker this http URL create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual, safer and more contextually aligned interactions. We deploy the finetuned model and conduct holistic human evaluations to assess the impact beyond single-turn conversations. Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
zh

[NLP-11] Spatial Mental Modeling from Limited Views

【速读】：该论文试图解决视觉语言模型（Vision Language Models, VLMs）在缺乏完整视图的情况下无法像人类一样构建全面场景理解的问题。现有VLMs在处理此类任务时表现接近随机，表明其在构建稳健的空间心理模型方面存在显著不足。论文提出的关键解决方案是“先生成认知地图再进行推理”（map-then-reason）的协同方法，通过联合训练模型首先生成认知地图，然后基于该地图进行推理，从而显著提升了模型对不可观测空间的理解能力。实验结果表明，该方法将准确率从37.8%提升至60.8%，结合强化学习后进一步提升至70.7%。

链接: https://arxiv.org/abs/2506.21458
作者: Baiqiao Yin,Qineng Wang,Pingyue Zhang,Jianshu Zhang,Kangrui Wang,Zihan Wang,Jieyu Zhang,Keshigeyan Chandrasegaran,Han Liu,Ranjay Krishna,Saining Xie,Manling Li,Jiajun Wu,Li Fei-Fei
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version

点击查看摘要

Abstract:Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for “what-if” movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, “map-then-reason”, that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.
zh

[NLP-12] xt2Cypher Across Languages: Evaluating Foundational Models Beyond English

【速读】：该论文试图解决多语言环境下生成式AI（Generative AI）在Text2Cypher任务中的性能差异问题，特别是针对英语以外的语言缺乏充分评估和研究的现状。其解决方案的关键在于构建一个跨语言的测试集，通过将英语问题翻译成西班牙语和土耳其语并保留原始Cypher查询，实现公平的跨语言比较，并评估多种基础大语言模型（LLM）在不同语言下的表现。

链接: https://arxiv.org/abs/2506.21445
作者: Makbule Gulcin Ozsoy,William Tai
机构: Neo4j(Neo4j)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent advances in large language models have enabled natural language interfaces that translate user questions into database queries, such as Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database accessibility, most research today focuses solely on English, with limited evaluation in other languages. This paper investigates the performance of foundational LLMs on the Text2Cypher task across multiple languages. We create and release a multilingual test set by translating English questions into Spanish and Turkish while preserving the original Cypher queries, enabling fair cross-lingual comparison. We evaluate multiple foundational models using standardized prompts and metrics. Our results show a consistent performance pattern: highest on English, then Spanish, and lowest on Turkish. We attribute this to differences in training data availability and linguistic characteristics. Additionally, we explore the impact of translating task prompts into Spanish and Turkish. Results show little to no change in evaluation metrics, suggesting prompt translation has minor impact. Our findings highlight the need for more inclusive evaluation and development in multilingual query generation. Future work includes schema localization and fine-tuning across diverse languages.
zh

[NLP-13] Domain Knowledge-Enhanced LLM s for Fraud and Concept Drift Detection

【速读】：该论文旨在解决动态平台上检测欺骗性对话的难题，特别是由于语言模式演变和概念漂移（Concept Drift, CD）导致的语义或主题变化所带来的挑战。这些变化可能掩盖恶意意图或模仿正常对话，从而使分类变得困难。解决方案的关键在于提出一种基于领域知识（Domain Knowledge, DK）增强的大型语言模型（Large Language Model, LLM）框架，该框架通过集成预训练LLM与结构化的任务特定洞察，实现欺诈和概念漂移检测。其核心组件包括用于检测虚假对话的DK-LLM模块、用于识别语义漂移的漂移检测单元（OCDD）以及用于分类漂移性质的第二个DK-LLM模块。

链接: https://arxiv.org/abs/2506.21443
作者: Ali Şenol,Garima Agrawal,Huan Liu
机构: Arizona State University (亚利桑那州立大学); Tarsus University (塔尔苏斯大学); Minerva CQ (明维拉CQ); HumaConn AI Consulting (人联AI咨询)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.
zh

[NLP-14] Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference UAI2025

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在不确定性量化方面的挑战，尤其是在高风险领域如自主系统和医疗健康中，这些模型容易产生错误信息且校准不足。其解决方案的关键在于提出一种可扩展的贝叶斯低秩适配方法（ScalaBL），通过在低维子空间中进行贝叶斯推断，并将低秩适配（LoRA）参数重新用作投影矩阵，从而将子空间中的样本映射到完整的权重空间，实现对模型所有参数的随机变分推断。该方法在保持低维度子空间的同时，仅需约1000个额外参数即可达到与最先进方法相当的性能，并支持扩展到目前最大的贝叶斯LLM。

链接: https://arxiv.org/abs/2506.21408
作者: Colin Samplawski,Adam D. Cobb,Manoj Acharya,Ramneet Kaur,Susmit Jha
机构: SRI International(美国国防高级研究计划局); Computer Science Laboratory(计算机科学实验室); Neuro-Symbolic Computing and Intelligence Research Group(神经符号计算与智能研究组)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at UAI 2025

点击查看摘要

Abstract:Despite their widespread use, large language models (LLMs) are known to hallucinate incorrect information and be poorly calibrated. This makes the uncertainty quantification of these models of critical importance, especially in high-stakes domains, such as autonomy and healthcare. Prior work has made Bayesian deep learning-based approaches to this problem more tractable by performing inference over the low-rank adaptation (LoRA) parameters of a fine-tuned model. While effective, these approaches struggle to scale to larger LLMs due to requiring further additional parameters compared to LoRA. In this work we present \textbfScala ble \textbfB ayesian \textbfL ow-Rank Adaptation via Stochastic Variational Subspace Inference (ScalaBL). We perform Bayesian inference in an r -dimensional subspace, for LoRA rank r . By repurposing the LoRA parameters as projection matrices, we are able to map samples from this subspace into the full weight space of the LLM. This allows us to learn all the parameters of our approach using stochastic variational inference. Despite the low dimensionality of our subspace, we are able to achieve competitive performance with state-of-the-art approaches while only requiring \sim1000 additional parameters. Furthermore, it allows us to scale up to the largest Bayesian LLM to date, with four times as a many base parameters as prior work.
zh

[NLP-15] Leverag ing LLM -Assisted Query Understanding for Live Retrieval-Augmented Generation SIGIR2025

【速读】：该论文旨在解决实时检索增强生成（RAG）系统在处理噪声大、模糊且包含多意图的用户查询时所面临的挑战。现有RAG系统通常在清洁数据上进行训练或评估，难以应对真实场景中的复杂输入。其解决方案的关键在于提出Omni-RAG框架，该框架通过三个核心模块实现：（1）深度查询理解和分解，利用定制提示的生成式AI（Generative AI）对查询进行去噪和多意图分解；（2）意图感知的知识检索，针对每个子查询从语料库中检索并聚合结果；（3）重排序与生成，通过重排序器（如BGE）优化文档选择，并由大型语言模型（如Falcon-10B）生成最终响应。

链接: https://arxiv.org/abs/2506.21384
作者: Guanting Dong,Xiaoxi Li,Yuyao Zhang,Mengjie Deng
机构: Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at SIGIR 2025 LiveRAG Workshop (Oral Presentation)

点击查看摘要

Abstract:Real-world live retrieval-augmented generation (RAG) systems face significant challenges when processing user queries that are often noisy, ambiguous, and contain multiple intents. While RAG enhances large language models (LLMs) with external knowledge, current systems typically struggle with such complex inputs, as they are often trained or evaluated on cleaner data. This paper introduces Omni-RAG, a novel framework designed to improve the robustness and effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs LLM-assisted query understanding to preprocess user inputs through three key modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs with tailored prompts to denoise queries (e.g., correcting spelling errors) and decompose multi-intent queries into structured sub-queries; (2) Intent-Aware Knowledge Retrieval, which performs retrieval for each sub-query from a corpus (i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking and Generation, where a reranker (i.e., BGE) refines document selection before a final response is generated by an LLM (i.e., Falcon-10B) using a chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG capabilities and the demands of real-world applications, such as those highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex and noisy queries.
zh

[NLP-16] Structuralist Approach to AI Literary Criticism: Leverag ing Greimas Semiotic Square for Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在提供具有深刻思想和复杂叙事作品的专业文学批评方面能力不足的问题。解决方案的关键是提出GLASS（Greimas Literary Analysis via Semiotic Square），这是一个基于Greimas语义方阵（Greimas Semiotic Square, GSS）的结构化分析框架，旨在增强LLMs进行深入文学分析的能力。通过GLASS，可以快速解构叙事结构和深层含义，并引入基于LLM-as-a-judge范式的量化评估指标，从而提升文学批评的准确性和深度。

链接: https://arxiv.org/abs/2506.21360
作者: Fangzhou Dong,Yifan Zeng,Yingpeng Sang,Hong Shen
机构: Sun Yat-sen University (中山大学); Central Queensland University (中央昆士兰大学)
类目: Computation and Language (cs.CL)
备注: Accepted in CogSci 2025

点击查看摘要

Abstract:Large Language Models (LLMs) excel in understanding and generating text but struggle with providing professional literary criticism for works with profound thoughts and complex narratives. This paper proposes GLASS (Greimas Literary Analysis via Semiotic Square), a structured analytical framework based on Greimas Semiotic Square (GSS), to enhance LLMs’ ability to conduct in-depth literary analysis. GLASS facilitates the rapid dissection of narrative structures and deep meanings in narrative works. We propose the first dataset for GSS-based literary criticism, featuring detailed analyses of 48 works. Then we propose quantitative metrics for GSS-based literary criticism using the LLM-as-a-judge paradigm. Our framework’s results, compared with expert criticism across multiple works and LLMs, show high performance. Finally, we applied GLASS to 39 classic works, producing original and high-quality analyses that address existing research gaps. This research provides an AI-based tool for literary research and education, offering insights into the cognitive mechanisms underlying literary engagement.
zh

[NLP-17] Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts

【速读】：该论文试图解决混合专家（Mixture-of-Experts, MoE）架构在训练和推理过程中存在的严重负载不平衡问题，即只有少量专家被持续激活，导致模型容量和计算资源的显著浪费。解决方案的关键在于从聚类视角重新审视专家路由机制，并提出一种名为潜在原型路由（Latent Prototype Routing, LPR）的新路由框架，该框架在不牺牲下游性能的前提下，实现了专家使用的平衡性，从而有效提升了模型资源利用率。

链接: https://arxiv.org/abs/2506.21328
作者: Jiajie Yang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 pages,4 figures

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures have emerged as a key strategy for scaling large language models (LLMs) efficiently. However, current MoE systems suffer from severe load imbalance, where only a small subset of experts is consistently activated during training and inference, leading to significant underutilization of model capacity and computational resources. In this work, we revisit expert routing through a clustering perspective and propose Latent Prototype Routing (LPR), a novel routing framework that generalizes existing approaches while promoting balanced expert utilization without compromising downstream performance. Extensive experiments across multiple open-source MoE models – including DeepSeek-V3, Qwen3-MoE, and Mixtral – demonstrate that LPR reduces the Gini coefficient of expert load from 0.70 to 0.035 on average, improves the min-max expert load ratio from 1e-6 to 0.70, achieving near-perfect load balancing.
zh

[NLP-18] Exploring Adapter Design Tradeoffs for Low Resource Music Generation

【速读】：该论文试图解决在低资源音乐流派中，如何通过参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）技术优化生成式 AI (Generative AI) 音乐模型的适应性问题，具体关注适配器（adapter）的设计选择对模型性能的影响。解决方案的关键在于系统性地研究不同适配器配置（包括架构、位置和规模）在两个音乐模型 MusicGen 和 Mustango 上的表现，以确定在保持模型性能的同时，如何有效减少可训练参数数量并提升生成质量。

链接: https://arxiv.org/abs/2506.21298
作者: Atharva Mehta,Shivam Chauhan,Monojit Choudhury
机构: Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music. Our findings reveal distinct trade-offs: convolution-based adapters excel in capturing fine-grained local musical details such as ornamentations and short melodic phrases, while transformer-based adapters better preserve long-range dependencies crucial for structured improvisation. Additionally, we analyze computational resource requirements across different adapter scales, demonstrating how mid-sized adapters (40M parameters) achieve an optimal balance between expressivity and quality. Furthermore, we find that Mustango, a diffusion-based model, generates more diverse outputs with better adherence to the description in the input prompt while lacking in providing stability in notes, rhythm alignment, and aesthetics. Also, it is computationally intensive and requires significantly more time to train. In contrast, autoregressive models like MusicGen offer faster training and are more efficient, and can produce better quality output in comparison, but have slightly higher redundancy in their generations. Comments: 9 pages, 5 figures Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS) Cite as: arXiv:2506.21298 [cs.SD] (or arXiv:2506.21298v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2506.21298 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-19] Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models ACL2025

【速读】：该论文试图解决从视觉语境对话中提取指代表达（referring expressions）的问题，其核心在于探究仅依赖语言上下文是否能够有效识别具有可视参照物的提及片段。解决方案的关键在于利用预训练的大规模语言模型（large language model, LLM）通过下一词预测来粗粒度标注对话中的提及范围，从而在不依赖视觉信息的情况下完成指代表达的检测。

链接: https://arxiv.org/abs/2506.21294
作者: Bram Willemsen,Gabriel Skantze
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication at XLLM @ ACL 2025

点击查看摘要

Abstract:In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.
zh

[NLP-20] Small Encoders Can Rival Large Decoders in Detecting Groundedness

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在缺乏足够上下文信息时，容易产生无依据推测或依赖内部知识的问题，从而影响回答的准确性和可信度。解决方案的关键在于在LLMs进行耗时的回答生成之前，通过轻量级、任务特定的编码器模型（如RoBERTa和NomicBERT）对查询是否基于给定文档进行检测，从而提升效率并减少资源消耗，同时实现与当前最先进的LLMs相当的检测精度。

链接: https://arxiv.org/abs/2506.21288
作者: Istabrak Abbes,Gabriele Prato,Quentin Fournier,Fernando Rodriguez,Alaa Boukhary,Adam Elwood,Sarath Chandar
机构: Chandar Research Lab; Mila – Quebec AI Institute; Université de Montréal; Polytechnique Montréal; Aily Labs; Canada CIFAR AI Chair
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness - generating responses strictly supported by the context - is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document provided in context before the costly answer generation by LLMs. Such a detection mechanism can significantly reduce both inference time and resource consumption. We show that lightweight, task specific encoder models such as RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in groundedness detection while reducing inference latency by orders of magnitude. The code is available at : this https URL
zh

[NLP-21] Double-Checker: Enhancing Reasoning of Slow-Thinking LLM s via Self-Critical Fine-Tuning

【速读】：该论文旨在解决慢思考型大语言模型（LLMs）在生成有信息量的批判性评价和改进先前解决方案方面能力有限的问题。其核心挑战在于如何提升这些模型在推理过程中进行自我批判和迭代优化的能力。解决方案的关键在于提出Double-Checker框架，该框架通过在精心收集的1,730个自我批判实例上进行微调，使长链式思维（long-CoT）LLMs能够在推理过程中主动进行自我批判并不断优化输出，直至其解决方案在自生成的批判下被评估为正确。

链接: https://arxiv.org/abs/2506.21285
作者: Xin Xu,Tianhao Chen,Fan Zhang,Wanlong Liu,Pengxiang Li,Ajay Kumar Jaiswal,Yuchen Yan,Jishan Hu,Yang Wang,Hao Chen,Shiwei Liu,Shizhe Diao,Can Yang,Lu Yin
机构: The Hong Kong University of Science and Technology (香港科技大学); University of Electronic Science and Technology of China (电子科技大学); Dalian University of Technology (大连理工大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); Zhejiang University (浙江大学); University of Oxford (牛津大学); NVIDIA (英伟达); University of Surrey (萨里大学)
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.
zh

[NLP-22] HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context

【速读】：该论文试图解决多模态大语言模型在理解人类意图时存在的两个关键问题：多模态上下文理解不足和捷径问题（shortcut problem）。多模态上下文理解不足指的是模型对多模态信息的误读导致错误回答，而捷径问题则指模型忽略多模态输入中的关键线索，直接对查询进行回应。解决方案的关键在于强化模型对多模态输入中全局上下文的理解，通过引入由大语言模型评判的上下文奖励、格式和准确性奖励，以及利用LLM评估逻辑奖励，以确保多模态信息与逻辑方法的有效融合。此外，论文还提出了一个名为IntentBench的多模态推理基准，用于评估模型在理解复杂人类意图和情感方面的能力。

链接: https://arxiv.org/abs/2506.21277
作者: Qize Yang,Shimin Yao,Weixuan Chen,Shenghao Fu,Detao Bai,Jiaxing Zhao,Boyuan Sun,Bowen Yin,Xihan Wei,Jingren Zhou
机构: Tongyi Lab, Alibaba Group (通义实验室，阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.
zh

[NLP-23] Cat and Mouse – Can Fake Text Generation Outpace Detector Systems?

【速读】：该论文试图解决生成式 AI (Generative AI) 生成的“虚假文本”在学术写作、产品评论和政治新闻等领域的检测问题。其解决方案的关键在于利用统计分类器对“虚假文本”进行识别，研究发现即使在模型参数、训练数据和能耗不断增加的情况下，相对简单的分类器仍能保持较高的检测准确率，这表明可靠检测虚假文本可能在更大规模的模型中仍然可行。

链接: https://arxiv.org/abs/2506.21274
作者: Andrea McGlinchey,Peter J Barclay
机构: 未知
类目: Computation and Language (cs.CL)
备注: (Submitted for publication)

点击查看摘要

Abstract:Large language models can produce convincing “fake text” in domains such as academic writing, product reviews, and political news. Many approaches have been investigated for the detection of artificially generated text. While this may seem to presage an endless “arms race”, we note that newer LLMs use ever more parameters, training data, and energy, while relatively simple classifiers demonstrate a good level of detection accuracy with modest resources. To approach the question of whether the models’ ability to beat the detectors may therefore reach a plateau, we examine the ability of statistical classifiers to identify “fake text” in the style of classical detective fiction. Over a 0.5 version increase, we found that Gemini showed an increased ability to generate deceptive text, while GPT did not. This suggests that reliable detection of fake text may remain feasible even for ever-larger models, though new model architectures may improve their deceptiveness
zh

[NLP-24] DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

【速读】：该论文旨在解决在参数量超过1000亿的大型语言模型（Large Language Models, LLMs）训练中，如何在低带宽网络环境下实现高效分布式训练的问题。传统方法高度依赖高速可靠的中心化集群，而本文提出的DiLoCoX框架通过结合流水线并行与双优化器策略、通信与本地训练的一步延迟重叠以及自适应梯度压缩方案，有效降低了通信开销，提升了训练效率。其解决方案的关键在于通过理论分析和实验验证，证明了通信与本地训练的一步延迟重叠以及自适应梯度压缩在保证模型收敛性的同时显著提高了训练速度。

链接: https://arxiv.org/abs/2506.21263
作者: Ji Qi,WenPeng Zhu,Li Li,Ming Wu,YingJun Wu,Wu He,Xun Gao,Jason Zeng,Michael Heinrich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper, we propose DiLoCoX, a low-communication large-scale decentralized cluster training framework. It combines Pipeline Parallelism with Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme. This combination significantly improves the scale of parameters and the speed of model pre-training. We justify the benefits of one-step-delay overlap of communication and local training, as well as the adaptive gradient compression scheme, through a theoretical analysis of convergence. Empirically, we demonstrate that DiLoCoX is capable of pre-training a 107B foundation model over a 1Gbps network. Compared to vanilla AllReduce, DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence. To the best of our knowledge, this is the first decentralized training framework successfully applied to models with over 100 billion parameters.
zh

[NLP-25] Agent -RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception Planning and Safety in Real-World Multimodal Agents ACL2025

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在实际任务中由于缺乏外部反馈而导致的自我修正和泛化能力不足的问题。解决方案的关键是构建一个针对代理的奖励基准（Agent-RewardBench），以评估MLLMs在奖励建模方面的能力。该基准具有三个核心特征：多维度和真实代理场景评估、步骤级奖励评估以及适当的难度控制和高质量数据保证，从而为代理的奖励建模提供更细致和可靠的评估手段。

链接: https://arxiv.org/abs/2506.21252
作者: Tianyi Men,Zhuoran Jin,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Main

点击查看摘要

Abstract:As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.
zh

[NLP-26] Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval

【速读】：该论文试图解决在自动术语提取（Automatic Term Extraction, ATE）任务中，如何有效利用大规模语言模型（Large Language Models, LLMs）的问题。尽管LLMs在多种自然语言处理任务中表现出色，但其在ATE中的潜力尚未被充分探索。论文提出的解决方案关键在于采用基于检索的提示策略，在少样本设置下根据句法而非语义相似性选择示例，该方法具有领域无关性，并能更可靠地指导术语边界的捕捉。

链接: https://arxiv.org/abs/2506.21222
作者: Yongchan Chun,Minhyuk Kim,Dongjun Kim,Chanjun Park,Heuiseok Lim
机构: Korea University (韩国科学技术院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Automatic Term Extraction (ATE) identifies domain-specific expressions that are crucial for downstream tasks such as machine translation and information retrieval. Although large language models (LLMs) have significantly advanced various NLP tasks, their potential for ATE has scarcely been examined. We propose a retrieval-based prompting strategy that, in the few-shot setting, selects demonstrations according to \emphsyntactic rather than semantic similarity. This syntactic retrieval method is domain-agnostic and provides more reliable guidance for capturing term boundaries. We evaluate the approach in both in-domain and cross-domain settings, analyzing how lexical overlap between the query sentence and its retrieved examples affects performance. Experiments on three specialized ATE benchmarks show that syntactic retrieval improves F1-score. These findings highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks.
zh

[NLP-27] Complexity-aware fine-tuning

【速读】：该论文旨在解决通用大语言模型（Large Language Models, LLMs）在特定领域中通过监督微调（Supervised Fine-Tuning, SFT）提升性能时，因需要大量数据和计算资源而导致的效率问题。其解决方案的关键在于利用熵值识别复杂数据，并仅对这些复杂数据进行推理处理，从而实现高效的微调流程。通过将训练数据按单个token答案的熵值划分为不同复杂度类别，并结合SFT与知识蒸馏技术，该方法在减少62%数据使用量的情况下，仍能获得与传统知识蒸馏相当的性能表现（平均准确率0.55）。

链接: https://arxiv.org/abs/2506.21220
作者: Andrey Goncharov,Daniil Vyazhev,Petr Sychev,Edvard Khalafyan,Alexey Zaytsev
机构: Skolkovo Institute of Science and Technology (Skoltech); National Research University Higher School of Economics; Moscow Institute of Physics and Technology
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across two small open models ( \approx 3B ) we split the training data into complexity categories by a single token answer entropy (ROC AUC 0.73 ), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ( 0.55 vs 0.43 average accuracy) and provides comparable with distillation performance while using 62% less data ( 0.55 average accuracy for both). We publish our code and data to facilitate further research in this direction.
zh

[NLP-28] Unveiling Causal Reasoning in Large Language Models : Reality or Mirag e? NEURIPS2024

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在因果推理能力上仅限于浅层（level-1）推理，而缺乏人类般的深层（level-2）因果推理能力的问题。其解决方案的关键在于提出G^2-Reasoner方法，该方法通过引入通用知识和目标导向提示，增强LLMs的因果推理过程，从而有效提升其在新颖和反事实情境下的因果推理能力。

链接: https://arxiv.org/abs/2506.21215
作者: Haoang Chi,He Li,Wenjing Yang,Feng Liu,Long Lan,Xiaoguang Ren,Tongliang Liu,Bo Han
机构: National University of Defense Technology (国防科技大学); University of Melbourne (墨尔本大学); University of Sydney (悉尼大学); Hong Kong Baptist University (香港浸会大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages, accepted at NeurIPS 2024

点击查看摘要

Abstract:Causal reasoning capability is critical in advancing large language models (LLMs) toward strong artificial intelligence. While versatile LLMs appear to have demonstrated capabilities in understanding contextual causality and providing responses that obey the laws of causality, it remains unclear whether they perform genuine causal reasoning akin to humans. However, current evidence indicates the contrary. Specifically, LLMs are only capable of performing shallow (level-1) causal reasoning, primarily attributed to the causal knowledge embedded in their parameters, but they lack the capacity for genuine human-like (level-2) causal reasoning. To support this hypothesis, methodologically, we delve into the autoregression mechanism of transformer-based LLMs, revealing that it is not inherently causal. Empirically, we introduce a new causal QA benchmark called CausalProbe-2024, whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs exhibit a significant performance drop on CausalProbe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning. To bridge the gap towards level-2 causal reasoning, we draw inspiration from the fact that human reasoning is usually facilitated by general knowledge and intended goals. We propose G^2-Reasoner, a method that incorporates general knowledge and goal-oriented prompts into LLMs’ causal reasoning processes. Experiments demonstrate that G^2-Reasoner significantly enhances LLMs’ causal reasoning capability, particularly in fresh and counterfactual contexts. This work sheds light on a new path for LLMs to advance towards genuine causal reasoning, going beyond level-1 and making strides towards level-2.
zh

[NLP-29] Prompt-Guided Turn-Taking Prediction SIGDIAL SIGDIAL2025

【速读】：该论文试图解决对话系统中话轮转换（turn-taking）预测的动态控制问题，即如何根据文本指令灵活调整对话行为以适应不同的对话伙伴和语境。解决方案的关键在于提出一种基于Transformer架构的模型，通过将文本提示嵌入到通道相关Transformer和跨通道Transformer中，实现对话轮转换预测的显式控制，从而提升预测准确性并使话轮转换时间行为能够根据文本提示进行有效调整。

链接: https://arxiv.org/abs/2506.21191
作者: Koji Inoue,Mikey Elmers,Yahui Fu,Zi Haur Pang,Divesh Lala,Keiko Ochi,Tatsuya Kawahara
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: This paper has been accepted for presentation at SIGdial Meeting on Discourse and Dialogue 2025 (SIGDIAL 2025) and represents the author’s version of the work

点击查看摘要

Abstract:Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as “faster” or “calmer” adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.
zh

[NLP-30] Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks

【速读】：该论文旨在解决文本嵌入模型评估基准（Massive Text Embedding Benchmark, MTEB）在持续可复现性和可扩展性方面的工程挑战。其关键解决方案在于构建稳健的持续集成流水线，以验证数据集完整性、自动化测试执行并评估基准结果的泛化能力，同时通过设计选择提升可复现性和易用性。此外，论文还探讨了处理社区贡献及扩展新任务和数据集的策略，这些工程实践有效推动了MTEB的全面性提升并保持其质量和领域相关性。

链接: https://arxiv.org/abs/2506.21182
作者: Isaac Chung,Imene Kerboua,Marton Kardos,Roman Solomatin,Kenneth Enevoldsen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB’s continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results’ generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: this https URL
zh

[NLP-31] Compressed and Smooth Latent Space for Text Diffusion Modeling

【速读】：该论文旨在解决传统自回归语言模型在文本生成中的两个核心问题：解码速度慢以及全局连贯性难以保持。同时，尽管扩散模型在文本生成中具有并行生成和灵活控制的优势，但其应用受限于词元级表示的高维性。论文提出的解决方案关键在于引入Cosmos，它通过一个压缩且平滑的潜在空间进行文本生成，该空间通过一个同时优化词元级重建和与预训练语言编码器冻结激活对齐的自编码器进行学习，从而实现了语义上的稳健性和有效的扰动增强。

链接: https://arxiv.org/abs/2506.21170
作者: Viacheslav Meshchaninov,Egor Chimbulatov,Alexander Shabalin,Aleksandr Abramov,Dmitry Vetrov
机构: HSE University (高等经济大学); Constructor University (构造大学); SaluteDevices (SaluteDevices)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by 8\times while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than 2\times faster inference.
zh

[NLP-32] Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models ICONIP2024

【速读】：该论文旨在解决大规模Transformer语言模型在微调过程中计算资源分配效率低的问题，尤其是在模型规模不断扩大的背景下，传统微调方法和多数参数高效微调方法未能考虑Transformer块之间的贡献差异，导致参数更新数量与初始规模相同，造成资源浪费。其解决方案的关键在于提出Progtuning框架，该框架结合渐进式学习，根据各Transformer块的贡献逐步减少需要更新的块数，从而优化资源分配并减少约25%的更新参数数量，同时保持竞争性性能，并展现出对参数高效微调方法的高适应性。

链接: https://arxiv.org/abs/2506.21119
作者: Xiaoshuang Ji,Zhendong Zhao,Xiaojun Chen,Xin Zhao,Zeyao Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICONIP 2024

点击查看摘要

Abstract:Fine-tuning is a promising technique for leveraging Transformer-based language models in downstream tasks. As model sizes continue to grow, updating all model parameters becomes increasingly costly. Parameter-efficient fine-tuning methods effectively address this issue by selectively updating a small subset of parameters. However, fine-tuning and most existing parameter-efficient fine-tuning methods require updating the same number of parameters as the initial size, ignoring the unequal contribution across Transformer blocks and leading to extremely inefficient allocation of computing resources. In this paper, we propose Progtuning, the novel fine-tuning framework combined with progressive learning for Transformer-based language models. Specifically, Progtuning progressively reduces the number of updated transformer blocks based on the contribution. Remarkably, Progtuning optimizes resource allocation and reduces the number of updated parameters by approximately 25%, while still maintaining competitive performance. And it also exhibits high adaptability with parameter-efficient fine-tuning methods, demonstrating excellent performance across various adaptation scenarios.
zh

[NLP-33] Learning to Skip the Middle Layers of Transformers

【速读】：该论文试图解决Transformer模型在计算效率上的问题，特别是通过动态跳过中间层来减少计算需求。其解决方案的关键在于提出一种新的架构，该架构基于输入内容动态跳过从中间向外的可变数量的层，利用学习到的门控机制决定是否绕过对称的中心块，并通过门控注意力机制防止后续标记访问被跳过的标记位置，同时采用“sandwich”或“perilayernorm”方案控制残差归一化，并通过自适应正则化损失控制门控稀疏性。

链接: https://arxiv.org/abs/2506.21103
作者: Tim Lawson,Laurence Aitchison
机构: University of Bristol (布里斯托大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a ‘sandwich’ or ‘perilayernorm’ scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for ‘simpler’ tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at this https URL.
zh

[NLP-34] ComRAG : Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry ACL2025

【速读】：该论文旨在解决工业级社区问答（Community Question Answering, CQA）平台中如何有效利用历史交互和领域知识进行实时问答的问题。现有方法普遍存在外部知识利用不足、未能融合动态历史问答上下文或缺乏适合工业部署的记忆机制等缺陷。论文提出的解决方案是ComRAG，其关键在于采用基于中心点的记忆机制，将静态知识与动态历史问答对进行整合，从而实现检索、生成和高效存储的统一。

链接: https://arxiv.org/abs/2506.21098
作者: Qinwen Chen,Wenbiao Tao,Zhiwei Zhu,Mingfan Xi,Liangzhong Guo,Yuan Wang,Wei Wang,Yunshi Lan
机构: East China Normal University (华东师范大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures. Accepted at ACL 2025 Industry Track

点击查看摘要

Abstract:Community Question Answering (CQA) platforms can be deemed as important knowledge bases in community, but effectively leveraging historical interactions and domain knowledge in real-time remains a challenge. Existing methods often underutilize external knowledge, fail to incorporate dynamic historical QA context, or lack memory mechanisms suited for industrial deployment. We propose ComRAG, a retrieval-augmented generation framework for real-time industrial CQA that integrates static knowledge with dynamic historical QA pairs via a centroid-based memory mechanism designed for retrieval, generation, and efficient storage. Evaluated on three industrial CQA datasets, ComRAG consistently outperforms all baselines–achieving up to 25.9% improvement in vector similarity, reducing latency by 8.7% to 23.3%, and lowering chunk growth from 20.23% to 2.06% over iterations.
zh

[NLP-35] DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning ACL2025

【速读】：该论文旨在解决多模态句子表示学习中的两个关键问题：跨模态对齐偏差（cross-modal misalignment bias）和模态内语义发散（intra-modal semantic divergence），这些问题显著降低了句子表示的质量。其解决方案的关键在于提出一种双层次对齐学习方法（Dual-level Alignment Learning, DALR），通过一致性学习模块实现细粒度的跨模态对齐，并结合排名蒸馏与全局模态内对齐学习，以更精确地捕捉句子间的复杂关系，从而提升表示质量。

链接: https://arxiv.org/abs/2506.21096
作者: Kang He,Yuzhe Ding. Haining Wang,Fei Li,Chong Teng,Donghong Ji
机构: Wuhan University (武汉大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 Findings

点击查看摘要

Abstract:Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.
zh

[NLP-36] Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph

【速读】：该论文试图解决如何有效提升大型语言模型（Large Language Models, LLMs）使用工具的能力，从而增强其问题解决能力和应用范围。传统方法依赖LLMs生成指令数据，但数据质量往往不足。论文提出的解决方案关键在于利用知识图谱（Knowledge Graphs）生成高质量的指令数据，通过从知识图谱中提取查询路径并转化为用户查询，将实体间的关系转化为可操作的工具，并解析查询路径为详细解决方案步骤，从而构建高质量的训练数据。实验表明，仅需少量此类合成数据即可显著提升LLMs的工具使用能力和整体性能。

链接: https://arxiv.org/abs/2506.21071
作者: Jingwei Wang,Zai Zhang,Hao Qian,Chunjing Gan,Binbin Hu,Ziqi Liu,Zhiqiang Zhang,Jun Zhou,Bin Shi,Bo Dong
机构: Ant Group(蚂蚁集团); School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院); School of Distance Education, Xi’an Jiaotong University(西安交通大学继续教育学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 20 pages, 12 figures

点击查看摘要

Abstract:Teaching large language models (LLMs) to use tools is crucial for improving their problem-solving abilities and expanding their applications. However, effectively using tools is challenging because it requires a deep understanding of tool functionalities and user intentions. Previous methods relied mainly on LLMs to generate instruction data, but the quality of these data was often insufficient. In this paper, we propose a new method that uses knowledge graphs to generate high-quality instruction data for LLMs. Knowledge graphs are manually curated datasets rich in semantic information. We begin by extracting various query pathways from a given knowledge graph, which are transformed into a broad spectrum of user queries. We then translate the relationships between entities into actionable tools and parse the pathways of each query into detailed solution steps, thereby creating high-quality instruction data. Our experiments show that fine-tuning on just a small sample of this synthetic data can significantly improve the tool utilization and overall capabilities of LLMs.
zh

[NLP-37] MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection

【速读】：该论文旨在解决社交媒体中多主体、多轮对话的立场检测问题，传统方法通常针对单一实例进行分析，难以建模真实场景中的多方讨论。其关键解决方案是提出一种基于大语言模型增强的对话关系注意力网络（LLM-CRAN），利用大语言模型的推理能力提升对话理解，从而有效应对MT2-CSD数据集所带来的新挑战。

链接: https://arxiv.org/abs/2506.21053
作者: Fuqiang Niu,Genan Dai,Yisha Lu,Jiayu Liao,Xiang Li,Hu Huang,Bowen Zhang
机构: Shenzhen Technology University (深圳技术大学); University of Washington (华盛顿大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the realm of contemporary social media, automatic stance detection is pivotal for opinion mining, as it synthesizes and examines user perspectives on contentious topics to uncover prevailing trends and sentiments. Traditional stance detection research often targets individual instances, thereby limiting its capacity to model multi-party discussions typical in real social media scenarios. This shortcoming largely stems from the scarcity of datasets that authentically capture the dynamics of social media interactions, hindering advancements in conversational stance detection. In this paper, we introduce MT2-CSD, a comprehensive dataset for multi-target, multi-turn conversational stance detection. To the best of our knowledge, MT2-CSD is the largest dataset available for this purpose, comprising 24,457 annotated instances and exhibiting the greatest conversational depth, thereby presenting new challenges for stance detection. To address these challenges, we propose the Large Language model enhanced Conversational Relational Attention Network (LLM-CRAN), which exploits the reasoning capabilities of LLMs to improve conversational understanding. We conduct extensive experiments to evaluate the efficacy of LLM-CRAN on the MT2-CSD dataset. The experimental results indicate that LLM-CRAN significantly outperforms strong baseline models in the task of conversational stance detection.
zh

[NLP-38] A Semi-supervised Scalable Unified Framework for E-commerce Query Classification ACL2025

【速读】：该论文旨在解决电子商务中查询分类（query classification）任务面临的挑战，包括查询短且缺乏上下文、标签间信息无法利用导致先验信息不足，以及现有方法依赖用户点击行为构建训练样本所引发的马太恶循环问题。此外，查询分类的子任务缺乏统一框架，导致算法优化效率低下。论文提出的解决方案是构建一种新型的半监督可扩展统一框架（Semi-supervised Scalable Unified Framework, SSUF），其关键在于通过三个增强模块实现任务统一：知识增强模块利用世界知识增强查询表示，标签增强模块通过标签语义和半监督信号减少对后验标签的依赖，结构增强模块基于复杂的标签关系增强标签表示。各模块高度可插拔，可根据不同子任务灵活调整输入特征。

链接: https://arxiv.org/abs/2506.21049
作者: Chunyuan Yuan,Chong Zhang,Zheng Fang,Ming Pang,Xue Jiang,Changping Peng,Zhangang Lin,Ching Law
机构: JD.COM(京东)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by ACL 2025

点击查看摘要

Abstract:Query classification, including multiple subtasks such as intent and category prediction, is vital to e-commerce applications. E-commerce queries are usually short and lack context, and the information between labels cannot be used, resulting in insufficient prior information for modeling. Most existing industrial query classification methods rely on users’ posterior click behavior to construct training samples, resulting in a Matthew vicious cycle. Furthermore, the subtasks of query classification lack a unified framework, leading to low efficiency for algorithm optimization. In this paper, we propose a novel Semi-supervised Scalable Unified Framework (SSUF), containing multiple enhanced modules to unify the query classification tasks. The knowledge-enhanced module uses world knowledge to enhance query representations and solve the problem of insufficient query information. The label-enhanced module uses label semantics and semi-supervised signals to reduce the dependence on posterior labels. The structure-enhanced module enhances the label representation based on the complex label relations. Each module is highly pluggable, and input features can be added or removed as needed according to each subtask. We conduct extensive offline and online A/B experiments, and the results show that SSUF significantly outperforms the state-of-the-art models. Comments: Accepted by ACL 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2506.21049 [cs.CL] (or arXiv:2506.21049v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.21049 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-39] Large Language Models Acing Chartered Accountancy

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在金融领域中对专业领域知识的掌握与应用能力不足的问题。其解决方案的关键在于构建一个专门针对印度财务背景的基准测试集CA-Ben，该基准测试集来源于印度特许会计师协会（Institute of Chartered Accountants of India, ICAI）的严格考试，涵盖基础、中级和高级财务课程内容，用于评估LLMs在财务、法律和定量推理方面的能力。通过这一基准测试，研究者能够更准确地衡量现有LLMs的性能，并揭示其在数值计算和法律解释方面的局限性。

链接: https://arxiv.org/abs/2506.21031
作者: Jatin Gupta,Akhil Sharma,Saransh Singhania,Mohammad Adnan,Sakshi Deo,Ali Imam Abidi,Keshav Gupta
机构: Sharda University (Sharda 大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication at MoStart 2025: International Conference on Digital Transformation in Education and Applications of Artificial Intelligence, Bosnia and Herzegovina, 2025

点击查看摘要

Abstract:Advanced intelligent systems, particularly Large Language Models (LLMs), are significantly reshaping financial practices through advancements in Natural Language Processing (NLP). However, the extent to which these models effectively capture and apply domain-specific financial knowledge remains uncertain. Addressing a critical gap in the expansive Indian financial context, this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically designed to evaluate the financial, legal, and quantitative reasoning capabilities of LLMs. CA-Ben comprises structured question-answer datasets derived from the rigorous examinations conducted by the Institute of Chartered Accountants of India (ICAI), spanning foundational, intermediate, and advanced CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1 405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated using standardized protocols. Results indicate variations in performance, with Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and legal reasoning. Notable challenges emerged in numerical computations and legal interpretations. The findings emphasize the strengths and limitations of current LLMs, suggesting future improvements through hybrid reasoning and retrieval-augmented generation methods, particularly for quantitative analysis and accurate legal interpretation.
zh

[NLP-40] SAC: A Framework for Measuring and Inducing Personality Traits in LLM s with Dynamic Intensity Control

【速读】：该论文试图解决现有大型语言模型（Large Language Models, LLMs）在模拟人类个性时存在的两个主要问题：一是依赖于粗粒度的五大性格维度（Big Five, OCEAN）框架，二是缺乏对特质强度的控制机制。其解决方案的关键在于扩展机器性格量表（Machine Personality Inventory, MPI），引入16种人格因素（16 Personality Factor, 16PF）模型以实现对16种独立特质的表达控制，并提出一种结构化框架——特定属性控制（Specific Attribute Control, SAC），通过形容词语义锚定和行为问题评估五个强度因素（Frequency, Depth, Threshold, Effort, Willingness）来动态调节特质强度。该方法将特质强度建模为连续谱，相较于二元特质切换，能够提供更一致且可控的性格表达。

链接: https://arxiv.org/abs/2506.20993
作者: Adithya Chittem,Aishna Shrivastava,Sai Tarun Pendela,Jagat Sesh Challa,Dhruv Kumar
机构: BITS Pilani(比尔拉理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Under review

点击查看摘要

Abstract:Large language models (LLMs) have gained significant traction across a wide range of fields in recent years. There is also a growing expectation for them to display human-like personalities during interactions. To meet this expectation, numerous studies have proposed methods for modelling LLM personalities through psychometric evaluations. However, most existing models face two major limitations: they rely on the Big Five (OCEAN) framework, which only provides coarse personality dimensions, and they lack mechanisms for controlling trait intensity. In this paper, we address this gap by extending the Machine Personality Inventory (MPI), which originally used the Big Five model, to incorporate the 16 Personality Factor (16PF) model, allowing expressive control over sixteen distinct traits. We also developed a structured framework known as Specific Attribute Control (SAC) for evaluating and dynamically inducing trait intensity in LLMs. Our method introduces adjective-based semantic anchoring to guide trait intensity expression and leverages behavioural questions across five intensity factors: \textitFrequency, \textitDepth, \textitThreshold, \textitEffort, and \textitWillingness. Through experimentation, we find that modelling intensity as a continuous spectrum yields substantially more consistent and controllable personality expression compared to binary trait toggling. Moreover, we observe that changes in target trait intensity systematically influence closely related traits in psychologically coherent directions, suggesting that LLMs internalize multi-dimensional personality structures rather than treating traits in isolation. Our work opens new pathways for controlled and nuanced human-machine interactions in domains such as healthcare, education, and interviewing processes, bringing us one step closer to truly human-like social machines.
zh

[NLP-41] SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes

【速读】：该论文试图解决在内存受限、仅支持推理的边缘设备上对视觉语言模型（VLM）进行微调的问题，因为传统方法依赖反向传播（BP）获取模型梯度，而这些设备无法支持。解决方案的关键在于提出一种混合的Sharpness-aware Zeroth-order优化方法（SharpZO），其核心是通过尖锐性感知的预热训练来提升零阶（ZO）优化的微调性能。SharpZO采用两阶段优化过程：第一阶段利用尖锐性感知的进化策略（ES）全局探索并平滑损失函数，构建强初始化；第二阶段通过稀疏的ZO优化进行细粒度局部搜索，整个过程仅依赖前向传播。

链接: https://arxiv.org/abs/2506.20990
作者: Yifan Yang,Zhen Zhang,Rupak Vignesh Swaminathan,Jing Liu,Nathan Susanj,Zheng Zhang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Amazon AGI (亚马逊人工智能)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-tuning vision language models (VLMs) has achieved remarkable performance across various downstream tasks; yet, it requires access to model gradients through backpropagation (BP), making them unsuitable for memory-constrained, inference-only edge devices. To address this limitation, previous work has explored various BP-free fine-tuning methods. However, these approaches often rely on high-variance evolutionary strategies (ES) or zeroth-order (ZO) optimization, and often fail to achieve satisfactory performance. In this paper, we propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO) approach, specifically designed to enhance the performance of ZO VLM fine-tuning via a sharpness-aware warm-up training. SharpZO features a two-stage optimization process: a sharpness-aware ES stage that globally explores and smooths the loss landscape to construct a strong initialization, followed by a fine-grained local search via sparse ZO optimization. The entire optimization relies solely on forward passes. Detailed theoretical analysis and extensive experiments on CLIP models demonstrate that SharpZO significantly improves accuracy and convergence speed, achieving up to 7% average gain over state-of-the-art forward-only methods.
zh

[NLP-42] Can Gradient Descent Simulate Prompting?

【速读】：该论文试图解决如何使微调（fine-tuning）能够模拟提示（prompting）的效果，从而在不增加长期存储成本的情况下，提升模型对新信息的适应能力。其解决方案的关键在于通过元训练（meta-training）语言模型（LM），使其梯度更新能够模仿新信息条件下的效果，具体方法利用了基于梯度的元学习工具，并将模型自身的提示预测作为目标，从而无需依赖真实标签。这一方法在单次梯度更新后即可恢复部分甚至全部提示模型的性能，表明适当的初始化下，梯度下降具有较强的表达能力。

链接: https://arxiv.org/abs/2506.20989
作者: Eric Zhang,Leshem Choshen,Jacob Andreas
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:There are two primary ways of incorporating new information into a language model (LM): changing its prompt or changing its parameters, e.g. via fine-tuning. Parameter updates incur no long-term storage cost for model changes. However, for many model updates, prompting is significantly more effective: prompted models can generalize robustly from single examples and draw logical inferences that do not occur under standard fine-tuning. Can models be modified so that fine-tuning does emulate prompting? This paper describes a method for meta-training LMs such that gradient updates emulate the effects of conditioning on new information. Our approach uses tools from gradient-based meta-learning but uses an LM’s own prompted predictions as targets, eliminating the need for ground-truth labels. Subsequent gradient descent training recovers some (and occasionally all) of prompted model performance – showing improvement on the ``reversal curse’’ tasks, and answering questions about text passages after a single gradient update. These results suggest that, with appropriate initialization, gradient descent can be surprisingly expressive. Our results suggest new avenues for long-context modeling and offer insight into the generalization capabilities of gradient-based learning.
zh

[NLP-43] Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation

【速读】：该论文试图解决语言模型代理在高风险社会决策中的潜在负面影响问题，特别是其建议在宏观社会系统中随时间传播的深远影响。解决方案的关键在于提出一个概念验证框架，用于模拟模型生成的建议如何在社会系统中传播，并通过引入包含100个间接伤害场景的数据集来评估模型对非显性负面结果的预见能力。该方法在新数据集上实现了超过20%的性能提升，并在现有安全基准（如AdvBench、SafeRLHF、WildGuardMix）上平均胜率超过70%，表明其在提升代理安全性方面的潜力。

链接: https://arxiv.org/abs/2506.20949
作者: Chenkai Sun,Denghui Zhang,ChengXiang Zhai,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Given the growing influence of language model-based agents on high-stakes societal decisions, from public policy to healthcare, ensuring their beneficial impact requires understanding the far-reaching implications of their suggestions. We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems on a macroscopic scale over time, enabling more robust alignment. To assess the long-term safety awareness of language models, we also introduce a dataset of 100 indirect harm scenarios, testing models’ ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts. Our approach achieves not only over 20% improvement on the new dataset but also an average win rate exceeding 70% against strong baselines on existing safety benchmarks (AdvBench, SafeRLHF, WildGuardMix), suggesting a promising direction for safer agents.
zh

[NLP-44] KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

【速读】：该论文旨在解决通用文本嵌入任务中模型性能与参数规模之间的平衡问题，即在保持模型紧凑性的同时实现卓越的表示能力。其关键解决方案在于：首先，通过移除因果注意力掩码并采用全双向Transformer结构结合简单的均值池化方法，使模型架构更契合表示学习；其次，设计多阶段训练流程，包括大规模弱监督语料预训练、高质量检索与非检索数据微调以及模型参数平均以增强泛化能力；此外，引入焦点重加权机制和在线难例混合策略，提升模型对困难样本的学习效率与多样性。这些创新共同推动了模型在MTEB基准测试中的优异表现。

链接: https://arxiv.org/abs/2506.20923
作者: Xinping Zhao,Xinshuo Hu,Zifei Shan,Shouzheng Huang,Yao Zhou,Zetian Sun,Zhenyu Liu,Dongfang Li,Xinyuan Wei,Qian Chen,Youcheng Pan,Yang Xiang,Meishan Zhang,Haofen Wang,Jun Yu,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Pengcheng Laboratory(鹏城实验室); Tongji University(同济大学)
类目: Computation and Language (cs.CL)
备注: Technical Report; 26 pages 12 tables 1 figure. arXiv admin note: substantial text overlap with arXiv:2501.01028

点击查看摘要

Abstract:In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.
zh

[NLP-45] FineWeb2: One Pipeline to Scale Them All – Adapting Pre-Training Data Processing to Every Language

【速读】：该论文旨在解决多语言大型语言模型（LLM）训练中数据集构建的挑战，特别是在处理大量语言时，过滤和去重管道的定制化难度较大。其解决方案的关键在于提出了一种基于FineWeb的预训练数据集构建流水线，该流水线能够自动适应任何语言，并通过在九种多样化语言上的广泛消融实验验证了其有效性。此外，该方法还引入了一种考虑重复次数和质量的数据集再平衡策略，进一步提升了模型性能。最终，该流水线被扩展至1000多种语言，生成了包含20TB数据（50亿文档）的多语言数据集FineWeb2。

链接: https://arxiv.org/abs/2506.20920
作者: Guilherme Penedo,Hynek Kydlíček,Vinko Sabolčec,Bettina Messmer,Negar Foroutan,Amir Hossein Kargaran,Colin Raffel,Martin Jaggi,Leandro Von Werra,Thomas Wolf
机构: Hugging Face(胡格弗莱); EPFL(瑞士洛桑联邦理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.
zh

[NLP-46] Optimising Language Models for Downstream Tasks: A Post-Training Perspective

【速读】：该论文旨在解决语言模型（Language Models, LMs）在适应特定任务时效率低、鲁棒性不足以及计算成本高的问题。其关键解决方案包括：提出一种新颖的持续预训练技术，以更有效地从无标签数据中提取任务相关知识；开发一种参数高效的微调方法，显著降低内存和计算成本；改进监督微调方法，使模型在标签数据稀缺的情况下仍能更好地遵循指令；并构建新的评估方法和基准测试，如多跳空间推理任务，以更全面地评估模型能力。这些方法共同提升了语言模型的鲁棒性、效率和泛化能力。

链接: https://arxiv.org/abs/2506.20917
作者: Zhengyan Shi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PhD Thesis

点击查看摘要

Abstract:Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often underutilizes available unlabelled data, leads to overfitting on small task-specific sets, and imposes significant computational costs. These limitations hamper their application to the open-ended landscape of real-world language tasks. This thesis proposes a series of methods to better adapt LMs to downstream applications. First, we explore strategies for extracting task-relevant knowledge from unlabelled data, introducing a novel continued pre-training technique that outperforms state-of-the-art semi-supervised approaches. Next, we present a parameter-efficient fine-tuning method that substantially reduces memory and compute costs while maintaining competitive performance. We also introduce improved supervised fine-tuning methods that enable LMs to better follow instructions, especially when labelled data is scarce, enhancing their performance across a range of NLP tasks, including open-ended generation. Finally, we develop new evaluation methods and benchmarks, such as multi-hop spatial reasoning tasks, to assess LM capabilities and adaptation more comprehensively. Through extensive empirical studies across diverse NLP tasks, our results demonstrate that these approaches substantially improve LM robustness, efficiency, and generalization, making them more adaptable to a broad range of applications. These advances mark a significant step towards more robust and efficient LMs, bringing us closer to the goal of artificial general intelligence. Comments: PhD Thesis Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.20917 [cs.CL] (or arXiv:2506.20917v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.20917 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-47] Decide less communicate more: On the construct validity of end-to-end fact-checking in medicine

【速读】：该论文试图解决医学领域中自动事实核查系统在实际应用中的局限性问题，特别是针对从社交媒体中提取的真实医疗声明进行验证时所面临的挑战。研究揭示了端到端事实核查在医学应用中的根本性难题，包括将现实世界中的声明与临床试验等科学证据相连接的困难、声明表述不明确与意图不匹配带来的歧义，以及真实性标签的主观性。论文的关键解决方案在于提出应将事实核查视为一种交互式沟通问题，而非单纯的端到端流程，从而更有效地应对医学信息复杂性和用户需求的多样性。

链接: https://arxiv.org/abs/2506.20876
作者: Sebastian Joseph,Lily Chen,Barry Wei,Michael Mackert,Iain J. Marshall,Paul Pu Liang,Ramez Kouzy,Byron C. Wallace,Junyi Jessy Li
机构: The University of Texas at Austin(德克萨斯大学奥斯汀分校); Massachusetts Institute of Technology(麻省理工学院); Indiana University School of Medicine(印第安纳大学医学院); King’s College London(伦敦国王学院); The University of Texas MD Anderson Cancer Center(德克萨斯大学MD安德森癌症中心); Northeastern University(东北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet the nature of it is highly technical, rendering the medical literacy of majority users inadequate to sufficiently navigate the domain. Such problems with medical communication ripens the ground for end-to-end fact-checking agents: check a claim against current medical literature and return with an evidence-backed verdict. And yet, such systems remain largely unused. To understand this, we present the first study examining how clinical experts verify real claims from social media by synthesizing medical evidence. In searching for this upper-bound, we reveal fundamental challenges in end-to-end fact-checking when applied to medicine: Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials; ambiguities in underspecified claims mixed with mismatched intentions; and inherently subjective veracity labels. We argue that fact-checking should be approached and evaluated as an interactive communication problem, rather than an end-to-end process.
zh

[NLP-48] Leaner Training Lower Leakage: Revisiting Memorization in LLM Fine-Tuning with LoRA

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在微调过程中存在的记忆问题，特别是LoRA（Low-Rank Adaptation）微调方法中记忆风险与任务性能之间的平衡问题。现有研究主要关注预训练阶段的记忆现象，而对微调阶段尤其是LoRA微调的记忆特性研究较少。论文的关键解决方案是通过引入一种更宽松的基于相似性的记忆度量方法，证明LoRA微调在显著降低记忆风险的同时，仍能保持较强的下游任务性能，从而为参数高效微调提供了一种更安全的选择。

链接: https://arxiv.org/abs/2506.20856
作者: Fei Wang,Baochun Li
机构: University of Toronto (多伦多大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Memorization in large language models (LLMs) makes them vulnerable to data extraction attacks. While pre-training memorization has been extensively studied, fewer works have explored its impact in fine-tuning, particularly for LoRA fine-tuning, a widely adopted parameter-efficient method. In this work, we re-examine memorization in fine-tuning and uncover a surprising divergence from prior findings across different fine-tuning strategies. Factors such as model scale and data duplication, which strongly influence memorization in pre-training and full fine-tuning, do not follow the same trend in LoRA fine-tuning. Using a more relaxed similarity-based memorization metric, we demonstrate that LoRA significantly reduces memorization risks compared to full fine-tuning, while still maintaining strong task performance. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR) Cite as: arXiv:2506.20856 [cs.LG] (or arXiv:2506.20856v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.20856 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-49] Uncovering Hidden Violent Tendencies in LLM s: A Demographic Analysis via Behavioral Vignettes

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理道德模糊的现实场景时，尤其是涉及暴力内容的判断与响应能力方面存在的研究空白。其解决方案的关键在于引入经过验证的社会科学工具——暴力行为情景问卷（Violent Behavior Vignette Questionnaire, VBVQ），并通过基于角色的提示方法（persona-based prompting）来评估模型在不同人口统计学特征下的表现，从而揭示LLMs在暴力倾向上的潜在偏差及其与传统社会科学研究结果的差异。

链接: https://arxiv.org/abs/2506.20822
作者: Quintin Myers,Yanjun Gao
机构: University of Colorado Anschutz (科罗拉多大学安舒茨医学分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Large language models (LLMs) are increasingly proposed for detecting and responding to violent content online, yet their ability to reason about morally ambiguous, real-world scenarios remains underexamined. We present the first study to evaluate LLMs using a validated social science instrument designed to measure human response to everyday conflict, namely the Violent Behavior Vignette Questionnaire (VBVQ). To assess potential bias, we introduce persona-based prompting that varies race, age, and geographic identity within the United States. Six LLMs developed across different geopolitical and organizational contexts are evaluated under a unified zero-shot setting. Our study reveals two key findings: (1) LLMs surface-level text generation often diverges from their internal preference for violent responses; (2) their violent tendencies vary across demographics, frequently contradicting established findings in criminology, social science, and psychology.
zh

[NLP-50] MultiFinRAG : An Optimized Multimodal Retrieval-Augmented Generation (RAG ) Framework for Financial Question Answering

【速读】：该论文旨在解决金融文档（如10-K、10-Q和投资者演示文稿）中跨模态问答（cross-modal QA）的挑战，这些问题通常涉及文本、表格和图像等多种模态的联合推理，而传统的大语言模型（Large Language Models, LLMs）和检索增强生成（Retrieval-Augmented Generation, RAG）管道因令牌限制、布局丢失和跨模态上下文碎片化而难以有效处理。其解决方案的关键在于提出MultiFinRAG框架，该框架通过多模态提取、模态感知的嵌入与索引以及分层回退策略，实现了对金融文档中多模态信息的高效整合与精准检索，从而提升了复杂金融问答任务的准确性。

链接: https://arxiv.org/abs/2506.20821
作者: Chinmay Gondhalekar,Urjitkumar Patel,Fang-Chun Yeh
机构: S&P Global Ratings(标准普尔全球评级)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Preprint Copy

点击查看摘要

Abstract:Financial documents–such as 10-Ks, 10-Qs, and investor presentations–span hundreds of pages and combine diverse modalities, including dense narrative text, structured tables, and complex figures. Answering questions over such content often requires joint reasoning across modalities, which strains traditional large language models (LLMs) and retrieval-augmented generation (RAG) pipelines due to token limitations, layout loss, and fragmented cross-modal context. We introduce MultiFinRAG, a retrieval-augmented generation framework purpose-built for financial QA. MultiFinRAG first performs multimodal extraction by grouping table and figure images into batches and sending them to a lightweight, quantized open-source multimodal LLM, which produces both structured JSON outputs and concise textual summaries. These outputs, along with narrative text, are embedded and indexed with modality-aware similarity thresholds for precise retrieval. A tiered fallback strategy then dynamically escalates from text-only to text+table+image contexts when necessary, enabling cross-modal reasoning while reducing irrelevant context. Despite running on commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o (free-tier) on complex financial QA tasks involving text, tables, images, and combined multimodal reasoning.
zh

[NLP-51] he Ideation-Execution Gap: Execution Outcomes of LLM -Generated versus Human Research Ideas

【速读】：该论文试图解决生成式 AI (Generative AI) 生成的研究思路是否能够带来更优的研究成果的问题。研究的核心在于验证 AI 生成的创意在实际执行后是否仍能保持其新颖性和有效性，而不仅仅是表面上的创新性。解决方案的关键在于通过一项执行研究，让 43 名专家研究人员分别执行由专家撰写或由 LLM 生成的研究思路，并对执行结果进行盲审评估，从而客观比较两者在实际研究中的表现。

链接: https://arxiv.org/abs/2506.20803
作者: Chenglei Si,Tatsunori Hashimoto,Diyi Yang
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: main paper is 14 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.
zh

[NLP-52] Multi-lingual Functional Evaluation for Large Language Models

【速读】：该论文试图解决现有静态多语言基准测试在评估大型语言模型的多语言能力时，无法充分反映模型在实际多语言场景中的性能和鲁棒性的问题。其解决方案的关键在于构建多语言功能性基准测试——跨语言小学数学符号（CL-GSM Symbolic）和跨语言指令遵循评估（CL-IFEval），通过将现有的功能性基准模板从英语翻译到法语、西班牙语、印地语、阿拉伯语和约鲁巴语五种语言，以覆盖不同资源水平的自然语言处理环境。

链接: https://arxiv.org/abs/2506.20793
作者: Victor Ojewale,Inioluwa Deborah Raji,Suresh Venkatasubramanian
机构: Brown University (布朗大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks – Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there’s a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.
zh

[NLP-53] owards Probabilistic Question Answering Over Tabular Data

【速读】：该论文试图解决在表格数据上进行概率性问答（probabilistic QA）的问题，传统方法如自然语言到SQL（NL2SQL）系统在处理需要不确定性推理的事实性问题时表现良好，但在处理概率性问题时存在不足。解决方案的关键在于引入一个名为LUCARIO的新基准和一种框架，该框架从表格中推导贝叶斯网络（Bayesian Networks），将自然语言查询转化为概率查询，并利用大语言模型（LLMs）生成最终答案，从而实现符号主义与神经网络推理的混合方法。

链接: https://arxiv.org/abs/2506.20747
作者: Chen Shen,Sajjadur Rahman,Estevam Hruschka
机构: Megagon Labs (Megagon Labs); Adobe (Adobe)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current approaches for question answering (QA) over tabular data, such as NL2SQL systems, perform well for factual questions where answers are directly retrieved from tables. However, they fall short on probabilistic questions requiring reasoning under uncertainty. In this paper, we introduce a new benchmark LUCARIO and a framework for probabilistic QA over large tabular data. Our method induces Bayesian Networks from tables, translates natural language queries into probabilistic queries, and uses large language models (LLMs) to generate final answers. Empirical results demonstrate significant improvements over baselines, highlighting the benefits of hybrid symbolic-neural reasoning.
zh

[NLP-54] MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation

【速读】：该论文试图解决基于大语言模型（Large Language Model, LLM）的智能体在多智能体协作中对上下文隐私的理解与保护问题。现有基准测试主要针对单轮、低复杂度任务，无法有效评估真实场景下的隐私保护能力。论文提出的关键解决方案是构建一个名为MAGPIE的基准测试，包含15个领域中的158个高风险实际场景，这些场景设计使得完全排除隐私数据会阻碍任务完成，而无限制的信息共享可能导致重大损失。通过该基准测试，论文评估了当前先进LLM在理解上下文隐私数据和协作过程中不违反用户隐私方面的能力，揭示了现有模型在隐私保护方面的显著不足。

链接: https://arxiv.org/abs/2506.20737
作者: Gurusha Juneja,Alon Albalak,Wenyue Hua,William Yang Wang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of LLM-based agents has led to increasing deployment of inter-agent collaboration for tasks like scheduling, negotiation, resource allocation etc. In such systems, privacy is critical, as agents often access proprietary tools and domain-specific databases requiring strict confidentiality. This paper examines whether LLM-based agents demonstrate an understanding of contextual privacy. And, if instructed, do these systems preserve inference time user privacy in non-adversarial multi-turn conversation. Existing benchmarks to evaluate contextual privacy in LLM-agents primarily assess single-turn, low-complexity tasks where private information can be easily excluded. We first present a benchmark - MAGPIE comprising 158 real-life high-stakes scenarios across 15 domains. These scenarios are designed such that complete exclusion of private data impedes task completion yet unrestricted information sharing could lead to substantial losses. We then evaluate the current state-of-the-art LLMs on (a) their understanding of contextually private data and (b) their ability to collaborate without violating user privacy. Empirical experiments demonstrate that current models, including GPT-4o and Claude-2.7-Sonnet, lack robust understanding of contextual privacy, misclassifying private data as shareable 25.2% and 43.6% of the time. In multi-turn conversations, these models disclose private information in 59.9% and 50.5% of cases even under explicit privacy instructions. Furthermore, multi-agent systems fail to complete tasks in 71% of scenarios. These results underscore that current models are not aligned towards both contextual privacy preservation and collaborative task-solving.
zh

[NLP-55] CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment

【速读】：该论文旨在解决非母语者语音流利度评估（fluency assessment）中的挑战，特别是如何有效捕捉语音节奏、停顿和不流畅现象。其解决方案的关键在于提出一种基于分块（chunk-based）的方法，结合自监督学习（SSL）模型（如Wav2Vec2、HuBERT和WavLM）与分层的卷积神经网络-双向长短期记忆网络（CNN-BiLSTM）框架，通过语音分块和特征融合实现更精确的流利度分析。

链接: https://arxiv.org/abs/2506.20243
作者: Papa Séga Wade,Mihai Andries,Ioannis Kanellos,Thierry Moudenc
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, accepted for presentation at EUSIPCO 2025

点击查看摘要

Abstract:Automatic fluency assessment (AFA) remains challenging, particularly in capturing speech rhythm, pauses, and disfluencies in non-native speakers. We introduce a chunk-based approach integrating self-supervised learning (SSL) models (Wav2Vec2, HuBERT, and WavLM) selected for their complementary strengths in phonetic, prosodic, and noisy speech modeling, with a hierarchical CNN-BiLSTM framework. Speech is segmented into breath-group chunks using Silero voice activity detection (Silero-VAD), enabling fine-grained temporal analysis while mitigating over-segmentation artifacts. SSL embeddings are fused via a learnable weighted mechanism, balancing acoustic and linguistic features, and enriched with chunk-level fluency markers (e.g., speech rate, pause durations, n-gram repetitions). The CNN-BiLSTM captures local and long-term dependencies across chunks. Evaluated on Avalinguo and Speechocean762, our approach improves F1-score by 2.8 and Pearson correlation by 6.2 points over single SSL baselines on Speechocean762, with gains of 4.2 F1-score and 4.0 Pearson points on Avalinguo, surpassing this http URL-based segmentation baselines. These findings highlight chunk-based multi-SSL fusion for robust fluency evaluation, though future work should explore generalization to dialects with irregular prosody.
zh

[NLP-56] Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings

【速读】：该论文旨在解决阿拉伯语方言识别在语音技术中的挑战，特别是在语言多样性高且缺乏大规模标注数据集的情况下，尤其是对代表性不足的方言。其解决方案的关键在于采用混合建模策略，将传统信号处理技术与深度学习架构相结合，以应对低资源场景下的识别问题。具体而言，研究开发并评估了两种混合模型：基于梅尔频率倒谱系数（MFCC）与卷积神经网络（CNN）的组合，以及基于离散小波变换（DWT）特征与循环神经网络（RNN）的组合，其中MFCC + CNN表现出更优的性能，验证了谱特征与卷积模型结合的有效性。

链接: https://arxiv.org/abs/2506.21386
作者: Ghazal Al-Shwayyat,Omer Nezih Gerek
机构: Eskisehir Technical University (埃斯基谢希尔技术大学); Department of Electrical and Electronics Engineering (电气与电子工程系)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Arabic dialect recognition presents a significant challenge in speech technology due to the linguistic diversity of Arabic and the scarcity of large annotated datasets, particularly for underrepresented dialects. This research investigates hybrid modeling strategies that integrate classical signal processing techniques with deep learning architectures to address this problem in low-resource scenarios. Two hybrid models were developed and evaluated: (1) Mel-Frequency Cepstral Coefficients (MFCC) combined with a Convolutional Neural Network (CNN), and (2) Discrete Wavelet Transform (DWT) features combined with a Recurrent Neural Network (RNN). The models were trained on a dialect-filtered subset of the Common Voice Arabic dataset, with dialect labels assigned based on speaker metadata. Experimental results demonstrate that the MFCC + CNN architecture achieved superior performance, with an accuracy of 91.2% and strong precision, recall, and F1-scores, significantly outperforming the Wavelet + RNN configuration, which achieved an accuracy of 66.5%. These findings highlight the effectiveness of leveraging spectral features with convolutional models for Arabic dialect recognition, especially when working with limited labeled data. The study also identifies limitations related to dataset size, potential regional overlaps in labeling, and model optimization, providing a roadmap for future research. Recommendations for further improvement include the adoption of larger annotated corpora, integration of self-supervised learning techniques, and exploration of advanced neural architectures such as Transformers. Overall, this research establishes a strong baseline for future developments in Arabic dialect recognition within resource-constrained environments.
zh

计算机视觉

[CV-0] Whole-Body Conditioned Egocentric Video Prediction

【速读】：该论文试图解决从人类视角出发，建模复杂现实环境及具身代理行为的视频预测问题（video prediction）。其核心挑战在于如何利用过去的视频和由相对3D身体姿态表示的动作，生成符合物理规律且具有情境感知的自我中心视频。解决方案的关键在于通过基于人体关节层次结构的运动学姿态轨迹进行条件建模，从而学习物理人类动作如何从第一人称视角影响环境，并在Nymeria数据集上训练了一个自回归条件扩散Transformer模型，以实现对具身预测和控制能力的全面评估。

链接: https://arxiv.org/abs/2506.21552
作者: Yutong Bai,Danny Tran,Amir Bar,Yann LeCun,Trevor Darrell,Jitendra Malik
机构: UC Berkeley (加州大学伯克利分校); FAIR, Meta (人工智能研究院，元宇宙); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model’s embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.
zh

[CV-1] SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark

【速读】：该论文旨在解决多视角与多模态信息融合下的三维异常检测与分割（3D Anomaly Detection and Segmentation, ADS）问题，特别是在制造领域中单实例异常检测场景下的泛化能力挑战。其解决方案的关键在于提出SiM3D基准，这是首个针对该任务的多视角和多模态数据集，包含高分辨率图像、点云及CAD模型，并提供手动标注的异常测试样本，以支持从合成数据到真实数据的泛化研究。

链接: https://arxiv.org/abs/2506.21549
作者: Alex Costanzino,Pierluigi Zama Ramirez,Luigi Lella,Matteo Ragaglia,Alessandro Oliva,Giuseppe Lisanti,Luigi Di Stefano
机构: CVLab, University of Bologna, Italy; SACMI Imola, Italy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.
zh

[CV-2] SAM4D: Segment Anything in Camera and LiDAR Streams ICCV2025

【速读】：该论文旨在解决自动驾驶场景中多模态（相机与LiDAR）数据的可提示分割问题，以及传统人工标注导致的标注瓶颈。其解决方案的关键在于提出SAM4D模型，通过引入统一多模态位置编码（UMPE）实现相机与LiDAR特征在共享3D空间中的对齐，从而支持跨模态提示与交互；同时结合运动感知的跨模态记忆注意力机制（MCMA），利用自车运动补偿提升时序一致性与长时程特征检索能力，确保动态场景下的鲁棒分割性能。此外，为克服标注限制，还开发了多模态自动化数据引擎，实现高效且语义保真的伪标签生成。

链接: https://arxiv.org/abs/2506.21547
作者: Jianyun Xu,Song Wang,Ziqian Ni,Chunyong Hu,Sheng Yang,Jianke Zhu,Qiang Li
机构: CaiNiao Inc., Alibaba Group (菜鸟网络，阿里巴巴集团); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICCV2025, Project Page: this https URL

点击查看摘要

Abstract:We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.
zh

[CV-3] DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion

【速读】：该论文旨在解决从单张图像中重建三维物体时，由于现实世界中的遮挡导致视图不一致和三维重建质量下降的问题。其关键解决方案是提出一个端到端的遮挡感知多视角生成框架，该框架直接从部分遮挡的单张图像中合成六个结构一致的新视角，从而无需前期修复或人工标注即可进行下游三维重建。通过使用Pix2Gestalt数据集构建自监督训练流程，结合遮挡与未遮挡图像对及伪真实视角，模型被训练以学习结构感知的补全和视角一致性，同时在不修改原始架构的前提下对视图合成模型进行微调，以联合学习补全与多视角生成。

链接: https://arxiv.org/abs/2506.21544
作者: Yansong Qu,Shaohui Dai,Xinyang Li,Yuze Wang,You Shen,Liujuan Cao,Rongrong Ji
机构: Xiamen University (厦门大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing 3D objects from a single image is a long-standing challenge, especially under real-world occlusions. While recent diffusion-based view synthesis models can generate consistent novel views from a single RGB image, they generally assume fully visible inputs and fail when parts of the object are occluded. This leads to inconsistent views and degraded 3D reconstruction quality. To overcome this limitation, we propose an end-to-end framework for occlusion-aware multi-view generation. Our method directly synthesizes six structurally consistent novel views from a single partially occluded image, enabling downstream 3D reconstruction without requiring prior inpainting or manual annotations. We construct a self-supervised training pipeline using the Pix2Gestalt dataset, leveraging occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency. Without modifying the original architecture, we fully fine-tune the view synthesis model to jointly learn completion and multi-view generation. Additionally, we introduce the first benchmark for occlusion-aware reconstruction, encompassing diverse occlusion levels, object categories, and mask patterns. This benchmark provides a standardized protocol for evaluating future methods under partial occlusions. Our code is available at this https URL.
zh

[CV-4] StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning ICCV2025

【速读】：该论文试图解决基于Mamba的方法在点云表示学习中面临的两个关键问题：在SSM处理过程中破坏3D点的邻接性以及随着输入长度增加无法保留长序列记忆。解决方案的关键在于提出StruMamba3D，通过设计空间状态作为点间空间依赖性的代理、增强SSM的逐状态更新策略并引入轻量级卷积以促进空间状态间的交互，以及采用序列长度自适应策略降低预训练模型对输入长度变化的敏感性。

链接: https://arxiv.org/abs/2506.21541
作者: Chuxin Wang,Yixin Zha,Wenfei Yang,Tianzhu Zhang
机构: University of Science and Technology of China (中国科学技术大学); National Key Laboratoray of Deep Space Exploration, Deep Space Exploration Laboratory (深空探测重点实验室，深空探测实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without voting strategy.
zh

[CV-5] Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval ACL2025

【速读】：该论文旨在解决跨模态图像-文本检索中的问题，即不同模态内容之间可能存在多样化的关联，而传统方法通过单一向量嵌入表示样本语义，难以捕捉模态间的细微和多样化关系。尽管基于集合的表示方法通过多嵌入方式提升了表达能力，但仍面临稀疏监督和集合坍缩等问题。该论文的关键解决方案是引入最大对齐相似性优化嵌入集合间的一对一匹配，以保持集合内的语义多样性，并结合全局判别损失和集合内发散损失进一步增强表示效果。

链接: https://arxiv.org/abs/2506.21538
作者: Hani Alomari,Anushka Sivakumar,Andrew Zhang,Chris Thomas
机构: Virginia Tech (弗吉尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025 Main)

点击查看摘要

Abstract:Cross-modal image-text retrieval is challenging because of the diverse possible associations between content from different modalities. Traditional methods learn a single-vector embedding to represent semantics of each sample, but struggle to capture nuanced and diverse relationships that can exist across modalities. Set-based approaches, which represent each sample with multiple embeddings, offer a promising alternative, as they can capture richer and more diverse relationships. In this paper, we show that, despite their promise, these set-based representations continue to face issues including sparse supervision and set collapse, which limits their effectiveness. To address these challenges, we propose Maximal Pair Assignment Similarity to optimize one-to-one matching between embedding sets which preserve semantic diversity within the set. We also introduce two loss functions to further enhance the representations: Global Discriminative Loss to enhance distinction among embeddings, and Intra-Set Divergence Loss to prevent collapse within each set. Our method achieves state-of-the-art performance on MS-COCO and Flickr30k without relying on external data.
zh

[CV-6] WAFT: Warping-Alone Field Transforms for Optical Flow

【速读】：该论文旨在解决光学流估计中的准确性与计算效率之间的平衡问题，特别是传统方法中构建代价体积（cost volume）所导致的高内存消耗和复杂性。其解决方案的关键在于提出Warping-Alone Field Transforms (WAFT)，该方法通过用高分辨率变形（high-resolution warping）替代代价体积，实现了更高的精度和更低的内存开销，同时挑战了传统观点认为构建代价体积是获得强性能必要条件的共识。

链接: https://arxiv.org/abs/2506.21526
作者: Yihan Wang,Jia Deng
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being up to 4.1x faster than methods with similar performance. Code and model weights are available at this https URL.
zh

[CV-7] MADrive: Memory-Augmented Driving Scene Modeling

【速读】：该论文试图解决现有场景重建方法在自主驾驶（autonomous driving, AD）环境中难以支持显著改变或全新驾驶场景的逼真合成问题。其解决方案的关键在于引入MADrive框架，通过使用大规模外部记忆库中的视觉相似3D资产替换原始观测中的车辆，从而扩展现有场景重建方法的能力。该框架包含一个检索模块，能够从记忆库中找到最相似的车辆实例，并通过视频重建对应的3D资产，再通过方向对齐和重新照明将其集成到目标场景中，实现多视角的完整车辆表示。

链接: https://arxiv.org/abs/2506.21520
作者: Polina Karpikova,Daniil Selikhanovych,Kirill Struminsky,Ruslan Musaev,Maria Golitsyna,Dmitry Baranchuk
机构: Yandex(雅库兹); Yandex Research(雅库兹研究); HSE University(高等经济大学); Skoltech(斯科尔科沃科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in scene reconstruction have pushed toward highly realistic modeling of autonomous driving (AD) environments using 3D Gaussian splatting. However, the resulting reconstructions remain closely tied to the original observations and struggle to support photorealistic synthesis of significantly altered or novel driving scenarios. This work introduces MADrive, a memory-augmented reconstruction framework designed to extend the capabilities of existing scene reconstruction methods by replacing observed vehicles with visually similar 3D assets retrieved from a large-scale external memory bank. Specifically, we release MAD-Cars, a curated dataset of \sim70 K 360° car videos captured in the wild and present a retrieval module that finds the most similar car instances in the memory bank, reconstructs the corresponding 3D assets from video, and integrates them into the target scene through orientation alignment and relighting. The resulting replacements provide complete multi-view representations of vehicles in the scene, enabling photorealistic synthesis of substantially altered configurations, as demonstrated in our experiments. Project page: this https URL
zh

[CV-8] G2D: Boosting Multimodal Learning with Gradient-Guided Distillation ICCV2025

【速读】：该论文旨在解决多模态学习中由于模态不平衡导致的模型优化偏向强模态、弱模态特征表示不充分的问题。其解决方案的关键在于提出一种基于梯度引导的知识蒸馏框架（Gradient-Guided Distillation, G^2D），该框架通过自定义的损失函数融合单模态和多模态目标，并引入动态顺序模态优先级（dynamic sequential modality prioritization, SMP）技术，确保每个模态在学习过程中都能主导优化，从而避免强模态压制弱模态。

链接: https://arxiv.org/abs/2506.21514
作者: Mohammed Rakib,Arunkumar Bagavathi
机构: Oklahoma State University (俄克拉荷马州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G ^2 D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. G ^2 D further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate G ^2 D on multiple real-world datasets and show that G ^2 D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. Our code is available at this https URL.
zh

[CV-9] GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation ICCV2025

【速读】：该论文旨在解决高质量、可泛化的语音驱动三维说话头生成问题，特别是针对大角度头部旋转和分布外（OOD）音频的挑战，以及传统方法在身份特定训练上的高耗时问题。其解决方案的关键在于引入了基于通用先验与身份特异性适应相结合的策略，通过两阶段的先验-适应训练机制，学习通用的头部先验和个体特征，并结合音表达与表达-视觉先验来捕捉唇部运动的普遍模式和头部纹理的分布，从而提升合成结果的3D一致性、唇形同步精度和渲染质量。

链接: https://arxiv.org/abs/2506.21513
作者: Wentao Hu,Shunkai Li,Ziqiao Peng,Haoxian Zhang,Fan Shi,Xiaoqiang Liu,Pengfei Wan,Di Zhang,Hui Tian
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Kuaishou Technology (快手科技); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, Project page: this https URL

点击查看摘要

Abstract:Creating high-quality, generalizable speech-driven 3D talking heads remains a persistent challenge. Previous methods achieve satisfactory results for fixed viewpoints and small-scale audio variations, but they struggle with large head rotations and out-of-distribution (OOD) audio. Moreover, they are constrained by the need for time-consuming, identity-specific training. We believe the core issue lies in the lack of sufficient 3D priors, which limits the extrapolation capabilities of synthesized talking heads. To address this, we propose GGTalker, which synthesizes talking heads through a combination of generalizable priors and identity-specific adaptation. We introduce a two-stage Prior-Adaptation training strategy to learn Gaussian head priors and adapt to individual characteristics. We train Audio-Expression and Expression-Visual priors to capture the universal patterns of lip movements and the general distribution of head textures. During the Customized Adaptation, individual speaking styles and texture details are precisely modeled. Additionally, we introduce a color MLP to generate fine-grained, motion-aligned textures and a Body Inpainter to blend rendered results with the background, producing indistinguishable, photorealistic video frames. Comprehensive experiments show that GGTalker achieves state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency.
zh

[CV-10] Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在生成文本时产生的幻觉问题，即生成与视觉输入矛盾的文本。现有无训练解码策略存在静态约束无法适应语义漂移、多次前向传播导致效率低下以及干预规则过于严格导致细节退化等关键局限。论文提出的解决方案是动态logits校准（Dynamic Logits Calibration, DLC），其核心在于在解码阶段逐步利用CLIP评估输入图像与生成文本序列之间的语义对齐，并通过相对视觉优势（Relative Visual Advantage, RVA）动态调整输出logits，同时结合实时上下文对齐分数进行自适应加权，以平衡视觉引导与文本质量，从而有效减少幻觉并保持高推理效率。

链接: https://arxiv.org/abs/2506.21509
作者: Jiahe Chen,Jiaying He,Qian Shao,Qiyuan Chen,Jiahe Ying,Hongxia Xu,Jintai Chen,Jianwei Zheng,Jian Wu
机构: Zhejiang University (浙江大学); Zhejiang University of Technology (浙江理工大学); Fudan University (复旦大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding, yet they are frequently hampered by hallucination-the generation of text that contradicts visual input. Existing training-free decoding strategies exhibit critical limitations, including the use of static constraints that do not adapt to semantic drift during generation, inefficiency stemming from the need for multiple forward passes, and degradation of detail due to overly rigid intervention rules. To overcome these challenges, this paper introduces Dynamic Logits Calibration (DLC), a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time. At the decoding phase, DLC step-wise employs CLIP to assess the semantic alignment between the input image and the generated text sequence. Then, the Relative Visual Advantage (RVA) of candidate tokens is evaluated against a dynamically updated contextual baseline, adaptively adjusting output logits to favor tokens that are visually grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time context alignment score, carefully balances the visual guidance while ensuring the overall quality of the textual output. Extensive experiments conducted across diverse benchmarks and various LVLM architectures (such as LLaVA, InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces hallucinations, outperforming current methods while maintaining high inference efficiency by avoiding multiple forward passes. Overall, we present an effective and efficient decoding-time solution to mitigate hallucinations, thereby enhancing the reliability of LVLMs for more practices. Code will be released on Github.
zh

[CV-11] owards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection

【速读】：该论文试图解决目标检测模型中置信度估计不准确以及无法量化未检测区域不确定性的问题（confidence calibration and uncertainty quantification outside detected bounding boxes）。解决方案的关键在于提出一种基于空间统计的物体检测模型，其中边界框数据被建模为标记点过程（marked point process），通过该统计框架实现基于似然的训练，并提供对可行驶区域（即无物体区域）的明确置信度估计。

链接: https://arxiv.org/abs/2506.21486
作者: Tobias J. Riedlinger,Kira Maag,Hanno Gottschalk
机构: Technical University of Berlin (柏林工业大学); Heinrich-Heine-University Düsseldorf (海因里希-海涅大学杜塞尔多夫)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Probability (math.PR)
备注: 15 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Deep neural networks have set the state-of-the-art in computer vision tasks such as bounding box detection and semantic segmentation. Object detectors and segmentation models assign confidence scores to predictions, reflecting the model’s uncertainty in object detection or pixel-wise classification. However, these confidence estimates are often miscalibrated, as their architectures and loss functions are tailored to task performance rather than probabilistic foundation. Even with well calibrated predictions, object detectors fail to quantify uncertainty outside detected bounding boxes, i.e., the model does not make a probability assessment of whether an area without detected objects is truly free of obstacles. This poses a safety risk in applications such as automated driving, where uncertainty in empty areas remains unexplored. In this work, we propose an object detection model grounded in spatial statistics. Bounding box data matches realizations of a marked point process, commonly used to describe the probabilistic occurrence of spatial point events identified as bounding box centers, where marks are used to describe the spatial extension of bounding boxes and classes. Our statistical framework enables a likelihood-based training and provides well-defined confidence estimates for whether a region is drivable, i.e., free of objects. We demonstrate the effectiveness of our method through calibration assessments and evaluation of performance.
zh

[CV-12] ITAN: Query-Token based Domain Adaptive Adversarial Learning ICCV2025

【速读】：该论文旨在解决无源域自适应目标检测（source-free domain adaptive object detection, SF-DAOD）问题，即在无法获取源域数据的情况下，模型需适应未标记的目标域。现有方法大多采用基于学生-教师（student-teacher, ST）框架的自监督方法，通过源域预训练模型生成伪标签进行微调，但因伪标签中的高噪声（由域偏差、差异和显著域转移引起），导致学生模型性能大幅下降。该论文提出的解决方案关键在于设计一种基于目标的迭代查询-令牌对抗网络（Target-based Iterative Query-Token Adversarial Network, TITAN），通过将目标图像分为与源域相似（易例）和不相似（难例）的子集，并利用方差估计策略进行划分，从而获得更可靠的伪标签，同时引入基于查询-令牌的对抗模块以减少特征表示间的域差距。

链接: https://arxiv.org/abs/2506.21484
作者: Tajamul Ashraf,Janibul Bashir
机构: MBZUAI(穆巴达拉科学技术大学); National Institute of Technology Srinagar(印度尼特理工学院斯里纳加尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025

点击查看摘要

Abstract:We focus on the source-free domain adaptive object detection (SF-DAOD) problem when source data is unavailable during adaptation and the model must adapt to an unlabeled target domain. The majority of approaches for the problem employ a self-supervised approach using a student-teacher (ST) framework where pseudo-labels are generated via a source-pretrained model for further fine-tuning. We observe that the performance of a student model often degrades drastically, due to the collapse of the teacher model, primarily caused by high noise in pseudo-labels, resulting from domain bias, discrepancies, and a significant domain shift across domains. To obtain reliable pseudo-labels, we propose a Target-based Iterative Query-Token Adversarial Network (TITAN), which separates the target images into two subsets: those similar to the source (easy) and those dissimilar (hard). We propose a strategy to estimate variance to partition the target domain. This approach leverages the insight that higher detection variances correspond to higher recall and greater similarity to the source domain. Also, we incorporate query-token-based adversarial modules into a student-teacher baseline framework to reduce the domain gaps between two feature representations. Experiments conducted on four natural imaging datasets and two challenging medical datasets have substantiated the superior performance of TITAN compared to existing state-of-the-art (SOTA) methodologies. We report an mAP improvement of +22.7, +22.2, +21.1, and +3.7 percent over the current SOTA on C2F, C2B, S2C, and K2C benchmarks, respectively.
zh

[CV-13] Global and Local Entailment Learning for Natural World Imagery ICCV2025

【速读】：该论文试图解决视觉-语言模型中学习数据分层结构的问题，现有方法通过蕴含学习（entailment learning）尝试解决该问题，但未能显式建模蕴含的传递性，从而无法有效建立表示空间中顺序与语义之间的关系。论文提出的关键解决方案是引入径向跨模态嵌入（Radial Cross-Modal Embeddings, RCME），该框架能够显式建模传递性强制的蕴含关系，并优化视觉-语言模型中概念的偏序关系。通过该框架，作者构建了一个能够表示生命树层次结构的分层视觉-语言基础模型。

链接: https://arxiv.org/abs/2506.21476
作者: Srikumar Sastry,Aayush Dhakal,Eric Xing,Subash Khanal,Nathan Jacobs
机构: Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at this https URL.
zh

[CV-14] Evaluation of Traffic Signals for Daily Traffic Pattern

【速读】：该论文试图解决交通信号设计中如何根据不同的交通流量模式优化信号配时的问题，以提高交叉口的通行效率和减少拥堵。解决方案的关键在于提出三种配置方法：动态、静态和混合配置，并利用基于视觉的跟踪系统估算六处交叉口的转弯交通量（TMC），结合仿真软件进行信号评估。通过分析不同时间段和区域的交通分布，验证了混合信号方法在高峰和非高峰时段切换的适应性，从而实现更有效的交通流管理。

链接: https://arxiv.org/abs/2506.21469
作者: Mohammad Shokrolah Shirazi,Hung-Fu Chang
机构: Marian University (马里安大学); University of Indianapolis (印第安纳波利斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The turning movement count data is crucial for traffic signal design, intersection geometry planning, traffic flow, and congestion analysis. This work proposes three methods called dynamic, static, and hybrid configuration for TMC-based traffic signals. A vision-based tracking system is developed to estimate the TMC of six intersections in Las Vegas using traffic cameras. The intersection design, route (e.g. vehicle movement directions), and signal configuration files with compatible formats are synthesized and imported into Simulation of Urban MObility for signal evaluation with realistic data. The initial experimental results based on estimated waiting times indicate that the cycle time of 90 and 120 seconds works best for all intersections. In addition, four intersections show better performance for dynamic signal timing configuration, and the other two with lower performance have a lower ratio of total vehicle count to total lanes of the intersection leg. Since daily traffic flow often exhibits a bimodal pattern, we propose a hybrid signal method that switches between dynamic and static methods, adapting to peak and off-peak traffic conditions for improved flow management. So, a built-in traffic generator module creates vehicle routes for 4 hours, including peak hours, and a signal design module produces signal schedule cycles according to static, dynamic, and hybrid methods. Vehicle count distributions are weighted differently for each zone (i.e., West, North, East, South) to generate diverse traffic patterns. The extended experimental results for 6 intersections with 4 hours of simulation time imply that zone-based traffic pattern distributions affect signal design selection. Although the static method works great for evenly zone-based traffic distribution, the hybrid method works well for highly weighted traffic at intersection pairs of the West-East and North-South zones.
zh

[CV-15] Rethinking Oversaturation in Classifier-Free Guidance via Low Frequency

【速读】：该论文试图解决生成式扩散模型中使用无分类器指导（Classifier-free Guidance, CFG）时因高指导尺度导致的过饱和和不真实伪影问题。解决方案的关键在于从低频信号的角度出发，识别出冗余信息的累积是造成上述问题的核心因素，并提出低频改进的无分类器指导（Low-frequency Improved Classifier-free Guidance, LF-CFG），通过自适应阈值测量定位冗余信息位置，并结合低频信息变化率确定合理阈值，最终采用降权策略降低冗余信息的影响。

链接: https://arxiv.org/abs/2506.21452
作者: Kaiyu Song,Hanjiang Lai
机构: Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classifier-free guidance (CFG) succeeds in condition diffusion models that use a guidance scale to balance the influence of conditional and unconditional terms. A high guidance scale is used to enhance the performance of the conditional term. However, the high guidance scale often results in oversaturation and unrealistic artifacts. In this paper, we introduce a new perspective based on low-frequency signals, identifying the accumulation of redundant information in these signals as the key factor behind oversaturation and unrealistic artifacts. Building on this insight, we propose low-frequency improved classifier-free guidance (LF-CFG) to mitigate these issues. Specifically, we introduce an adaptive threshold-based measurement to pinpoint the locations of redundant information. We determine a reasonable threshold by analyzing the change rate of low-frequency information between prior and current steps. We then apply a down-weight strategy to reduce the impact of redundant information in the low-frequency signals. Experimental results demonstrate that LF-CFG effectively alleviates oversaturation and unrealistic artifacts across various diffusion models, including Stable Diffusion-XL, Stable Diffusion 2.1, 3.0, 3.5, and SiT-XL.
zh

[CV-16] A Comprehensive Dataset for Underground Miner Detection in Diverse Scenario

【速读】：该论文试图解决地下采矿作业中因安全挑战导致的应急响应能力不足问题，特别是针对矿工检测能力的可靠性问题。解决方案的关键在于构建一个专门用于训练和验证矿工检测系统的热成像数据集，以支持基于热成像的自动矿工检测算法的发展，从而为未来的紧急应用场景提供可靠的技术基础。

链接: https://arxiv.org/abs/2506.21451
作者: Cyrus Addy,Ajay Kumar Gurumadaiah,Yixiang Gao,Kwame Awuah-Offei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Underground mining operations face significant safety challenges that make emergency response capabilities crucial. While robots have shown promise in assisting with search and rescue operations, their effectiveness depends on reliable miner detection capabilities. Deep learning algorithms offer potential solutions for automated miner detection, but require comprehensive training datasets, which are currently lacking for underground mining environments. This paper presents a novel thermal imaging dataset specifically designed to enable the development and validation of miner detection systems for potential emergency applications. We systematically captured thermal imagery of various mining activities and scenarios to create a robust foundation for detection algorithms. To establish baseline performance metrics, we evaluated several state-of-the-art object detection algorithms including YOLOv8, YOLOv10, YOLO11, and RT-DETR on our dataset. While not exhaustive of all possible emergency situations, this dataset serves as a crucial first step toward developing reliable thermal-based miner detection systems that could eventually be deployed in real emergency scenarios. This work demonstrates the feasibility of using thermal imaging for miner detection and establishes a foundation for future research in this critical safety application.
zh

[CV-17] Controllable 3D Placement of Objects with Scene-Aware Diffusion Models

【速读】：该论文试图解决在图像编辑中精确放置物体于特定位置和方向的问题，这一任务通常需要精心设计的修复掩码或提示词。解决方案的关键在于设计一种精心构造的视觉地图，结合粗略的物体掩码，以实现高质量的物体放置。该方法通过构建一个能够消除歧义但又足够灵活以支持形状或物体方向变化的条件信号来实现这一目标，并基于修复模型确保背景保持不变，从而区别于那些同时建模物体和背景的方法。

链接: https://arxiv.org/abs/2506.21446
作者: Mohamed Omran,Dimitris Kalatzis,Jens Petersen,Amirhossein Habibian,Auke Wiggers
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image editing approaches have become more powerful and flexible with the advent of powerful text-conditioned generative models. However, placing objects in an environment with a precise location and orientation still remains a challenge, as this typically requires carefully crafted inpainting masks or prompts. In this work, we show that a carefully designed visual map, combined with coarse object masks, is sufficient for high quality object placement. We design a conditioning signal that resolves ambiguities, while being flexible enough to allow for changing of shapes or object orientations. By building on an inpainting model, we leave the background intact by design, in contrast to methods that model objects and background jointly. We demonstrate the effectiveness of our method in the automotive setting, where we compare different conditioning signals in novel object placement tasks. These tasks are designed to measure edit quality not only in terms of appearance, but also in terms of pose and location accuracy, including cases that require non-trivial shape changes. Lastly, we show that fine location control can be combined with appearance control to place existing objects in precise locations in a scene.
zh

[CV-18] Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation

【速读】：该论文试图解决乳腺癌中非典型有丝分裂（Atypical Mitotic Figure, AMF）分类的挑战性问题，该问题由于其发生率低、形态差异细微、病理学家之间的一致性较低以及数据集的类别不平衡而难以准确识别。解决方案的关键在于利用迁移学习和模型微调技术，特别是基于Virchow-line基础模型的低秩适应（LoRA）方法，在域内和域外数据集上均取得了较高的平衡准确率，表明通过先进的模型适配策略可以有效提升AMF分类性能。

链接: https://arxiv.org/abs/2506.21444
作者: Sweta Banerjee,Viktoria Weiss,Taryn A. Donovan,Rutger A. Fick,Thomas Conrad,Jonas Ammeling,Nils Porsche,Robert Klopfleisch,Christopher Kaltenecker,Katharina Breininger,Marc Aubreville,Christof A. Bertram
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Atypical mitoses mark a deviation in the cell division process that can be an independent prognostically relevant marker for tumor malignancy. However, their identification remains challenging due to low prevalence, at times subtle morphological differences from normal mitoses, low inter-rater agreement among pathologists, and class imbalance in datasets. Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including baseline models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new hold-out AMF datasets - AtNorM-Br, a dataset of mitoses from the The TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitoses from the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7696, and 0.7705 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively, with the results being particularly good for LoRA-based adaptation of the Virchow-line of foundation models. Our work shows that atypical mitosis classification, while being a challenging problem, can be effectively addressed through the use of recent advances in transfer learning and model fine-tuning techniques. We make available all code and data used in this paper in this github repository: this https URL.
zh

[CV-19] HyperSORT: Self-Organising Robust Training with hyper-networks MICCAI2025

【速读】：该论文试图解决医学影像数据集中存在的异质性偏差问题，这些偏差包括错误的标签和不一致的标注风格，这些问题可能对深度分割网络的性能产生负面影响。解决方案的关键在于引入HyperSORT框架，该框架利用超网络从表示图像和标注变异性的潜在向量中预测UNet的参数，通过联合学习超网络参数和每个训练样本对应的潜在向量集合，从而学习到一个复杂的UNet参数分布，使得低密度区域能够捕捉噪声特定模式，而高密度区域则能以区分但有意义的方式稳健地分割器官。

链接: https://arxiv.org/abs/2506.21430
作者: Samuel Joutard,Marijn Stollenga,Marc Balle Sanchez,Mohammad Farid Azampour,Raphael Prevost
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025

点击查看摘要

Abstract:Medical imaging datasets often contain heterogeneous biases ranging from erroneous labels to inconsistent labeling styles. Such biases can negatively impact deep segmentation networks performance. Yet, the identification and characterization of such biases is a particularly tedious and challenging task. In this paper, we introduce HyperSORT, a framework using a hyper-network predicting UNets’ parameters from latent vectors representing both the image and annotation variability. The hyper-network parameters and the latent vector collection corresponding to each data sample from the training set are jointly learned. Hence, instead of optimizing a single neural network to fit a dataset, HyperSORT learns a complex distribution of UNet parameters where low density areas can capture noise-specific patterns while larger modes robustly segment organs in differentiated but meaningful manners. We validate our method on two 3D abdominal CT public datasets: first a synthetically perturbed version of the AMOS dataset, and TotalSegmentator, a large scale dataset containing real unknown biases and errors. Our experiments show that HyperSORT creates a structured mapping of the dataset allowing the identification of relevant systematic biases and erroneous samples. Latent space clusters yield UNet parameters performing the segmentation task in accordance with the underlying learned systematic bias. The code and our analysis of the TotalSegmentator dataset are made available: this https URL
zh

[CV-20] EndoFlow-SLAM: Real-Time Endoscopic SLAM with Flow-Constrained Gaussian Splatting

【速读】：该论文旨在解决内窥镜场景中由于非朗伯表面引起的光度不一致性以及呼吸运动导致的动态干扰，从而影响基于3D Gaussian Splatting (3DGS)的同步定位与建图（SLAM）系统性能的问题。其解决方案的关键在于引入光流损失作为几何约束，以有效约束场景的3D结构和相机运动，并提出深度正则化策略以缓解光度不一致性问题，同时改进3DGS的精炼策略，聚焦于渲染质量较差的关键帧视角，从而提升SLAM系统的场景表示能力。

链接: https://arxiv.org/abs/2506.21420
作者: Taoyu Wu,Yiyi Miao,Zhuoxiao Li,Haocheng Zhao,Kang Dang,Jionglong Su,Limin Yu,Haoang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Efficient three-dimensional reconstruction and real-time visualization are critical in surgical scenarios such as endoscopy. In recent years, 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in efficient 3D reconstruction and rendering. Most 3DGS-based Simultaneous Localization and Mapping (SLAM) methods only rely on the appearance constraints for optimizing both 3DGS and camera poses. However, in endoscopic scenarios, the challenges include photometric inconsistencies caused by non-Lambertian surfaces and dynamic motion from breathing affects the performance of SLAM systems. To address these issues, we additionally introduce optical flow loss as a geometric constraint, which effectively constrains both the 3D structure of the scene and the camera motion. Furthermore, we propose a depth regularisation strategy to mitigate the problem of photometric inconsistencies and ensure the validity of 3DGS depth rendering in endoscopic scenes. In addition, to improve scene representation in the SLAM system, we improve the 3DGS refinement strategy by focusing on viewpoints corresponding to Keyframes with suboptimal rendering quality frames, achieving better rendering results. Extensive experiments on the C3VD static dataset and the StereoMIS dynamic dataset demonstrate that our method outperforms existing state-of-the-art methods in novel view synthesis and pose estimation, exhibiting high performance in both static and dynamic surgical scenes. The source code will be publicly available upon paper acceptance.
zh

[CV-21] XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

【速读】：该论文旨在解决在文本到图像生成中对多个主体的身份和语义属性（如姿态、风格、光照）进行细粒度控制时，容易导致扩散变换器（Diffusion Transformers, DiTs）的可编辑性和一致性受损的问题。现有方法常引入伪影或出现属性纠缠。论文提出的解决方案关键在于提出一种新的多主体可控生成模型XVerse，通过将参考图像转换为针对特定标记的文本流调制偏移量，实现对特定主体的精确且独立控制，同时不干扰图像潜在表示或特征，从而实现高质量、可编辑的多主体图像合成。

链接: https://arxiv.org/abs/2506.21416
作者: Bowen Chen,Mengyi Zhao,Haomiao Sun,Li Chen,Xu Wang,Kang Du,Xinglong Wu
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Github Link: this https URL

点击查看摘要

Abstract:Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.
zh

[CV-22] Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction ICCV2025

【速读】：该论文旨在解决从多视角边缘图中直接重建3D参数化曲线的问题，相较于现有两阶段方法（即“边缘点云重建与参数化曲线拟合”流程），其关键在于提出一种单阶段端到端框架，通过直接从2D边缘图优化3D参数化曲线，避免了因阶段间断导致的误差累积。为克服参数化曲线在基于渲染的多视角优化中的不适用性，作者引入了一种双向耦合机制，将参数化曲线与面向边缘的高斯组件相结合，形成一种具备曲线感知能力的高斯表示（CurveGaussian），从而实现可微分渲染并直接利用多视角证据进行优化。此外，论文还提出了动态自适应拓扑优化框架，在训练过程中通过线性化、合并、分割和剪枝操作优化曲线结构。

链接: https://arxiv.org/abs/2506.21401
作者: Zhirui Gao. Renjiao Yi,Yaqiao Dai,Xuening Zhu,Wei Chen,Chenyang Zhu,Kai Xu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL Accepted by ICCV 2025

点击查看摘要

Abstract:This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential ``edge point cloud reconstruction and parametric curve fitting’’ pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curve-aware Gaussian representation, \textbfCurveGaussian, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our one-stage method’s superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.
zh

[CV-23] FastRef:Fast Prototype Refinement for Few-Shot Industrial Anomaly Detection

【速读】：该论文旨在解决少样本工业异常检测（Few-shot Industrial Anomaly Detection, FS-IAD）中由于数据稀缺导致的原型表征不充分问题。现有方法多聚焦于从有限的正常样本中提取原型，但未能系统性地引入查询图像统计信息以提升原型的表征能力。论文提出的解决方案——FastRef，其关键在于通过迭代的两阶段过程实现原型优化：第一阶段通过可优化的变换矩阵将查询特征的特性转移到原型；第二阶段通过原型对齐实现异常抑制，其中利用最优传输（Optimal Transport, OT）对非高斯采样特征进行距离度量与最小化，从而增强原型的鲁棒性与区分能力。

链接: https://arxiv.org/abs/2506.21398
作者: Long Tian,Yufei Li,Yuyang Dai,Wenchao Chen,Xiyang Liu,Bo Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18pages, 7figures, 6tables

点击查看摘要

Abstract:Few-shot industrial anomaly detection (FS-IAD) presents a critical challenge for practical automated inspection systems operating in data-scarce environments. While existing approaches predominantly focus on deriving prototypes from limited normal samples, they typically neglect to systematically incorporate query image statistics to enhance prototype representativeness. To address this issue, we propose FastRef, a novel and efficient prototype refinement framework for FS-IAD. Our method operates through an iterative two-stage process: (1) characteristic transfer from query features to prototypes via an optimizable transformation matrix, and (2) anomaly suppression through prototype alignment. The characteristic transfer is achieved through linear reconstruction of query features from prototypes, while the anomaly suppression addresses a key observation in FS-IAD that unlike conventional IAD with abundant normal prototypes, the limited-sample setting makes anomaly reconstruction more probable. Therefore, we employ optimal transport (OT) for non-Gaussian sampled features to measure and minimize the gap between prototypes and their refined counterparts for anomaly suppression. For comprehensive evaluation, we integrate FastRef with three competitive prototype-based FS-IAD methods: PatchCore, FastRecon, WinCLIP, and AnomalyDINO. Extensive experiments across four benchmark datasets of MVTec, ViSA, MPDD and RealIAD demonstrate both the effectiveness and computational efficiency of our approach under 1/2/4-shots.
zh

[CV-24] GenFlow: Interactive Modular System for Image Generation

【速读】：该论文试图解决生成式艺术（Generative Art）在实际应用中因技术门槛高而限制其潜力发挥的问题。解决方案的关键在于提出GenFlow，一个模块化框架，通过节点编辑器实现工作流的无缝定制，并结合基于自然语言处理的智能助手，将复杂的工作流创建过程转化为直观易用的体验，从而降低技术壁垒并提升创作效率。

链接: https://arxiv.org/abs/2506.21369
作者: Duc-Hung Nguyen,Huu-Phuc Huynh,Minh-Triet Tran,Trung-Nghia Le
机构: University of Science, Ho Chi Minh City, Vietnam (胡志明市科学大学); Vietnam National University, Ho Chi Minh City, Vietnam (胡志明市国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative art unlocks boundless creative possibilities, yet its full potential remains untapped due to the technical expertise required for advanced architectural concepts and computational workflows. To bridge this gap, we present GenFlow, a novel modular framework that empowers users of all skill levels to generate images with precision and ease. Featuring a node-based editor for seamless customization and an intelligent assistant powered by natural language processing, GenFlow transforms the complexity of workflow creation into an intuitive and accessible experience. By automating deployment processes and minimizing technical barriers, our framework makes cutting-edge generative art tools available to everyone. A user study demonstrated GenFlow’s ability to optimize workflows, reduce task completion times, and enhance user understanding through its intuitive interface and adaptive features. These results position GenFlow as a groundbreaking solution that redefines accessibility and efficiency in the realm of generative art.
zh

[CV-25] CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection ICCV2025

【速读】：该论文旨在解决图像与点云配准中因特征通道注意力差异导致的匹配性能下降以及场景中相似结构引发的跨模态冗余对应问题。其解决方案的关键在于提出通道自适应调整模块（Channel Adaptive Adjustment Module, CAA）和全局最优选择模块（Global Optimal Selection Module, GOS），CAA通过增强模态内特征并抑制跨模态敏感性来优化特征表示，GOS则通过全局优化替代局部选择以提升对应关系的质量。

链接: https://arxiv.org/abs/2506.21364
作者: Zhixin Cheng,Jiacheng Deng,Xinjun Li,Xiaotian Yin,Bohao Liao,Baoqun Yin,Wenfei Yang,Tianzhu Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025 accepted

点击查看摘要

Abstract:Detection-free methods typically follow a coarse-to-fine pipeline, extracting image and point cloud features for patch-level matching and refining dense pixel-to-point correspondences. However, differences in feature channel attention between images and point clouds may lead to degraded matching results, ultimately impairing registration accuracy. Furthermore, similar structures in the scene could lead to redundant correspondences in cross-modal matching. To address these issues, we propose Channel Adaptive Adjustment Module (CAA) and Global Optimal Selection Module (GOS). CAA enhances intra-modal features and suppresses cross-modal sensitivity, while GOS replaces local selection with global optimization. Experiments on RGB-D Scenes V2 and 7-Scenes demonstrate the superiority of our method, achieving state-of-the-art performance in image-to-point cloud registration.
zh

[CV-26] oosiCubix: Monocular 3D Cuboid Labeling via Vehicle Part Annotations

【速读】：该论文试图解决在缺乏昂贵且精密校准的相机-LiDAR或立体视觉系统的情况下，如何高效地进行车辆3D立方体（cuboid）标注的问题。解决方案的关键在于提出一种仅使用单目图像和相机内参进行精确3D立方体标注的方法——ToosiCubix，该方法通过用户少量点击（约每辆车10次）即可实现高精度的车辆位置、姿态和尺寸估计，并通过几何约束优化与概率尺寸先验相结合，克服了尺度不确定性和未观测维度等常见问题。

链接: https://arxiv.org/abs/2506.21358
作者: Behrooz Nasihatkon,Hossein Resani,Amirreza Mehrzadian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Many existing methods for 3D cuboid annotation of vehicles rely on expensive and carefully calibrated camera-LiDAR or stereo setups, limiting their accessibility for large-scale data collection. We introduce ToosiCubix, a simple yet powerful approach for annotating ground-truth cuboids using only monocular images and intrinsic camera parameters. Our method requires only about 10 user clicks per vehicle, making it highly practical for adding 3D annotations to existing datasets originally collected without specialized equipment. By annotating specific features (e.g., wheels, car badge, symmetries) across different vehicle parts, we accurately estimate each vehicle’s position, orientation, and dimensions up to a scale ambiguity (8 DoF). The geometric constraints are formulated as an optimization problem, which we solve using a coordinate descent strategy, alternating between Perspective-n-Points (PnP) and least-squares subproblems. To handle common ambiguities such as scale and unobserved dimensions, we incorporate probabilistic size priors, enabling 9 DoF cuboid placements. We validate our annotations against the KITTI and Cityscapes3D datasets, demonstrating that our method offers a cost-effective and scalable solution for high-quality 3D cuboid annotation.
zh

[CV-27] CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations

【速读】：该论文试图解决当前2D场景图（scene graph）数据缺乏准确性和完备性的问题。其解决方案的关键在于构建一个名为CoPa-SG的合成场景图数据集，该数据集具备高精度的地面真实数据和所有物体之间的详尽关系标注。此外，论文引入了参数化关系（parametric relations）和原型关系（proto-relations）两个新概念，前者通过添加角度或距离等额外参数来实现更细粒度的关系表示，后者则编码场景中假设关系的形成方式，从而提升场景理解与推理能力。

链接: https://arxiv.org/abs/2506.21357
作者: Julian Lorenz,Mrunmai Phatak,Robin Schön,Katja Ludwig,Nico Hörmann,Annemarie Friedrich,Rainer Lienhart
机构: University of Augsburg(奥格斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:2D scene graphs provide a structural and explainable framework for scene understanding. However, current work still struggles with the lack of accurate scene graph data. To overcome this data bottleneck, we present CoPa-SG, a synthetic scene graph dataset with highly precise ground truth and exhaustive relation annotations between all objects. Moreover, we introduce parametric and proto-relations, two new fundamental concepts for scene graphs. The former provides a much more fine-grained representation than its traditional counterpart by enriching relations with additional parameters such as angles or distances. The latter encodes hypothetical relations in a scene graph and describes how relations would form if new objects are placed in the scene. Using CoPa-SG, we compare the performance of various scene graph generation models. We demonstrate how our new relation types can be integrated in downstream applications to enhance planning and reasoning capabilities.
zh

[CV-28] ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

【速读】：该论文旨在解决当前视觉-语言模型（VLMs）在理解电影中嵌入的细微摄影语法（cinematography grammar）方面存在的不足，这一问题限制了AI在视频生成中的精细视觉理解和准确性。其解决方案的关键在于构建了一个专门针对电影语言理解的全面基准测试平台——ShotBench，以及一个大规模多模态数据集ShotQA，并基于此通过监督微调和组相对策略优化方法开发出性能优越的模型ShotVL，从而显著提升了现有开源和专有模型的表现。

链接: https://arxiv.org/abs/2506.21356
作者: Hongbo Liu,Jingwen He,Yi Jin,Dian Zheng,Yuhao Dong,Fan Zhang,Ziqi Huang,Yinan He,Yangguang Li,Weichao Chen,Yu Qiao,Wanli Ouyang,Shengjie Zhao,Ziwei Liu
机构: Tongji University (同济大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); S-Lab, Nanyang Technological University (S-Lab，南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce \textbfShotBench, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct \textbfShotQA, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop \textbfShotVL through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new \textbfstate-of-the-art performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.
zh

[CV-29] Generalizable Neural Electromagnetic Inverse Scattering

【速读】：该论文旨在解决电磁逆散射问题（Electromagnetic Inverse Scattering Problems, EISP），即从散射的电磁场中重建相对介电常数。该问题本质上是病态且高度非线性的，具有较大的挑战性。论文提出的解决方案关键在于从物理信息的角度重新审视EISP，将其重构为一个两阶段的逆传输-散射过程，从而揭示感应电流作为可泛化的中间表示，有效解耦非线性散射过程与病态逆问题。基于此，论文提出首个可泛化的物理驱动框架，包含电流估计器和介电常数求解器，实现端到端的数据驱动训练与对未见数据的泛化前馈预测，同时保持对发射器稀疏性的强鲁棒性。

链接: https://arxiv.org/abs/2506.21349
作者: Yizhe Cheng,Chunxun Tian,Haoru Wang,Wentao Zhu,Xiaoxuan Ma,Yizhou Wang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires case-specific optimization, lacks generalization to unseen data, and fails under sparse transmitter setups (e.g., with only one transmitter). To address these limitations, we revisit EISP from a physics-informed perspective, reformulating it as a two stage inverse transmission-scattering process. This formulation reveals the induced current as a generalizable intermediate representation, effectively decoupling the nonlinear scattering process from the ill-posed inverse problem. Built on this insight, we propose the first generalizable physics-driven framework for EISP, comprising a current estimator and a permittivity solver, working in an end-to-end manner. The current estimator explicitly learns the induced current as a physical bridge between the incident and scattered field, while the permittivity solver computes the relative permittivity directly from the estimated induced current. This design enables data-driven training and generalizable feed-forward prediction of relative permittivity on unseen data while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.
zh

[CV-30] PanSt3R: Multi-view Consistent Panoptic Segmentation ICCV2025

【速读】：该论文旨在解决3D场景的全景分割问题，即在密集的3D场景重建中对物体实例进行分割和分类，尤其是在仅依赖未标定的2D图像的情况下。现有方法通常依赖于现成的2D全景分割模型提取每帧的2D全景分割结果，随后通过隐式几何表示（如NeRF）进行优化以整合和融合2D预测，但这种方法在处理本质上具有3D特性和多视角关系的问题时可能次优。该论文提出的解决方案关键在于提出了一种统一且集成的方法PanSt3R，通过一次前向传播联合预测3D几何结构和多视角全景分割，从而消除了对测试时优化的需求。该方法基于MUSt3R进行改进，增强了语义感知和多视角全景分割能力，并引入了更系统的多视角分割后处理方法，实现了高效、可扩展且性能优越的3D全景分割。

链接: https://arxiv.org/abs/2506.21348
作者: Lojze Zust,Yohann Cabon,Juliette Marrie,Leonid Antsfeld,Boris Chidlovskii,Jerome Revaud,Gabriela Csurka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Panoptic segmentation of 3D scenes, involving the segmentation and classification of object instances in a dense 3D reconstruction of a scene, is a challenging problem, especially when relying solely on unposed 2D images. Existing approaches typically leverage off-the-shelf models to extract per-frame 2D panoptic segmentations, before optimizing an implicit geometric representation (often based on NeRF) to integrate and fuse the 2D predictions. We argue that relying on 2D panoptic segmentation for a problem inherently 3D and multi-view is likely suboptimal as it fails to leverage the full potential of spatial relationships across views. In addition to requiring camera parameters, these approaches also necessitate computationally expensive test-time optimization for each scene. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view panoptic segmentation in a single forward pass. Our approach builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, and enhances it with semantic awareness and multi-view panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster than existing methods.
zh

[CV-31] Automatic Reviewers Assignment to a Research Paper Based on Allied References and Publications Weight

【速读】：该论文试图解决在学术出版物中高效选择最适合评审研究论文的审稿人（referee）的问题。随着新兴研究领域和论文数量的迅速增长，传统的人工选择审稿人方法面临挑战，因为期刊通常只能分配非全领域专家的审稿团队。该研究提出了一种新的策略，其关键在于通过分析论文的参考文献，提取作者信息和研究主题关键词，结合h-index、i10-index及引用次数等指标对潜在审稿人进行评分和排序，并自动获取其电子邮件地址，同时排除合作者和同事，从而筛选出最有可能胜任的审稿人。

链接: https://arxiv.org/abs/2506.21331
作者: Tamim Al Mahmud,B M Mainul Hossain,Dilshad Ara
机构: Green University of Bangladesh (格林大学); University of Dhaka (达卡大学); Dhaka International University (达卡国际大学)
类目: Digital Libraries (cs.DL); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Conference Proceedings (5 Pages)

点击查看摘要

Abstract:Everyday, a vast stream of research documents is submitted to conferences, anthologies, journals, newsletters, annual reports, daily papers, and various periodicals. Many such publications use independent external specialists to review submissions. This process is called peer review, and the reviewers are called referees. However, it is not always possible to pick the best referee for reviewing. Moreover, new research fields are emerging in every sector, and the number of research papers is increasing dramatically. To review all these papers, every journal assigns a small team of referees who may not be experts in all areas. For example, a research paper in communication technology should be reviewed by an expert from the same field. Thus, efficiently selecting the best reviewer or referee for a research paper is a big challenge. In this research, we propose and implement program that uses a new strategy to automatically select the best reviewers for a research paper. Every research paper contains references at the end, usually from the same area. First, we collect the references and count authors who have at least one paper in the references. Then, we automatically browse the web to extract research topic keywords. Next, we search for top researchers in the specific topic and count their h-index, i10-index, and citations for the first n authors. Afterward, we rank the top n authors based on a score and automatically browse their homepages to retrieve email addresses. We also check their co-authors and colleagues online and discard them from the list. The remaining top n authors, generally professors, are likely the best referees for reviewing the research paper. Comments: IEEE Conference Proceedings (5 Pages) Subjects: Digital Libraries (cs.DL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.21331 [cs.DL] (or arXiv:2506.21331v1 [cs.DL] for this version) https://doi.org/10.48550/arXiv.2506.21331 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2018 4th International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 2018, pp. 1-5 Related DOI: https://doi.org/10.1109/CCAA.2018.8777730 Focus to learn more DOI(s) linking to related resources
zh

[CV-32] Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models

【速读】：该论文旨在解决机器人辅助手术中手术流程分析的挑战，特别是在处理长时间手术视频时，传统基于Transformer模型的方法因二次注意力机制导致处理效率受限的问题。解决方案的关键在于提出一种分层的输入依赖状态空间模型，该模型利用状态空间模型的线性扩展特性，在保持对局部和全局动态捕捉能力的同时，实现对完整时长视频的高效决策。该模型包含两个核心模块：局部聚合状态空间模型块用于捕捉复杂的局部动态，全局关系状态空间模型块用于建模整个视频的时间依赖性。

链接: https://arxiv.org/abs/2506.21330
作者: Haoyang Wu,Tsun-Hsuan Wang,Mathias Lechner,Ramin Hasani,Jennifer A. Eckhoff,Paul Pak,Ozanan R. Meireles,Guy Rosman,Yutong Ban,Daniela Rus
机构: UM-SJTU Joint Institute, Shanghai Jiao Tong University (上海交通大学-密歇根大学联合学院); Massachusetts Institute of Technology (麻省理工学院); Liquid AI (液态人工智能公司); University Hospital of Cologne (科隆大学医院); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthy surgical videos. In this paper, we propose a novel hierarchical input-dependent state space model that leverages the linear scaling property of state space models to enable decision making on full-length videos while capturing both local and global dynamics. Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information. The proposed model consists of two key modules: a local-aggregation state space model block that effectively captures intricate local dynamics, and a global-relation state space model block that models temporal dependencies across the entire video. The model is trained using a hybrid discrete-continuous supervision strategy, where both signals of discrete phase labels and continuous phase progresses are propagated through the network. Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available after paper acceptance.
zh

[CV-33] Multimodal LLM s for Visualization Reconstruction and Understanding

【速读】：该论文试图解决当前多模态大模型在理解可视化内容时的不足，即无法准确解码数据到视觉的映射规则并提取结构化信息的问题。其解决方案的关键在于构建一个新型数据集，并训练专门用于理解可视化的多模态可视化大语言模型（multimodal visualization LLMs），通过结合图表图像与其对应的矢量化表示、编码方案和数据特征，实现对可视化内容的精确重建与数据提取。

链接: https://arxiv.org/abs/2506.21319
作者: Can Liu,Chunlin Da,Xiaoxiao Long,Yuxiao Yang,Yu Zhang,Yong Wang
机构: Nanyang Technological University (南洋理工大学); ByteDance Inc. (字节跳动公司); Nanjing University (南京大学); Tsinghua University (清华大学); University of Oxford (牛津大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visualizations are crucial for data communication, yet understanding them requires comprehension of both visual elements and their underlying data relationships. Current multimodal large models, while effective in natural image understanding, struggle with visualization due to their inability to decode the data-to-visual mapping rules and extract structured information. To address these challenges, we present a novel dataset and train multimodal visualization LLMs specifically designed for understanding. Our approach combines chart images with their corresponding vectorized representations, encoding schemes, and data features. The proposed vector format enables compact and accurate reconstruction of visualization content. Experimental results demonstrate significant improvements in both data extraction accuracy and chart reconstruction quality.
zh

[CV-34] LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在处理与人体姿态和动作相关的复杂视觉任务时表现不佳的问题，其根本原因是缺乏专门的视觉语言指令遵循数据。解决方案的关键在于通过整合人体关键点信息与传统视觉特征（如描述和边界框），生成高质量的指令遵循数据，从而提升模型对以人类为中心场景的理解能力。该方法构建了一个包含200,328个样本的数据集，并在此基础上微调了LLaVA-1.5-7B模型，最终在Extended Human Pose and Action Understanding Benchmark (E-HPAUB) 上取得了显著性能提升。

链接: https://arxiv.org/abs/2506.21317
作者: Dewen Zhang,Tahir Hussain,Wangpeng An,Hayaru Shouno
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2409.09306

点击查看摘要

Abstract:Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding. Code is available at this https URL.
zh

[CV-35] DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images

【速读】：该论文旨在解决文本密集型文档图像中的视觉定位（visual grounding）问题，这是文档智能和视觉问答（VQA）系统中一个关键但研究不足的挑战。其解决方案的核心在于提出一种多粒度的视觉定位框架——\drishtikon，该框架通过集成多语言OCR、大语言模型以及一种新颖的区域匹配算法，实现对答案片段在块、行、词和点级别的精确定位，从而提升VQA系统的可解释性和可信度。

链接: https://arxiv.org/abs/2506.21316
作者: Badri Vishal Kasuba,Parag Chaudhuri,Ganesh Ramakrishnan
机构: IIT Bombay(印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:Visual grounding in text-rich document images is a critical yet underexplored challenge for document intelligence and visual question answering (VQA) systems. We present \drishtikon, a multi-granular visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates robust multi-lingual OCR, large language models, and a novel region matching algorithm to accurately localize answer spans at block, line, word, and point levels. We curate a new benchmark from the CircularsVQA test set, providing fine-grained, human-verified annotations across multiple granularities. Extensive experiments demonstrate that our method achieves state-of-the-art grounding accuracy, with line-level granularity offering the best trade-off between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations with leading vision-language models reveal the limitations of current VLMs in precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios. Code and dataset has been made available at this https URL.
zh

[CV-36] Continual Self-Supervised Learning with Masked Autoencoders in Remote Sensing

【速读】：该论文试图解决遥感（Remote Sensing, RS）领域中持续学习（Continual Learning, CL）方法在面对新任务时容易发生灾难性遗忘（Catastrophic Forgetting）的问题，尤其是在缺乏大量标注样本的情况下，传统方法难以有效提升模型的鲁棒性。解决方案的关键在于提出一种基于掩码自编码器（Masked Autoencoder, MAE）的新型持续自监督学习方法（CoSMAE），其核心包括两个组件：数据混叠（Data Mixup）和模型混叠知识蒸馏（Model Mixup Knowledge Distillation）。数据混叠通过将当前任务与先前任务的图像进行插值以保留历史数据分布信息，而模型混叠知识蒸馏则通过插值不同模型权重生成教师模型，实现对过去模型和当前模型知识的联合蒸馏，从而在数据和模型层面共同正则化MAE，提升跨任务的泛化能力并降低灾难性遗忘的风险。

链接: https://arxiv.org/abs/2506.21312
作者: Lars Möllenbrok,Behnood Rasti,Begüm Demir
机构: Technische Universität Berlin(柏林工业大学); BIFOLD - Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Geoscience and Remote Sensing Letters. Our code is available at this https URL

点击查看摘要

Abstract:The development of continual learning (CL) methods, which aim to learn new tasks in a sequential manner from the training data acquired continuously, has gained great attention in remote sensing (RS). The existing CL methods in RS, while learning new tasks, enhance robustness towards catastrophic forgetting. This is achieved by using a large number of labeled training samples, which is costly and not always feasible to gather in RS. To address this problem, we propose a novel continual self-supervised learning method in the context of masked autoencoders (denoted as CoSMAE). The proposed CoSMAE consists of two components: i) data mixup; and ii) model mixup knowledge distillation. Data mixup is associated with retaining information on previous data distributions by interpolating images from the current task with those from the previous tasks. Model mixup knowledge distillation is associated with distilling knowledge from past models and the current model simultaneously by interpolating their model weights to form a teacher for the knowledge distillation. The two components complement each other to regularize the MAE at the data and model levels to facilitate better generalization across tasks and reduce the risk of catastrophic forgetting. Experimental results show that CoSMAE achieves significant improvements of up to 4.94% over state-of-the-art CL methods applied to MAE. Our code is publicly available at: this https URL.
zh

[CV-37] HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation MICCAI2025

【速读】：该论文旨在解决手术视频生成中缺乏对手术动作和阶段的一致性理解以及细粒度引导的问题，现有方法大多为无条件生成，难以实现真实场景的模拟。其解决方案的关键在于提出HieraSurg框架，该框架包含两个专门的扩散模型，通过分层感知的方式利用手术信息，包括手术阶段、动作三元组和全景分割图，首先预测粗粒度语义变化，再通过第二阶段模型结合细粒度视觉特征生成高质量视频，从而实现更精确的纹理渲染和语义信息整合。

链接: https://arxiv.org/abs/2506.21287
作者: Diego Biagini,Nassir Navab,Azade Farshad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025

点击查看摘要

Abstract:Surgical Video Synthesis has emerged as a promising research direction following the success of diffusion models in general-domain video generation. Although existing approaches achieve high-quality video generation, most are unconditional and fail to maintain consistency with surgical actions and phases, lacking the surgical understanding and fine-grained guidance necessary for factual simulation. We address these challenges by proposing HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models. Given a surgical phase and an initial frame, HieraSurg first predicts future coarse-grained semantic changes through a segmentation prediction model. The final video is then generated by a second-stage model that augments these temporal segmentation maps with fine-grained visual features, leading to effective texture rendering and integration of semantic information in the video space. Our approach leverages surgical information at multiple levels of abstraction, including surgical phase, action triplets, and panoptic segmentation maps. The experimental results on Cholecystectomy Surgical Video Generation demonstrate that the model significantly outperforms prior work both quantitatively and qualitatively, showing strong generalization capabilities and the ability to generate higher frame-rate videos. The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.
zh

[CV-38] WordCon: Word-level Typography Control in Scene Text Rendering

【速读】：该论文试图解决在生成图像中实现精确的单词级排版控制（word-level typography control）这一持续存在的挑战。其关键解决方案是构建一个单词级受控场景文本数据集，并引入文本-图像对齐（Text-Image Alignment, TIA）框架，该框架利用基础模型提供的文本与局部图像区域之间的跨模态对应关系，以增强文本到图像（Text-to-Image, T2I）模型的训练效果。此外，论文还提出了一种混合参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法——WordCon，通过重新参数化选择性关键参数，提升方法的效率和可移植性。

链接: https://arxiv.org/abs/2506.21276
作者: Wenda Shi,Yiren Song,Zihan Rao,Dengming Zhang,Jiaming Liu,Xingxing Zou
机构: The Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学); Chongqing Univesity (重庆大学); Zhejiang University (浙江大学); Tiamat AI (Tiamat AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving precise word-level typography control within generated images remains a persistent challenge. To address it, we newly construct a word-level controlled scene text dataset and introduce the Text-Image Alignment (TIA) framework. This framework leverages cross-modal correspondence between text and local image regions provided by grounding models to enhance the Text-to-Image (T2I) model training. Furthermore, we propose WordCon, a hybrid parameter-efficient fine-tuning (PEFT) method. WordCon reparameterizes selective key parameters, improving both efficiency and portability. This allows seamless integration into diverse pipelines, including artistic text rendering, text editing, and image-conditioned text rendering. To further enhance controllability, the masked loss at the latent level is applied to guide the model to concentrate on learning the text region in the image, and the joint-attention loss provides feature-level supervision to promote disentanglement between different words. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. The datasets and source code will be available for academic use.
zh

[CV-39] FairyGen: Storied Cartoon Video from a Single Child-Drawn Character

【速读】：该论文试图解决从单张儿童绘画中自动生成具有叙事驱动的卡通视频的问题，同时忠实保留其独特的艺术风格。解决方案的关键在于将角色建模与风格化背景生成进行显式解耦，并引入电影镜头设计以支持富有表现力和连贯性的叙事。此外，系统通过多模态大语言模型生成结构化分镜脚本、风格传播适配器保持视觉一致性、镜头设计模块提升视觉多样性以及两阶段运动定制适配器实现物理合理且个性化的动作生成，从而实现风格忠实、叙事结构清晰且自然运动的动画输出。

链接: https://arxiv.org/abs/2506.21272
作者: Jiayi Zheng,Xiaodong Cun
机构: GVC Lab; Great Bay University (大湾大学); Guangdong (广东)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Project Page: this https URL ; Code: this https URL

点击查看摘要

Abstract:We propose FairyGen, an automatic system for generating story-driven cartoon videos from a single child’s drawing, while faithfully preserving its unique artistic style. Unlike previous storytelling methods that primarily focus on character consistency and basic motion, FairyGen explicitly disentangles character modeling from stylized background generation and incorporates cinematic shot design to support expressive and coherent storytelling. Given a single character sketch, we first employ an MLLM to generate a structured storyboard with shot-level descriptions that specify environment settings, character actions, and camera perspectives. To ensure visual consistency, we introduce a style propagation adapter that captures the character’s visual style and applies it to the background, faithfully retaining the character’s full visual identity while synthesizing style-consistent scenes. A shot design module further enhances visual diversity and cinematic quality through frame cropping and multi-view synthesis based on the storyboard. To animate the story, we reconstruct a 3D proxy of the character to derive physically plausible motion sequences, which are then used to fine-tune an MMDiT-based image-to-video diffusion model. We further propose a two-stage motion customization adapter: the first stage learns appearance features from temporally unordered frames, disentangling identity from motion; the second stage models temporal dynamics using a timestep-shift strategy with frozen identity weights. Once trained, FairyGen directly renders diverse and coherent video scenes aligned with the storyboard. Extensive experiments demonstrate that our system produces animations that are stylistically faithful, narratively structured natural motion, highlighting its potential for personalized and engaging story animation. The code will be available at this https URL
zh

[CV-40] Video Virtual Try-on with Conditional Diffusion Transformer Inpainter

【速读】：该论文试图解决视频虚拟试衣（video virtual try-on）中的时空一致性问题以及服装细节在连续视频帧中的保留问题。传统方法通过逐帧应用图像级试衣模型导致严重不一致，而基于扩散的视频试衣方法虽然引入了时间注意力机制，但仍存在一致性缺陷。论文提出ViTI（Video Try-on Inpainter），将视频虚拟试衣建模为条件视频修复（conditional video inpainting）任务，其关键在于从视频生成问题出发，而非图像级试衣问题，从而在初始阶段就具备更好的时空一致性。具体而言，ViTI基于带有全3D时空注意力的扩散Transformer构建视频修复框架，并通过掩码策略和多阶段训练逐步适应服装修复任务，最终实现具有高质量时空一致性的服装区域修复。

链接: https://arxiv.org/abs/2506.21270
作者: Cheng Zou,Senlin Cheng,Bolei Xu,Dandan Zheng,Xiaobo Li,Jingdong Chen,Ming Yang
机构: Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with a video generation problem instead of an image-based try-on problem, which from the beginning has a better spatial-temporal consistency. Specifically, at first we build a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, and then we progressively adapt it for video garment inpainting, with a collection of masking strategies and multi-stage training. After these steps, the model can inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency. Finally, as other try-on methods, garment condition is added to the model to make sure the inpainted garment appearance and details are as expected. Both quantitative and qualitative experimental results show that ViTI is superior to previous works.
zh

[CV-41] DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic ICCV2025

【速读】：该论文旨在解决现实世界目标检测系统在持续学习新物体类别的同时适应环境变化的挑战，现有方法Class Incremental Object Detection (CIOD) 和 Domain Incremental Object Detection (DIOD) 分别仅关注类别增量或领域增量，存在泛化能力不足或灾难性遗忘的问题。论文提出的解决方案是Dual Incremental Object Detection (DuIOD)，其关键在于提出DuET框架，该框架基于任务算术的模型融合方法，通过引入Directional Consistency Loss 来稳定增量学习并缓解符号冲突，同时具备检测器无关性，支持如YOLO11和RT-DETR等模型作为实时增量检测器。

链接: https://arxiv.org/abs/2506.21260
作者: Munish Monga,Vishal Chudasama,Pankaj Wasnik,Biplab Banerjee
机构: Sony Research India(索尼研究印度); Indian Institute of Technology, Bombay(印度理工学院，孟买)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Real-world object detection systems, such as those in autonomous driving and surveillance, must continuously learn new object categories and simultaneously adapt to changing environmental conditions. Existing approaches, Class Incremental Object Detection (CIOD) and Domain Incremental Object Detection (DIOD) only address one aspect of this challenge. CIOD struggles in unseen domains, while DIOD suffers from catastrophic forgetting when learning new classes, limiting their real-world applicability. To overcome these limitations, we introduce Dual Incremental Object Detection (DuIOD), a more practical setting that simultaneously handles class and domain shifts in an exemplar-free manner. We propose DuET, a Task Arithmetic-based model merging framework that enables stable incremental learning while mitigating sign conflicts through a novel Directional Consistency Loss. Unlike prior methods, DuET is detector-agnostic, allowing models like YOLO11 and RT-DETR to function as real-time incremental object detectors. To comprehensively evaluate both retention and adaptation, we introduce the Retention-Adaptability Index (RAI), which combines the Average Retention Index (Avg RI) for catastrophic forgetting and the Average Generalization Index for domain adaptability into a common ground. Extensive experiments on the Pascal Series and Diverse Weather Series demonstrate DuET’s effectiveness, achieving a +13.12% RAI improvement while preserving 89.3% Avg RI on the Pascal Series (4 tasks), as well as a +11.39% RAI improvement with 88.57% Avg RI on the Diverse Weather Series (3 tasks), outperforming existing methods.
zh

[CV-42] mporal Rate Reduction Clustering for Human Motion Segmentation ICCV2025

【速读】：该论文旨在解决人类运动分割（Human Motion Segmentation, HMS）问题，即在视频中将非重叠的人类运动进行划分。现有方法主要依赖于子空间聚类技术，其假设高维时间数据符合联合子空间（Union-of-Subspaces, UoS）分布，然而复杂背景下的视频帧可能并不符合该分布。本文提出的解决方案为时空速率降低聚类（Temporal Rate Reduction Clustering, TR²C），其关键在于联合学习结构化表示和相似性，以保持时间一致性并更好地符合UoS结构，从而提升HMS任务的性能。

链接: https://arxiv.org/abs/2506.21249
作者: Xianghan Meng,Zhengyu Tong,Zhiyuan Huang,Chun-Guang Li
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper is accepted by ICCV 2025. The first two authors are equally contributed

点击查看摘要

Abstract:Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering ( \textTR^2\textC ), which jointly learns structured representations and affinity to segment the frame sequences in video. Specifically, the structured representations learned by \textTR^2\textC maintain temporally consistent and align well with a UoS structure, which is favorable for the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors.
zh

[CV-43] DiMPLe – Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation

【速读】：该论文旨在解决多模态学习中视觉与语言模态间不变特征与虚假特征的纠缠问题，特别是由于视觉数据中的虚假相关性导致的分布外（OOD）性能下降问题。其解决方案的关键在于提出DiMPLe（Disentangled Multi-Modal Prompt Learning），通过分离模态内和模态间的特征，并保持一致的对齐，从而提升对新类别泛化能力和对分布偏移的鲁棒性。该方法结合了三个核心目标：（1）最小化不变特征与虚假特征之间的互信息，（2）对虚假特征进行正则化，（3）在不变特征上进行对比学习。

链接: https://arxiv.org/abs/2506.21237
作者: Umaima Rahman,Mohammad Yaqub,Dwarikanath Mahapatra
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Khalifa University (哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce DiMPLe (Disentangled Multi-Modal Prompt Learning), a novel approach to disentangle invariant and spurious features across vision and language modalities in multi-modal learning. Spurious correlations in visual data often hinder out-of-distribution (OOD) performance. Unlike prior methods focusing solely on image features, DiMPLe disentangles features within and across modalities while maintaining consistent alignment, enabling better generalization to novel classes and robustness to distribution shifts. Our method combines three key objectives: (1) mutual information minimization between invariant and spurious features, (2) spurious feature regularization, and (3) contrastive learning on invariant features. Extensive experiments demonstrate DiMPLe demonstrates superior performance compared to CoOp-OOD, when averaged across 11 diverse datasets, and achieves absolute gains of 15.27 in base class accuracy and 44.31 in novel class accuracy.
zh

[CV-44] Real-Time ESFP: Estimating Smoothing Filtering and Pose-Mapping

【速读】：该论文试图解决将单目RGB视频转换为低成本4-自由度桌面机械臂可执行关节轨迹的问题，其解决方案的关键在于ESFP端到端流水线中的核心模块：HPSTM（一种结合自注意力机制的序列到序列Transformer），该模块通过融合长时序上下文与可微分正运动学解码器，强制约束骨骼长度恒定和解剖合理性，同时联合预测关节均值和完整协方差，从而实现高质量的运动估计与轨迹生成。

链接: https://arxiv.org/abs/2506.21234
作者: Qifei Cui,Yuang Zhou,Ruichen Deng
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper presents ESFP, an end-to-end pipeline that converts monocular RGB video into executable joint trajectories for a low-cost 4-DoF desktop arm. ESFP comprises four sequential modules. (1) Estimating: ROMP lifts each frame to a 24-joint 3-D skeleton. (2) Smoothing: the proposed HPSTM-a sequence-to-sequence Transformer with self-attention-combines long-range temporal context with a differentiable forward-kinematics decoder, enforcing constant bone lengths and anatomical plausibility while jointly predicting joint means and full covariances. (3) Filtering: root-normalized trajectories are variance-weighted according to HPSTM’s uncertainty estimates, suppressing residual noise. (4) Pose-Mapping: a geometric retargeting layer transforms shoulder-elbow-wrist triples into the uArm’s polar workspace, preserving wrist orientation.
zh

[CV-45] ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation ICCV2025

【速读】：该论文旨在解决无需微调模型的开放词汇语义分割（training-free open-vocabulary semantic segmentation, OVS）中因依赖模型能力或参考集质量不足而导致性能受限的问题。其解决方案的关键在于强调数据质量的重要性，并提出一种以数据质量为导向的框架，包括构建具有良好配对的图像-文本嵌入的参考集的数据处理流程，以及基于相似性的简单检索方法，从而显著提升OVS性能。

链接: https://arxiv.org/abs/2506.21233
作者: Xiwei Xuan,Ziquan Deng,Kwan-Liu Ma
机构: University of California, Davis (加利福尼亚大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training. Our code is available at this https URL .
zh

[CV-46] BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models

【速读】：该论文试图解决生成式图像模型在互联网中广泛传播后，其输出内容可能被重新用作训练数据，从而导致模型崩溃（model collapse）的问题。解决方案的关键在于提出一种名为BitMark的鲁棒位级水印框架，该框架在Infinity模型生成图像的过程中，直接在令牌流的多个尺度上嵌入人类不可察觉但可检测的信号，从而实现对生成内容的可靠识别，同时保持图像的视觉保真度和生成速度，并具备对抗多种移除技术的鲁棒性。

链接: https://arxiv.org/abs/2506.21209
作者: Louis Kerner,Michel Meintz,Bihe Zhao,Franziska Boenisch,Adam Dziedzic
机构: CISPA Helmholtz Center for Information Security (CISPA 海尔姆霍兹信息安全中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:State-of-the-art text-to-image models like Infinity generate photorealistic images at an unprecedented speed. These models operate in a bitwise autoregressive manner over a discrete set of tokens that is practically infinite in size. However, their impressive generative power comes with a growing risk: as their outputs increasingly populate the Internet, they are likely to be scraped and reused as training data-potentially by the very same models. This phenomenon has been shown to lead to model collapse, where repeated training on generated content, especially from the models’ own previous versions, causes a gradual degradation in performance. A promising mitigation strategy is watermarking, which embeds human-imperceptible yet detectable signals into generated images-enabling the identification of generated content. In this work, we introduce BitMark, a robust bitwise watermarking framework for Infinity. Our method embeds a watermark directly at the bit level of the token stream across multiple scales (also referred to as resolutions) during Infinity’s image generation process. Our bitwise watermark subtly influences the bits to preserve visual fidelity and generation speed while remaining robust against a spectrum of removal techniques. Furthermore, it exhibits high radioactivity, i.e., when watermarked generated images are used to train another image generative model, this second model’s outputs will also carry the watermark. The radioactive traces remain detectable even when only fine-tuning diffusion or image autoregressive models on images watermarked with our BitMark. Overall, our approach provides a principled step toward preventing model collapse in image generative models by enabling reliable detection of generated outputs.
zh

[CV-47] MedPrompt: LLM -CNN Fusion with Weight Routing for Medical Image Segmentation and Classification

【速读】：该论文试图解决当前医学图像分析系统任务专用性强、需为分类和分割分别构建模型以及缺乏支持用户自定义工作流灵活性的问题。其解决方案的关键在于提出MedPrompt框架，该框架结合了少量样本提示的大型语言模型（Llama-4-17B）用于高层次任务规划，以及模块化卷积神经网络（DeepFusionLab）用于低层次图像处理，通过动态路由任务特定预训练权重实现系统的可扩展性和部署性。

链接: https://arxiv.org/abs/2506.21199
作者: Shadman Sobhan,Kazi Abrar Mahmud,Abduz Zami
机构: Bangladesh University of Engineering and Technology (孟加拉国工程与技术大学); Rajshahi University of Engineering and Technology (拉杰沙希工程与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 40 pages, 8 Tables, 9 Figures

点击查看摘要

Abstract:Current medical image analysis systems are typically task-specific, requiring separate models for classification and segmentation, and lack the flexibility to support user-defined workflows. To address these challenges, we introduce MedPrompt, a unified framework that combines a few-shot prompted Large Language Model (Llama-4-17B) for high-level task planning with a modular Convolutional Neural Network (DeepFusionLab) for low-level image processing. The LLM interprets user instructions and generates structured output to dynamically route task-specific pretrained weights. This weight routing approach avoids retraining the entire framework when adding new tasks-only task-specific weights are required, enhancing scalability and deployment. We evaluated MedPrompt across 19 public datasets, covering 12 tasks spanning 5 imaging modalities. The system achieves a 97% end-to-end correctness in interpreting and executing prompt-driven instructions, with an average inference latency of 2.5 seconds, making it suitable for near real-time applications. DeepFusionLab achieves competitive segmentation accuracy (e.g., Dice 0.9856 on lungs) and strong classification performance (F1 0.9744 on tuberculosis). Overall, MedPrompt enables scalable, prompt-driven medical imaging by combining the interpretability of LLMs with the efficiency of modular CNNs.
zh

[CV-48] Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation ICCV2025

【速读】：该论文旨在解决全景图像处理中因畸变、透视遮挡和标注受限而导致的全场景感知问题。传统无监督域适应方法依赖于源域针孔数据进行知识迁移，而本文提出了一种更实用的任务——无源遮挡感知无缝分割（Source-Free Occlusion-Aware Seamless Segmentation, SFOASS），并提出了首个解决方案UNLOCK。其关键在于引入了两个核心模块：全向伪标签学习和非模态驱动的上下文学习，使模型在无需源域数据或目标标签的情况下，实现360°视角覆盖和遮挡感知的分割能力。

链接: https://arxiv.org/abs/2506.21198
作者: Yihong Cao,Jiaming Zhang,Xu Zheng,Hao Shi,Kunyu Peng,Hang Liu,Kailun Yang,Hui Zhang
机构: Hunan University (湖南大学); Hunan Normal University (湖南师范大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); ETH Zürich (苏黎世联邦理工学院); HKUST(GZ) (香港科技大学（广州）); INSAIT, Sofia University “St. Kliment Ohridski” (INSAIT，索非亚大学“圣克莱门特·奥赫里德斯基”); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Accepted to ICCV 2025. All data and code will be made publicly available at this https URL

点击查看摘要

Abstract:Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360° viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at this https URL.
zh

[CV-49] GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

【速读】：该论文旨在解决3D点云序列定位（SG3D）任务中，现有3D视觉定位（3DVG）方法无法有效提取和利用文本指令中的时间信息的问题。由于SG3D指令常包含指代词如“it”、“here”和“the same”，要求定位方法具备上下文理解能力并从先前步骤中检索相关信息。解决方案的关键在于提出GroundFlow模块，该模块通过整合短期和长期步骤信息，增强对历史信息的全面理解，从而提升3DVG模型在SG3D任务中的性能。

链接: https://arxiv.org/abs/2506.21188
作者: Zijun Lin,Shuting He,Cheston Tan,Bihan Wen
机构: Nanyang Technological University (南洋理工大学); Shanghai University of Finance and Economics (上海财经大学); Centre for Frontier AI Research, A*STAR (新加坡科技研究局前沿人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as “it”, “here” and “the same” to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-the-art 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow – a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5% and +10.2%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal understanding advantage as step counts increase. Overall, our work introduces temporal reasoning capabilities to existing 3DVG models and achieves state-of-the-art performance in the SG3D benchmark across five datasets.
zh

[CV-50] Out-of-Distribution Semantic Occupancy Prediction

【速读】：该论文旨在解决3D语义占据预测中对分布外（Out-of-Distribution, OoD）物体检测能力不足的问题，现有方法在分布内场景表现良好，但对OoD物体和长尾分布的鲁棒性较差，导致异常未被检测和误解释的风险增加，进而带来安全隐患。其解决方案的关键在于提出一种名为OccOoD的框架，该框架将OoD检测集成到3D语义占据预测中，并通过Voxel-BEV Progressive Fusion（VBPF）结构，利用基于RWKV的分支实现几何-语义融合，从而提升OoD检测性能。

链接: https://arxiv.org/abs/2506.21185
作者: Yuheng Zhang,Mengfei Duan,Kunyu Peng,Yuhang Wang,Ruiping Liu,Fei Teng,Kai Luo,Zhiyong Li,Kailun Yang
机构: Hunan University (湖南大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The established datasets and source code will be made publicly available at this https URL

点击查看摘要

Abstract:3D Semantic Occupancy Prediction is crucial for autonomous driving, providing a dense, semantically rich environmental representation. However, existing methods focus on in-distribution scenes, making them susceptible to Out-of-Distribution (OoD) objects and long-tail distributions, which increases the risk of undetected anomalies and misinterpretations, posing safety hazards. To address these challenges, we introduce Out-of-Distribution Semantic Occupancy Prediction, targeting OoD detection in 3D voxel space. To fill the gaps in the dataset, we propose a Synthetic Anomaly Integration Pipeline that injects synthetic anomalies while preserving realistic spatial and occlusion patterns, enabling the creation of two datasets: VAA-KITTI and VAA-KITTI-360. We introduce OccOoD, a novel framework integrating OoD detection into 3D semantic occupancy prediction, with Voxel-BEV Progressive Fusion (VBPF) leveraging an RWKV-based branch to enhance OoD detection via geometry-semantic fusion. Experimental results demonstrate that OccOoD achieves state-of-the-art OoD detection with an AuROC of 67.34% and an AuPRCr of 29.21% within a 1.2m region, while maintaining competitive occupancy prediction performance. The established datasets and source code will be made publicly available at this https URL.
zh

[CV-51] ask-Aware KV Compression For Cost-Effective Long Video Understanding

【速读】：该论文旨在解决长视频理解（Long-video understanding, LVU）在现有多模态大语言模型（Multimodal large language models, MLLMs）中面临的计算成本过高的问题。现有方法虽然尝试通过键值（KV）压缩来缓解这一问题，但在高压缩比下常导致信息丢失。论文提出的解决方案关键在于Video-X^2L，其核心是双层级KV压缩（bi-level KV compression）和选择性KV重载（selective KV re-loading）。通过在预填充阶段生成低压缩KV和高压缩KV，在解码阶段根据视频片段的重要性选择性地重新加载低压缩KV，从而在保持整体紧凑性的同时充分利用任务相关的视频信息。

链接: https://arxiv.org/abs/2506.21184
作者: Minghao Qin,Yan Shu,Peitian Zhang,Kun Lun,Huaying Yuan,Juenjie Zhou,Shitao Xiao,Bo Zhao,Zheng Liu
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); Shanghai Jiao Tong University (上海交通大学); University of Trento (特伦托大学); Renmin University of China (中国人民大学); Beijing University of Posts and Telecommunications (北京邮电大学); Hong Kong Polytechnic University (香港理工大学); Institute of Automation, CAS (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Long-video understanding (LVU) remains a severe challenge for existing multimodal large language models (MLLMs), primarily due to the prohibitive computational cost. Recent approaches have explored KV compression to mitigate this issue, but they often suffer from significant information loss at high compression ratios. In this paper, we introduce Video-X^2L, which flexibly preserves critical video information for each LVU task. Video-X^2L involves two key operations. The first one is called bi-level KV compression. During the MLLM’s pre-filling stage, Video-X^2L generates two types of compressed KVs: low-compression KVs (L-KVs) to capture fine-grained video details and high-compression KVs (H-KVs) to offer compact video representations. The second one is called selective KV re-loading. During the MLLM’s decoding stage, Video-X^2L selectively re-loads L-KVs for the most critical video chunks while using H-KVs for other less important ones. This allows the MLLM to fully utilize task-specific information while maintaining the overall compactness. Video-X^2L is simple yet effective: it is free from additional training and directly compatible with existing KV-compressible MLLMs. We evaluate Video-X^2L with a variety of popular LVU benchmarks, including VideoMME, MLVU, LongVideoBench, and VNBench. Our experiment result shows that Video-X^2L outperforms existing KV-compression methods by a huge advantage while substantially saving the computation cost.
zh

[CV-52] opology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition

【速读】：该论文旨在解决3D物体点云在模拟到现实（Sim2Real）域适应中的领域差距问题，这一差距主要由数据采集方法的不同导致，从而限制了点分类器的泛化能力。其解决方案的关键在于提出一种名为拓扑感知建模（Topology-Aware Modeling, TAM）的框架，通过利用全局空间拓扑信息，特别是低层次、高频率的3D结构，并引入一种新颖的自监督学习任务来建模局部几何特征的拓扑关系，以缓解领域差异。此外，还提出了一种结合跨域对比学习与自训练的增强自训练策略，以降低噪声伪标签的影响并提升适应过程的鲁棒性。

链接: https://arxiv.org/abs/2506.21165
作者: Longkun Zou,Kangjun Liu,Ke Chen,Kailing Guo,Kui Jia,Yaowei Wang
机构: Pengcheng Laboratory (鹏城实验室); South China University of Technology (华南理工大学); Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学（深圳）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning semantic representations from point sets of 3D object shapes is often challenged by significant geometric variations, primarily due to differences in data acquisition methods. Typically, training data is generated using point simulators, while testing data is collected with distinct 3D sensors, leading to a simulation-to-reality (Sim2Real) domain gap that limits the generalization ability of point classifiers. Current unsupervised domain adaptation (UDA) techniques struggle with this gap, as they often lack robust, domain-insensitive descriptors capable of capturing global topological information, resulting in overfitting to the limited semantic patterns of the source domain. To address this issue, we introduce a novel Topology-Aware Modeling (TAM) framework for Sim2Real UDA on object point clouds. Our approach mitigates the domain gap by leveraging global spatial topology, characterized by low-level, high-frequency 3D structures, and by modeling the topological relations of local geometric features through a novel self-supervised learning task. Additionally, we propose an advanced self-training strategy that combines cross-domain contrastive learning with self-training, effectively reducing the impact of noisy pseudo-labels and enhancing the robustness of the adaptation process. Experimental results on three public Sim2Real benchmarks validate the effectiveness of our TAM framework, showing consistent improvements over state-of-the-art methods across all evaluated tasks. The source code of this work will be available at this https URL.
zh

[CV-53] Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image

【速读】：该论文旨在解决从单视角图像生成具有自然外观、三维一致性和多 plausible 解释的逼真三维物体的问题，现有方法在多视角一致性与几何细节方面表现不佳。其解决方案的关键在于提出一种无需额外模型训练的新方法，通过无缝整合几何先验和感知先验，利用三个分别初始化于几何先验、感知先验和高斯噪声的高斯分支，并通过几何与感知先验之间的相互作用及基于重投影的策略增强深度一致性，从而实现高质量的三维重建。

链接: https://arxiv.org/abs/2506.21152
作者: Pufan Li,Bi’an Du,Wei Hu
机构: Wangxuan Institute of Computer Technology, Peking University(王选计算机技术研究所，北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Generating realistic 3D objects from single-view images requires natural appearance, 3D consistency, and the ability to capture multiple plausible interpretations of unseen regions. Existing approaches often rely on fine-tuning pretrained 2D diffusion models or directly generating 3D information through fast network inference or 3D Gaussian Splatting, but their results generally suffer from poor multiview consistency and lack geometric detail. To takle these issues, we present a novel method that seamlessly integrates geometry and perception priors without requiring additional model training to reconstruct detailed 3D objects from a single image. Specifically, we train three different Gaussian branches initialized from the geometry prior, perception prior and Gaussian noise, respectively. The geometry prior captures the rough 3D shapes, while the perception prior utilizes the 2D pretrained diffusion model to enhance multiview information. Subsequently, we refine 3D Gaussian branches through mutual interaction between geometry and perception priors, further enhanced by a reprojection-based strategy that enforces depth consistency. Experiments demonstrate the higher-fidelity reconstruction results of our method, outperforming existing methods on novel view synthesis and 3D reconstruction, demonstrating robust and consistent 3D object generation.
zh

[CV-54] Robust Deep Learning for Myocardial Scar Segmentation in Cardiac MRI with Noisy Labels MICCAI2025

【速读】：该论文旨在解决心脏磁共振成像（Cardiac MRI）中心肌瘢痕的准确分割问题，这对于临床评估和治疗规划至关重要。其解决方案的关键在于提出了一种基于深度学习的鲁棒性管道，通过微调最先进的模型实现心肌瘢痕的全自动检测与分割。该方法通过引入Kullback-Leibler损失函数和广泛的图像增强技术，有效应对了半自动标注产生的标签噪声、数据异质性和类别不平衡等挑战。

链接: https://arxiv.org/abs/2506.21151
作者: Aida Moafi,Danial Moafi,Evgeny M. Mirkes,Gerry P. McCann,Abbas S. Alatrany,Jayanth R. Arnold,Mostafa Mehdipour Ghazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2025

点击查看摘要

Abstract:The accurate segmentation of myocardial scars from cardiac MRI is essential for clinical assessment and treatment planning. In this study, we propose a robust deep-learning pipeline for fully automated myocardial scar detection and segmentation by fine-tuning state-of-the-art models. The method explicitly addresses challenges of label noise from semi-automatic annotations, data heterogeneity, and class imbalance through the use of Kullback-Leibler loss and extensive data augmentation. We evaluate the model’s performance on both acute and chronic cases and demonstrate its ability to produce accurate and smooth segmentations despite noisy labels. In particular, our approach outperforms state-of-the-art models like nnU-Net and shows strong generalizability in an out-of-distribution test set, highlighting its robustness across various imaging conditions and clinical tasks. These results establish a reliable foundation for automated myocardial scar quantification and support the broader clinical adoption of deep learning in cardiac imaging.
zh

[CV-55] ree-based Semantic Losses: Application to Sparsely-supervised Large Multi-class Hyperspectral Segmentation

【速读】：该论文旨在解决生物医学分割任务中，传统学习方法对所有错误进行等同惩罚，从而无法利用标签空间中的类间语义关系的问题。其解决方案的关键在于引入两种基于树结构的语义损失函数，这些损失函数利用了标签的层次化组织结构，并将其整合到一种使用稀疏、无背景标注的训练方法中，从而在具有临床定义语义树结构的高光谱成像数据集上实现了最先进的分割性能，并有效检测了分布外像素。

链接: https://arxiv.org/abs/2506.21150
作者: Junwen Wang,Oscar Maccormac,William Rochford,Aaron Kujawa,Jonathan Shapey,Tom Vercauteren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) shows great promise for surgical applications, offering detailed insights into biological tissue differences beyond what the naked eye can perceive. Refined labelling efforts are underway to train vision systems to distinguish large numbers of subtly varying classes. However, commonly used learning methods for biomedical segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the label space. In this work, we introduce two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations. Extensive experiments demonstrate that our proposed method reaches state-of-the-art performance on a sparsely annotated HSI dataset comprising 107 classes organised in a clinically-defined semantic tree structure. Furthermore, our method enables effective detection of out-of-distribution (OOD) pixels without compromising segmentation performance on in-distribution (ID) pixels.
zh

[CV-56] Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中因数据、计算和通信异质性带来的挑战，特别是在现有基于文本提示的联邦提示学习方法中忽略联合标签-领域分布偏移的问题。其解决方案的关键在于提出一种基于双提示学习与跨模态融合的个性化联邦学习框架pFedDC，该框架通过在视觉和语言模态中维护全局提示和局部提示，分别捕捉联邦共享的通用知识和客户端特定语义与领域特征，并引入跨融合模块以自适应整合不同层次的提示，从而生成与各客户端数据分布对齐的个性化表示。

链接: https://arxiv.org/abs/2506.21144
作者: Yuguang Zhang,Kuangpu Guo,Zhihe Lu,Yunbo Wang,Jian Liang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training across decentralized clients without sharing local data, but is challenged by heterogeneity in data, computation, and communication. Pretrained vision-language models (VLMs), with their strong generalization and lightweight tuning via prompts, offer a promising solution. However, existing federated prompt-learning methods rely only on text prompts and overlook joint label-domain distribution shifts. In this paper, we propose a personalized FL framework based on dual-prompt learning and cross fusion, termed pFedDC. Specifically, each client maintains both global and local prompts across vision and language modalities: global prompts capture common knowledge shared across the federation, while local prompts encode client-specific semantics and domain characteristics. Meanwhile, a cross-fusion module is designed to adaptively integrate prompts from different levels, enabling the model to generate personalized representations aligned with each client’s unique data distribution. Extensive experiments across nine datasets with various types of heterogeneity show that pFedDC consistently outperforms state-of-the-art methods.
zh

[CV-57] YOLO-FDA: Integrating Hierarchical Attention and Detail Enhancement for Surface Defect Detection

【速读】：该论文旨在解决工业场景中表面缺陷检测的挑战，包括缺陷类型多样、形状和尺寸不规则、细粒度要求高以及材料纹理复杂等问题。现有基于AI的检测方法普遍存在冗余特征、细节敏感性不足以及多尺度条件下的鲁棒性较弱的问题。其解决方案的关键在于提出YOLO-FDA框架，该框架融合了细粒度细节增强和注意力引导的特征融合机制，具体包括采用BiFPN风格的架构以增强YOLOv5主干中的双向多层级特征聚合，引入细节方向融合模块（DDFM）以提升空间细节并增强语义一致性，并设计两种基于注意力的融合策略——注意力加权拼接（AC）和跨层注意力融合（CAF），以改善上下文表示并减少特征噪声。

链接: https://arxiv.org/abs/2506.21135
作者: Jiawei Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures. Submitted to The 8th Chinese Conference on Pattern Recognition and Computer Vision

点击查看摘要

Abstract:Surface defect detection in industrial scenarios is both crucial and technically demanding due to the wide variability in defect types, irregular shapes and sizes, fine-grained requirements, and complex material textures. Although recent advances in AI-based detectors have improved performance, existing methods often suffer from redundant features, limited detail sensitivity, and weak robustness under multiscale conditions. To address these challenges, we propose YOLO-FDA, a novel YOLO-based detection framework that integrates fine-grained detail enhancement and attention-guided feature fusion. Specifically, we adopt a BiFPN-style architecture to strengthen bidirectional multilevel feature aggregation within the YOLOv5 backbone. To better capture fine structural changes, we introduce a Detail-directional Fusion Module (DDFM) that introduces a directional asymmetric convolution in the second-lowest layer to enrich spatial details and fuses the second-lowest layer with low-level features to enhance semantic consistency. Furthermore, we propose two novel attention-based fusion strategies, Attention-weighted Concatenation (AC) and Cross-layer Attention Fusion (CAF) to improve contextual representation and reduce feature noise. Extensive experiments on benchmark datasets demonstrate that YOLO-FDA consistently outperforms existing state-of-the-art methods in terms of both accuracy and robustness across diverse types of defects and scales.
zh

[CV-58] Learning to See in the Extremely Dark ICCV2025

【速读】：该论文旨在解决在极端低照度场景（环境照度低至0.0001 lux）下，基于学习的方法在RAW图像增强任务中的性能不足问题，这一问题主要由于缺乏相应的数据集。解决方案的关键在于提出了一种成对的数据合成流程，能够生成三个精确照度范围（0.01-0.1 lux、0.001-0.01 lux和0.0001-0.001 lux）下的校准良好的极低照度RAW图像，并配以高质量的sRGB参考图像，构建了一个大规模的成对数据集See-in-the-Extremely-Dark（SIED）。此外，还提出了一种基于扩散模型的框架，通过利用扩散模型的生成能力和内在去噪特性，结合自适应光照校正模块和颜色一致性损失，实现从极低信噪比RAW输入中恢复视觉效果良好的结果。

链接: https://arxiv.org/abs/2506.21132
作者: Hai Jiang,Binhao Guan,Zhen Liu,Xiaohong Liu,Jian Yu,Zheng Liu,Songchen Han,Shuaicheng Liu
机构: School of Aeronautics and Astronautics, Sichuan University (四川大学航空航天学院); University of Electronic Science and Technology of China (电子科技大学); Shanghai Jiao Tong University (上海交通大学); National Innovation Center for UHD Video Technology (国家超高清视频创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset are available at this https URL.
zh

[CV-59] GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction ICML2025

【速读】：该论文旨在解决自动驾驶中周围交通参与者轨迹预测的问题，该问题因其固有的不确定性与潜在的多模态特性而具有挑战性。传统数据驱动方法主要依赖监督学习，而本文提出了一种基于图结构的逆强化学习（Graph-oriented Inverse Reinforcement Learning, GoIRL）框架，其关键在于利用向量化上下文表示，并通过特征适配器将车道图特征有效聚合到网格空间中，从而与最大熵逆强化学习范式无缝集成，以推断奖励分布并获取可采样的策略，生成多种合理的轨迹规划。此外，基于采样轨迹规划，采用分层参数化轨迹生成器与精炼模块及概率融合策略进一步提升预测精度与置信度。

链接: https://arxiv.org/abs/2506.21121
作者: Muleilan Pei,Shaoshuai Shi,Lu Zhang,Peiliang Li,Shaojie Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Trajectory prediction for surrounding agents is a challenging task in autonomous driving due to its inherent uncertainty and underlying multimodality. Unlike prevailing data-driven methods that primarily rely on supervised learning, in this paper, we introduce a novel Graph-oriented Inverse Reinforcement Learning (GoIRL) framework, which is an IRL-based predictor equipped with vectorized context representations. We develop a feature adaptor to effectively aggregate lane-graph features into grid space, enabling seamless integration with the maximum entropy IRL paradigm to infer the reward distribution and obtain the policy that can be sampled to induce multiple plausible plans. Furthermore, conditioned on the sampled plans, we implement a hierarchical parameterized trajectory generator with a refinement module to enhance prediction accuracy and a probability fusion strategy to boost prediction confidence. Extensive experimental results showcase our approach not only achieves state-of-the-art performance on the large-scale Argoverse nuScenes motion forecasting benchmarks but also exhibits superior generalization abilities compared to existing supervised models.
zh

[CV-60] CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization ICCV2025

【速读】：该论文旨在解决动态三维环境中场景表示的高效更新问题，特别是在场景随时间变化时，如何在不重新优化整个场景的情况下保持高质量的重建。解决方案的关键在于CL-Splats，它通过集成一个鲁棒的改变检测模块，对场景中的更新部分和静态部分进行分割，从而实现局部优化，避免不必要的重新计算，并支持历史场景状态的存储与恢复。

链接: https://arxiv.org/abs/2506.21117
作者: Jan Ackermann,Jonas Kulhanek,Shengqu Cai,Haofei Xu,Marc Pollefeys,Gordon Wetzstein,Leonidas Guibas,Songyou Peng
机构: ETH Zürich(苏黎世联邦理工学院); Stanford University(斯坦福大学); CTU Prague(布拉格查理大学); Google DeepMind(谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, Project Page: this https URL

点击查看摘要

Abstract:In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene. This paper introduces CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures. CL-Splats integrates a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization that avoids unnecessary re-computation. Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications. Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.
zh

[CV-61] IPFormer-VideoLLM : Enhancing Multi-modal Video Understanding for Multi-shot Scenes

【速读】：该论文旨在解决视频大语言模型（VideoLLM）在多示例（multi-shot）场景下的性能瓶颈，例如在不同摄像角度或场景变化的视频片段中出现的实例身份遗忘和关键帧忽视问题。其解决方案的关键在于引入一个新的数据集MultiClip-Bench，该数据集包含密集描述和基于指令的问答对，以更好地支持多示例任务，并提出一种新模型IPFormer-VideoLLM，其核心思想是通过高效的注意力连接器将实例级特征作为实例提示注入，从而实现跨场景的实例特异性信息聚合。

链接: https://arxiv.org/abs/2506.21116
作者: Yujia Liang,Jile Jiao,Zhicheng Wang,Xuetao Feng,Zixuan Ye,Yuan Wang,Hao Lu
机构: School of AIA, Huazhong University of Science and Technology (人工智能学院，华中科技大学); Deepeleph Intelligent Technology (深度电科技术); JD Explore Academy (京东探索研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) have demonstrated remarkable understanding capabilities, but are found struggling to tackle multi-shot scenarios,e.g., video clips with varying camera angles or scene changes. This challenge can render failures such as instance identity forgetting and key frame negligence. In this work, we first attribute the challenge to the lack of multi-shot annotations among existing datasets and therefore we introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios. We empirically find that the training set significantly boosts the multi-shot performance, while the testing benchmark provides a reliable measure of the model capability in multi-shot scenarios. By further analyzing and discovering that current models only encode instance features in a discrete or lossy manner, at the risk of missing identity information, we then contribute a new model IPFormer-VideoLLM. Its key idea is the injection of instance-level features as instance prompts through an efficient attention-based connector. This allows for the aggregation of instance-specific information across scenes. Experiments demonstrate that our proposed dataset and model not only enhance the multi-scene video understanding significantly, but also offer distinct advantages across various video benchmarks.
zh

[CV-62] Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection

【速读】：该论文旨在解决遥感变化检测中深度学习模型复杂度高、计算需求大但精度提升有限的问题，提出一种更高效的解决方案以满足星上处理对资源消耗的严格要求。其关键在于设计轻量级模型FlickCD，通过引入增强差异模块（Enhanced Difference Module, EDM）放大时相间的关键特征差异并抑制无关变化，同时在解码器中采用局部-全局融合块结合移位窗口自注意力（Shifted Window Self-Attention, SWSA）和增强全局自注意力（Enhanced Global Self-Attention, EGSA），从而在显著降低计算和存储开销的同时保持较高的检测精度。

链接: https://arxiv.org/abs/2506.21109
作者: Luosheng Xu,Dalin Zhang,Zhaohui Song
机构: Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages

点击查看摘要

Abstract:Remote sensing change detection is essential for monitoring urban expansion, disaster assessment, and resource management, offering timely, accurate, and large-scale insights into dynamic landscape transformations. While deep learning has revolutionized change detection, the increasing complexity and computational demands of modern models have not necessarily translated into significant accuracy gains. Instead of following this trend, this study explores a more efficient approach, focusing on lightweight models that maintain high accuracy while minimizing resource consumption, which is an essential requirement for on-satellite processing. To this end, we propose FlickCD, which means quick flick then get great results, pushing the boundaries of the performance-resource trade-off. FlickCD introduces an Enhanced Difference Module (EDM) to amplify critical feature differences between temporal phases while suppressing irrelevant variations such as lighting and weather changes, thereby reducing computational costs in the subsequent change decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion Blocks, leveraging Shifted Window Self-Attention (SWSA) and Enhanced Global Self-Attention (EGSA) to efficiently capture semantic information at multiple scales, preserving both coarse- and fine-grained changes. Extensive experiments on four benchmark datasets demonstrate that FlickCD reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art (SOTA) performance or incurring only a minor (1% F1) accuracy trade-off. The implementation code is publicly available at this https URL.
zh

[CV-63] OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography ICCV2025

【速读】：该论文旨在解决甲骨文（Oracle Bone Script, OBS）中大量未破译字符的解读难题，这些字符因其复杂的结构和抽象的图像特征，给传统方法带来了显著挑战。论文提出的解决方案是基于多模态大语言模型（Multimodal Large Language Model, MLLM）与增强的空间感知推理（Spatial Awareness Reasoning, SAR）能力的两阶段语义排版框架OracleFusion，其关键在于通过视觉定位关键部件与结构向量融合技术，实现语义丰富且保持字形结构完整的矢量字体生成，从而提升可读性与美学质量，并为专家提供类似专业见解的辅助分析工具。

链接: https://arxiv.org/abs/2506.21101
作者: Caoshuo Li,Zengmao Ding,Xiaobin Hu,Bang Li,Donghao Luo,AndyPian Wu,Chaoyang Wang,Chengjie Wang,Taisong Jin,SevenShu,Yunsheng Wu,Yongge Liu,Rongrong Ji
机构: Xiamen University (厦门大学); Tencent (腾讯); Anyang Normal University (安阳师范学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named OracleFusion. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (OSVF), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.
zh

[CV-64] ESMStereo: Enhanced ShuffleMixer Disparity Upsampling for Real-Time and Accurate Stereo Matching

【速读】：该论文旨在解决基于成本体积的立体匹配中，小规模成本体积导致信息不足从而影响视差估计精度的问题，同时兼顾实时性能的需求。其解决方案的关键在于提出增强型混洗混合器（Enhanced Shuffle Mixer, ESM），通过将主特征整合到视差上采样单元中，以恢复关键细节，利用混洗和层分割进行特征融合，并通过紧凑的特征引导的沙漏网络进行细化，从而在保持低计算成本的同时提升视差图的准确性。

链接: https://arxiv.org/abs/2506.21091
作者: Mahmoud Tahmasebi,Saif Huq,Kevin Meehan,Marion McAfee
机构: Atlantic Technological University (大西洋科技大学); York College of Pennsylvania (宾夕法尼亚州约克学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under peer review

点击查看摘要

Abstract:Stereo matching has become an increasingly important component of modern autonomous systems. Developing deep learning-based stereo matching models that deliver high accuracy while operating in real-time continues to be a major challenge in computer vision. In the domain of cost-volume-based stereo matching, accurate disparity estimation depends heavily on large-scale cost volumes. However, such large volumes store substantial redundant information and also require computationally intensive aggregation units for processing and regression, making real-time performance unattainable. Conversely, small-scale cost volumes followed by lightweight aggregation units provide a promising route for real-time performance, but lack sufficient information to ensure highly accurate disparity estimation. To address this challenge, we propose the Enhanced Shuffle Mixer (ESM) to mitigate information loss associated with small-scale cost volumes. ESM restores critical details by integrating primary features into the disparity upsampling unit. It quickly extracts features from the initial disparity estimation and fuses them with image features. These features are mixed by shuffling and layer splitting then refined through a compact feature-guided hourglass network to recover more detailed scene geometry. The ESM focuses on local contextual connectivity with a large receptive field and low computational cost, leading to the reconstruction of a highly accurate disparity map at real-time. The compact version of ESMStereo achieves an inference speed of 116 FPS on high-end GPUs and 91 FPS on the AGX Orin.
zh

[CV-65] EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception ICCV2025

【速读】：该论文旨在解决多模态自我中心感知任务中现代感知模型计算成本高、难以在资源受限环境中部署的问题。其解决方案的关键在于提出EgoAdapt框架，该框架通过自适应的跨模态蒸馏和策略学习，实现不同自我中心感知任务（如自我中心动作识别、主动说话者定位和行为预测）的高效推理，从而显著降低计算量（GMACs）、参数量和能耗，同时保持或超越现有先进模型的性能。

链接: https://arxiv.org/abs/2506.21080
作者: Sanjoy Chowdhury,Subrata Biswas,Sayan Nag,Tushar Nagarajan,Calvin Murdock,Ishwarya Ananthabhotla,Yijun Qian,Vamsi Krishna Ithapu,Dinesh Manocha,Ruohan Gao
机构: University of Maryland, College Park (马里兰大学学院公园分校); Meta Reality Labs (Meta现实实验室); Worcester Polytechnic Institute (伍斯特理工学院); University of Toronto (多伦多大学); FAIR, Meta AI (FAIR，Meta人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets EPIC-Kitchens, EasyCom, and Aria Everyday Activities demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6x, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.
zh

[CV-66] PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single Image

【速读】：该论文旨在解决3D角色建模中由于自遮挡和视角问题导致的姿态标准化阶段生成失真和退化图像的问题，进而影响后续三维重建的几何质量。其解决方案的关键在于提出PoseMaster框架，该框架将姿态变换与3D角色生成统一到基于流的3D原生生成框架中，并利用可动画角色骨架中的3D骨骼作为姿态条件以实现精确的任意姿态控制，同时在训练过程中随机清空姿态条件和图像条件以提升姿态控制的有效性和泛化能力。

链接: https://arxiv.org/abs/2506.21076
作者: Hongyu Yan,Kunming Luo,Weiyu Li,Yixun Liang,Shengming Li,Jingwei Huang,Chunchao Guo,Ping Tan
机构: Hong Kong University of Science and Technology (香港科技大学); Tencent Hunyuan (腾讯混元); Tencent IEG (腾讯互动娱乐)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D characters play a crucial role in our daily entertainment. To improve the efficiency of 3D character modeling, recent image-based methods use two separate models to achieve pose standardization and 3D reconstruction of the A-pose character. However, these methods are prone to generating distorted and degraded images in the pose standardization stage due to self-occlusion and viewpoints, which further affects the geometric quality of the subsequent reconstruction process. To tackle these problems, we propose PoseMaster, an end-to-end controllable 3D character generation framework. Specifically, we unify pose transformation and 3D character generation into a flow-based 3D native generation framework. To achieve accurate arbitrary-pose control, we propose to leverage the 3D body bones existing in the skeleton of an animatable character as the pose condition. Furthermore, considering the specificity of multi-condition control, we randomly empty the pose condition and the image condition during training to improve the effectiveness and generalizability of pose control. Finally, we create a high-quality pose-control dataset derived from realistic character animation data to make the model learning the implicit relationships between skeleton and skinning weights. Extensive experiments show that PoseMaster outperforms current state-of-the-art techniques in both qualitative and quantitative evaluations for A-pose character generation while demonstrating its powerful ability to achieve precise control for arbitrary poses.
zh

[CV-67] SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification

【速读】：该论文试图解决在复杂室内环境中仅通过遮挡的2D图像和自然语言描述检索3D物体的问题，该问题受到有限的3D场景上下文、扭曲视角、无纹理遮挡区域、模糊的语言提示和噪声分割掩码等挑战的影响。解决方案的关键在于提出SAMURAI：一种结合基于CLIP的语义匹配与基于二值轮廓的形状引导重排序的多模态检索方法，并引入稳健的多数投票策略，同时通过专用预处理流程提升掩码质量，从而有效融合语言和形状线索，实现鲁棒的开放世界3D物体检索。

链接: https://arxiv.org/abs/2506.21056
作者: Dinh-Khoi Vo,Van-Loc Nguyen,Minh-Triet Tran,Trung-Nghia Le
机构: University of Science, VNU-HCM (河内国家大学科学大学); Vietnam National University (越南国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retrieving 3D objects in complex indoor environments using only a masked 2D image and a natural language description presents significant challenges. The ROOMELSA challenge limits access to full 3D scene context, complicating reasoning about object appearance, geometry, and semantics. These challenges are intensified by distorted viewpoints, textureless masked regions, ambiguous language prompts, and noisy segmentation masks. To address this, we propose SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification. SAMURAI integrates CLIP-based semantic matching with shape-guided re-ranking derived from binary silhouettes of masked regions, alongside a robust majority voting strategy. A dedicated preprocessing pipeline enhances mask quality by extracting the largest connected component and removing background noise. Our hybrid retrieval framework leverages both language and shape cues, achieving competitive performance on the ROOMELSA private test set. These results highlight the importance of combining shape priors with language understanding for robust open-world 3D object retrieval.
zh

[CV-68] Class-Agnostic Region-of-Interest Matching in Document Images ICDAR2025

【速读】：该论文试图解决现有文档分析解决方案在固定类别定义和粒度下无法实现用户自定义的灵活应用问题。其解决方案的关键在于提出了一种新的任务“Class-Agnostic Region-of-Interest Matching”（RoI-Matching），通过引入视觉提示和目标文档图像作为输入，输出对应的目标文档图像中的边界框，从而实现灵活、高效、多粒度和开放集的区域匹配。为此，研究者构建了基准数据集RoI-Matching-Bench，并提出了宏观和微观评估指标，同时设计了RoI-Matcher框架，利用孪生网络提取参考和目标域的多层次特征，并通过交叉注意力层对齐不同领域中的相似语义。

链接: https://arxiv.org/abs/2506.21055
作者: Demin Zhang,Jiahao Lyu,Zhijie Shen,Yu Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICDAR2025

点击查看摘要

Abstract:Document understanding and analysis have received a lot of attention due to their widespread application. However, existing document analysis solutions, such as document layout analysis and key information extraction, are only suitable for fixed category definitions and granularities, and cannot achieve flexible applications customized by users. Therefore, this paper defines a new task named Class-Agnostic Region-of-Interest Matching'' (RoI-Matching’’ for short), which aims to match the customized regions in a flexible, efficient, multi-granularity, and open-set manner. The visual prompt of the reference document and target document images are fed into our model, while the output is the corresponding bounding boxes in the target document images. To meet the above requirements, we construct a benchmark RoI-Matching-Bench, which sets three levels of difficulties following real-world conditions, and propose the macro and micro metrics to evaluate. Furthermore, we also propose a new framework RoI-Matcher, which employs a siamese network to extract multi-level features both in the reference and target domains, and cross-attention layers to integrate and align similar semantics in different domains. Experiments show that our method with a simple procedure is effective on RoI-Matching-Bench, and serves as the baseline for further research. The code is available at this https URL.
zh

[CV-69] Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features ICCV2025

【速读】：该论文试图解决对抗样本在不同模型之间的迁移性（black-box transferability）不足的问题，其核心在于通过利用自监督学习的视觉Transformer（ViT）表征来提升对抗扰动的泛化能力。解决方案的关键在于提出dSVA——一种基于生成式双自监督ViT特征的攻击方法，该方法同时利用对比学习（CL）提取的全局结构特征和掩码图像建模（MIM）提取的局部纹理特征，并通过生成器结合ViT的注意力机制进行训练，从而实现对多种架构模型的高效迁移攻击。

链接: https://arxiv.org/abs/2506.21046
作者: Shangbo Wu,Yu-an Tan,Ruinan Ma,Wencong Ma,Dehua Zhu,Yuanzhang Li
机构: Beijing Institute of Technology (北京理工大学); School of Cyberspace Science and Technology (网络空间安全学院); School of Computer Science and Technology (计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 14 pages, 9 figures, to appear in ICCV 2025

点击查看摘要

Abstract:The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA – a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at this https URL.
zh

[CV-70] Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling

【速读】：该论文旨在解决文本引导的扩散模型在图像编辑中面临的可编辑性（editability）与忠实性（faithfulness）之间的固有权衡问题。其解决方案的关键在于提出了一种名为忠实性引导与调度（Faithfulness Guidance and Scheduling, FGS）的方法，该方法通过引入忠实性引导机制以增强输入图像信息的保留能力，并结合调度策略解决可编辑性与忠实性之间的不匹配问题，从而在保持较高可编辑性的同时显著提升图像编辑的忠实性。

链接: https://arxiv.org/abs/2506.21045
作者: Hansam Cho,Seoung Bum Kim
机构: Korea University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Text-guided diffusion models have become essential for high-quality image synthesis, enabling dynamic image editing. In image editing, two crucial aspects are editability, which determines the extent of modification, and faithfulness, which reflects how well unaltered elements are preserved. However, achieving optimal results is challenging because of the inherent trade-off between editability and faithfulness. To address this, we propose Faithfulness Guidance and Scheduling (FGS), which enhances faithfulness with minimal impact on editability. FGS incorporates faithfulness guidance to strengthen the preservation of input image information and introduces a scheduling strategy to resolve misalignment between editability and faithfulness. Experimental results demonstrate that FGS achieves superior faithfulness while maintaining editability. Moreover, its compatibility with various editing methods enables precise, high-quality image edits across diverse tasks.
zh

[CV-71] Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness Generalization and Transferability ICCV2025

【速读】：该论文旨在解决检测器在训练与测试数据之间存在领域差异时性能下降的问题。其关键解决方案是通过从单步扩散过程中提取中间特征，优化特征收集与融合，从而将推理时间减少75%并提升源域（Fitness）的性能；同时构建以物体为中心的辅助分支，利用带框掩码图像和类别提示提取鲁棒且领域不变的特征，并通过一致性损失对齐辅助分支与常规分支，平衡Fitness与泛化能力（Generalization），防止过拟合并提升目标域性能。此外，在统一框架下，标准检测器通过在源域和无标签目标域上进行特征级和物体级对齐，提升跨域检测性能（Transferability）。

链接: https://arxiv.org/abs/2506.21042
作者: Boyong He,Yuxiang Ji,Zhuoyue Tan,Liaoni Wu
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025. arXiv admin note: text overlap with arXiv:2503.02101

点击查看摘要

Abstract:Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains. The code is available at \hrefthis https URLFitness-Generalization-Transferability.
zh

[CV-72] V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling

【速读】：该论文旨在解决城市环境中自动驾驶在罕见、多样化及视觉退化长尾场景下的鲁棒性规划与决策问题，尤其在协同设置中，车辆与基础设施需共同感知和推理复杂环境。其解决方案的关键在于提出V2X-REALM框架，该框架基于视觉-语言模型（VLM）并采用自适应多模态学习，包含三个核心创新：（i）一种基于提示的长尾场景生成与评估流水线，利用基础模型合成如雪和雾等真实长尾条件；（ii）一种门控多场景自适应注意力模块，通过场景先验调节视觉流以重新校准模糊或损坏特征；（iii）一种多任务场景感知对比学习目标，提升多模态对齐并促进跨场景特征可分性。

链接: https://arxiv.org/abs/2506.21041
作者: Junwei You,Pei Li,Zhuoyu Jiang,Zilin Huang,Rui Gan,Haotian Shi,Bin Ran
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring robust planning and decision-making under rare, diverse, and visually degraded long-tail scenarios remains a fundamental challenge for autonomous driving in urban environments. This issue becomes more critical in cooperative settings, where vehicles and infrastructure jointly perceive and reason across complex environments. To address this challenge, we propose V2X-REALM, a vision-language model (VLM)-based framework with adaptive multimodal learning for robust cooperative autonomous driving under long-tail scenarios. V2X-REALM introduces three core innovations: (i) a prompt-driven long-tail scenario generation and evaluation pipeline that leverages foundation models to synthesize realistic long-tail conditions such as snow and fog across vehicle- and infrastructure-side views, enriching training diversity efficiently; (ii) a gated multi-scenario adaptive attention module that modulates the visual stream using scenario priors to recalibrate ambiguous or corrupted features; and (iii) a multi-task scenario-aware contrastive learning objective that improves multimodal alignment and promotes cross-scenario feature separability. Extensive experiments demonstrate that V2X-REALM significantly outperforms existing baselines in robustness, semantic reasoning, safety, and planning accuracy under complex, challenging driving conditions, advancing the scalability of end-to-end cooperative autonomous driving.
zh

[CV-73] RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment ICCV2025

【速读】：该论文旨在解决大规模数据集训练中计算与存储开销高、数据冗余问题，其核心挑战在于如何提升数据效率而不牺牲模型性能。解决方案的关键在于引入“epsilon-sample cover”概念，通过量化样本间的相互关系来捕捉数据集的内在结构，并将数据选择重新建模为强化学习（Reinforcement Learning, RL）过程，利用动态数据分布生成的epsilon-sample cover作为奖励信号，由轻量级RL代理优化选择策略，从而实现更高效且更具泛化能力的模型训练。

链接: https://arxiv.org/abs/2506.21037
作者: Suorong Yang,Peijia Li,Furao Shen,Jian Zhao
机构: Nanjing University (南京大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Modern deep architectures often rely on large-scale datasets, but training on these datasets incurs high computational and storage overhead. Real-world datasets often contain substantial redundancies, prompting the need for more data-efficient training paradigms. Data selection has shown promise to mitigate redundancy by identifying the most representative samples, thereby reducing training costs without compromising performance. Existing methods typically rely on static scoring metrics or pretrained models, overlooking the combined effect of selected samples and their evolving dynamics during training. We introduce the concept of epsilon-sample cover, which quantifies sample redundancy based on inter-sample relationships, capturing the intrinsic structure of the dataset. Based on this, we reformulate data selection as a reinforcement learning (RL) process and propose RL-Selector, where a lightweight RL agent optimizes the selection policy by leveraging epsilon-sample cover derived from evolving dataset distribution as a reward signal. Extensive experiments across benchmark datasets and diverse architectures demonstrate that our method consistently outperforms existing state-of-the-art baselines. Models trained with our selected datasets show enhanced generalization performance with improved training efficiency.
zh

[CV-74] DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation

【速读】：该论文旨在解决非朗伯物体（non-Lambertian objects）在商业RGB-D相机中产生的深度图噪声大、不完整的问题，以及传统深度补全方法因训练数据多样性与规模有限而难以泛化的挑战。其解决方案的关键在于提出一种基于扩散模型的框架\textbfDidSee，通过引入重缩放噪声调度器以消除信号泄漏偏差、设计无噪声单步训练公式以缓解误差累积，并结合语义增强模块实现深度补全与语义分割的联合优化，从而提升非朗伯区域的深度预测精度。

链接: https://arxiv.org/abs/2506.21034
作者: Wenzhou Lyu,Jialing Lin,Wenqi Ren,Ruihao Xia,Feng Qian,Yang Tang
机构: East China University of Science and Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre-trained text-to-image diffusion models to enhance generalization in dense prediction tasks. However, we find that biases arising from training-inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non-Lambertian regions further hinders precise prediction. To address these issues, we propose \textbfDidSee, a diffusion-based framework for depth completion on non-Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a noise-agnostic single-step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task-specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine-grained depth maps. DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks such as category-level pose estimation and robotic this http URL page: this https URL
zh

[CV-75] Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation

【速读】：该论文旨在解决高分辨率图像建模中计算需求过高的问题，通过改进图像标记化方法以提升图像和多模态理解与生成的效率。其解决方案的关键在于引入1D二进制图像潜在空间（1D binary image latents），将每张图像表示为二进制向量序列，而非传统的一热码本标记，从而在保持1D潜在空间紧凑性的同时保留高分辨率细节。该方法实现了仅使用128个离散标记即可达到与标准VQ-VAEs相比32倍的标记数减少，并在单个GPU节点上支持4096的大批量训练，显著提升了训练和推理速度。

链接: https://arxiv.org/abs/2506.21022
作者: Ze Wang,Hao Chen,Benran Hu,Jiang Liu,Ximeng Sun,Jialian Wu,Yusheng Su,Xiaodong Yu,Emad Barsoum,Zicheng Liu
机构: AMD GenAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image representation by introducing 1D binary image latents. By representing each image as a sequence of binary vectors, rather than using traditional one-hot codebook tokens, our approach preserves high-resolution details while maintaining the compactness of 1D latents. To the best of our knowledge, our text-to-image models are the first to achieve competitive performance in both diffusion and auto-regressive generation using just 128 discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary latent space, coupled with simple model architectures, achieves marked improvements in speed training and inference speed. Our text-to-image models allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X GPUs, and the training can be completed within 200 GPU days. Our models achieve competitive performance compared to modern image generation models without any in-house private training data or post-training refinements, offering a scalable and efficient alternative to conventional tokenization methods.
zh

[CV-76] LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection

【速读】：该论文旨在解决多模态目标检测中因采用多级特征级融合单元导致的训练过程复杂和计算开销大的问题。其解决方案的关键在于提出一种基于单一特征级融合单元的新型融合检测基线，并在此基础上构建轻量级注意力引导自调制特征融合网络（LASFNet），其中引入了注意力引导的自调制特征融合（ASFF）模块，通过不同模态的注意力信息自适应调整融合特征的全局与局部响应，从而实现更全面和丰富的特征生成。此外，LASFNet在颈部设计了轻量级特征注意力转换模块（FATM），以增强对融合特征的关注并减少信息损失。

链接: https://arxiv.org/abs/2506.21018
作者: Lei Hao,Lina Xu,Chang Liu,Yanni Dong
机构: China University of Geosciences (中国地质大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective deep feature extraction via feature-level fusion is crucial for multimodal object detection. However, previous studies often involve complex training processes that integrate modality-specific features by stacking multiple feature-level fusion units, leading to significant computational overhead. To address this issue, we propose a new fusion detection baseline that uses a single feature-level fusion unit to enable high-performance detection, thereby simplifying the training process. Based on this approach, we propose a lightweight attention-guided self-modulation feature fusion network (LASFNet), which introduces a novel attention-guided self-modulation feature fusion (ASFF) module that adaptively adjusts the responses of fusion features at both global and local levels based on attention information from different modalities, thereby promoting comprehensive and enriched feature generation. Additionally, a lightweight feature attention transformation module (FATM) is designed at the neck of LASFNet to enhance the focus on fused features and minimize information loss. Extensive experiments on three representative datasets demonstrate that, compared to state-of-the-art methods, our approach achieves a favorable efficiency-accuracy trade-off, reducing the number of parameters and computational cost by as much as 90% and 85%, respectively, while improving detection accuracy (mAP) by 1%-3%. The code will be open-sourced at this https URL.
zh

[CV-77] Multimodal Prompt Alignment for Facial Expression Recognition ICCV2025

【速读】：该论文旨在解决基于视觉-语言模型（VLM）的面部表情识别（FER）方法在捕捉细粒度文本-视觉关系方面的不足，这一问题限制了模型对细微面部表情差异的区分能力。解决方案的关键在于提出一种多模态提示对齐框架（MPA-FER），通过引入多粒度硬提示生成策略，利用大语言模型（LLM）生成详细的面部表情描述，并通过最小化软提示与硬提示之间的特征差异来注入外部知识；同时结合原型引导的视觉特征对齐和跨模态全局-局部对齐模块，提升文本与视觉特征的对齐效果，从而获得更精确且可解释的表示。

链接: https://arxiv.org/abs/2506.21017
作者: Fuyan Ma,Yiran He,Bin Sun,Shutao Li
机构: Chinese Academy of Military Science (中国军事科学学院); Changchun University of Science and Technology (长春理工大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To appear in ICCV2025

点击查看摘要

Abstract:Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.
zh

[CV-78] HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation

【速读】：该论文试图解决皮肤疾病检测中机器学习模型训练所需高质量数据不足的问题，特别是针对皮肤疾病数据集存在的类别不平衡、隐私问题和对象偏差。其解决方案的关键在于提出一种新颖的经典-量子潜在空间融合技术，从而克服了现有基于量子的图像生成方法仅能生成低质量灰度图像的局限性，并引入了首个能够生成彩色医学图像的经典-量子生成对抗网络（GAN）。该方法在图像生成质量和分类性能提升方面优于传统深度卷积GAN及现有的混合经典-量子GAN，同时具有更少的参数和训练轮次，展现出量子图像生成在硬件进步背景下的广阔前景。

链接: https://arxiv.org/abs/2506.21015
作者: Qingyue Jiao,Kangyu Zheng,Yiyu Shi,Zhiding Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Machine learning-assisted diagnosis is gaining traction in skin disease detection, but training effective models requires large amounts of high-quality data. Skin disease datasets often suffer from class imbalance, privacy concerns, and object bias, making data augmentation essential. While classical generative models are widely used, they demand extensive computational resources and lengthy training time. Quantum computing offers a promising alternative, but existing quantum-based image generation methods can only yield grayscale low-quality images. Through a novel classical-quantum latent space fusion technique, our work overcomes this limitation and introduces the first classical-quantum generative adversarial network (GAN) capable of generating color medical images. Our model outperforms classical deep convolutional GANs and existing hybrid classical-quantum GANs in both image generation quality and classification performance boost when used as data augmentation. Moreover, the performance boost is comparable with that achieved using state-of-the-art classical generative models, yet with over 25 times fewer parameters and 10 times fewer training epochs. Such results suggest a promising future for quantum image generation as quantum hardware advances. Finally, we demonstrate the robust performance of our model on real IBM quantum machine with hardware noise.
zh

[CV-79] FedSC: Federated Learning with Semantic-Aware Collaboration KDD2025

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中因数据异质性导致的模型训练难题，即不同客户端之间存在标签偏好偏差的问题。其解决方案的关键在于提出一种语义感知的联邦学习框架FedSC，通过构建语义层面的关系原型和一致原型，以捕捉客户端特定且与类别相关的知识，从而提供丰富的类别基础信息和稳定的收敛信号。FedSC的核心思想是通过引入跨对比学习策略和差异聚合机制，增强模型在异构客户端间的协同能力，提升模型的泛化性和收敛稳定性。

链接: https://arxiv.org/abs/2506.21012
作者: Huan Wang,Haoran Li,Huaming Chen,Jun Yan,Jiahua Shi,Jun Shen
机构: University of Wollongong(伍伦贡大学); The University of Sydney(悉尼大学); The University of Queensland(昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, KDD 2025

点击查看摘要

Abstract:Federated learning (FL) aims to train models collaboratively across clients without sharing data for privacy-preserving. However, one major challenge is the data heterogeneity issue, which refers to the biased labeling preferences at multiple clients. A number of existing FL methods attempt to tackle data heterogeneity locally (e.g., regularizing local models) or globally (e.g., fine-tuning global model), often neglecting inherent semantic information contained in each client. To explore the possibility of using intra-client semantically meaningful knowledge in handling data heterogeneity, in this paper, we propose Federated Learning with Semantic-Aware Collaboration (FedSC) to capture client-specific and class-relevant knowledge across heterogeneous clients. The core idea of FedSC is to construct relational prototypes and consistent prototypes at semantic-level, aiming to provide fruitful class underlying knowledge and stable convergence signals in a prototype-wise collaborative way. On the one hand, FedSC introduces an inter-contrastive learning strategy to bring instance-level embeddings closer to relational prototypes with the same semantics and away from distinct classes. On the other hand, FedSC devises consistent prototypes via a discrepancy aggregation manner, as a regularization penalty to constrain the optimization region of the local model. Moreover, a theoretical analysis for FedSC is provided to ensure a convergence guarantee. Experimental results on various challenging scenarios demonstrate the effectiveness of FedSC and the efficiency of crucial components.
zh

[CV-80] Bridging Video Quality Scoring and Justification via Large Multimodal Models

【速读】：该论文旨在解决传统视频质量评估（VQA）方法仅能生成单一数值评分，无法全面描述视频复杂质量维度的问题，从而限制了其应用范围。解决方案的关键在于构建以视频质量为中心的指令数据，通过Score-based Instruction Generation (SIG) 流程自动生成高质量的指令-响应对数据，该流程首先对未标注视频的多个质量维度进行评分，并将其映射到文本定义的级别，随后引入分层Chain-of-Thought (CoT) 模型来建模特定维度与整体质量之间的关联，模拟人类视觉系统的推理过程，从而实现数据的可扩展性和生成效率。

链接: https://arxiv.org/abs/2506.21011
作者: Qizhi Xie,Kun Yuan,Yunpeng Qu,Jiachao Gong,Mingda Wu,Ming Sun,Chao Zhou,Jihong Zhu
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Classical video quality assessment (VQA) methods generate a numerical score to judge a video’s perceived visual fidelity and clarity. Yet, a score fails to describe the video’s complex quality dimensions, restricting its applicability. Benefiting from the linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue. The core of the approach lies in the video quality-centric instruction data. Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems, limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. Specifically, SIG first scores multiple quality dimensions of an unlabeled video and maps scores to text-defined levels. It then explicitly incorporates a hierarchical Chain-of-Thought (CoT) to model the correlation between specific dimensions and overall quality, mimicking the human visual system’s reasoning process. The automated pipeline eliminates the reliance on expert-written quality descriptions and proprietary systems, ensuring data scalability and generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset contains over 320K diverse instruction-response pairs, laying the basis for instruction tuning. Moreover, to advance video LMMs’ quality scoring and justification abilities simultaneously, we devise a progressive tuning strategy to fully unleash the power of S2I. Built upon SIG, we further curate a benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the quality justification capacity of video LMMs. Experimental results on the S2I-Bench and existing benchmarks indicate that our method consistently improves quality scoring and justification capabilities across multiple video LMMs.
zh

[CV-81] User-in-the-Loop View Sampling with Error Peaking Visualization ICIP25 ICIP2025

【速读】：该论文试图解决传统增强现实（AR）在新视角合成中需要用户进行3D标注和受限场景探索的问题，这些问题使得数据采集过程具有较高的认知负荷并限制了捕捉区域。解决方案的关键在于利用局部重建的光场（light field）并可视化需移除的误差，通过插入新视角来减少对3D标注的依赖，从而提升用户体验并扩展场景探索范围。

链接: https://arxiv.org/abs/2506.21009
作者: Ayaka Yasunaga,Hideo Saito,Shohei Mori
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ICIP 2025, Project Page: this https URL

点击查看摘要

Abstract:Augmented reality (AR) provides ways to visualize missing view samples for novel view synthesis. Existing approaches present 3D annotations for new view samples and task users with taking images by aligning the AR display. This data collection task is known to be mentally demanding and limits capture areas to pre-defined small areas due to the ideal but restrictive underlying sampling theory. To free users from 3D annotations and limited scene exploration, we propose using locally reconstructed light fields and visualizing errors to be removed by inserting new views. Our results show that the error-peaking visualization is less invasive, reduces disappointment in final results, and is satisfactory with fewer view samples in our mobile view synthesis system. We also show that our approach can contribute to recent radiance field reconstruction for larger scenes, such as 3D Gaussian splatting.
zh

[CV-82] he Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion

【速读】：该论文试图解决传统方法在面部老化生成中将老化建模为单一确定性路径的问题，而未能充分考虑外部因素（如环境、健康和生活方式）对老化轨迹的影响。其解决方案的关键在于提出一种无需训练的基于扩散的方法，通过注意力混合（attention mixing）调节编辑强度，并引入模拟老化正则化（Simulated Aging Regularization）策略以稳定编辑过程，从而实现多条合理且可控的老化轨迹生成。

链接: https://arxiv.org/abs/2506.21008
作者: Bang Gong,Luchao Qi,Jiaye Wu,Zhicheng Fu,Chunbo Song,David W. Jacobs,John Nicholson,Roni Sengupta
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); University of Maryland (马里兰大学); Lenovo (联想)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce the Aging Multiverse, a framework for generating multiple plausible facial aging trajectories from a single image, each conditioned on external factors such as environment, health, and lifestyle. Unlike prior methods that model aging as a single deterministic path, our approach creates an aging tree that visualizes diverse futures. To enable this, we propose a training-free diffusion-based method that balances identity preservation, age accuracy, and condition control. Our key contributions include attention mixing to modulate editing strength and a Simulated Aging Regularization strategy to stabilize edits. Extensive experiments and user studies demonstrate state-of-the-art performance across identity preservation, aging realism, and conditional alignment, outperforming existing editing and age-progression models, which often fail to account for one or more of the editing criteria. By transforming aging into a multi-dimensional, controllable, and interpretable process, our approach opens up new creative and practical avenues in digital storytelling, health education, and personalized visualization.
zh

[CV-83] Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning

【速读】：该论文旨在解决乳腺癌保乳手术中术中标本边缘评估的准确性问题，当前使用的二维标本放射摄影（2D specimen radiography, SR）方法准确率有限，导致近四分之一的患者需要二次手术。其解决方案的关键在于提出一种结合Segment Anything Model (SAM) 和Forward-Forward Contrastive Learning (FFCL) 的深度学习框架，通过融合局部与全局对比学习策略对SR图像进行补丁级分类，并利用预训练的ResNet-18骨干网络进行边缘状态分类，随后通过SAM进行细化的肿瘤边缘分割，从而显著提升术中边缘评估的速度和准确性。

链接: https://arxiv.org/abs/2506.21006
作者: Tyler Ward,Xiaoqin Wang,Braxton McFarland,Md Atik Ahamed,Sahar Nozad,Talal Arshad,Hafsa Nebbache,Jin Chen,Abdullah Imran
机构: University of Kentucky (肯塔基大学); The University of Alabama at Birmingham (阿拉巴马大学伯明翰分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Complete removal of cancer tumors with a negative specimen margin during lumpectomy is essential in reducing breast cancer recurrence. However, 2D specimen radiography (SR), the current method used to assess intraoperative specimen margin status, has limited accuracy, resulting in nearly a quarter of patients requiring additional surgery. To address this, we propose a novel deep learning framework combining the Segment Anything Model (SAM) with Forward-Forward Contrastive Learning (FFCL), a pre-training strategy leveraging both local and global contrastive learning for patch-level classification of SR images. After annotating SR images with regions of known maligancy, non-malignant tissue, and pathology-confirmed margins, we pre-train a ResNet-18 backbone with FFCL to classify margin status, then reconstruct coarse binary masks to prompt SAM for refined tumor margin segmentation. Our approach achieved an AUC of 0.8455 for margin classification and segmented margins with a 27.4% improvement in Dice similarity over baseline models, while reducing inference time to 47 milliseconds per image. These results demonstrate that FFCL-SAM significantly enhances both the speed and accuracy of intraoperative margin assessment, with strong potential to reduce re-excision rates and improve surgical outcomes in breast cancer treatment. Our code is available at this https URL.
zh

[CV-84] VisionGuard: Synergistic Framework for Helmet Violation Detection

【速读】：该论文旨在解决摩托车头盔违规检测中的挑战，包括环境变化、摄像头角度差异以及数据不一致导致的摩托车及骑手检测不可靠和对象分类不一致问题。其解决方案的关键在于提出VisionGuard框架，该框架整合了自适应标签（Adaptive Labeling）模块和上下文扩展器（Contextual Expander）模块，通过跟踪算法实现标签一致性优化，并通过生成虚拟边界框提升少数类别召回率，从而有效应对数据不平衡问题。

链接: https://arxiv.org/abs/2506.21005
作者: Lam-Huy Nguyen,Thinh-Phuc Nguyen,Thanh-Hai Nguyen,Gia-Huy Dinh,Minh-Triet Tran,Trung-Nghia Le
机构: University of Science, Ho Chi Minh City, Vietnam (胡志明市科学大学); Vietnam National University, Ho Chi Minh City, Vietnam (胡志明市国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Enforcing helmet regulations among motorcyclists is essential for enhancing road safety and ensuring the effectiveness of traffic management systems. However, automatic detection of helmet violations faces significant challenges due to environmental variability, camera angles, and inconsistencies in the data. These factors hinder reliable detection of motorcycles and riders and disrupt consistent object classification. To address these challenges, we propose VisionGuard, a synergistic multi-stage framework designed to overcome the limitations of frame-wise detectors, especially in scenarios with class imbalance and inconsistent annotations. VisionGuard integrates two key components: Adaptive Labeling and Contextual Expander modules. The Adaptive Labeling module is a tracking-based refinement technique that enhances classification consistency by leveraging a tracking algorithm to assign persistent labels across frames and correct misclassifications. The Contextual Expander module improves recall for underrepresented classes by generating virtual bounding boxes with appropriate confidence scores, effectively addressing the impact of data imbalance. Experimental results show that VisionGuard improves overall mAP by 3.1% compared to baseline detectors, demonstrating its effectiveness and potential for real-world deployment in traffic surveillance systems, ultimately promoting safety and regulatory compliance.
zh

[CV-85] Inverse Scene Text Removal

【速读】：该论文试图解决生成式 AI (Generative AI) 在场景文本移除 (Scene Text Removal, STR) 应用中的潜在滥用问题，其核心是通过逆向 STR (Inverse STR, ISTR) 来检测经过 STR 处理的图像，并定位被移除的文本区域。解决方案的关键在于利用深度学习方法实现高精度的二分类任务（判断图像是否经过 STR）和文本区域定位，同时尝试通过训练文本识别模型恢复被移除的文本内容，从而提升 STR 技术的安全性和可靠性。

链接: https://arxiv.org/abs/2506.21002
作者: Takumi Yoshimatsu,Shumpei Takezaki,Seiichi Uchida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Scene text removal (STR) aims to erase textual elements from images. It was originally intended for removing privacy-sensitiveor undesired texts from natural scene images, but is now also appliedto typographic images. STR typically detects text regions and theninpaints them. Although STR has advanced through neural networksand synthetic data, misuse risks have increased. This paper investi-gates Inverse STR (ISTR), which analyzes STR-processed images andfocuses on binary classification (detecting whether an image has un-dergone STR) and localizing removed text regions. We demonstrate inexperiments that these tasks are achievable with high accuracies, en-abling detection of potential misuse and improving STR. We also at-tempt to recover the removed text content by training a text recognizerto understand its difficulty.
zh

[CV-86] Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology

【速读】：该论文旨在解决在细胞病理学中训练鲁棒的异常细胞检测神经网络所面临的挑战，包括缺乏高质量标注、长尾数据分布以及染色风格不一致等问题。其解决方案的关键在于提出一种风格对齐图像合成（style-aligned image composition, SAIC）方法，通过属性引导选择异常细胞库中的合适候选样本，并利用高频特征重建实现异常细胞与病理背景的风格对齐和高保真合成，最后借助大视觉-语言模型筛选高质量合成图像，从而提升检测模型的效果和鲁棒性。

链接: https://arxiv.org/abs/2506.21001
作者: Qiuyi Qi,Xin Li,Ming Kong,Zikang Xu,Bingdi Chen,Qiang Zhu,S Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MIDL 2025 Oral

点击查看摘要

Abstract:Challenges such as the lack of high-quality annotations, long-tailed data distributions, and inconsistent staining styles pose significant obstacles to training neural networks to detect abnormal cells in cytopathology robustly. This paper proposes a style-aligned image composition (SAIC) method that composes high-fidelity and style-preserved pathological images to enhance the effectiveness and robustness of detection models. Without additional training, SAIC first selects an appropriate candidate from the abnormal cell bank based on attribute guidance. Then, it employs a high-frequency feature reconstruction to achieve a style-aligned and high-fidelity composition of abnormal cells and pathological backgrounds. Finally, it introduces a large vision-language model to filter high-quality synthesis images. Experimental results demonstrate that incorporating SAIC-synthesized images effectively enhances the performance and robustness of abnormal cell detection for tail categories and styles, thereby improving overall detection performance. The comprehensive quality evaluation further confirms the generalizability and practicality of SAIC in clinical application scenarios. Our code will be released at this https URL.
zh

[CV-87] DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting CVPR

【速读】：该论文旨在解决从模糊单目视频中合成动态场景的新视角问题，这一任务在现有方法中因依赖高分辨率图像或对静态几何和刚性场景先验的强假设而面临挑战，导致在具有动态物体和相机运动的真实环境中表现不稳定且视觉保真度下降。论文提出的解决方案关键在于通过稀疏控制的高斯点云（Sparse-Controlled Gaussian Splatting）实现动态视角合成，该方法生成密集的3D高斯分布，从模糊视频中恢复清晰度并重建受动态运动变化影响的场景详细三维几何结构，从而在动态模糊场景下实现了鲁棒的新视角合成性能。

链接: https://arxiv.org/abs/2506.20998
作者: Yeon-Ji Song,Jaein Kim,Byung-Ju Kim,Byoung-Tak Zhang
机构: Seoul National University (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPRW 2025, Neural Fields Beyond Conventional Cameras

点击查看摘要

Abstract:Novel view synthesis is a task of generating scenes from unseen perspectives; however, synthesizing dynamic scenes from blurry monocular videos remains an unresolved challenge that has yet to be effectively addressed. Existing novel view synthesis methods are often constrained by their reliance on high-resolution images or strong assumptions about static geometry and rigid scene priors. Consequently, their approaches lack robustness in real-world environments with dynamic object and camera motion, leading to instability and degraded visual fidelity. To address this, we propose Motion-aware Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting (DBMovi-GS), a method designed for dynamic view synthesis from blurry monocular videos. Our model generates dense 3D Gaussians, restoring sharpness from blurry videos and reconstructing detailed 3D geometry of the scene affected by dynamic motion variations. Our model achieves robust performance in novel view synthesis under dynamic blurry scenes and sets a new benchmark in realistic novel view synthesis for blurry monocular video inputs.
zh

[CV-88] Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

【速读】：该论文试图解决视频到音频生成中如何准确捕捉并合成多个语义上不同的音频轨道的问题，以实现更高质量的复合音频合成。解决方案的关键在于提出一种分步生成的方法，该方法将每个生成步骤建模为受文本提示和先前生成音频轨道条件约束的视频到音频合成任务，从而模仿传统Foley工作流程，确保视频中所有声音事件被全面捕获。

链接: https://arxiv.org/abs/2506.20995
作者: Akio Hayakawa,Masato Ishii,Takashi Shibuya,Yuki Mitsufuji
机构: Sony AI(索尼人工智能); Sony Group Corporation(索尼集团有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.
zh

[CV-89] SDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation

【速读】：该论文旨在解决现有3D视觉-语言模型（3D VLMs）在点级任务（如分割）中表现不佳的问题，其根本原因在于缺乏直接的3D-文本对齐，从而限制了模型将局部3D特征与文本上下文进行关联的能力。解决方案的关键在于提出TSDASeg，该模型包含一个直接跨模态对齐模块和一个记忆模块，通过建立3D点云与文本/2D图像数据之间的显式对齐，并利用多个专用记忆库存储文本特征、视觉特征及其跨模态对应映射，结合自注意力与交叉注意力机制动态更新场景特定特征，从而有效缓解不同场景下交互分割结果的不一致性。

链接: https://arxiv.org/abs/2506.20991
作者: Chade Li,Pengju Zhang,Yihong Wu
机构: 1State Key Laboratory of Multimodal Artificial Intelligence, Institute of Automation, Chinese Academy of Sciences; 2School of Artificial Intelligence, University of Chinese Academy of Sciences
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of 3D vision-language models (VLMs) has spurred significant interest in interactive point cloud processing tasks, particularly for real-world applications. However, existing methods often underperform in point-level tasks, such as segmentation, due to missing direct 3D-text alignment, limiting their ability to link local 3D features with textual context. To solve this problem, we propose TSDASeg, a Two-Stage model coupled with a Direct cross-modal Alignment module and memory module for interactive point cloud Segmentation. We introduce the direct cross-modal alignment module to establish explicit alignment between 3D point clouds and textual/2D image data. Within the memory module, we employ multiple dedicated memory banks to separately store text features, visual features, and their cross-modal correspondence mappings. These memory banks are dynamically leveraged through self-attention and cross-attention mechanisms to update scene-specific features based on prior stored data, effectively addressing inconsistencies in interactive segmentation results across diverse scenarios. Experiments conducted on multiple 3D instruction, reference, and semantic segmentation datasets demonstrate that the proposed method achieves state-of-the-art performance.
zh

[CV-90] Segment Anything in Pathology Images with Natural Language

【速读】：该论文旨在解决病理图像分割在临床应用中的两大挑战：标注数据有限以及类别定义受限。其解决方案的关键在于提出PathSegmentor，这是首个针对病理图像设计的文本提示分割基础模型，并构建了最大的病理分割数据集PathSeg。通过引入自然语言提示，PathSegmentor能够实现语义分割，无需依赖繁琐的空间输入（如点或框），从而提高了模型的准确性和适用性。

链接: https://arxiv.org/abs/2506.20988
作者: Zhixuan Chen,Junlin Hou,Liqi Lin,Yihui Wang,Yequan Bie,Xi Wang,Yanning Zhou,Ronald Cheong Kin Chan,Hao Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pathology image segmentation is crucial in computational pathology for analyzing histological features relevant to cancer diagnosis and prognosis. However, current methods face major challenges in clinical applications due to limited annotated data and restricted category definitions. To address these limitations, we propose PathSegmentor, the first text-prompted segmentation foundation model designed specifically for pathology images. We also introduce PathSeg , the largest and most comprehensive dataset for pathology segmentation, built from 17 public sources and containing 275k image-mask-label triples across 160 diverse categories. With PathSegmentor, users can perform semantic segmentation using natural language prompts, eliminating the need for laborious spatial inputs such as points or boxes. Extensive experiments demonstrate that PathSegmentor outperforms specialized models with higher accuracy and broader applicability, while maintaining a compact architecture. It significantly surpasses existing spatial- and text-prompted models by 0.145 and 0.429 in overall Dice scores, respectively, showing strong robustness in segmenting complex structures and generalizing to external datasets. Moreover, PathSegmentor’s outputs enhance the interpretability of diagnostic models through feature importance estimation and imaging biomarker discovery, offering pathologists evidence-based support for clinical decision-making. This work advances the development of explainable AI in precision oncology.
zh

[CV-91] EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning

【速读】：该论文旨在解决组合零样本学习（Compositional Zero-Shot Learning, CZSL）中由于简单组合原型映射导致的原始概念特征表示不足以及跨模态原始概念匹配忽略组合差异的问题，从而影响细粒度图像-组合对齐。其解决方案的关键在于提出EVA框架，该框架通过领域专家适应实现token感知学习，并构建高质量的原始表征；同时引入语义变体对齐机制，以选择语义相关的表征进行图像-原始概念匹配，从而提升组合泛化能力。

链接: https://arxiv.org/abs/2506.20986
作者: Xiao Zhang,Yongqiang Ma,Haodong Jing,Nanning Zheng
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compositional Zero-Shot Learning (CZSL) investigates compositional generalization capacity to recognize unknown state-object pairs based on learned primitive concepts. Existing CZSL methods typically derive primitives features through a simple composition-prototype mapping, which is suboptimal for a set of individuals that can be divided into distinct semantic subsets. Moreover, the all-to-one cross-modal primitives matching neglects compositional divergence within identical states or objects, limiting fine-grained image-composition alignment. In this study, we propose EVA, a Mixture-of-Experts Semantic Variant Alignment framework for CZSL. Specifically, we introduce domain-expert adaption, leveraging multiple experts to achieve token-aware learning and model high-quality primitive representations. To enable accurate compositional generalization, we further present semantic variant alignment to select semantically relevant representation for image-primitives matching. Our method significantly outperforms other state-of-the-art CZSL methods on three popular benchmarks in both closed- and open-world settings, demonstrating the efficacy of the proposed insight.
zh

[CV-92] Rethink Sparse Signals for Pose-guided Text-to-image Generation ICCV2025

【速读】：该论文试图解决在基于文本的图像生成中，使用密集信号（如深度图、DensePose）进行姿态引导时所面临的编辑困难和与文本提示潜在不一致的问题。其解决方案的关键在于重新引入稀疏信号（如OpenPose）进行姿态引导，并通过提出一种名为Spatial-Pose ControlNet (SP-Ctrl) 的新方法，使稀疏信号具备强大的可控性。具体而言，该方法将OpenPose扩展为可学习的空间表示，增强关键点嵌入的判别性和表达能力，并引入关键点概念学习，以提高姿态对齐效果。实验表明，该方法在稀疏姿态引导下优于现有的空间可控文本到图像生成方法，并能与基于密集信号的方法相媲美。

链接: https://arxiv.org/abs/2506.20983
作者: Wenjie Xuan,Jing Zhang,Juhua Liu,Bo Du,Dacheng Tao
机构: Wuhan University (武汉大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICCV 2025

点击查看摘要

Abstract:Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges, including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet(SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable T2I generation approaches under sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals. Codes will be available at this https URL.
zh

[CV-93] 3D Scene-Camera Representation with Joint Camera Photometric Optimization

【速读】：该论文旨在解决多视角图像中由于相机成像固有光度失真导致的3D场景表示质量下降问题（3D scene representation quality degradation）。其关键解决方案是提出一种结合相机光度优化的3D场景-相机表示方法，通过引入内部和外部光度模型构建完整的光度模型，并在同时优化相机表示参数的过程中有效分离与场景无关的信息，同时利用深度正则化防止3D场景表示拟合非场景相关信息。

链接: https://arxiv.org/abs/2506.20979
作者: Weichen Dai,Kangcheng Ma,Jiaxin Wang,Kecen Pan,Yuhang Ming,Hua Zhang,Wanzeng Kong
机构: Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Representing scenes from multi-view images is a crucial task in computer vision with extensive applications. However, inherent photometric distortions in the camera imaging can significantly degrade image quality. Without accounting for these distortions, the 3D scene representation may inadvertently incorporate erroneous information unrelated to the scene, diminishing the quality of the representation. In this paper, we propose a novel 3D scene-camera representation with joint camera photometric optimization. By introducing internal and external photometric model, we propose a full photometric model and corresponding camera representation. Based on simultaneously optimizing the parameters of the camera representation, the proposed method effectively separates scene-unrelated information from the 3D scene representation. Additionally, during the optimization of the photometric parameters, we introduce a depth regularization to prevent the 3D scene representation from fitting scene-unrelated information. By incorporating the camera model as part of the mapping process, the proposed method constructs a complete map that includes both the scene radiance field and the camera photometric model. Experimental results demonstrate that the proposed method can achieve high-quality 3D scene representations, even under conditions of imaging degradation, such as vignetting and dirt.
zh

[CV-94] From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging

【速读】：该论文试图解决人脸老化任务中年龄准确性与身份一致性之间的权衡问题，即Age-ID trade-off。现有方法在处理大年龄跨度或极端头部姿态时难以实现真实且无缝的转换。解决方案的关键在于提出一种两阶段的人脸老化框架Cradle2Cane，该框架基于少量步骤的文本到图像（T2I）扩散模型。第一阶段通过引入自适应噪声注入（AdaNI）机制来确保年龄准确性，第二阶段则通过条件化两个身份感知嵌入（IDEmb）来增强身份一致性，从而在保持年龄特征的同时提升身份保留能力。

链接: https://arxiv.org/abs/2506.20977
作者: Tao Liu,Dafeng Zhang,Gengchen Li,Shizhuo Liu,Yongqi Song,Senmao Li,Shiqi Yang,Boqian Li,Kai Wang,Yaxing Wang
机构: VCIP, College of Computer Science, Nankai University (VCIP，南开大学计算机学院); Samsung Research China - Beijing (SRC-B) (三星中国研究院-北京); School of Electrical and Information Engineering, Zhengzhou University (郑州大学电气与信息工程学院); SB Intuitions, SoftBank (SB Intuitions，软银); School of Computer , Zhengzhou University of Aeronautics (郑州航空工业管理学院计算机学院); Computer Vision Center, Universitat Autónoma de Barcelona (视觉计算中心，巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 pages, 12 figures

点击查看摘要

Abstract:Face aging has become a crucial task in computer vision, with applications ranging from entertainment to healthcare. However, existing methods struggle with achieving a realistic and seamless transformation across the entire lifespan, especially when handling large age gaps or extreme head poses. The core challenge lies in balancing age accuracy and identity preservation–what we refer to as the Age-ID trade-off. Most prior methods either prioritize age transformation at the expense of identity consistency or vice versa. In this work, we address this issue by proposing a two-pass face aging framework, named Cradle2Cane, based on few-step text-to-image (T2I) diffusion models. The first pass focuses on solving age accuracy by introducing an adaptive noise injection (AdaNI) mechanism. This mechanism is guided by including prompt descriptions of age and gender for the given person as the textual condition. Also, by adjusting the noise level, we can control the strength of aging while allowing more flexibility in transforming the face. However, identity preservation is weakly ensured here to facilitate stronger age transformations. In the second pass, we enhance identity preservation while maintaining age-specific features by conditioning the model on two identity-aware embeddings (IDEmb): SVR-ArcFace and Rotate-CLIP. This pass allows for denoising the transformed image from the first pass, ensuring stronger identity preservation without compromising the aging accuracy. Both passes are jointly trained in an end-to-end way. Extensive experiments on the CelebA-HQ test dataset, evaluated through Face++ and Qwen-VL protocols, show that our Cradle2Cane outperforms existing face aging methods in age accuracy and identity consistency.
zh

[CV-95] hermalDiffusion: Visual-to-Thermal Image-to-Image Translation for Autonomous Navigation ICRA2025

【速读】：该论文试图解决热成像在机器人和自动化领域中因数据匮乏而难以广泛应用的问题（thermal imaging data scarcity）。其解决方案的关键在于利用条件扩散模型（conditional diffusion models）将现有的RGB图像转换为合成热图像，通过自注意力机制学习真实世界物体的热特性，从而增强多模态数据集的热成像内容。

链接: https://arxiv.org/abs/2506.20969
作者: Shruti Bansal,Wenshan Wang,Yifei Liu,Parv Maheshwari
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Thermal Infrared in Robotics (TIRO) Workshop, ICRA 2025

点击查看摘要

Abstract:Autonomous systems rely on sensors to estimate the environment around them. However, cameras, LiDARs, and RADARs have their own limitations. In nighttime or degraded environments such as fog, mist, or dust, thermal cameras can provide valuable information regarding the presence of objects of interest due to their heat signature. They make it easy to identify humans and vehicles that are usually at higher temperatures compared to their surroundings. In this paper, we focus on the adaptation of thermal cameras for robotics and automation, where the biggest hurdle is the lack of data. Several multi-modal datasets are available for driving robotics research in tasks such as scene segmentation, object detection, and depth estimation, which are the cornerstone of autonomous systems. However, they are found to be lacking in thermal imagery. Our paper proposes a solution to augment these datasets with synthetic thermal data to enable widespread and rapid adaptation of thermal cameras. We explore the use of conditional diffusion models to convert existing RGB images to thermal images using self-attention to learn the thermal properties of real-world objects.
zh

[CV-96] DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing

【速读】：该论文旨在解决在视频生成模型Video Diffusion Transformers (Video DiTs)中进行视频编辑时计算开销过大的问题。传统视频编辑方法在应用到Video DiTs时通常需要进行资源密集型的注意力机制修改或微调，导致效率低下。解决方案的关键在于提出DFVEdit，该方法通过直接对干净潜在表示进行流变换操作，避免了注意力机制修改和微调过程，从而显著提升了计算效率。

链接: https://arxiv.org/abs/2506.20967
作者: Lingling Cai,Kang Zhao,Hangjie Yuan,Xiang Wang,Yingya Zhang,Kejie Huang
机构: Zhejiang University (浙江大学); Tongyi Lab; DAMO Academy; Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Zero-shot video editing

点击查看摘要

Abstract:The advent of Video Diffusion Transformers (Video DiTs) marks a milestone in video generation. However, directly applying existing video editing methods to Video DiTs often incurs substantial computational overhead, due to resource-intensive attention modification or finetuning. To alleviate this problem, we present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs. DFVEdit eliminates the need for both attention modification and fine-tuning by directly operating on clean latents via flow transformation. To be more specific, we observe that editing and sampling can be unified under the continuous flow perspective. Building upon this foundation, we propose the Conditional Delta Flow Vector (CDFV) – a theoretically unbiased estimation of DFV – and integrate Implicit Cross Attention (ICA) guidance as well as Embedding Reinforcement (ER) to further enhance editing quality. DFVEdit excels in practical efficiency, offering at least 20x inference speed-up and 85% memory reduction on Video DiTs compared to attention-engineering-based editing methods. Extensive quantitative and qualitative experiments demonstrate that DFVEdit can be seamlessly applied to popular Video DiTs (e.g., CogVideoX and Wan2.1), attaining state-of-the-art performance on structural fidelity, spatial-temporal consistency, and editing quality.
zh

[CV-97] Evidence-based diagnostic reasoning with multi-agent copilot for human pathology

【速读】：该论文旨在解决计算病理学中多模态大语言模型（MLLM）存在的训练数据不足、多图像理解支持与评估不充分以及缺乏自主诊断推理能力的问题。其解决方案的关键在于提出PathChat+，这是一个专为人体病理学设计的新一代MLLM，通过在超过100万条多样化、病理学特定的指令样本和近550万条问答对话上进行训练，显著提升了模型在病理分析中的性能。此外，还引入了SlideSeek系统，利用PathChat+实现对千兆像素全切片图像（WSI）的自主评估，通过迭代的分层诊断推理达到高精度的鉴别诊断效果。

链接: https://arxiv.org/abs/2506.20964
作者: Chengkuan Chen,Luca L. Weishaupt,Drew F. K. Williamson,Richard J. Chen,Tong Ding,Bowen Chen,Anurag Vaidya,Long Phi Le,Guillaume Jaume,Ming Y. Lu,Faisal Mahmood
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pathology is experiencing rapid digital transformation driven by whole-slide imaging and artificial intelligence (AI). While deep learning-based computational pathology has achieved notable success, traditional models primarily focus on image analysis without integrating natural language instruction or rich, text-based context. Current multimodal large language models (MLLMs) in computational pathology face limitations, including insufficient training data, inadequate support and evaluation for multi-image understanding, and a lack of autonomous, diagnostic reasoning capabilities. To address these limitations, we introduce PathChat+, a new MLLM specifically designed for human pathology, trained on over 1 million diverse, pathology-specific instruction samples and nearly 5.5 million question answer turns. Extensive evaluations across diverse pathology benchmarks demonstrated that PathChat+ substantially outperforms the prior PathChat copilot, as well as both state-of-the-art (SOTA) general-purpose and other pathology-specific models. Furthermore, we present SlideSeek, a reasoning-enabled multi-agent AI system leveraging PathChat+ to autonomously evaluate gigapixel whole-slide images (WSIs) through iterative, hierarchical diagnostic reasoning, reaching high accuracy on DDxBench, a challenging open-ended differential diagnosis benchmark, while also capable of generating visually grounded, humanly-interpretable summary reports.
zh

[CV-98] OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual Auditory and Textual Inputs

【速读】：该论文试图解决多模态模型在跨视觉、听觉和文本输入的上下文中构建和理解连贯性能力的评估问题。解决方案的关键在于提出OmniEval基准，该基准通过全模态协作、视频多样性以及任务的多样性和细粒度设计，实现对多模态模型的全面评估，其中引入了更细粒度的视频定位任务（Grounding）以增强评估的准确性与深度。

链接: https://arxiv.org/abs/2506.20960
作者: Yiman Zhang,Ziheng Luo,Qiangyu Yan,Wei He,Borui Jiang,Xinghao Chen,Kai Han
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce OmniEval, a benchmark for evaluating omni-modality models like MiniCPM-O 2.6, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverage the collaborative perception of all modalities; (ii) Diversity of videos: OmniEval includes 810 audio-visual synchronized videos, 285 Chinese videos and 525 English videos; (iii) Diversity and granularity of tasks: OmniEval contains 2617 question-answer pairs, comprising 1412 open-ended questions and 1205 multiple-choice questions. These questions are divided into 3 major task types and 12 sub-task types to achieve comprehensive evaluation. Among them, we introduce a more granular video localization task named Grounding. Then we conduct experiments on OmniEval with several omni-modality models. We hope that our OmniEval can provide a platform for evaluating the ability to construct and understand coherence from the context of all modalities. Codes and data could be found at this https URL.
zh

[CV-99] Hierarchical Sub-action Tree for Continuous Sign Language Recognition

【速读】：该论文旨在解决连续手语识别（Continuous Sign Language Recognition, CSLR）中因缺乏大规模数据集和精确标注而导致的训练数据不足问题。其解决方案的关键在于提出一种层次化子动作树（Hierarchical Sub-action Tree, HST），通过将词汇知识与视觉表征学习高效结合，提升文本信息的利用效率。具体而言，该方法构建了一个用于文本信息表示的HST，并通过分步对齐视觉与文本模态，借助树状结构降低计算复杂度，同时引入对比对齐增强机制以缩小模态间的差距。

链接: https://arxiv.org/abs/2506.20947
作者: Dejie Yang,Zhu Xu,Xinjie Gao,Yang Liu
机构: Peking University(北京大学); State Key Laboratory of General Artificial Intelligence(通用人工智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typically extract textual features from glosses without fully utilizing their knowledge. In this paper, we propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning. By incorporating gloss-specific knowledge from large language models, our approach leverages textual information more effectively. Specifically, we construct an HST for textual information representation, aligning visual and textual modalities step-by-step and benefiting from the tree structure to reduce computational complexity. Additionally, we impose a contrastive alignment enhancement to bridge the gap between the two modalities. Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of our HST-CSLR.
zh

[CV-100] Consistent Zero-shot 3D Texture Synthesis Using Geometry-aware Diffusion and Temporal Video Models

【速读】：该论文旨在解决现有纹理合成方法在生成纹理时由于缺乏全局上下文和几何理解而导致的不一致性问题，以及3D纹理在空间和时间上的不连贯性。其解决方案的关键在于引入VideoTex框架，该框架利用视频生成模型来处理3D纹理的空间和时间不一致性，并通过引入几何感知条件，实现对3D网格结构的精确利用，同时提出了一种基于结构的UV扩散策略，以在保留语义信息的同时增强遮挡区域的生成效果，从而实现更平滑、连贯的纹理。

链接: https://arxiv.org/abs/2506.20946
作者: Donggoo Kang,Jangyeong Kim,Dasol Jeong,Junyoung Choi,Jeonga Wi,Hyunmin Lee,Joonho Gwon,Joonki Paik
机构: Chung-Ang University (忠南大学); NCSOFT (NCSOFT); University of Seoul (首尔大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current texture synthesis methods, which generate textures from fixed viewpoints, suffer from inconsistencies due to the lack of global context and geometric understanding. Meanwhile, recent advancements in video generation models have demonstrated remarkable success in achieving temporally consistent videos. In this paper, we introduce VideoTex, a novel framework for seamless texture synthesis that leverages video generation models to address both spatial and temporal inconsistencies in 3D textures. Our approach incorporates geometry-aware conditions, enabling precise utilization of 3D mesh structures. Additionally, we propose a structure-wise UV diffusion strategy, which enhances the generation of occluded areas by preserving semantic information, resulting in smoother and more coherent textures. VideoTex not only achieves smoother transitions across UV boundaries but also ensures high-quality, temporally stable textures across video frames. Extensive experiments demonstrate that VideoTex outperforms existing methods in texture fidelity, seam blending, and stability, paving the way for dynamic real-time applications that demand both visual quality and temporal coherence.
zh

[CV-101] AIR-VIEW: The Aviation Image Repository for Visibility Estimation of Weather A Dataset and Benchmark

【速读】：该论文试图解决航空气象领域中大气能见度估计缺乏公开可用、标注准确且适用于监督学习的大型数据集的问题。解决方案的关键在于引入了一个为期一年的数据采集项目所生成的新数据集，该数据集来源于美国联邦航空管理局（FAA）的天气摄像头网络，能够为生成式AI (Generative AI) 在此领域的应用提供支持。此外，研究还通过在多个公开数据集上应用三种常用方法并建立通用基准，验证了所提出数据集的有效性。

链接: https://arxiv.org/abs/2506.20939
作者: Chad Mourning,Zhewei Wang,Justin Murray
机构: Ohio University (俄亥俄大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, meant as citation for dataset

点击查看摘要

Abstract:Machine Learning for aviation weather is a growing area of research for providing low-cost alternatives for traditional, expensive weather sensors; however, in the area of atmospheric visibility estimation, publicly available datasets, tagged with visibility estimates, of distances relevant for aviation, of diverse locations, of sufficient size for use in supervised learning, are absent. This paper introduces a new dataset which represents the culmination of a year-long data collection campaign of images from the FAA weather camera network suitable for this purpose. We also present a benchmark when applying three commonly used approaches and a general-purpose baseline when trained and tested on three publicly available datasets, in addition to our own, when compared against a recently ratified ASTM standard.
zh

[CV-102] PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling ICCV2025

【速读】：该论文旨在解决传统线性混合蒙皮（Linear Blend Skinning, LBS）在动画、刚体对象重建、运动迁移和4D生成中存在的问题，如体积丢失、不自然的形变以及无法模拟弹性材料（如软组织、毛发和柔性附肢）。其解决方案的关键在于提出PhysRig：一种基于物理的可微分蒙皮与骨骼框架，通过将刚性骨架嵌入到体积表示（如四面体网格）中，并将其模拟为由动画骨架驱动的可变形软体结构，从而实现更真实的物理模拟。该方法结合连续介质力学，并将物体离散化为嵌入欧拉背景网格中的粒子，以确保对材料属性和骨骼运动的可微性。

链接: https://arxiv.org/abs/2506.20936
作者: Hao Zhang,Haolan Xu,Chun Feng,Varun Jampani,Narendra Ahuja
机构: University of Illinois Urbana Champaign (伊利诺伊大学厄巴纳-香槟分校); Stability AI (Stability AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS), due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose PhysRig: a differentiable physics-based skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes, significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse, The Amazing Animals Zoo, and MixaMo, covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the pose transfer task highlighting its versatility for articulated object modeling.
zh

[CV-103] M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization ICCV

【速读】：该论文旨在解决数字图像篡改检测中像素级伪造定位的挑战，尤其是在处理细微或复杂篡改时，现有基于深度学习的方法常面临计算开销大和表征能力有限的问题。其解决方案的关键在于提出M2SFormer框架，该框架通过在跳跃连接中统一多频段和多尺度注意力机制，利用全局上下文更好地捕捉多种伪造痕迹；同时，通过引入全局先验图和曲率度量，指导难度引导的注意力模块以更有效地保留细微篡改细节。

链接: https://arxiv.org/abs/2506.20922
作者: Ju-Hyeon Nam,Dong-Hyun Moon,Sang-Chul Lee
机构: Inha University (仁荷大学); DeepCardio (DeepCardio)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in International Conference on Computer Vision (ICCV) 2025

点击查看摘要

Abstract:Image editing techniques have rapidly advanced, facilitating both innovative use cases and malicious manipulation of digital images. Deep learning-based methods have recently achieved high accuracy in pixel-level forgery localization, yet they frequently struggle with computational overhead and limited representation power, particularly for subtle or complex tampering. In this paper, we propose M2SFormer, a novel Transformer encoder-based framework designed to overcome these challenges. Unlike approaches that process spatial and frequency cues separately, M2SFormer unifies multi-frequency and multi-scale attentions in the skip connection, harnessing global context to better capture diverse forgery artifacts. Additionally, our framework addresses the loss of fine detail during upsampling by utilizing a global prior map, a curvature metric indicating the difficulty of forgery localization, which then guides a difficulty-guided attention module to preserve subtle manipulations more effectively. Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models, offering superior generalization in detecting and localizing forgeries across unseen domains.
zh

[CV-104] FaSTA*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

【速读】：该论文试图解决多轮图像编辑任务中成本效率与准确性难以兼顾的问题，例如“在图像中检测长椅并将其重新着色为粉色，同时移除猫以获得更清晰的视野，并将墙壁重新着色为黄色”。解决方案的关键在于构建一种神经符号代理（neurosymbolic agent），该代理结合了大语言模型（LLM）的快速高层次子任务规划与每个子任务中的慢速、精确的工具使用及局部A*搜索，以找到成本高效的工具路径（toolpath）。通过在LLM上对先前成功的工具路径进行归纳推理，持续提取和优化常用子程序，并将其作为新工具用于未来的任务，从而实现自适应的快慢结合规划，显著降低了相同类型子任务在相似图像上的探索成本。

链接: https://arxiv.org/abs/2506.20911
作者: Advait Gupta,Rishie Raj,Dang Nguyen,Tianyi Zhou
机构: University of Maryland, College Park(马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as "Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow.‘’ It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A ^* search per subtask to find a cost-efficient toolpath – a sequence of calls to AI tools. To save the cost of A ^* on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A ^* search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent "FaSTA ^* ‘’: fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A ^* search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA ^* is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate.
zh

[CV-105] he Role of Cyclopean-Eye in Stereo Vision

【速读】：该论文试图解决现代立体视觉系统中3D结构与人眼感知如何共同促进精确深度重建的问题，特别是针对遮挡和深度不连续性等几何挑战。其解决方案的关键在于重新审视Cyclopean Eye模型，并提出新的几何约束，同时结合深度学习模型的特征匹配质量与注意力机制，以恢复有意义的3D表面。通过理论分析与真实数据集的实证研究，证明将强几何先验与学习特征相结合能够为理解立体视觉系统提供内部抽象。

链接: https://arxiv.org/abs/2506.20900
作者: Sherlon Almeida da Silva,Davi Geiger,Luiz Velho,Moacir Antonelli Ponti
机构: Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo (ICMC-USP), São Carlos, SP, Brazil; Courant Institute of Mathematical Sciences, New York University (NYU), New York, NY, United States; Instituto de Matemática Pura e Aplicada (IMPA), Rio de Janeiro, RJ, Brazil
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2502.21280

点击查看摘要

Abstract:This work investigates the geometric foundations of modern stereo vision systems, with a focus on how 3D structure and human-inspired perception contribute to accurate depth reconstruction. We revisit the Cyclopean Eye model and propose novel geometric constraints that account for occlusions and depth discontinuities. Our analysis includes the evaluation of stereo feature matching quality derived from deep learning models, as well as the role of attention mechanisms in recovering meaningful 3D surfaces. Through both theoretical insights and empirical studies on real datasets, we demonstrate that combining strong geometric priors with learned features provides internal abstractions for understanding stereo vision systems.
zh

[CV-106] MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans

【速读】：该论文试图解决多人类图像生成中保持面部身份一致性的难题（multi-human image generation with preserved facial identities），其核心挑战在于缺乏专门的评估基准。解决方案的关键是引入了MultiHuman-Testbench，这是一个新的基准测试平台，包含1800个样本，涵盖从简单到复杂的人类动作描述，并与5,550张独特的面部图像匹配，确保年龄、种族和性别的多样性。此外，论文还提出了一个包含四个关键指标的多维评估套件，用于量化面部数量、ID相似性、提示对齐度和动作检测，并通过人类分割和匈牙利匹配等新技术显著提升了ID相似性。

链接: https://arxiv.org/abs/2506.20879
作者: Shubhankar Borse,Seokeon Choi,Sunghyun Park,Jeongho Kim,Shreya Kadambi,Risheek Garrepalli,Sungrack Yun,Munawar Hayat,Fatih Porikli
机构: Qualcomm AI Research (高通人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation.
zh

[CV-107] HIRDEYE: Cue-Aware Monocular Depth Estimation via Brain-Inspired Multi-Stage Fusion

【速读】：该论文旨在解决单目深度估计中传统方法依赖隐式学习而忽略人类视觉系统所依赖的显式单目线索（如遮挡边界、阴影和透视）的问题。其解决方案的关键在于提出ThirdEye，一个基于线索感知的流水线，通过专门预训练并冻结的网络显式提供每种线索，并在具有关键-值工作记忆模块的三级皮层层次结构（V1-V2-V3）中融合这些线索，以可靠性加权的方式进行处理，最终通过自适应分箱变换器头部生成高分辨率视差图。

链接: https://arxiv.org/abs/2506.20877
作者: Calin Teodor Ioan
机构: DartLabs(达特实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monocular depth estimation methods traditionally train deep models to infer depth directly from RGB pixels. This implicit learning often overlooks explicit monocular cues that the human visual system relies on, such as occlusion boundaries, shading, and perspective. Rather than expecting a network to discover these cues unaided, we present ThirdEye, a cue-aware pipeline that deliberately supplies each cue through specialised, pre-trained, and frozen networks. These cues are fused in a three-stage cortical hierarchy (V1-V2-V3) equipped with a key-value working-memory module that weights them by reliability. An adaptive-bins transformer head then produces a high-resolution disparity map. Because the cue experts are frozen, ThirdEye inherits large amounts of external supervision while requiring only modest fine-tuning. This extended version provides additional architectural detail, neuroscientific motivation, and an expanded experimental protocol; quantitative results will appear in a future revision.
zh

[CV-108] 3DGH: 3D Head Generation with Composable Hair and Face SIGGRAPH2025

【速读】：该论文试图解决3D人体头部生成中头发与面部几何结构耦合导致的建模困难问题，旨在实现可组合的3D发型编辑与无条件全头图像合成。解决方案的关键在于提出一种基于模板的3D高斯点云（template-based 3D Gaussian Splatting）数据表示方法，将头发与面部建模分离，并引入可变形头发几何以捕捉不同发型的几何变化，同时设计了基于3D GAN的双生成器架构与跨注意力机制，以建模头发与面部之间的内在关联。

链接: https://arxiv.org/abs/2506.20875
作者: Chengan He,Junxuan Li,Tobias Kirschstein,Artem Sevastopolsky,Shunsuke Saito,Qingyang Tan,Javier Romero,Chen Cao,Holly Rushmeier,Giljoo Nam
机构: Yale University (耶鲁大学); Meta Codec Avatars Lab (元代码化身实验室); Technical University of Munich (慕尼黑工业大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH 2025. Project page: this https URL

点击查看摘要

Abstract:We present 3DGH, an unconditional generative model for 3D human heads with composable hair and face components. Unlike previous work that entangles the modeling of hair and face, we propose to separate them using a novel data representation with template-based 3D Gaussian Splatting, in which deformable hair geometry is introduced to capture the geometric variations across different hairstyles. Based on this data representation, we design a 3D GAN-based architecture with dual generators and employ a cross-attention mechanism to model the inherent correlation between hair and face. The model is trained on synthetic renderings using carefully designed objectives to stabilize training and facilitate hair-face separation. We conduct extensive experiments to validate the design choice of 3DGH, and evaluate it both qualitatively and quantitatively by comparing with several state-of-the-art 3D GAN methods, demonstrating its effectiveness in unconditional full-head image synthesis and composable 3D hairstyle editing. More details will be available on our project page: this https URL.
zh

[CV-109] Enhancing Ambiguous Dynamic Facial Expression Recognition with Soft Label-based Data Augmentation

【速读】：该论文旨在解决动态面部表情识别（DFER）中对模糊面部表情的准确识别问题，特别是在真实场景数据中频繁出现的不确定性情况。解决方案的关键在于提出MIDAS方法，该方法通过使用表示多个情感类别概率的软标签（soft labels）对视频帧及其对应的情感类别标签进行凸组合，从而增强训练数据的多样性与鲁棒性，扩展了Mixup技术至软标签视频数据，提供了一种简单而有效的处理DFER中模糊性的方法。

链接: https://arxiv.org/abs/2506.20867
作者: Ryosuke Kawamura,Hideaki Hayashi,Shunsuke Otake,Noriko Takemura,Hajime Nagahara
机构: Fujitsu Research of America, Inc.(富士通美国研究院); University of Osaka(大阪大学); Kyushu Institute of Technology(九州工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic facial expression recognition (DFER) is a task that estimates emotions from facial expression video sequences. For practical applications, accurately recognizing ambiguous facial expressions – frequently encountered in in-the-wild data – is essential. In this study, we propose MIDAS, a data augmentation method designed to enhance DFER performance for ambiguous facial expression data using soft labels representing probabilities of multiple emotion classes. MIDAS augments training data by convexly combining pairs of video frames and their corresponding emotion class labels. This approach extends mixup to soft-labeled video data, offering a simple yet highly effective method for handling ambiguity in DFER. To evaluate MIDAS, we conducted experiments on both the DFEW dataset and FERV39k-Plus, a newly constructed dataset that assigns soft labels to an existing DFER dataset. The results demonstrate that models trained with MIDAS-augmented data achieve superior performance compared to the state-of-the-art method trained on the original dataset.
zh

[CV-110] Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision ICCV2025

【速读】：该论文试图解决对比学习（Contrastive Learning, CL）在像素级表示学习中的应用问题，尤其是在医学视觉领域中保持像素级特征相关性的问题。传统对比学习将自监督预训练建模为二元优化问题（二元对比学习），过度追求特征分散导致像素级特征相关性被破坏，从而影响类内分布。该论文提出的解决方案关键在于将对比学习重新定义为向量回归问题，通过建模回归位移向量中的特征距离来实现像素级预训练中的分散量化。为此，作者提出了COntrast in VEctor Regression (COVER)框架，该框架通过向量金字塔结构实现粒度适应，并确保从向量回归到距离建模的一致优化流程，从而在自监督预训练中保持像素级特征的相关性。

链接: https://arxiv.org/abs/2506.20850
作者: Yuting He,Shuo Li
机构: Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Contrastive learning (CL) has become a cornerstone of self-supervised pretraining (SSP) in foundation models, however, extending CL to pixel-wise representation, crucial for medical vision, remains an open problem. Standard CL formulates SSP as a binary optimization problem (binary CL) where the excessive pursuit of feature dispersion leads to an over-dispersion problem, breaking pixel-wise feature correlation thus disrupting the intra-class distribution. Our vector CL reformulates CL as a vector regression problem, enabling dispersion quantification in pixel-wise pretraining via modeling feature distances in regressing displacement vectors. To implement this novel paradigm, we propose the COntrast in VEctor Regression (COVER) framework. COVER establishes an extendable vector-based self-learning, enforces a consistent optimization flow from vector regression to distance modeling, and leverages a vector pyramid architecture for granularity adaptation, thus preserving pixel-wise feature correlations in SSP. Extensive experiments across 8 tasks, spanning 2 dimensions and 4 modalities, show that COVER significantly improves pixel-wise SSP, advancing generalizable medical visual foundation models.
zh

[CV-111] FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization

【速读】：该论文旨在解决在仅有少量标签的情况下，半监督领域泛化（Semi-supervised Domain Generalization, SSDG）方法在面对分布外数据时泛化能力不足的问题。现有方法虽然结合了半监督学习与正则化项，但未能明确对齐跨所有领域的域不变表示，这是领域泛化的核心目标。为解决这一问题，作者提出了FixCLR，其关键在于通过引入伪标签中的类别信息并仅使用排斥项（repelling term）来适应对比学习，从而实现显式的域不变性正则化。此外，FixCLR可与多数现有SSDG和半监督方法结合，以提升性能。

链接: https://arxiv.org/abs/2506.20841
作者: Ha Min Son,Shahbaz Rezaei,Xin Liu
机构: University of California, Davis(加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semi-supervised domain generalization (SSDG) aims to solve the problem of generalizing to out-of-distribution data when only a few labels are available. Due to label scarcity, applying domain generalization methods often underperform. Consequently, existing SSDG methods combine semi-supervised learning methods with various regularization terms. However, these methods do not explicitly regularize to learn domains invariant representations across all domains, which is a key goal for domain generalization. To address this, we introduce FixCLR. Inspired by success in self-supervised learning, we change two crucial components to adapt contrastive learning for explicit domain invariance regularization: utilization of class information from pseudo-labels and using only a repelling term. FixCLR can also be added on top of most existing SSDG and semi-supervised methods for complementary performance improvements. Our research includes extensive experiments that have not been previously explored in SSDG studies. These experiments include benchmarking different improvements to semi-supervised methods, evaluating the performance of pretrained versus non-pretrained models, and testing on datasets with many domains. Overall, FixCLR proves to be an effective SSDG method, especially when combined with other semi-supervised methods.
zh

[CV-112] Leverag ing Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models

【速读】：该论文旨在解决超分辨率（Super-resolution, SR）任务中生成图像的可信度评估问题，特别是在扩散模型生成多样化的SR图像时，如何高效、自动化地选出最可信的单一输出。其解决方案的关键在于利用视觉-语言模型（Vision-Language Models, VLMs）的语义推理能力，通过结构化查询评估生成图像的语义正确性、视觉质量和伪影存在情况，并结合一种新的可信度评分（Trustworthiness Score, TWS）进行综合判断，从而实现对扩散生成SR样本的可靠筛选。

链接: https://arxiv.org/abs/2506.20832
作者: Cansu Korkmaz,Ahmet Murat Tekalp,Zafer Dogan
机构: Koç University (科克大学); TUBITAK (土耳其科学与技术研究理事会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures, 5 tables, accepted to IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Super-resolution (SR) is an ill-posed inverse problem with many feasible solutions consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution, but this trade-off often introduces artifacts that create ambiguity in information-critical applications such as recognizing digits or letters. On the other hand, diffusion models generate a diverse set of SR images, but selecting the most trustworthy solution from this set remains a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to assess semantic correctness, visual quality, and artifact presence. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity via CLIP embeddings, structural integrity using SSIM on edge maps, and artifact sensitivity through multi-level wavelet decomposition. We empirically show that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, which fail to reflect information fidelity, our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.
zh

[CV-113] Universal and Efficient Detection of Adversarial Data through Nonuniform Impact on Network Layers

【速读】：该论文试图解决深度神经网络（Deep Neural Networks, DNNs）在面对具有有限噪声预算的对抗性输入时的脆弱性问题，特别是现有防御技术在应对最先进的攻击方法时效果不佳或计算效率低的问题。解决方案的关键在于提出一种通用且高效的对抗样本检测方法，该方法通过分析攻击对不同DNN层的影响程度来实现检测，具体是训练一个轻量级的回归模型，从早期层特征预测深层特征，并利用预测误差来识别对抗样本。

链接: https://arxiv.org/abs/2506.20816
作者: Furkan Mumcu,Yasin Yilmaz
机构: University of South Florida (南佛罗里达大学)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2410.17442

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are notoriously vulnerable to adversarial input designs with limited noise budgets. While numerous successful attacks with subtle modifications to original input have been proposed, defense techniques against these attacks are relatively understudied. Existing defense approaches either focus on improving DNN robustness by negating the effects of perturbations or use a secondary model to detect adversarial data. Although equally important, the attack detection approach, which is studied in this work, provides a more practical defense compared to the robustness approach. We show that the existing detection methods are either ineffective against the state-of-the-art attack techniques or computationally inefficient for real-time processing. We propose a novel universal and efficient method to detect adversarial examples by analyzing the varying degrees of impact of attacks on different DNN layers. Our method trains a lightweight regression model that predicts deeper-layer features from early-layer features, and uses the prediction error to detect adversarial samples. Through theoretical arguments and extensive experiments, we demonstrate that our detection method is highly effective, computationally efficient for real-time processing, compatible with any DNN architecture, and applicable across different domains, such as image, video, and audio.
zh

[CV-114] Model-Based Real-Time Pose and Sag Estimation of Overhead Power Lines Using LiDAR for Drone Inspection

【速读】：该论文旨在解决在使用机载LiDAR传感器对带电架空输电线路进行定位时所面临的挑战，包括导线提供的反射面积有限导致扫描点数量少、并非所有导线都能被一致检测到以及难以区分导线与树木、杆塔等其他物体的LiDAR点。解决方案的关键在于提出一种估计方法，通过最小化LiDAR测量值与代表整个导线阵列的单一几何模型之间的误差，而非单独跟踪每个导线。

链接: https://arxiv.org/abs/2506.20812
作者: Alexandre Girard,Steven A. Parkison,Philippe Hamelin
机构: Universite de Sherbrooke(谢布克大学); Hydro-Québec Research Institute (IREQ)(魁北克水电研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE case 2025

点击查看摘要

Abstract:Drones can inspect overhead power lines while they remain energized, significantly simplifying the inspection process. However, localizing a drone relative to all conductors using an onboard LiDAR sensor presents several challenges: (1) conductors provide minimal surface for LiDAR beams limiting the number of conductor points in a scan, (2) not all conductors are consistently detected, and (3) distinguishing LiDAR points corresponding to conductors from other objects, such as trees and pylons, is difficult. This paper proposes an estimation approach that minimizes the error between LiDAR measurements and a single geometric model representing the entire conductor array, rather than tracking individual conductors separately. Experimental results, using data from a power line drone inspection, demonstrate that this method achieves accurate tracking, with a solver converging under 50 ms per frame, even in the presence of partial observations, noise, and outliers. A sensitivity analysis shows that the estimation approach can tolerate up to twice as many outlier points as valid conductors measurements.
zh

[CV-115] How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?

【速读】：该论文旨在解决在嘈杂环境（如敏捷生产场景）中实现高效的人机非语言通信问题，特别是针对动态全身手势识别的挑战。其解决方案的关键在于探索使用视觉基础模型（Vision Foundation Models, VFMs）和视觉语言模型（Vision Language Models, VLMs）替代传统任务特定模块，以降低系统复杂性。研究对比了V-JEPA、Gemini Flash 2.0和HD-GCN三种方法，并引入NUGGET数据集进行评估，结果显示虽然基于骨架的方法HD-GCN表现最佳，但V-JEPA通过简单的任务特定分类头也表现出接近的性能，表明其作为共享多任务模型的潜力。

链接: https://arxiv.org/abs/2506.20795
作者: Stephanie Käs,Anton Burenko,Louis Markert,Onur Alp Culha,Dennis Mack,Timm Linder,Bastian Leibe
机构: RWTH Aachen University (亚琛工业大学); Robert Bosch GmbH (博世集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head - thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.
zh

[CV-116] AI-Driven MRI-based Brain Tumour Segmentation Benchmarking

【速读】：该论文试图解决当前医学图像分割领域中缺乏对多种提示质量下不同模型性能的系统评估与比较的问题。研究的关键在于利用Segment Anything Model (SAM)、Segment Anything Model 2 (SAM 2)、MedSAM、SAM-Med-3D以及nnU-Net在BraTS 2023成人胶质瘤和儿科数据集上进行零样本推理，并通过点提示和边界框提示的不同质量来评估各模型的性能，从而揭示不同模型在实际应用中的优势与局限性。

链接: https://arxiv.org/abs/2506.20786
作者: Connor Ludwig,Khashayar Namdar,Farzad Khalvati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation has greatly aided medical diagnosis, with U-Net based architectures and nnU-Net providing state-of-the-art performance. There have been numerous general promptable models and medical variations introduced in recent years, but there is currently a lack of evaluation and comparison of these models across a variety of prompt qualities on a common medical dataset. This research uses Segment Anything Model (SAM), Segment Anything Model 2 (SAM 2), MedSAM, SAM-Med-3D, and nnU-Net to obtain zero-shot inference on the BraTS 2023 adult glioma and pediatrics dataset across multiple prompt qualities for both points and bounding boxes. Several of these models exhibit promising Dice scores, particularly SAM and SAM 2 achieving scores of up to 0.894 and 0.893, respectively when given extremely accurate bounding box prompts which exceeds nnU-Net’s segmentation performance. However, nnU-Net remains the dominant medical image segmentation network due to the impracticality of providing highly accurate prompts to the models. The model and prompt evaluation, as well as the comparison, are extended through fine-tuning SAM, SAM 2, MedSAM, and SAM-Med-3D on the pediatrics dataset. The improvements in point prompt performance after fine-tuning are substantial and show promise for future investigation, but are unable to achieve better segmentation than bounding boxes or nnU-Net.
zh

[CV-117] ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations

【速读】：该论文旨在解决视觉与触觉模态在特征融合过程中存在的对齐不足问题，传统方法通过直接组合（如特征相加和拼接）进行模态融合，导致特征整合效果不佳。其解决方案的关键在于提出了一种对比嵌入条件化（Contrastive Embedding Conditioning, CEC）机制，该机制利用通过自监督对比学习预训练的对比编码器，将视觉和触觉输入投影到统一的潜在嵌入空间中，并通过跨模态注意力机制实现视觉-触觉特征的耦合，从而提升特征对齐效果和下游任务性能。

链接: https://arxiv.org/abs/2506.20757
作者: Zhiyuan Wu,Yongqiang Zhao,Shan Luo
机构: King’s College London (国王学院伦敦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision and touch are two fundamental sensory modalities for robots, offering complementary information that enhances perception and manipulation tasks. Previous research has attempted to jointly learn visual-tactile representations to extract more meaningful information. However, these approaches often rely on direct combination, such as feature addition and concatenation, for modality fusion, which tend to result in poor feature integration. In this paper, we propose ConViTac, a visual-tactile representation learning network designed to enhance the alignment of features during fusion using contrastive representations. Our key contribution is a Contrastive Embedding Conditioning (CEC) mechanism that leverages a contrastive encoder pretrained through self-supervised contrastive learning to project visual and tactile inputs into unified latent embeddings. These embeddings are used to couple visual-tactile feature fusion through cross-modal attention, aiming at aligning the unified representations and enhancing performance on downstream tasks. We conduct extensive experiments to demonstrate the superiority of ConViTac in real world over current state-of-the-art methods and the effectiveness of our proposed CEC mechanism, which improves accuracy by up to 12.0% in material classification and grasping prediction tasks.
zh

[CV-118] StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation

【速读】：该论文试图解决视频深度估计中动态与静态区域在时间一致性上的差异问题，传统方法将视频深度估计视为图像深度估计的简单扩展，而忽略了视频中静态区域（如背景）与动态区域在时间一致性需求上的根本不同。解决方案的关键在于提出StereoDiff，一种两阶段的视频深度估计器，通过在静态区域结合立体匹配以获取更强的全局3D线索，而在动态区域则利用视频深度扩散模型来保证深度过渡的平滑性，从而实现两者的优势互补。

链接: https://arxiv.org/abs/2506.20756
作者: Haodong Li,Chen Wang,Jiahui Lei,Kostas Daniilidis,Lingjie Liu
机构: University of Pennsylvania (宾夕法尼亚大学); HKUST (广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work done in Nov. 2024. Project page: this https URL

点击查看摘要

Abstract:Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally different. Consistent video depth in static regions, typically backgrounds, can be more effectively achieved via stereo matching across all frames, which provides much stronger global 3D cues. While the consistency for dynamic regions still should be learned from large-scale video depth data to ensure smooth transitions, due to the violation of triangulation constraints. Based on these insights, we introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion for maintaining consistent depth transitions in dynamic areas. We mathematically demonstrate how stereo matching and video depth diffusion offer complementary strengths through frequency domain analysis, highlighting the effectiveness of their synergy in capturing the advantages of both. Experimental results on zero-shot, real-world, dynamic video depth benchmarks, both indoor and outdoor, demonstrate StereoDiff’s SoTA performance, showcasing its superior consistency and accuracy in video depth estimation.
zh

[CV-119] OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport

【速读】：该论文旨在解决全切片图像（WSI）中生存预测的病理异质性建模问题，现有多重实例学习（MIL）方法难以明确捕捉WSI中的全局长尾形态分布和局部tile级预测不确定性。其解决方案的关键在于从最优传输（OT）的角度提出OTSurv框架，通过引入两个约束条件：全局长尾约束以调节传输质量分配，避免模式崩溃和过度均匀性；局部不确定性感知约束通过逐步增加总传输质量来优先高置信度区域并抑制噪声，从而有效建模病理异质性。

链接: https://arxiv.org/abs/2506.20741
作者: Qin Ren,Yifan Wang,Ruogu Fang,Haibin Ling,Chenyu You
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Survival prediction using whole slide images (WSIs) can be formulated as a multiple instance learning (MIL) problem. However, existing MIL methods often fail to explicitly capture pathological heterogeneity within WSIs, both globally – through long-tailed morphological distributions, and locally through – tile-level prediction uncertainty. Optimal transport (OT) provides a principled way of modeling such heterogeneity by incorporating marginal distribution constraints. Building on this insight, we propose OTSurv, a novel MIL framework from an optimal transport perspective. Specifically, OTSurv formulates survival predictions as a heterogeneity-aware OT problem with two constraints: (1) global long-tail constraint that models prior morphological distributions to avert both mode collapse and excessive uniformity by regulating transport mass allocation, and (2) local uncertainty-aware constraint that prioritizes high-confidence patches while suppressing noise by progressively raising the total transport mass. We then recast the initial OT problem, augmented by these constraints, into an unbalanced OT formulation that can be solved with an efficient, hardware-friendly matrix scaling algorithm. Empirically, OTSurv sets new state-of-the-art results across six popular benchmarks, achieving an absolute 3.6% improvement in average C-index. In addition, OTSurv achieves statistical significance in log-rank tests and offers high interpretability, making it a powerful tool for survival prediction in digital pathology. Our codes are available at this https URL.
zh

[CV-120] Generative Blocks World: Moving Things Around in Pictures

【速读】：该论文试图解决在生成图像场景中实现高效、精确且具有高视觉保真度的编辑问题。其解决方案的关键在于引入“生成式块世界”（Generative Blocks World），通过将场景表示为凸3D原始形状的集合，允许编辑者对整体结构或细节进行操作，同时利用基于流的方法，在深度信息和纹理提示的条件下生成图像。该方法的纹理提示考虑了修改后的3D原始形状，从而超越了现有键值缓存技术提供的纹理一致性，实现了更准确的对象和相机移动，并有效保持了图像中对象的身份特征。

链接: https://arxiv.org/abs/2506.20703
作者: Vaibhav Vavilala,Seemandhar Jain,Rahul Vasanth,D.A. Forsyth,Anand Bhattad
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Toyota Technological Institute at Chicago (丰田技术学院芝加哥校区)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 16 figures, 2 tables

点击查看摘要

Abstract:We describe Generative Blocks World to interact with the scene of a generated image by manipulating simple geometric abstractions. Our method represents scenes as assemblies of convex 3D primitives, and the same scene can be represented by different numbers of primitives, allowing an editor to move either whole structures or small details. Once the scene geometry has been edited, the image is generated by a flow-based method which is conditioned on depth and a texture hint. Our texture hint takes into account the modified 3D primitives, exceeding texture-consistency provided by existing key-value caching techniques. These texture hints (a) allow accurate object and camera moves and (b) largely preserve the identity of objects depicted. Quantitative and qualitative experiments demonstrate that our approach outperforms prior works in visual fidelity, editability, and compositional generalization.
zh

[CV-121] SEPT: Standard-Definition Map Enhanced Scene Perception and Topology Reasoning for Autonomous Driving

【速读】：该论文旨在解决自主车辆在无图驾驶系统中面临的在线场景感知与拓扑推理问题，尤其是在长距离或遮挡场景下，由于车载传感器的固有限制导致的场景理解不足。其解决方案的关键在于提出一种标准清晰度（Standard-Definition, SD）地图增强的场景感知与拓扑推理（SEPT）框架，通过有效融合SD地图作为先验知识到现有的感知与推理流程中，从而提升场景理解性能。该框架引入了一种新颖的混合特征融合策略，结合SD地图与鸟瞰图（Bird’s-Eye-View, BEV）特征，并考虑了栅格化和矢量化的表示方式，同时缓解了SD地图与BEV特征空间之间的潜在错位问题。此外，还设计了一个辅助的交叉口感知关键点检测任务，进一步增强了整体场景理解能力。

链接: https://arxiv.org/abs/2505.12246
作者: Muleilan Pei,Jiayao Shan,Peiliang Li,Jieqi Shi,Jing Huo,Yang Gao,Shaojie Shen
机构: Hong Kong University of Science and Technology (香港科技大学); Zhuoyu Technology (卓越科技); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by IEEE Robotics and Automation Letters

点击查看摘要

Abstract:Online scene perception and topology reasoning are critical for autonomous vehicles to understand their driving environments, particularly for mapless driving systems that endeavor to reduce reliance on costly High-Definition (HD) maps. However, recent advances in online scene understanding still face limitations, especially in long-range or occluded scenarios, due to the inherent constraints of onboard sensors. To address this challenge, we propose a Standard-Definition (SD) Map Enhanced scene Perception and Topology reasoning (SEPT) framework, which explores how to effectively incorporate the SD map as prior knowledge into existing perception and reasoning pipelines. Specifically, we introduce a novel hybrid feature fusion strategy that combines SD maps with Bird’s-Eye-View (BEV) features, considering both rasterized and vectorized representations, while mitigating potential misalignment between SD maps and BEV feature spaces. Additionally, we leverage the SD map characteristics to design an auxiliary intersection-aware keypoint detection task, which further enhances the overall scene understanding performance. Experimental results on the large-scale OpenLane-V2 dataset demonstrate that by effectively integrating SD map priors, our framework significantly improves both scene perception and topology reasoning, outperforming existing methods by a substantial margin.
zh

[CV-122] ransferring disentangled representations: bridging the gap between synthetic and real images NEURIPS

【速读】：该论文试图解决在真实图像中充分展现解耦表示学习（Disentangled Representation Learning）潜力的问题，主要挑战包括生成因素的相关性、图像分辨率以及缺乏真实的标签。其解决方案的关键在于利用合成数据学习通用的解耦表示，并探讨微调的效果以及解耦特性在迁移至真实数据后是否得以保留。通过广泛的实证研究和提出一种基于干预的可解释度量方法，验证了从合成数据到真实数据的解耦表示迁移是可行且有效的。

链接: https://arxiv.org/abs/2409.18017
作者: Jacopo Dapueto,Nicoletta Noceti,Francesca Odone
机构: MaLGa-DIBRIS, Università degli studi di Genova, Genova, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to NeurIPS, 2024

点击查看摘要

Abstract:Developing meaningful and efficient representations that separate the fundamental structure of the data generation mechanism is crucial in representation learning. However, Disentangled Representation Learning has not fully shown its potential on real images, because of correlated generative factors, their resolution and limited access to ground truth labels. Specifically on the latter, we investigate the possibility of leveraging synthetic data to learn general-purpose disentangled representations applicable to real data, discussing the effect of fine-tuning and what properties of disentanglement are preserved after the transfer. We provide an extensive empirical study to address these issues. In addition, we propose a new interpretable intervention-based metric, to measure the quality of factors encoding in the representation. Our results indicate that some level of disentanglement, transferring a representation from synthetic to real data, is possible and effective.
zh

[CV-123] ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers ICCV

【速读】：该论文试图解决传统神经网络在模型表达能力和训练效率上的局限性，特别是通过引入神经微分方程（neural ODE）来增强残差神经网络（ResNet）的性能。其解决方案的关键在于利用模拟里德堡原子量子计算机的特性，通过设计一种名为ResQ的新框架，优化量子计算系统的动力学以实现基于模拟量子神经微分方程的机器学习分类任务。

链接: https://arxiv.org/abs/2506.21537
作者: Nicholas S. DiBrita,Jason Han,Tirthak Patel
机构: Rice University (莱斯大学)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: ResQ will appear in the Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2025

点击查看摘要

Abstract:Research in quantum machine learning has recently proliferated due to the potential of quantum computing to accelerate machine learning. An area of machine learning that has not yet been explored is neural ordinary differential equation (neural ODE) based residual neural networks (ResNets), which aim to improve the effectiveness of neural networks using the principles of ordinary differential equations. In this work, we present our insights about why analog Rydberg atom quantum computers are especially well-suited for ResNets. We also introduce ResQ, a novel framework to optimize the dynamics of Rydberg atom quantum computers to solve classification problems in machine learning using analog quantum neural ODEs.
zh

[CV-124] Exploring the Design Space of 3D MLLM s for CT Report Generation

【速读】：该论文旨在解决医学影像领域中放射科报告生成（Radiology Report Generation, RRG）的自动化问题，特别是针对三维CT图像的报告生成。其解决方案的关键在于系统性地探索多模态大语言模型（Multimodal Large Language Models, MLLMs）的设计空间，包括视觉输入表示、投影器、大语言模型（Large Language Models, LLMs）以及微调技术，并引入两种基于知识的报告增强方法，以提升报告质量。此外，研究还表明，在相同训练协议下，RRG性能并不完全依赖于LLM的规模，且结合分割掩码与CT体积可进一步提升性能。

链接: https://arxiv.org/abs/2506.21535
作者: Mohammed Baharoon,Jun Ma,Congyu Fang,Augustin Toma,Bo Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have emerged as a promising way to automate Radiology Report Generation (RRG). In this work, we systematically investigate the design space of 3D MLLMs, including visual input representation, projectors, Large Language Models (LLMs), and fine-tuning techniques for 3D CT report generation. We also introduce two knowledge-based report augmentation methods that improve performance on the GREEN score by up to 10%, achieving the 2nd place on the MICCAI 2024 AMOS-MM challenge. Our results on the 1,687 cases from the AMOS-MM dataset show that RRG is largely independent of the size of LLM under the same training protocol. We also show that larger volume size does not always improve performance if the original ViT was pre-trained on a smaller volume size. Lastly, we show that using a segmentation mask along with the CT volume improves performance. The code is publicly available at this https URL
zh

[CV-125] Lightweight Physics-Informed Zero-Shot Ultrasound Plane Wave Denoising

【速读】：该论文旨在解决超声相干平面波复合（CPWC）成像中因减少发射角度而导致的图像对比度下降和噪声敏感问题。其关键解决方案是提出一种零样本去噪框架，通过将可用的发射角度划分为两个不相交子集，生成包含更高噪声水平的复合图像，并利用自监督残差学习策略训练深度模型，从而在无需外部训练数据的情况下抑制非相干噪声并保留解剖结构。该方法借助角度依赖性伪影在子集间的差异与组织响应的一致性，实现伪影与真实信号的解耦，具备跨解剖区域和采集设置的适应性。

链接: https://arxiv.org/abs/2506.21499
作者: Hojat Asgariandehkordi,Mostafa Sharifzadeh,Hassan Rivaz
机构: Concordia University (康奈迪亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrasound Coherent Plane Wave Compounding (CPWC) enhances image contrast by combining echoes from multiple steered transmissions. While increasing the number of angles generally improves image quality, it drastically reduces the frame rate and can introduce blurring artifacts in fast-moving targets. Moreover, compounded images remain susceptible to noise, particularly when acquired with a limited number of transmissions. We propose a zero-shot denoising framework tailored for low-angle CPWC acquisitions, which enhances contrast without relying on a separate training dataset. The method divides the available transmission angles into two disjoint subsets, each used to form compound images that include higher noise levels. The new compounded images are then used to train a deep model via a self-supervised residual learning scheme, enabling it to suppress incoherent noise while preserving anatomical structures. Because angle-dependent artifacts vary between the subsets while the underlying tissue response is similar, this physics-informed pairing allows the network to learn to disentangle the inconsistent artifacts from the consistent tissue signal. Unlike supervised methods, our model requires no domain-specific fine-tuning or paired data, making it adaptable across anatomical regions and acquisition setups. The entire pipeline supports efficient training with low computational cost due to the use of a lightweight architecture, which comprises only two convolutional layers. Evaluations on simulation, phantom, and in vivo data demonstrate superior contrast enhancement and structure preservation compared to both classical and deep learning-based denoising methods.
zh

[CV-126] hinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

【速读】：该论文旨在解决视频到音频生成中高质量音频难以真实捕捉视觉内容细节的问题，特别是如何在生成过程中实现对视觉动态、声学环境和时间关系的复杂推理。其解决方案的关键在于提出\textbfThinkSound框架，该框架利用链式思维（Chain-of-Thought, CoT）推理，通过分阶段的音频生成与编辑流程，包括基础音效生成、基于用户交互的对象中心优化以及自然语言指令引导的针对性编辑，从而实现更精准和语境一致的音频合成。

链接: https://arxiv.org/abs/2506.21448
作者: Huadai Liu,Jialei Wang,Kaicheng Luo,Wen Wang,Qian Chen,Zhou Zhao,Wei Xue
机构: Tongyi Lab, Alibaba Group (通义实验室，阿里巴巴集团); Hong Kong University of Science and Technology (HKUST) (香港科技大学); Zhejiang University (浙江大学)
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present \textbfThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce \textbfAudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark. The demo page is available at this https URL.
zh

[CV-127] GANet-Seg: Adversarial Learning for Brain Tumor Segmentation with Hybrid Generative Models

【速读】：该论文旨在解决脑肿瘤分割中因标注数据有限而导致的模型泛化能力不足与分割精度不高的问题。其解决方案的关键在于引入预训练的生成式对抗网络（Generative Adversarial Networks, GANs）与Unet架构相结合的框架，并通过全局异常检测模块与优化的掩码生成网络实现对肿瘤敏感区域的精准识别，同时利用对抗损失约束迭代提升分割精度。此外，多模态MRI数据与合成图像增强技术的应用进一步提升了模型的鲁棒性。

链接: https://arxiv.org/abs/2506.21245
作者: Qifei Cui,Xinyu Lu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work introduces a novel framework for brain tumor segmentation leveraging pre-trained GANs and Unet architectures. By combining a global anomaly detection module with a refined mask generation network, the proposed model accurately identifies tumor-sensitive regions and iteratively enhances segmentation precision using adversarial loss constraints. Multi-modal MRI data and synthetic image augmentation are employed to improve robustness and address the challenge of limited annotated datasets. Experimental results on the BraTS dataset demonstrate the effectiveness of the approach, achieving high sensitivity and accuracy in both lesion-wise Dice and HD95 metrics than the baseline. This scalable method minimizes the dependency on fully annotated data, paving the way for practical real-world applications in clinical settings.
zh

[CV-128] Uncover Treasures in DCT: Advancing JPEG Quality Enhancement by Exploiting Latent Correlations

【速读】：该论文旨在解决JPEG图像质量增强的问题，特别是针对现有方法在像素域中计算成本高且效率低的缺陷。其解决方案的关键在于识别JPEG图像DCT系数中的两种关键相关性，并基于此提出一种先进的DCT域JPEG质量增强方法（AJQE），该方法能够充分利用这些相关性，将大量成熟的像素域模型适配到DCT域，从而在保持高性能的同时降低计算复杂度。

链接: https://arxiv.org/abs/2506.21171
作者: Jing Yang,Qunliang Xing,Mai Xu,Minglang Qiao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Joint Photographic Experts Group (JPEG) achieves data compression by quantizing Discrete Cosine Transform (DCT) coefficients, which inevitably introduces compression artifacts. Most existing JPEG quality enhancement methods operate in the pixel domain, suffering from the high computational costs of decoding. Consequently, direct enhancement of JPEG images in the DCT domain has gained increasing attention. However, current DCT-domain methods often exhibit limited performance. To address this challenge, we identify two critical types of correlations within the DCT coefficients of JPEG images. Building on this insight, we propose an Advanced DCT-domain JPEG Quality Enhancement (AJQE) method that fully exploits these correlations. The AJQE method enables the adaptation of numerous well-established pixel-domain models to the DCT domain, achieving superior performance with reduced computational complexity. Compared to the pixel-domain counterparts, the DCT-domain models derived by our method demonstrate a 0.35 dB improvement in PSNR and a 60.5% increase in enhancement throughput on average.
zh

[CV-129] Development of MR spectral analysis method robust against static magnetic field inhomogeneity

【速读】：该论文旨在解决静态磁场B0不均匀性对光谱分析准确性的影响问题。其解决方案的关键在于利用深度学习模型，该模型基于由B0图和健康人脑代谢物比例生成的模拟光谱进行训练，通过将B0图划分为子区域并平均估算的代谢物和基线成分进行整合，从而提高光谱分析的精度。

链接: https://arxiv.org/abs/2506.20897
作者: Shuki Maruyama,Hidenori Takeshima
机构: Canon Medical Systems Corporation (佳能医疗系统公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Purpose:To develop a method that enhances the accuracy of spectral analysis in the presence of static magnetic field B0 inhomogeneity. Methods:The authors proposed a new spectral analysis method utilizing a deep learning model trained on modeled spectra that consistently represent the spectral variations induced by B0 inhomogeneity. These modeled spectra were generated from the B0 map and metabolite ratios of the healthy human brain. The B0 map was divided into a patch size of subregions, and the separately estimated metabolites and baseline components were averaged and then integrated. The quality of the modeled spectra was visually and quantitatively evaluated against the measured spectra. The analysis models were trained using measured, simulated, and modeled spectra. The performance of the proposed method was assessed using mean squared errors (MSEs) of metabolite ratios. The mean absolute percentage errors (MAPEs) of the metabolite ratios were also compared to LCModel when analyzing the phantom spectra acquired under two types of B0 inhomogeneity. Results:The modeled spectra exhibited broadened and narrowed spectral peaks depending on the B0 inhomogeneity and were quantitatively close to the measured spectra. The analysis model trained using measured spectra with modeled spectra improved MSEs by 49.89% compared to that trained using measured spectra alone, and by 26.66% compared to that trained using measured spectra with simulated spectra. The performance improved as the number of modeled spectra increased from 0 to 1,000. This model showed significantly lower MAPEs than LCModel under both types of B0 inhomogeneity. Conclusion:A new spectral analysis-trained deep learning model using the modeled spectra was developed. The results suggest that the proposed method has the potential to improve the accuracy of spectral analysis by increasing the training samples of spectra.
zh

[CV-130] U-R-VEDA: Integrating UNET Residual Links Edge and Dual Attention and Vision Transformer for Accurate Semantic Segmentation of CMRs

【速读】：该论文旨在解决心脏磁共振（CMR）图像中心脏结构的精确自动分割问题，这是实现心脏疾病诊断与管理自动化的重要前提。其解决方案的关键在于提出一种基于深度学习的增强型UNet模型U-R-Veda，该模型融合了卷积变换、视觉Transformer、残差连接、通道注意力和空间注意力机制，并结合基于边缘检测的跳跃连接，以提升语义分割的准确性。通过在卷积块中嵌入通道和空间注意力机制，模型能够有效识别关键特征及其空间位置，同时利用边缘信息作为跳跃连接减少卷积过程中的信息损失，从而显著提高CMR图像的语义分割性能。

链接: https://arxiv.org/abs/2506.20689
作者: Racheal Mukisa,Arvind K. Bansal
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Artificial intelligence, including deep learning models, will play a transformative role in automated medical image analysis for the diagnosis of cardiac disorders and their management. Automated accurate delineation of cardiac images is the first necessary initial step for the quantification and automated diagnosis of cardiac disorders. In this paper, we propose a deep learning based enhanced UNet model, U-R-Veda, which integrates convolution transformations, vision transformer, residual links, channel-attention, and spatial attention, together with edge-detection based skip-connections for an accurate fully-automated semantic segmentation of cardiac magnetic resonance (CMR) images. The model extracts local-features and their interrelationships using a stack of combination convolution blocks, with embedded channel and spatial attention in the convolution block, and vision transformers. Deep embedding of channel and spatial attention in the convolution block identifies important features and their spatial localization. The combined edge information with channel and spatial attention as skip connection reduces information-loss during convolution transformations. The overall model significantly improves the semantic segmentation of CMR images necessary for improved medical image analysis. An algorithm for the dual attention module (channel and spatial attention) has been presented. Performance results show that U-R-Veda achieves an average accuracy of 95.2%, based on DSC metrics. The model outperforms the accuracy attained by other models, based on DSC and HD metrics, especially for the delineation of right-ventricle and left-ventricle-myocardium.
zh

[CV-131] Global and Local Contrastive Learning for Joint Representations from Cardiac MRI and ECG MICCAI2025

【速读】：该论文旨在解决传统心电图（ECG）无法直接测量心脏功能参数（如心室容积和射血分数）的问题，而这些参数对于评估心脏功能至关重要。现有的金标准心脏磁共振（CMR）虽然能够提供详细的结构和功能信息，但其成本高且可及性差。为弥补这一差距，本文提出了一种多模态对比学习框架PTACL（Patient and Temporal Alignment Contrastive Learning），其关键在于通过整合ECG的时空信息与CMR数据，利用患者级全局对比损失和时间级局部对比损失，增强ECG表示的学习效果，从而在不引入新可学习权重的情况下，提升ECG在心脏表型检索和CMR衍生功能参数预测任务中的性能。

链接: https://arxiv.org/abs/2506.20683
作者: Alexander Selivanov,Philip Müller,Özgün Turgut,Nil Stolt-Ansó,Daniel Rückert
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: accepted to MICCAI 2025 (Springer LNCS)

点击查看摘要

Abstract:An electrocardiogram (ECG) is a widely used, cost-effective tool for detecting electrical abnormalities in the heart. However, it cannot directly measure functional parameters, such as ventricular volumes and ejection fraction, which are crucial for assessing cardiac function. Cardiac magnetic resonance (CMR) is the gold standard for these measurements, providing detailed structural and functional insights, but is expensive and less accessible. To bridge this gap, we propose PTACL (Patient and Temporal Alignment Contrastive Learning), a multimodal contrastive learning framework that enhances ECG representations by integrating spatio-temporal information from CMR. PTACL uses global patient-level contrastive loss and local temporal-level contrastive loss. The global loss aligns patient-level representations by pulling ECG and CMR embeddings from the same patient closer together, while pushing apart embeddings from different patients. Local loss enforces fine-grained temporal alignment within each patient by contrasting encoded ECG segments with corresponding encoded CMR frames. This approach enriches ECG representations with diagnostic information beyond electrical activity and transfers more insights between modalities than global alignment alone, all without introducing new learnable weights. We evaluate PTACL on paired ECG-CMR data from 27,951 subjects in the UK Biobank. Compared to baseline approaches, PTACL achieves better performance in two clinically relevant tasks: (1) retrieving patients with similar cardiac phenotypes and (2) predicting CMR-derived cardiac function parameters, such as ventricular volumes and ejection fraction. Our results highlight the potential of PTACL to enhance non-invasive cardiac diagnostics using ECG. The code is available at: this https URL
zh

人工智能

[AI-0] mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale

【速读】：该论文旨在解决多变量时间序列异常检测（Multivariate Time Series Anomaly Detection, MTS-AD）中的挑战，包括变量间复杂的依赖关系、时间动态特性以及异常标签的稀疏性。其解决方案的关键在于提出mTSBench，这是目前最大的MTS-AD和无监督模型选择基准，涵盖了344个标注的时间序列数据集，评估了24种异常检测方法，并系统性地比较了无监督模型选择技术，以推动自适应异常检测和鲁棒模型选择的发展。

链接: https://arxiv.org/abs/2506.21550
作者: Xiaona Zhou,Constantin Brif,Ismini Lourentzou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multivariate time series anomaly detection (MTS-AD) is critical in domains like healthcare, cybersecurity, and industrial monitoring, yet remains challenging due to complex inter-variable dependencies, temporal dynamics, and sparse anomaly labels. We introduce mTSBench, the largest benchmark to date for MTS-AD and unsupervised model selection, spanning 344 labeled time series across 19 datasets and 12 diverse application domains. mTSBench evaluates 24 anomaly detection methods, including large language model (LLM)-based detectors for multivariate time series, and systematically benchmarks unsupervised model selection techniques under standardized conditions. Consistent with prior findings, our results confirm that no single detector excels across datasets, underscoring the importance of model selection. However, even state-of-the-art selection methods remain far from optimal, revealing critical gaps. mTSBench provides a unified evaluation suite to enable rigorous, reproducible comparisons and catalyze future advances in adaptive anomaly detection and robust model selection.
zh

[AI-1] WorldVLA: Towards Autoregressive Action World Model

【速读】：该论文试图解决传统行动模型和世界模型在协同作用中的性能局限问题，旨在通过整合视觉-语言-行动（Vision-Language-Action, VLA）模型与世界模型，实现更高效的动作生成与图像理解与生成。其解决方案的关键在于构建一个统一的框架，使世界模型能够通过动作和图像理解预测未来图像，从而学习环境的底层物理规律以提升动作生成能力，同时动作模型基于图像观测生成后续动作，辅助视觉理解并反哺世界模型的视觉生成。

链接: https://arxiv.org/abs/2506.21539
作者: Jun Cen,Chaohui Yu,Hangjie Yuan,Yuming Jiang,Siteng Huang,Jiayan Guo,Xin Li,Yibing Song,Hao Luo,Fan Wang,Deli Zhao,Hao Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:We present WorldVLA, an autoregressive action world model that unifies action and image understanding and generation. Our WorldVLA intergrates Vision-Language-Action (VLA) model and world model in one single framework. The world model predicts future images by leveraging both action and image understanding, with the purpose of learning the underlying physics of the environment to improve action generation. Meanwhile, the action model generates the subsequent actions based on image observations, aiding in visual understanding and in turn helps visual generation of the world model. We demonstrate that WorldVLA outperforms standalone action and world models, highlighting the mutual enhancement between the world model and the action model. In addition, we find that the performance of the action model deteriorates when generating sequences of actions in an autoregressive manner. This phenomenon can be attributed to the model’s limited generalization capability for action prediction, leading to the propagation of errors from earlier actions to subsequent ones. To address this issue, we propose an attention mask strategy that selectively masks prior actions during the generation of the current action, which shows significant performance improvement in the action chunk generation task.
zh

[AI-2] PsyLite Technical Report

【速读】：该论文旨在解决当前AI驱动的心理咨询模型在对话安全性、详细场景处理以及轻量级部署方面的不足。其解决方案的关键在于提出PsyLite，一个基于InternLM2.5-7B-chat基础模型的轻量级心理咨询服务大语言模型代理。通过两阶段训练策略（混合蒸馏数据微调与ORPO偏好优化），提升了模型的深度推理能力、心理咨询能力和对话安全性。此外，采用量化技术（GGUF q4_k_m）实现低硬件部署需求，仅需5GB内存即可运行，为资源受限环境下的心理咨询应用提供了可行方案。

链接: https://arxiv.org/abs/2506.21536
作者: Fangjun Ding,Renyu Zhang,Xinyu Feng,Chengye Xie,Zheng Zhang,Yanting Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:With the rapid development of digital technology, AI-driven psychological counseling has gradually become an important research direction in the field of mental health. However, existing models still have deficiencies in dialogue safety, detailed scenario handling, and lightweight deployment. To address these issues, this study proposes PsyLite, a lightweight psychological counseling large language model agent developed based on the base model InternLM2.5-7B-chat. Through a two-stage training strategy (hybrid distillation data fine-tuning and ORPO preference optimization), PsyLite enhances the model’s deep-reasoning ability, psychological counseling ability, and safe dialogue ability. After deployment using Ollama and Open WebUI, a custom workflow is created with Pipelines. An innovative conditional RAG is designed to introduce crosstalk humor elements at appropriate times during psychological counseling to enhance user experience and decline dangerous requests to strengthen dialogue safety. Evaluations show that PsyLite outperforms the baseline models in the Chinese general evaluation (CEval), psychological counseling professional evaluation (CPsyCounE), and dialogue safety evaluation (SafeDialBench), particularly in psychological counseling professionalism (CPsyCounE score improvement of 47.6%) and dialogue safety (\safe score improvement of 2.4%). Additionally, the model uses quantization technology (GGUF q4_k_m) to achieve low hardware deployment (5GB memory is sufficient for operation), providing a feasible solution for psychological counseling applications in resource-constrained environments.
zh

[AI-3] Process mining-driven modeling and simulation to enhance fault diagnosis in cyber-physical systems

【速读】：该论文试图解决在信息物理系统（Cyber-Physical Systems, CPSs）中，由于手动建模故障行为需要大量领域知识，导致模型复杂、易出错且难以解释的问题。其解决方案的关键在于提出一种新颖的无监督故障诊断方法，该方法结合了多变量时间序列中的集体异常检测、流程挖掘和随机模拟，通过将传感器数据中的异常转化为结构化事件日志，利用流程挖掘发现可解释的流程模型，并通过引入时间分布到提取的Petri网中，支持故障行为的随机模拟，从而提升根本原因分析和行为理解能力。

链接: https://arxiv.org/abs/2506.21502
作者: Francesco Vitale,Nicola Dall’Ora,Sebastiano Gaiardelli,Enrico Fraccaroli,Nicola Mazzocca,Franco Fummi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fault diagnosis in Cyber-Physical Systems (CPSs) is essential for ensuring system dependability and operational efficiency by accurately detecting anomalies and identifying their root causes. However, the manual modeling of faulty behaviors often demands extensive domain expertise and produces models that are complex, error-prone, and difficult to interpret. To address this challenge, we present a novel unsupervised fault diagnosis methodology that integrates collective anomaly detection in multivariate time series, process mining, and stochastic simulation. Initially, collective anomalies are detected from low-level sensor data using multivariate time-series analysis. These anomalies are then transformed into structured event logs, enabling the discovery of interpretable process models through process mining. By incorporating timing distributions into the extracted Petri nets, the approach supports stochastic simulation of faulty behaviors, thereby enhancing root cause analysis and behavioral understanding. The methodology is validated using the Robotic Arm Dataset (RoAD), a widely recognized benchmark in smart manufacturing. Experimental results demonstrate its effectiveness in modeling, simulating, and classifying faulty behaviors in CPSs. This enables the creation of comprehensive fault dictionaries that support predictive maintenance and the development of digital twins for industrial environments.
zh

[AI-4] Ad-Hoc Human-AI Coordination Challenge ICML2025

【速读】：该论文旨在解决人机协作中的无缝协调问题，特别是在具有不完全信息、有限通信和心智理论要求的复杂场景下，如Hanabi合作卡牌游戏。其关键解决方案是引入了Ad-Hoc Human-AI Coordination Challenge (AH2AC2)，通过构建大规模人类数据集上的拟人代理（human proxy agents），提供了一种成本低、可重复且类似人类的评估伙伴，以克服传统人类评估的高昂成本和难以复现的问题。

链接: https://arxiv.org/abs/2506.21490
作者: Tin Dizdarević,Ravi Hammond,Tobias Gessler,Anisoara Calinescu,Jonathan Cook,Matteo Gallici,Andrei Lupu,Jakob Nicolaus Foerster
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Published at ICML 2025

点击查看摘要

Abstract:Achieving seamless coordination between AI agents and humans is crucial for real-world applications, yet it remains a significant open challenge. Hanabi is a cooperative card game featuring imperfect information, constrained communication, theory of mind requirements, and coordinated action – making it an ideal testbed for human-AI coordination. However, its use for human-AI interaction has been limited by the challenges of human evaluation. In this work, we introduce the Ad-Hoc Human-AI Coordination Challenge (AH2AC2) to overcome the constraints of costly and difficult-to-reproduce human evaluations. We develop \textithuman proxy agents on a large-scale human dataset that serve as robust, cheap, and reproducible human-like evaluation partners in AH2AC2. To encourage the development of data-efficient methods, we open-source a dataset of 3,079 games, deliberately limiting the amount of available human gameplay data. We present baseline results for both two- and three- player Hanabi scenarios. To ensure fair evaluation, we host the proxy agents through a controlled evaluation system rather than releasing them publicly. The code is available at \hrefthis https URLthis https URL.
zh

[AI-5] SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture

【速读】：该论文旨在解决歌唱语音合成（Singing Voice Synthesis, SVS）中由于声学和音乐特征复杂性导致的自然度下降和伪影问题。现有方法依赖于声码器作为最终阶段，常引入失真，而扩散模型在SVS中的应用仍面临挑战。其解决方案的关键在于提出SmoothSinger，一个条件扩散模型，通过统一框架直接优化低质量合成音频，避免了两阶段流水线带来的退化问题。该模型采用参考引导的双分支架构，利用任意基线系统的低质量音频作为参考指导去噪过程，并引入并行低频上采样路径以更好地捕捉音高轮廓和长期频谱依赖关系，从而提升合成语音的质量与自然度。

链接: https://arxiv.org/abs/2506.21478
作者: Kehan Sui,Jinxu Xiang,Fang Jin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Singing voice synthesis (SVS) aims to generate expressive and high-quality vocals from musical scores, requiring precise modeling of pitch, duration, and articulation. While diffusion-based models have achieved remarkable success in image and video generation, their application to SVS remains challenging due to the complex acoustic and musical characteristics of singing, often resulting in artifacts that degrade naturalness. In this work, we propose SmoothSinger, a conditional diffusion model designed to synthesize high quality and natural singing voices. Unlike prior methods that depend on vocoders as a final stage and often introduce distortion, SmoothSinger refines low-quality synthesized audio directly in a unified framework, mitigating the degradation associated with two-stage pipelines. The model adopts a reference-guided dual-branch architecture, using low-quality audio from any baseline system as a reference to guide the denoising process, enabling more expressive and context-aware synthesis. Furthermore, it enhances the conventional U-Net with a parallel low-frequency upsampling path, allowing the model to better capture pitch contours and long term spectral dependencies. To improve alignment during training, we replace reference audio with degraded ground truth audio, addressing temporal mismatch between reference and target signals. Experiments on the Opencpop dataset, a large-scale Chinese singing corpus, demonstrate that SmoothSinger achieves state-of-the-art results in both objective and subjective evaluations. Extensive ablation studies confirm its effectiveness in reducing artifacts and improving the naturalness of synthesized voices.
zh

[AI-6] Optimising 4th-Order Runge-Kutta Methods: A Dynamic Heuristic Approach for Efficiency and Low Storag e

【速读】：该论文旨在解决在大规模科学与工程计算中，如何平衡高阶低存储扩展稳定性龙格-库塔（ESRK）方法的精度、稳定性与计算效率的问题。传统方法依赖人工设计的启发式规则或穷举数值搜索，难以有效优化低存储ESRK方案。论文提出的解决方案关键在于结合遗传算法（GA）与强化学习（RL）的混合方法，利用GA驱动的变异进行搜索空间探索，并通过RL启发的状态转移机制动态优化启发式选择，从而实现参数系统的简化，在保持四阶精度的同时显著提升计算效率。

链接: https://arxiv.org/abs/2506.21465
作者: Gavin Lee Goodship,Luis Miralles-Pechuan,Stephen O’Sullivan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extended Stability Runge-Kutta (ESRK) methods are crucial for solving large-scale computational problems in science and engineering, including weather forecasting, aerodynamic analysis, and complex biological modelling. However, balancing accuracy, stability, and computational efficiency remains challenging, particularly for high-order, low-storage schemes. This study introduces a hybrid Genetic Algorithm (GA) and Reinforcement Learning (RL) approach for automated heuristic discovery, optimising low-storage ESRK methods. Unlike traditional approaches that rely on manually designed heuristics or exhaustive numerical searches, our method leverages GA-driven mutations for search-space exploration and an RL-inspired state transition mechanism to refine heuristic selection dynamically. This enables systematic parameter reduction, preserving fourth-order accuracy while significantly improving computational this http URL proposed GA-RL heuristic optimisation framework is validated through rigorous testing on benchmark problems, including the 1D and 2D Brusselator systems and the steady-state Navier-Stokes equations. The best-performing heuristic achieves a 25% reduction in IPOPT runtime compared to traditional ESRK optimisation processes while maintaining numerical stability and accuracy. These findings demonstrate the potential of adaptive heuristic discovery to improve resource efficiency in high-fidelity simulations and broaden the applicability of low-storage Runge-Kutta methods in real-world computational fluid dynamics, physics simulations, and other demanding fields. This work establishes a new paradigm in heuristic optimisation for numerical methods, opening pathways for further exploration using Deep RL and AutoML-based heuristic search
zh

[AI-7] ableMoE: Neuro-Symbolic Routing for Structured Expert Reasoning in Multimodal Table Understanding

【速读】：该论文旨在解决现实场景中多模态表格理解的挑战，特别是由于结构复杂性、符号密度高以及视觉退化（如模糊、倾斜、水印、不完整结构或字体、多跨度或层次嵌套布局）导致的现有多模态大语言模型（MLLMs）在WildStruct条件下的性能受限和泛化能力差的问题。其解决方案的关键在于提出TableMoE，一种基于神经符号混合专家（Neuro-Symbolic Mixture-of-Connector-Experts, MoCE）架构，该架构通过创新的神经符号路由机制，预测潜在语义标记角色并动态将表格元素路由至专用专家（如Table-to-HTML、Table-to-JSON、Table-to-Code），从而实现对多模态表格数据的鲁棒结构化推理。

链接: https://arxiv.org/abs/2506.21393
作者: Junwen Zhang,Pu Chen,Yin Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 43 pages and 11 figures

点击查看摘要

Abstract:Multimodal understanding of tables in real-world contexts is challenging due to the complexity of structure, symbolic density, and visual degradation (blur, skew, watermarking, incomplete structures or fonts, multi-span or hierarchically nested layouts). Existing multimodal large language models (MLLMs) struggle with such WildStruct conditions, resulting in limited performance and poor generalization. To address these challenges, we propose TableMoE, a neuro-symbolic Mixture-of-Connector-Experts (MoCE) architecture specifically designed for robust, structured reasoning over multimodal table data. TableMoE features an innovative Neuro-Symbolic Routing mechanism, which predicts latent semantic token roles (e.g., header, data cell, axis, formula) and dynamically routes table elements to specialized experts (Table-to-HTML, Table-to-JSON, Table-to-Code) using a confidence-aware gating strategy informed by symbolic reasoning graphs. To facilitate effective alignment-driven pretraining, we introduce the large-scale TableMoE-Align dataset, consisting of 1.2M table-HTML-JSON-code quadruples across finance, science, biomedicine and industry, utilized exclusively for model pretraining. For evaluation, we curate and release four challenging WildStruct benchmarks: WMMFinQA, WMMTatQA, WMMTabDialog, and WMMFinanceMath, designed specifically to stress-test models under real-world multimodal degradation and structural complexity. Experimental results demonstrate that TableMoE significantly surpasses existing state-of-the-art models. Extensive ablation studies validate each core component, emphasizing the critical role of Neuro-Symbolic Routing and structured expert alignment. Through qualitative analyses, we further showcase TableMoE’s interpretability and enhanced robustness, underscoring the effectiveness of integrating neuro-symbolic reasoning for multimodal table understanding.
zh

[AI-8] mporal-Aware Graph Attention Network for Cryptocurrency Transaction Fraud Detection

【速读】：该论文旨在解决加密货币交易欺诈检测中面临的双重挑战，即交易模式日益复杂和类别严重不平衡的问题。传统方法依赖于人工特征工程，难以捕捉交易网络中的时序和结构依赖关系。解决方案的关键在于提出一种增强的时序感知图注意力网络（ATGAT），其核心包括三个模块：设计一种融合多尺度时间差特征与周期位置编码的先进时序嵌入模块；构建联合优化结构、时序和全局上下文注意力的时序感知三重注意力机制；以及采用加权二元交叉熵损失来缓解类别不平衡问题。这些创新有效提升了欺诈检测性能。

链接: https://arxiv.org/abs/2506.21382
作者: Zhi Zheng,Bochuan Zhou,Yuping Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cryptocurrency transaction fraud detection faces the dual challenges of increasingly complex transaction patterns and severe class imbalance. Traditional methods rely on manual feature engineering and struggle to capture temporal and structural dependencies in transaction networks. This paper proposes an Augmented Temporal-aware Graph Attention Network (ATGAT) that enhances detection performance through three modules: (1) designing an advanced temporal embedding module that fuses multi-scale time difference features with periodic position encoding; (2) constructing a temporal-aware triple attention mechanism that jointly optimizes structural, temporal, and global context attention; (3) employing weighted BCE loss to address class imbalance. Experiments on the Elliptic++ cryptocurrency dataset demonstrate that ATGAT achieves an AUC of 0.9130, representing a 9.2% improvement over the best traditional method XGBoost, 12.0% over GCN, and 10.0% over standard GAT. This method not only validates the enhancement effect of temporal awareness and triple attention mechanisms on graph neural networks, but also provides financial institutions with more reliable fraud detection tools, with its design principles generalizable to other temporal graph anomaly detection tasks.
zh

[AI-9] Pay Attention to Small Weights

【速读】：该论文试图解决在微调大规模预训练神经网络时资源消耗过大的问题，特别是在内存和计算成本方面。解决方案的关键在于提出NANOADAM方法，该方法通过动态更新小幅度权重来减少需要训练的参数数量，从而降低计算负担。该方法的核心思想是基于微调过程中梯度与权重之间的关系，观察到大梯度通常与小幅度权重相关，这一现象在微调设置中比从头训练更为显著。NANOADAM无需梯度计算即可确定参数子集，同时保留大尺度权重以避免灾难性遗忘，并允许使用更大的学习率以提升泛化性能。

链接: https://arxiv.org/abs/2506.21374
作者: Chao Zhou,Tom Jacobs,Advait Gadhikar,Rebekka Burkholz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Finetuning large pretrained neural networks is known to be resource-intensive, both in terms of memory and computational cost. To mitigate this, a common approach is to restrict training to a subset of the model parameters. By analyzing the relationship between gradients and weights during finetuning, we observe a notable pattern: large gradients are often associated with small-magnitude weights. This correlation is more pronounced in finetuning settings than in training from scratch. Motivated by this observation, we propose NANOADAM, which dynamically updates only the small-magnitude weights during finetuning and offers several practical advantages: first, this criterion is gradient-free – the parameter subset can be determined without gradient computation; second, it preserves large-magnitude weights, which are likely to encode critical features learned during pretraining, thereby reducing the risk of catastrophic forgetting; thirdly, it permits the use of larger learning rates and consistently leads to better generalization performance in experiments. We demonstrate this for both NLP and vision tasks.
zh

[AI-10] Real-time and personalized product recommendations for large e-commerce platforms ICANN

【速读】：该论文试图解决大规模电子商务平台中实时且个性化的商品推荐问题，特别是在时尚零售领域。其关键解决方案是利用图神经网络（Graph Neural Networks）和简约学习（parsimonious learning）方法，以实现准确、可扩展的推荐系统，并在最小响应时间内满足用户需求。

链接: https://arxiv.org/abs/2506.21368
作者: Matteo Tolloso,Davide Bacciu,Shahab Mokarizadeh,Marco Varesi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for publication at the International Conference on Artificial Neural Networks (ICANN) 2025. The final authenticated version will be available for purchase through the publisher’s website. The conference proceedings will be published by Springer in the Lecture Notes in Computer Science (LNCS) series

点击查看摘要

Abstract:We present a methodology to provide real-time and personalized product recommendations for large e-commerce platforms, specifically focusing on fashion retail. Our approach aims to achieve accurate and scalable recommendations with minimal response times, ensuring user satisfaction, leveraging Graph Neural Networks and parsimonious learning methodologies. Extensive experimentation with datasets from one of the largest e-commerce platforms demonstrates the effectiveness of our approach in forecasting purchase sequences and handling multi-interaction scenarios, achieving efficient personalized recommendations under real-world constraints.
zh

[AI-11] rQdia: Regularizing Q-Value Distributions With Image Augmentation

【速读】：该论文试图解决基于像素的深度强化学习中Q值分布不均衡导致的样本效率低和长期训练性能不足的问题。解决方案的关键在于引入rQdia，通过在增强图像上对Q值分布进行正则化，利用简单的辅助损失函数（通过均方误差等价化这些分布），从而提升DrQ和SAC在MuJoCo连续控制套件以及Data-Efficient Rainbow在Atari街机环境中的表现。

链接: https://arxiv.org/abs/2506.21367
作者: Sam Lerman,Jing Bi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:rQdia regularizes Q-value distributions with augmented images in pixel-based deep reinforcement learning. With a simple auxiliary loss, that equalizes these distributions via MSE, rQdia boosts DrQ and SAC on 9/12 and 10/12 tasks respectively in the MuJoCo Continuous Control Suite from pixels, and Data-Efficient Rainbow on 18/26 Atari Arcade environments. Gains are measured in both sample efficiency and longer-term training. Moreover, the addition of rQdia finally propels model-free continuous control from pixels over the state encoding baseline.
zh

[AI-12] A Systematic Review of Human-AI Co-Creativity

【速读】：该论文试图解决如何设计更有效支持人类创造力的协同创作系统（co-creative systems）的问题，其核心在于识别和总结系统设计的关键维度与设计考虑因素。解决方案的关键在于通过系统文献综述方法，从62篇相关研究中提取出影响系统效能的核心要素，包括创作过程阶段、创作任务、系统主动性、用户控制、系统具身化及AI模型类型等，并提出了24项设计考虑因素，强调高用户控制度对提升满意度、信任感和创作成果归属感的重要性，以及主动且具备情境适应能力的系统能够增强协作效果。

链接: https://arxiv.org/abs/2506.21333
作者: Saloni Singh,Koen Hndriks,Drik Heylen,Kim Baraka
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The co creativity community is making significant progress in developing more sophisticated and tailored systems to support and enhance human creativity. Design considerations from prior work can serve as a valuable and efficient foundation for future systems. To support this effort, we conducted a systematic literature review of 62 papers on co-creative systems. These papers cover a diverse range of applications, including visual arts, design, and writing, where the AI acts not just as a tool but as an active collaborator in the creative process. From this review, we identified several key dimensions relevant to system design: phase of the creative process, creative task, proactive behavior of the system, user control, system embodiment, and AI model type. Our findings suggest that systems offering high user control lead to greater satisfaction, trust, and a stronger sense of ownership over creative outcomes. Furthermore, proactive systems, when adaptive and context sensitive, can enhance collaboration. We also extracted 24 design considerations, highlighting the value of encouraging users to externalize their thoughts and of increasing the system’s social presence and transparency to foster trust. Despite recent advancements, important gaps remain, such as limited support for early creative phases like problem clarification, and challenges related to user adaptation to AI systems.
zh

[AI-13] Active Inference AI Systems for Scientific Discovery

【速读】：该论文试图解决当前人工智能驱动科学发现中的三个核心问题：抽象性差距（abstraction gap）、推理差距（reasoning gap）和现实差距（reality gap）。解决方案的关键在于构建一种主动推断（active inference）的AI系统，该系统通过长期研究记忆、符号或神经符号规划器、持续增长的知识图谱以及与高保真模拟器和自动化实验室的闭环交互，实现内部模型与外部验证之间的动态协同。这种架构强调因果结构、持续校准和人类判断的不可替代性，以确保科学推理的有效性和可靠性。

链接: https://arxiv.org/abs/2506.21329
作者: Karthik Duraisamy
机构: 未知
类目: Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:The rapid evolution of artificial intelligence has led to expectations of transformative scientific discovery, yet current systems remain fundamentally limited by their operational architectures, brittle reasoning mechanisms, and their separation from experimental reality. Building on earlier work, we contend that progress in AI-driven science now depends on closing three fundamental gaps – the abstraction gap, the reasoning gap, and the reality gap – rather than on model size/data/test time compute. Scientific reasoning demands internal representations that support simulation of actions and response, causal structures that distinguish correlation from mechanism, and continuous calibration. We define active inference AI systems for scientific discovery as those that (i) maintain long-lived research memories grounded in causal self-supervised foundation models, (ii) symbolic or neuro-symbolic planners equipped with Bayesian guardrails, (iii) grow persistent knowledge graphs where thinking generates novel conceptual nodes, reasoning establishes causal edges, and real-world interaction prunes false connections while strengthening verified pathways, and (iv) refine their internal representations through closed-loop interaction with both high-fidelity simulators and automated laboratories - an operational loop where mental simulation guides action and empirical surprise reshapes understanding. In essence, we outline an architecture where discovery arises from the interplay between internal models that enable counterfactual reasoning and external validation that grounds hypotheses in reality. It is also argued that the inherent ambiguity in feedback from simulations and experiments, and underlying uncertainties makes human judgment indispensable, not as a temporary scaffold but as a permanent architectural component.
zh

[AI-14] IXAII: An Interactive Explainable Artificial Intelligence Interface for Decision Support Systems

【速读】：该论文试图解决现有可解释人工智能（Explainable AI, XAI）后验方法多为静态且忽视用户视角，从而限制了其对目标受众的有效性问题。解决方案的关键在于开发了一个交互式可解释智能系统IXAII，该系统整合了四种XAI方法（LIME、SHAP、Anchors和DiCE），并为五个用户群体提供定制化视图，使用户能够自主控制解释内容和格式，从而提升透明度和实用性。

链接: https://arxiv.org/abs/2506.21310
作者: Pauline Speckmann,Mario Nadj,Christian Janiesch
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 9 pages, 2 figures, accepted to DESRIST 2025 Prototype Track

点击查看摘要

Abstract:Although several post-hoc methods for explainable AI have been developed, most are static and neglect the user perspective, limiting their effectiveness for the target audience. In response, we developed the interactive explainable intelligent system called IXAII that offers explanations from four explainable AI methods: LIME, SHAP, Anchors, and DiCE. Our prototype provides tailored views for five user groups and gives users agency over the explanations’ content and their format. We evaluated IXAII through interviews with experts and lay users. Our results indicate that IXAII, which provides different explanations with multiple visualization options, is perceived as helpful to increase transparency. By bridging the gaps between explainable AI methods, interactivity, and practical implementation, we provide a novel perspective on AI explanation practices and human-AI interaction.
zh

[AI-15] On Uniform Weighted Deep Polynomial approximation

【速读】：该论文试图解决非对称行为函数（如一侧无界增长而另一侧衰减）的高效逼近问题，这类函数在传统多项式逼近和部分经典正交多项式方法中难以实现指数级收敛。解决方案的关键在于引入一种加权深度多项式逼近框架，通过将可学习的深度多项式与单边权重相乘，从而同时捕捉局部非光滑性和全局增长特性。该方法在数值实验中表现出优于泰勒、切比雪夫及标准深度多项式逼近器的性能，即使参数数量相同。

链接: https://arxiv.org/abs/2506.21306
作者: Kingsley Yeon,Steven B. Damelin
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:It is a classical result in rational approximation theory that certain non-smooth or singular functions, such as |x| and x^1/p , can be efficiently approximated using rational functions with root-exponential convergence in terms of degrees of freedom \citeSta, GN. In contrast, polynomial approximations admit only algebraic convergence by Jackson’s theorem \citeLub2. Recent work shows that composite polynomial architectures can recover exponential approximation rates even without smoothness \citeKY. In this work, we introduce and analyze a class of weighted deep polynomial approximants tailored for functions with asymmetric behavior-growing unbounded on one side and decaying on the other. By multiplying a learnable deep polynomial with a one-sided weight, we capture both local non-smoothness and global growth. We show numerically that this framework outperforms Taylor, Chebyshev, and standard deep polynomial approximants, even when all use the same number of parameters. To optimize these approximants in practice, we propose a stable graph-based parameterization strategy building on \citeJar.
zh

[AI-16] Integrating Vehicle Acoustic Data for Enhanced Urban Traffic Management: A Study on Speed Classification in Suzhou

【速读】：该论文旨在解决城市道路环境中基于声学数据的车辆速度分类问题，以支持智能交通管理系统中的实时噪声监测与速度估计。其解决方案的关键在于提出一种双模态特征融合的深度卷积神经网络（Bimodal-Feature-Fusion Deep Convolutional Neural Network, BMCNN），通过并行分支提取梅尔频率倒谱系数（MFCCs）和小波包能量特征，并在中间特征空间中利用跨模态注意力机制进行特征融合，从而充分挖掘时频信息，提升分类性能。

链接: https://arxiv.org/abs/2506.21269
作者: Pengfei Fan,Yuli Zhang,Xinheng Wang,Ruiyuan Jiang,Hankang Gu,Dongyao Jia,Shangbo Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This study presents and publicly releases the Suzhou Urban Road Acoustic Dataset (SZUR-Acoustic Dataset), which is accompanied by comprehensive data-acquisition protocols and annotation guidelines to ensure transparency and reproducibility of the experimental workflow. To model the coupling between vehicular noise and driving speed, we propose a bimodal-feature-fusion deep convolutional neural network (BMCNN). During preprocessing, an adaptive denoising and normalization strategy is applied to suppress environmental background interference; in the network architecture, parallel branches extract Mel-frequency cepstral coefficients (MFCCs) and wavelet-packet energy features, which are subsequently fused via a cross-modal attention mechanism in the intermediate feature space to fully exploit time-frequency information. Experimental results demonstrate that BMCNN achieves a classification accuracy of 87.56% on the SZUR-Acoustic Dataset and 96.28% on the public IDMT-Traffic dataset. Ablation studies and robustness tests on the Suzhou dataset further validate the contributions of each module to performance improvement and overfitting mitigation. The proposed acoustics-based speed classification method can be integrated into smart-city traffic management systems for real-time noise monitoring and speed estimation, thereby optimizing traffic flow control, reducing roadside noise pollution, and supporting sustainable urban planning.
zh

[AI-17] World-aware Planning Narratives Enhance Large Vision-Language Model Planner

【速读】：该论文旨在解决大型视觉-语言模型（LVLMs）在涉及陌生环境和多步骤目标的复杂场景中进行具身规划任务时表现不佳的问题。现有方法依赖于与环境无关的模仿学习，导致指令与环境上下文脱节，从而使模型难以处理上下文敏感的指令，并在长时程交互中依赖辅助线索而非视觉推理。解决方案的关键在于提出World-Aware Planning Narrative Enhancement (WAP)框架，该框架通过四种认知能力（视觉外观建模、空间推理、功能抽象和句法定位）赋予LVLM全面的环境理解能力，同时仅使用原始视觉观察进行模型的训练和评估，采用课程学习策略。

链接: https://arxiv.org/abs/2506.21230
作者: Junhao Shi,Zhaoye Fei,Siyin Wang,Qipeng Guo,Jingjing Gong,Xipeng QIu
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) show promise for embodied planning tasks but struggle with complex scenarios involving unfamiliar environments and multi-step goals. Current approaches rely on environment-agnostic imitation learning that disconnects instructions from environmental contexts, causing models to struggle with context-sensitive instructions and rely on supplementary cues rather than visual reasoning during long-horizon interactions. In this work, we propose World-Aware Planning Narrative Enhancement (WAP), a framework that infuses LVLMs with comprehensive environmental understanding through four cognitive capabilities (visual appearance modeling, spatial reasoning, functional abstraction, and syntactic grounding) while developing and evaluating models using only raw visual observations through curriculum learning. Evaluations on the EB-ALFRED benchmark demonstrate substantial improvements, with Qwen2.5-VL achieving a 60.7 absolute improvement in task success rates, particularly in commonsense reasoning (+60.0) and long-horizon planning (+70.0). Notably, our enhanced open-source models outperform proprietary systems like GPT-4o and Claude-3.5-Sonnet by a large margin.
zh

[AI-18] T3: Multi-level Tree-based Automatic Program Repair with Large Language Models

【速读】：该论文试图解决自动程序修复（Automatic Program Repair, APR）中由于复杂逻辑和多步骤推理需求导致的Chain-of-Thought (CoT)技术应用不足的问题。解决方案的关键在于提出一种创新框架 T^3，该框架将大型语言模型（Large Language Models, LLMs）的强大推理能力与树搜索相结合，从而有效提升生成候选修复方案的精度，并为优化样本选择和修复策略提供指导。

链接: https://arxiv.org/abs/2506.21211
作者: Quanming Liu,Xupeng Bu,Zhichao Yan,Ru Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic Program Repair (APR) is a core technology in software development and maintenance, with aims to enable automated defect repair with minimal human intervention. In recent years, the substantial advancements in Large Language Models (LLMs) and the Chain-of-Thought (CoT) techniques have significantly enhanced the reasoning capabilities of these models. However, due to the complex logic and multi-step reasoning ability needed, the application of CoT techniques in the APR domain remains insufficient. This study systematically evaluates the performance of several common CoT techniques in APR tasks and proposes an innovative framework T^3 , which integrates the powerful reasoning capabilities of LLMs with tree search, effectively improving the precision of generating candidate repair solutions. Furthermore, T^3 provides valuable guidance for optimizing sample selection and repair strategies in APR tasks, establishing a robust framework for achieving efficient automated debugging.
zh

[AI-19] A Hierarchical Deep Learning Approach for Minority Instrument Detection

【速读】：该论文试图解决在音乐信息检索中，如何准确识别音频片段中的乐器活动问题，尤其是在乐器类别数据量有限的情况下。其解决方案的关键在于引入层次分类（hierarchical classification）机制，通过基于Hornbostel-Sachs分类体系的层级结构，提升粗粒度乐器检测的可靠性，并弥合详细乐器识别与组级识别之间的差距。

链接: https://arxiv.org/abs/2506.21167
作者: Dylan Sechet,Francesca Bugiotti,Matthieu Kowalski,Edouard d’Hérouville,Filip Langiewicz
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: International Conference on Digital Audio Effects (DAFx)

点击查看摘要

Abstract:Identifying instrument activities within audio excerpts is vital in music information retrieval, with significant implications for music cataloging and discovery. Prior deep learning endeavors in musical instrument recognition have predominantly emphasized instrument classes with ample data availability. Recent studies have demonstrated the applicability of hierarchical classification in detecting instrument activities in orchestral music, even with limited fine-grained annotations at the instrument level. Based on the Hornbostel-Sachs classification, such a hierarchical classification system is evaluated using the MedleyDB dataset, renowned for its diversity and richness concerning various instruments and music genres. This work presents various strategies to integrate hierarchical structures into models and tests a new class of models for hierarchical music prediction. This study showcases more reliable coarse-level instrument detection by bridging the gap between detailed instrument identification and group-level recognition, paving the way for further advancements in this domain.
zh

[AI-20] Linearity-based neural network compression

【速读】：该论文试图解决神经网络压缩中如何进一步减少模型参数的问题，特别是在已有高度优化的模型基础上提升压缩效率。其解决方案的关键在于提出基于线性的压缩方法（linearity-based compression），该方法利用ReLU-like激活函数下神经元在大部分情况下呈现线性行为的特性，从而实现后续层的合并，达到无损压缩的效果。实验结果表明，该方法可在多数测试模型上将模型大小压缩至原来的1/4，且与基于重要性的剪枝方法具有良好的兼容性。

链接: https://arxiv.org/abs/2506.21146
作者: Silas Dobler,Florian Lemmerich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In neural network compression, most current methods reduce unnecessary parameters by measuring importance and redundancy. To augment already highly optimized existing solutions, we propose linearity-based compression as a novel way to reduce weights in a neural network. It is based on the intuition that with ReLU-like activation functions, neurons that are almost always activated behave linearly, allowing for merging of subsequent layers. We introduce the theory underlying this compression and evaluate our approach experimentally. Our novel method achieves a lossless compression down to 1/4 of the original model size in over the majority of tested models. Applying our method on already importance-based pruned models shows very little interference between different types of compression, demonstrating the option of successful combination of techniques. Overall, our work lays the foundation for a new type of compression method that enables smaller and ultimately more efficient neural network models.
zh

[AI-21] DBConformer: Dual-Branch Convolutional Transformer for EEG Decoding

【速读】：该论文旨在解决基于脑电图（EEG）的脑机接口（BCI）中，卷积神经网络（CNN）因感受野有限而难以捕捉长程时间依赖性和全局通道间关系的问题。现有混合模型虽部分缓解了该问题，但多采用串行设计，导致局部与全局特征整合不优，并常忽略显式的通道建模。论文提出的DBConformer是一种双分支卷积Transformer网络，其关键在于通过时间Conformer建模长程时间依赖性，通过空间Conformer提取通道间交互，同时引入轻量级通道注意力模块以数据驱动的方式优化空间表征，从而有效捕获EEG信号中的时空特征。

链接: https://arxiv.org/abs/2506.21140
作者: Ziwei Wang,Hongbin Wang,Tianwang Jia,Xingyi He,Siyang Li,Dongrui Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Electroencephalography (EEG)-based brain-computer interfaces (BCIs) transform spontaneous/evoked neural activity into control commands for external communication. While convolutional neural networks (CNNs) remain the mainstream backbone for EEG decoding, their inherently short receptive field makes it difficult to capture long-range temporal dependencies and global inter-channel relationships. Recent CNN-Transformer (Conformers) hybrids partially address this issue, but most adopt a serial design, resulting in suboptimal integration of local and global features, and often overlook explicit channel-wise modeling. To address these limitations, we propose DBConformer, a dual-branch convolutional Transformer network tailored for EEG decoding. It integrates a temporal Conformer to model long-range temporal dependencies and a spatial Conformer to extract inter-channel interactions, capturing both temporal dynamics and spatial patterns in EEG signals. A lightweight channel attention module further refines spatial representations by assigning data-driven importance to EEG channels. Extensive experiments on five motor imagery (MI) datasets and two seizure detection datasets under three evaluation settings demonstrate that DBConformer consistently outperforms 10 competitive baseline models, with over eight times fewer parameters than the high-capacity EEG Conformer baseline. Further, the visualization results confirm that the features extracted by DBConformer are physiologically interpretable and aligned with sensorimotor priors in MI. The superior performance and interpretability of DBConformer make it reliable for robust and explainable EEG decoding. Code is publicized at this https URL.
zh

[AI-22] How Good Are Synthetic Requirements ? Evaluating LLM -Generated Datasets for AI4RE

【速读】：该论文旨在解决公开可用的带标签需求数据集短缺问题，这一问题严重阻碍了人工智能在需求工程（AI4RE）领域的进展。其解决方案的关键在于提出Synthline v1，这是一种增强的产品线方法，通过先进的生成策略和数据整理技术来生成合成需求数据。该方法通过系统性地控制和优化生成需求的质量，提升了数据的实用性和多样性，并在多个分类任务中表现出与人工编写的需求数据相当或更优的性能。

链接: https://arxiv.org/abs/2506.21138
作者: Abdelkarim El-Hajjami,Camille Salinesi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The shortage of publicly available, labeled requirements datasets remains a major barrier to advancing Artificial Intelligence for Requirements Engineering (AI4RE). While Large Language Models offer promising capabilities for synthetic data generation, systematic approaches to control and optimize the quality of generated requirements remain underexplored. This paper presents Synthline v1, an enhanced Product Line approach for generating synthetic requirements data that extends our earlier v0 version with advanced generation strategies and curation techniques. We investigate four research questions assessing how prompting strategies, automated prompt optimization, and post-generation curation affect data quality across four classification tasks: defect detection, functional vs. non-functional, quality vs. non-quality, and security vs. non-security. Our evaluation shows that multi-sample prompting significantly boosts both utility and diversity over single-sample generation, with F1-score gains from 6 to 44 points. The use of PACE (Prompt Actor-Critic Editing) for automated prompt optimization yields task-dependent results, greatly improving functional classification (+32.5 points) but reducing performance on others. Interestingly, similarity-based curation improves diversity but often harms classification performance, indicating that some redundancy may help ML models. Most importantly, our results show that synthetic requirements can match or outperform human-authored ones for specific tasks, with synthetic data surpassing human data for security (+7.8 points) and defect classification (+15.4 points). These findings offer practical insights for AI4RE and chart a viable path to mitigating dataset scarcity through systematic synthetic generation.
zh

[AI-23] Curriculum-Guided Antifrag ile Reinforcement Learning for Secure UAV Deconfliction under Observation-Space Attacks

【速读】：该论文旨在解决安全关键系统中强化学习（Reinforcement Learning, RL）策略在面对观测空间中的分布外（Out-of-Distribution, OOD）对抗性攻击时的脆弱性问题，这种攻击会导致价值估计显著退化，从而引发不安全或次优决策。解决方案的关键在于提出一种抗脆弱强化学习框架，通过模拟攻击者逐步增加观测空间扰动强度，使RL代理能够适应并泛化到更广泛的OOD观测，并预见未见过的攻击。该框架通过迭代专家引导的评论器对齐，利用Wasserstein距离最小化来强制执行价值转移的边界，从而稳定遗忘并提升策略的鲁棒性。

链接: https://arxiv.org/abs/2506.21129
作者: Deepak Kumar Panda,Adolfo Perrusquia,Weisi Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) policies deployed in safety-critical systems, such as unmanned aerial vehicle (UAV) navigation in dynamic airspace, are vulnerable to out-ofdistribution (OOD) adversarial attacks in the observation space. These attacks induce distributional shifts that significantly degrade value estimation, leading to unsafe or suboptimal decision making rendering the existing policy fragile. To address this vulnerability, we propose an antifragile RL framework designed to adapt against curriculum of incremental adversarial perturbations. The framework introduces a simulated attacker which incrementally increases the strength of observation-space perturbations which enables the RL agent to adapt and generalize across a wider range of OOD observations and anticipate previously unseen attacks. We begin with a theoretical characterization of fragility, formally defining catastrophic forgetting as a monotonic divergence in value function distributions with increasing perturbation strength. Building on this, we define antifragility as the boundedness of such value shifts and derive adaptation conditions under which forgetting is stabilized. Our method enforces these bounds through iterative expert-guided critic alignment using Wasserstein distance minimization across incrementally perturbed observations. We empirically evaluate the approach in a UAV deconfliction scenario involving dynamic 3D obstacles. Results show that the antifragile policy consistently outperforms standard and robust RL baselines when subjected to both projected gradient descent (PGD) and GPS spoofing attacks, achieving up to 15% higher cumulative reward and over 30% fewer conflict events. These findings demonstrate the practical and theoretical viability of antifragile reinforcement learning for secure and resilient decision-making in environments with evolving threat scenarios.
zh

[AI-24] Robust Policy Switching for Antifrag ile Reinforcement Learning for UAV Deconfliction in Adversarial Environments

【速读】：该论文旨在解决无人飞行器（UAV）在强化学习（Reinforcement Learning, RL）导航中面临的安全问题，即对抗性攻击通过传感器操控利用RL的漏洞，导致状态-动作-价值分布的偏移。现有鲁棒RL方法在处理分布外变化时效果有限，因其主要针对固定扰动设计。该论文提出的解决方案关键在于引入一种抗脆弱强化学习（Antifragile RL）框架，其核心是基于折扣Thompson采样（Discounted Thompson Sampling, DTS）的切换机制，动态选择多个鲁棒策略以最小化对抗性干扰带来的分布偏移。该机制通过构建策略空间中的多样化动作鲁棒策略集，并将其建模为多臂老虎机（Multi-Armed Bandit, MAB）问题，从而有效适应不断变化的对抗策略，实现对未见攻击的适应性增强。

链接: https://arxiv.org/abs/2506.21127
作者: Deepak Kumar Panda,Weisi Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing automation of navigation for unmanned aerial vehicles (UAVs) has exposed them to adversarial attacks that exploit vulnerabilities in reinforcement learning (RL) through sensor manipulation. Although existing robust RL methods aim to mitigate such threats, their effectiveness has limited generalization to out-of-distribution shifts from the optimal value distribution, as they are primarily designed to handle fixed perturbation. To address this limitation, this paper introduces an antifragile RL framework that enhances adaptability to broader distributional shifts by incorporating a switching mechanism based on discounted Thompson sampling (DTS). This mechanism dynamically selects among multiple robust policies to minimize adversarially induced state-action-value distribution shifts. The proposed approach first derives a diverse ensemble of action robust policies by accounting for a range of perturbations in the policy space. These policies are then modeled as a multiarmed bandit (MAB) problem, where DTS optimally selects policies in response to nonstationary Bernoulli rewards, effectively adapting to evolving adversarial strategies. Theoretical framework has also been provided where by optimizing the DTS to minimize the overall regrets due to distributional shift, results in effective adaptation against unseen adversarial attacks thus inducing antifragility. Extensive numerical simulations validate the effectiveness of the proposed framework in complex navigation environments with multiple dynamic three-dimensional obstacles and with stronger projected gradient descent (PGD) and spoofing attacks. Compared to conventional robust, non-adaptive RL methods, the antifragile approach achieves superior performance, demonstrating shorter navigation path lengths and a higher rate of conflict-free navigation trajectories compared to existing robust RL techniques
zh

[AI-25] PhishKey: A Novel Centroid-Based Approach for Enhanced Phishing Detection Using Adaptive HTML Component Extraction

【速读】：该论文旨在解决网络钓鱼攻击（phishing attacks）在检测过程中面临的适应性差、鲁棒性不足和效率低下的问题。其解决方案的关键在于提出一种名为PhishKey的新型检测方法，该方法通过从混合数据源中自动提取特征，并结合字符级处理与卷积神经网络（CNN）进行URL分类，以及采用基于中心点的关键组件网络钓鱼提取器（CAPE）对HTML内容进行单词级别的处理，从而有效减少噪声并保证样本的完整性。最终通过软投票集成策略融合两个模块的预测结果，提升分类的准确性和可靠性。

链接: https://arxiv.org/abs/2506.21106
作者: Felipe Castaño,Eduardo Fidalgo,Enrique Alegre,Rocio Alaiz-Rodríguez,Raul Orduna,Francesco Zola
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Phishing attacks pose a significant cybersecurity threat, evolving rapidly to bypass detection mechanisms and exploit human vulnerabilities. This paper introduces PhishKey to address the challenges of adaptability, robustness, and efficiency. PhishKey is a novel phishing detection method using automatic feature extraction from hybrid sources. PhishKey combines character-level processing with Convolutional Neural Networks (CNN) for URL classification, and a Centroid-Based Key Component Phishing Extractor (CAPE) for HTML content at the word level. CAPE reduces noise and ensures complete sample processing avoiding crop operations on the input data. The predictions from both modules are integrated using a soft-voting ensemble to achieve more accurate and reliable classifications. Experimental evaluations on four state-of-the-art datasets demonstrate the effectiveness of PhishKey. It achieves up to 98.70% F1 Score and shows strong resistance to adversarial manipulations such as injection attacks with minimal performance degradation.
zh

[AI-26] Interpretable Hierarchical Concept Reasoning through Attention-Guided Graph Learning

【速读】：该论文试图解决传统概念基础模型（Concept-Based Models, CBMs）在解释性上的局限性，即仅能对最终任务预测提供可解释性，而概念预测本身通常由黑箱神经网络完成。解决方案的关键在于提出分层概念记忆推理器（Hierarchical Concept Memory Reasoner, H-CMR），该模型通过学习有向无环图来建模概念之间的关系，并利用神经注意力机制在推理过程中选择逻辑规则，从而实现对概念和任务预测的双重可解释性。

链接: https://arxiv.org/abs/2506.21102
作者: David Debot,Pietro Barbiero,Gabriele Dominici,Giuseppe Marra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept-Based Models (CBMs) are a class of deep learning models that provide interpretability by explaining predictions through high-level concepts. These models first predict concepts and then use them to perform a downstream task. However, current CBMs offer interpretability only for the final task prediction, while the concept predictions themselves are typically made via black-box neural networks. To address this limitation, we propose Hierarchical Concept Memory Reasoner (H-CMR), a new CBM that provides interpretability for both concept and task predictions. H-CMR models relationships between concepts using a learned directed acyclic graph, where edges represent logic rules that define concepts in terms of other concepts. During inference, H-CMR employs a neural attention mechanism to select a subset of these rules, which are then applied hierarchically to predict all concepts and the final task. Experimental results demonstrate that H-CMR matches state-of-the-art performance while enabling strong human interaction through concept and model interventions. The former can significantly improve accuracy at inference time, while the latter can enhance data efficiency during training when background knowledge is available.
zh

[AI-27] FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation

【速读】：该论文试图解决联邦学习（Federated Learning, FL）中因客户端数据分布异质性导致的公平性问题，特别是现有方法多关注单一敏感属性的偏见缓解，而忽视了不同客户端之间多样且可能冲突的公平性需求。其解决方案的关键在于构建一个一致的基准测试框架，通过引入FeDa4Fair库生成针对异质客户端偏见的表格数据集，并发布四个具有偏见异质性的数据集及其对应的基准，同时提供可用于评估公平性结果的即用型函数，以支持更稳健和可复现的公平性研究。

链接: https://arxiv.org/abs/2506.21095
作者: Xenia Heilmann,Luca Corbucci,Mattia Cerrato,Anna Monreale
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across multiple clients without sharing clients’ private data. However, fairness remains a key concern, as biases in local clients’ datasets can impact the entire federated system. Heterogeneous data distributions across clients may lead to models that are fairer for some clients than others. Although several fairness-enhancing solutions are present in the literature, most focus on mitigating bias for a single sensitive attribute, typically binary, overlooking the diverse and sometimes conflicting fairness needs of different clients. This limited perspective can limit the effectiveness of fairness interventions for the different clients. To support more robust and reproducible fairness research in FL, we aim to enable a consistent benchmarking of fairness-aware FL methods at both the global and client levels. In this paper, we contribute in three ways: (1) We introduce FeDa4Fair, a library to generate tabular datasets tailored to evaluating fair FL methods under heterogeneous client bias; (2) we release four bias-heterogeneous datasets and corresponding benchmarks to compare fairness mitigation methods in a controlled environment; (3) we provide ready-to-use functions for evaluating fairness outcomes for these datasets.
zh

[AI-28] Efficient Skill Discovery via Regret-Aware Optimization

【速读】：该论文试图解决无监督技能发现（unsupervised skill discovery）在开放性强化学习中效率不足的问题，尤其是在高维环境下的表现受限。其解决方案的关键在于将技能发现建模为技能生成与策略学习之间的极小极大博弈，并提出一种基于时间表示学习的后悔感知方法，通过沿可升级策略强度方向扩展已发现的技能空间来提升效率和多样性。核心思想是技能发现与策略学习具有对抗性，即弱强度技能应进一步探索，而收敛的强技能则减少探索。

链接: https://arxiv.org/abs/2506.21044
作者: He Zhang,Ming Zhou,Shaopeng Zhai,Ying Sun,Hui Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unsupervised skill discovery aims to learn diverse and distinguishable behaviors in open-ended reinforcement learning. For existing methods, they focus on improving diversity through pure exploration, mutual information optimization, and learning temporal representation. Despite that they perform well on exploration, they remain limited in terms of efficiency, especially for the high-dimensional situations. In this work, we frame skill discovery as a min-max game of skill generation and policy learning, proposing a regret-aware method on top of temporal representation learning that expands the discovered skill space along the direction of upgradable policy strength. The key insight behind the proposed method is that the skill discovery is adversarial to the policy learning, i.e., skills with weak strength should be further explored while less exploration for the skills with converged strength. As an implementation, we score the degree of strength convergence with regret, and guide the skill discovery with a learnable skill generator. To avoid degeneration, skill generation comes from an up-gradable population of skill generators. We conduct experiments on environments with varying complexities and dimension sizes. Empirical results show that our method outperforms baselines in both efficiency and diversity. Moreover, our method achieves a 15% zero shot improvement in high-dimensional environments, compared to existing methods.
zh

[AI-29] Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

【速读】：该论文试图解决长期目标条件任务中由于目标距离远和奖励稀疏所带来的强化学习（Reinforcement Learning, RL）挑战。其解决方案的关键在于提出一种基于图的分层RL框架——严格子目标执行（Strict Subgoal Execution, SSE），通过结构化约束高层决策以确保单步子目标可达性，并结合解耦探索策略和故障感知路径优化，提升探索效率与子目标可靠性。

链接: https://arxiv.org/abs/2506.21039
作者: Jaebak Hwang,Sanghyeon Lee,Jeongmo Kim,Seungyul Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 technical page followed by references and appendix

点击查看摘要

Abstract:Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, they often suffer from subgoal infeasibility and inefficient planning. We introduce Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that enforces single-step subgoal reachability by structurally constraining high-level decision-making. To enhance exploration, SSE employs a decoupled exploration policy that systematically traverses underexplored regions of the goal space. Furthermore, a failure-aware path refinement, which refines graph-based planning by dynamically adjusting edge costs according to observed low-level success rates, thereby improving subgoal reliability. Experimental results across diverse long-horizon benchmarks demonstrate that SSE consistently outperforms existing goal-conditioned RL and hierarchical RL approaches in both efficiency and success rate.
zh

[AI-30] Enhancing Homophily-Heterophily Separation: Relation-Aware Learning in Heterogeneous Graphs KDD2025

【速读】：该论文试图解决异构图中节点异质性（node heterogeneity）与节点异质性（node heterophily）共存所带来的建模挑战，特别是在现有方法通过将异构图转换为同构图来学习节点异质性时，会不可避免地丢失由异构关系传递的潜在异质性信息。解决方案的关键在于提出一种名为Relation-Aware Separation of Homophily and Heterophily (RASH) 的对比学习框架，该框架通过引入双异构超图编码多关系二部子图，并基于关系重要性动态构建同质图和异质图，同时设计多关系对比损失以最大化互信息，从而显式建模异构交互的高阶语义并自适应分离同质与异质模式。

链接: https://arxiv.org/abs/2506.20980
作者: Ziyu Zheng,Yaming Yang,Ziyu Guan,Wei Zhao,Weigang Lu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: accepted by KDD 2025

点击查看摘要

Abstract:Real-world networks usually have a property of node heterophily, that is, the connected nodes usually have different features or different labels. This heterophily issue has been extensively studied in homogeneous graphs but remains under-explored in heterogeneous graphs, where there are multiple types of nodes and edges. Capturing node heterophily in heterogeneous graphs is very challenging since both node/edge heterogeneity and node heterophily should be carefully taken into consideration. Existing methods typically convert heterogeneous graphs into homogeneous ones to learn node heterophily, which will inevitably lose the potential heterophily conveyed by heterogeneous relations. To bridge this gap, we propose Relation-Aware Separation of Homophily and Heterophily (RASH), a novel contrastive learning framework that explicitly models high-order semantics of heterogeneous interactions and adaptively separates homophilic and heterophilic patterns. Particularly, RASH introduces dual heterogeneous hypergraphs to encode multi-relational bipartite subgraphs and dynamically constructs homophilic graphs and heterophilic graphs based on relation importance. A multi-relation contrastive loss is designed to align heterogeneous and homophilic/heterophilic views by maximizing mutual information. In this way, RASH simultaneously resolves the challenges of heterogeneity and heterophily in heterogeneous graphs. Extensive experiments on benchmark datasets demonstrate the effectiveness of RASH across various downstream tasks. The code is available at: this https URL.
zh

[AI-31] Parallels Between VLA Model Post-Training and Human Motor Learning: Progress Challenges and Trends

【速读】：该论文旨在解决Vision-Language-Action (VLA)模型在需要高精度和准确性的应用中表现不足的问题，其核心挑战在于提升模型与具体任务和环境的适配能力。解决方案的关键在于通过后训练（post-training）策略，借鉴人类运动学习机制，从环境感知、本体感知、任务理解及多组件整合四个维度优化VLA模型，以增强其与环境交互的能力。

链接: https://arxiv.org/abs/2506.20966
作者: Tian-Yu Xiang,Ao-Qun Jin,Xiao-Hu Zhou,Mei-Jiang Gui,Xiao-Liang Xie,Shi-Qi Liu,Shuang-Yi Wang,Sheng-Bin Duan,Fu-Chao Xie,Wen-Kai Wang,Si-Cheng Wang,Ling-Yun Li,Tian Tu,Zeng-Guang Hou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models extend vision-language models (VLM) by integrating action generation modules for robotic manipulation. Leveraging strengths of VLM in vision perception and instruction understanding, VLA models exhibit promising generalization across diverse manipulation tasks. However, applications demanding high precision and accuracy reveal performance gaps without further adaptation. Evidence from multiple domains highlights the critical role of post-training to align foundational models with downstream applications, spurring extensive research on post-training VLA models. VLA model post-training aims to address the challenge of improving an embodiment’s ability to interact with the environment for the given tasks, analogous to the process of humans motor skills acquisition. Accordingly, this paper reviews post-training strategies for VLA models through the lens of human motor learning, focusing on three dimensions: environments, embodiments, and tasks. A structured taxonomy is introduced aligned with human learning mechanisms: (1) enhancing environmental perception, (2) improving embodiment awareness, (3) deepening task comprehension, and (4) multi-component integration. Finally, key challenges and trends in post-training VLA models are identified, establishing a conceptual framework to guide future research. This work delivers both a comprehensive overview of current VLA model post-training methods from a human motor learning perspective and practical insights for VLA model development. (Project website: this https URL)
zh

[AI-32] Antibody Design and Optimization with Multi-scale Equivariant Graph Diffusion Models for Accurate Complex Antigen Binding IJCAI2025

【速读】：该论文旨在解决抗体设计中的关键挑战，尤其是在处理复杂抗原时难以准确捕捉分子相互作用和保持结构完整性的问题。其解决方案的关键在于提出一种端到端框架AbMEGD，该框架整合了多尺度等变图扩散（Multi-scale Equivariant Graph Diffusion），通过结合原子级几何特征与残基级嵌入，实现抗体序列与结构的协同设计。AbMEGD采用E(3)-等变扩散方法，确保几何精度、计算效率和对复杂抗原接口的良好泛化能力。

链接: https://arxiv.org/abs/2506.20957
作者: Jiameng Chen,Xiantao Cai,Jia Wu,Wenbin Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, accepted at IJCAI 2025

点击查看摘要

Abstract:Antibody design remains a critical challenge in therapeutic and diagnostic development, particularly for complex antigens with diverse binding interfaces. Current computational methods face two main limitations: (1) capturing geometric features while preserving symmetries, and (2) generalizing novel antigen interfaces. Despite recent advancements, these methods often fail to accurately capture molecular interactions and maintain structural integrity. To address these challenges, we propose \textbfAbMEGD, an end-to-end framework integrating \textbfMulti-scale \textbfEquivariant \textbfGraph \textbfDiffusion for antibody sequence and structure co-design. Leveraging advanced geometric deep learning, AbMEGD combines atomic-level geometric features with residue-level embeddings, capturing local atomic details and global sequence-structure interactions. Its E(3)-equivariant diffusion method ensures geometric precision, computational efficiency, and robust generalizability for complex antigens. Furthermore, experiments using the SAbDab database demonstrate a 10.13% increase in amino acid recovery, 3.32% rise in improvement percentage, and a 0.062~Å reduction in root mean square deviation within the critical CDR-H3 region compared to DiffAb, a leading antibody design model. These results highlight AbMEGD’s ability to balance structural integrity with improved functionality, establishing a new benchmark for sequence-structure co-design and affinity optimization. The code is available at: this https URL.
zh

[AI-33] Interpretable Representation Learning for Additive Rule Ensembles

【速读】：该论文试图解决传统符号规则集成模型在缺乏丰富且独立输入特征时，难以保持高精度与可解释性平衡的问题。传统方法依赖于单变量的简单阈值条件，导致决策区域为轴对齐的多面体，当特征表达能力不足时，需增加规则数量和复杂度，从而降低模型的可解释性。解决方案的关键在于引入可学习的稀疏线性变换逻辑命题，即形如 \mathbfx^\mathrmT\mathbfw \geq t 的命题，其中 \mathbfw 是可学习的稀疏权重向量，使得决策区域能够表示为具有斜面的一般多面体，从而在保持可解释性的同时提升模型性能。

链接: https://arxiv.org/abs/2506.20927
作者: Shahrzad Behzadimanesh,Pierre Le Bodic,Geoffrey I. Webb,Mario Boley
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Small additive ensembles of symbolic rules offer interpretable prediction models. Traditionally, these ensembles use rule conditions based on conjunctions of simple threshold propositions x \geq t on a single input variable x and threshold t , resulting geometrically in axis-parallel polytopes as decision regions. While this form ensures a high degree of interpretability for individual rules and can be learned efficiently using the gradient boosting approach, it relies on having access to a curated set of expressive and ideally independent input features so that a small ensemble of axis-parallel regions can describe the target variable well. Absent such features, reaching sufficient accuracy requires increasing the number and complexity of individual rules, which diminishes the interpretability of the model. Here, we extend classical rule ensembles by introducing logical propositions with learnable sparse linear transformations of input variables, i.e., propositions of the form \mathbfx^\mathrmT\mathbfw \geq t , where \mathbfw is a learnable sparse weight vector, enabling decision regions as general polytopes with oblique faces. We propose a learning method using sequential greedy optimization based on an iteratively reweighted formulation of logistic regression. Experimental results demonstrate that the proposed method efficiently constructs rule ensembles with the same test risk as state-of-the-art methods while significantly reducing model complexity across ten benchmark datasets.
zh

[AI-34] LLM -guided Chemical Process Optimization with a Multi-Agent Approach

【速读】：该论文试图解决化学过程优化中由于操作约束不明确或不可用而导致的传统优化方法（如基于梯度的求解器、进化算法和参数网格搜索）失效的问题，这些问题通常需要工程师依赖主观启发式方法估计可行参数范围。解决方案的关键在于提出一种基于大型语言模型（Large Language Model, LLM）的多智能体框架，该框架能够从少量过程描述中自主推断操作约束，并通过协作优化过程引导优化，从而无需预定义的操作边界。该框架利用AutoGen架构和OpenAI的o3模型，结合专门用于约束生成、参数验证、模拟执行和优化指导的智能体，实现了高效且具有领域知识的优化过程。

链接: https://arxiv.org/abs/2506.20921
作者: Tong Zeng,Srivathsan Badrinarayanan,Janghoon Ock,Cheng-Kai Lai,Amir Barati Farimani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 16 pages (main manuscript without references), 2 figures

点击查看摘要

Abstract:Chemical process optimization is crucial to maximize production efficiency and economic performance. Traditional methods, including gradient-based solvers, evolutionary algorithms, and parameter grid searches, become impractical when operating constraints are ill-defined or unavailable, requiring engineers to rely on subjective heuristics to estimate feasible parameter ranges. To address this constraint definition bottleneck, we present a multi-agent framework of large language model (LLM) agents that autonomously infer operating constraints from minimal process descriptions, then collaboratively guide optimization using the inferred constraints. Our AutoGen-based agentic framework employs OpenAI’s o3 model, with specialized agents for constraint generation, parameter validation, simulation execution, and optimization guidance. Through two phases - autonomous constraint generation using embedded domain knowledge, followed by iterative multi-agent optimization - the framework eliminates the need for predefined operational bounds. Validated on the hydrodealkylation process across cost, yield, and yield-to-cost ratio metrics, the framework demonstrated competitive performance with conventional optimization methods while achieving better computational efficiency, requiring fewer iterations to converge. Our approach converged in under 20 minutes, achieving a 31-fold speedup over grid search. Beyond computational efficiency, the framework’s reasoning-guided search demonstrates sophisticated process understanding, correctly identifying utility trade-offs, and applying domain-informed heuristics. This approach shows significant potential for optimization scenarios where operational constraints are poorly characterized or unavailable, particularly for emerging processes and retrofit applications.
zh

[AI-35] ZKPROV: A Zero-Knowledge Approach to Dataset Provenance for Large Language Models

【速读】：该论文旨在解决在敏感领域中大型语言模型（Large Language Models, LLMs）计算来源完整性保障的问题，特别是在医疗等受监管行业中对数据集使用有严格要求的情况下。其解决方案的关键在于提出ZKPROV，一个基于零知识证明（Zero-Knowledge Proofs）的密码学框架，该框架能够在不泄露数据集或模型参数敏感信息的前提下，验证模型是否基于可信数据集进行训练。ZKPROV通过数据集签名的元数据和紧凑的模型参数承诺，将训练后的模型与授权数据集进行密码学绑定，从而提供既可靠又隐私保护的证明机制。

链接: https://arxiv.org/abs/2506.20915
作者: Mina Namazi,Alexander Nemecek,Erman Ayday
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 1 figure

点击查看摘要

Abstract:As the deployment of large language models (LLMs) grows in sensitive domains, ensuring the integrity of their computational provenance becomes a critical challenge, particularly in regulated sectors such as healthcare, where strict requirements are applied in dataset usage. We introduce ZKPROV, a novel cryptographic framework that enables zero-knowledge proofs of LLM provenance. It allows users to verify that a model is trained on a reliable dataset without revealing sensitive information about it or its parameters. Unlike prior approaches that focus on complete verification of the training process (incurring significant computational cost) or depend on trusted execution environments, ZKPROV offers a distinct balance. Our method cryptographically binds a trained model to its authorized training dataset(s) through zero-knowledge proofs while avoiding proof of every training step. By leveraging dataset-signed metadata and compact model parameter commitments, ZKPROV provides sound and privacy-preserving assurances that the result of the LLM is derived from a model trained on the claimed authorized and relevant dataset. Experimental results demonstrate the efficiency and scalability of the ZKPROV in generating this proof and verifying it, achieving a practical solution for real-world deployments. We also provide formal security guarantees, proving that our approach preserves dataset confidentiality while ensuring trustworthy dataset provenance.
zh

[AI-36] Omniwise: Predicting GPU Kernels Performance with LLM s

【速读】：该论文旨在解决GPU内核性能预测的问题，即通过大语言模型（LLMs）直接从内核代码预测关键性能指标，而无需实际执行代码或依赖传统性能分析工具。解决方案的关键在于提出Omniwise，这是一个端到端的自监督微调管道，具有模型无关性和轻量级特性，能够在不依赖外部工具的情况下准确预测内存带宽、缓存命中率、GFLOPs和算术强度等指标，且在AMD MI250和MI300X架构上的预测误差在10%以内的比例超过90%。

链接: https://arxiv.org/abs/2506.20886
作者: Zixian Wang,Cole Ramos,Muhammad A. Awad,Keith Lowery
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, the rapid advancement of deep neural networks (DNNs) has revolutionized artificial intelligence, enabling models with unprecedented capabilities in understanding, generating, and processing complex data. These powerful architectures have transformed a wide range of downstream applications, tackling tasks beyond human reach. In this paper, we introduce Omniwise, the first end-to-end, self-supervised fine-tuning pipeline that applies large language models (LLMs) to GPU kernel performance prediction–a novel use case in performance profiling. Omniwise is model-agnostic and lightweight, achieving strong results even with a small 3B-parameter model. It can predict key performance metrics, including memory bandwidth, cache hit rates, GFLOPs, and arithmetic intensity, directly from kernel code without the need for code execution or profiling tools. Our approach achieves over 90% of predictions within 10% relative error on GPU kernels executed on AMD MI250 and MI300X architectures. In addition to the pipeline, we develop an online inference server and a Visual Studio Code plugin that seamlessly integrate LLM-based performance prediction into developers’ workflows.
zh

[AI-37] Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance

【速读】：该论文试图解决在模型驱动工程中复杂模型转换（Model Transformation, MT）序列的开发问题，这些问题通常涉及多步骤的MT链式操作，手动开发过程易出错且难以实现。解决方案的关键在于结合强化学习（Reinforcement Learning, RL）与人类指导，通过将用户定义的MT映射到RL原语，并将其作为RL程序执行，从而寻找最优的MT序列。该方法利用人类建议，即使其存在不确定性，也能显著提升RL性能，提高复杂MT开发的效率。

链接: https://arxiv.org/abs/2506.20883
作者: Kyanna Dagenais,Istvan David
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for ACM/IEEE MODELS’25

点击查看摘要

Abstract:Model-driven engineering problems often require complex model transformations (MTs), i.e., MTs that are chained in extensive sequences. Pertinent examples of such problems include model synchronization, automated model repair, and design space exploration. Manually developing complex MTs is an error-prone and often infeasible process. Reinforcement learning (RL) is an apt way to alleviate these issues. In RL, an autonomous agent explores the state space through trial and error to identify beneficial sequences of actions, such as MTs. However, RL methods exhibit performance issues in complex problems. In these situations, human guidance can be of high utility. In this paper, we present an approach and technical framework for developing complex MT sequences through RL, guided by potentially uncertain human advice. Our framework allows user-defined MTs to be mapped onto RL primitives, and executes them as RL programs to find optimal MT sequences. Our evaluation shows that human guidance, even if uncertain, substantially improves RL performance, and results in more efficient development of complex MTs. Through a trade-off between the certainty and timeliness of human advice, our method takes a step towards RL-driven human-in-the-loop engineering methods.
zh

[AI-38] Engineering RAG Systems for Real-World Applications: Design Development and Evaluation MICRO

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在事实准确性与上下文相关性方面的局限性，通过引入检索增强生成（Retrieval-Augmented Generation, RAG）系统，将LLMs与外部知识相结合，以提升其在实际应用中的可靠性。解决方案的关键在于构建五个针对不同领域（如治理、网络安全、农业、工业研究和医学诊断）的RAG应用，这些系统集成了多语言光学字符识别（OCR）、基于向量嵌入的语义检索以及领域适配的LLMs，并通过本地服务器或云API部署，以满足多样化的用户需求。

链接: https://arxiv.org/abs/2506.20869
作者: Md Toufique Hasan,Muhammad Waseem,Kai-Kristian Kemell,Ayman Asad Khan,Mika Saari,Pekka Abrahamsson
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted as a full paper to the 51st Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2025). 9 pages, 4 figures. This is the preprint version and not the final camera ready version

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems are emerging as a key approach for grounding Large Language Models (LLMs) in external knowledge, addressing limitations in factual accuracy and contextual relevance. However, there is a lack of empirical studies that report on the development of RAG-based implementations grounded in real-world use cases, evaluated through general user involvement, and accompanied by systematic documentation of lessons learned. This paper presents five domain-specific RAG applications developed for real-world scenarios across governance, cybersecurity, agriculture, industrial research, and medical diagnostics. Each system incorporates multilingual OCR, semantic retrieval via vector embeddings, and domain-adapted LLMs, deployed through local servers or cloud APIs to meet distinct user needs. A web-based evaluation involving a total of 100 participants assessed the systems across six dimensions: (i) Ease of Use, (ii) Relevance, (iii) Transparency, (iv) Responsiveness, (v) Accuracy, and (vi) Likelihood of Recommendation. Based on user feedback and our development experience, we documented twelve key lessons learned, highlighting technical, operational, and ethical challenges affecting the reliability and usability of RAG systems in practice.
zh

[AI-39] Generating Reliable Adverse event Profiles for Health through Automated Integrated Data (GRAPH-AID): A Semi-Automated Ontology Building Approach

【速读】：该论文试图解决将Neo4j数据库与Web Ontology Language (OWL)无缝集成的问题，特别是在面对快速增长的不良药物事件数据集时，如何高效生成知识图谱。解决方案的关键在于提出一种用户友好的方法，利用Python及其rdflib库自动生成所需的类和公理，从而降低对描述逻辑（DL）语法的依赖，提升ontology开发的可访问性和效率。

链接: https://arxiv.org/abs/2506.20851
作者: Srikar Reddy Gadusu,Larry Callahan,Samir Lababidi,Arunasri Nishtala,Sophia Healey,Hande McGinty
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:As data and knowledge expand rapidly, adopting systematic methodologies for ontology generation has become crucial. With the daily increases in data volumes and frequent content changes, the demand for databases to store and retrieve information for the creation of knowledge graphs has become increasingly urgent. The previously established Knowledge Acquisition and Representation Methodology (KNARM) outlines a systematic approach to address these challenges and create knowledge graphs. However, following this methodology highlights the existing challenge of seamlessly integrating Neo4j databases with the Web Ontology Language (OWL). Previous attempts to integrate data from Neo4j into an ontology have been discussed, but these approaches often require an understanding of description logics (DL) syntax, which may not be familiar to many users. Thus, a more accessible method is necessary to bridge this gap. This paper presents a user-friendly approach that utilizes Python and its rdflib library to support ontology development. We showcase our novel approach through a Neo4j database we created by integrating data from the Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) database. Using this dataset, we developed a Python script that automatically generates the required classes and their axioms, facilitating a smoother integration process. This approach offers a practical solution to the challenges of ontology generation in the context of rapidly growing adverse drug event datasets, supporting improved drug safety monitoring and public health decision-making.
zh

[AI-40] Dynamic Context-Aware Prompt Recommendation for Domain-Specific AI Applications

【速读】：该论文旨在解决领域特定人工智能应用中用户提示（prompt）质量对模型性能的显著影响问题，尤其是在生成高质量提示方面存在的挑战。其解决方案的关键在于提出一种动态上下文感知的提示推荐系统，该系统结合了上下文查询分析、检索增强的知识定位、分层技能组织以及自适应技能排序，通过行为遥测和两阶段分层推理过程动态选择和排序相关技能，并利用预定义与自适应模板及少量样本学习合成提示，从而生成相关且可操作的提示建议。

链接: https://arxiv.org/abs/2506.20815
作者: Xinye Tang,Haijun Zhai,Chaitanya Belwal,Vineeth Thayanithi,Philip Baumann,Yogesh K Roy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-powered applications are highly susceptible to the quality of user prompts, and crafting high-quality prompts can often be challenging especially for domain-specific applications. This paper presents a novel dynamic context-aware prompt recommendation system for domain-specific AI applications. Our solution combines contextual query analysis, retrieval-augmented knowledge grounding, hierarchical skill organization, and adaptive skill ranking to generate relevant and actionable prompt suggestions. The system leverages behavioral telemetry and a two-stage hierarchical reasoning process to dynamically select and rank relevant skills, and synthesizes prompts using both predefined and adaptive templates enhanced with few-shot learning. Experiments on real-world datasets demonstrate that our approach achieves high usefulness and relevance, as validated by both automated and expert evaluations. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.20815 [cs.AI] (or arXiv:2506.20815v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.20815 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-41] FINN-GL: Generalized Mixed-Precision Extensions for FPGA-Accelerated LSTMs

【速读】：该论文旨在解决在资源受限环境中部署循环神经网络（Recurrent Neural Networks, RNNs）特别是长短期记忆网络（Long Short-Term Memory, LSTM）时面临的计算复杂度高和实时性不足的问题。现有工具主要针对前馈网络，而LSTM加速通常需要全定制实现，缺乏通用性。解决方案的关键在于利用开源且可扩展的FINN框架，通过ONNX规范中的Scan操作建模LSTM的递归特性，支持混合量化并实现功能验证，同时引入自定义转换将量化后的ONNX计算图映射到FINN编译器和Vitis HLS的HLS内核库中的硬件模块，从而实现LSTM在FPGA上的通用部署。

链接: https://arxiv.org/abs/2506.20810
作者: Shashwat Khandelwal,Jakoba Petri-Koenig,Thomas B. Preußer,Michaela Blott,Shreejith Shanker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Signal Processing (eess.SP)
备注: 9 pages, 6 figures, 5 tables, Accepted for publication in IEEE FPL-2025 ( this https URL )

点击查看摘要

Abstract:Recurrent neural networks (RNNs), particularly LSTMs, are effective for time-series tasks like sentiment analysis and short-term stock prediction. However, their computational complexity poses challenges for real-time deployment in resource constrained environments. While FPGAs offer a promising platform for energy-efficient AI acceleration, existing tools mainly target feed-forward networks, and LSTM acceleration typically requires full custom implementation. In this paper, we address this gap by leveraging the open-source and extensible FINN framework to enable the generalized deployment of LSTMs on FPGAs. Specifically, we leverage the Scan operator from the Open Neural Network Exchange (ONNX) specification to model the recurrent nature of LSTM computations, enabling support for mixed quantisation within them and functional verification of LSTM-based models. Furthermore, we introduce custom transformations within the FINN compiler to map the quantised ONNX computation graph to hardware blocks from the HLS kernel library of the FINN compiler and Vitis HLS. We validate the proposed tool-flow by training a quantised ConvLSTM model for a mid-price stock prediction task using the widely used dataset and generating a corresponding hardware IP of the model using our flow, targeting the XCZU7EV device. We show that the generated quantised ConvLSTM accelerator through our flow achieves a balance between performance (latency) and resource consumption, while matching (or bettering) inference accuracy of state-of-the-art models with reduced precision. We believe that the generalisable nature of the proposed flow will pave the way for resource-efficient RNN accelerator designs on FPGAs.
zh

[AI-42] GPU Kernel Scientist: An LLM -Driven Framework for Iterative Kernel Optimization ICML2025

【速读】：该论文试图解决在高性能GPU内核优化中面临的复杂性问题，尤其是在针对新型或文档不完善的GPU架构时，传统开发工具和方法的不足。解决方案的关键在于引入一种基于大语言模型（LLM）的“GPU内核科学家”自动化方法，该方法通过多阶段、演进式的流程实现加速器内核的迭代优化，包括从已有代码版本中选择优化起点、生成优化实验假设以及自主实施实验并利用观察到的性能数据进行反馈调整。

链接: https://arxiv.org/abs/2506.20807
作者: Martin Andrews,Sam Witteveen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Software Engineering (cs.SE)
备注: 4 page paper plus Appendices. Accepted to the ES-FoMo “Efficient Systems for Foundation Models” workshop at ICML 2025

点击查看摘要

Abstract:Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU architectures where traditional development aids are scarce. This paper introduces an LLM-powered “GPU Kernel Scientist,” an automated methodology for iteratively refining accelerator kernels. Our methodology employs LLMs in a multi-stage, evolutionary process: (a) strategically selecting promising prior code versions as a basis for new iterations; (b) generating hypotheses for optimization experiments, based on existing code and assimilated knowledge from general GPU literature; and © autonomously implementing these experiments through code modification and subsequent submission to an external evaluation system, using only observed timing data as performance feedback. We detail how this approach navigates the challenges of the AMD MI300 target architecture and leverages LLMs to compensate for limited domain-specific human expertise. Since quantitative results from an ongoing performance competition were embargoed on paper submission date, we present the architectural design, operational workflow, and qualitative insights, highlighting the potential of LLM-driven agents to democratise and accelerate GPU kernel optimization, especially in resource-constrained or rapidly evolving hardware environments. Comments: 4 page paper plus Appendices. Accepted to the ES-FoMo “Efficient Systems for Foundation Models” workshop at ICML 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Software Engineering (cs.SE) Cite as: arXiv:2506.20807 [cs.LG] (or arXiv:2506.20807v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.20807 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-43] Poster: Enhancing GNN Robustness for Network Intrusion Detection via Agent -based Analysis

【速读】：该论文试图解决图神经网络（Graph Neural Networks, GNNs）在物联网（IoT）环境下的网络入侵检测系统（NIDS）中因分布漂移和对现实对抗攻击缺乏鲁棒性而导致的性能退化问题。解决方案的关键在于引入大型语言模型（Large Language Models, LLMs）作为模拟网络安全专家代理，构建一个代理流程，在GNN处理之前对从网络流量数据中提取的图结构进行审查，识别并可能缓解可疑或对抗性扰动的元素，从而增强GNN的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2506.20806
作者: Zhonghao Zhan,Huichi Zhou,Hamed Haddadi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Poster accepted at the 10th IEEE European Symposium on Security and Privacy (Euro SP 2025)

点击查看摘要

Abstract:Graph Neural Networks (GNNs) show great promise for Network Intrusion Detection Systems (NIDS), particularly in IoT environments, but suffer performance degradation due to distribution drift and lack robustness against realistic adversarial attacks. Current robustness evaluations often rely on unrealistic synthetic perturbations and lack demonstrations on systematic analysis of different kinds of adversarial attack, which encompass both black-box and white-box scenarios. This work proposes a novel approach to enhance GNN robustness and generalization by employing Large Language Models (LLMs) in an agentic pipeline as simulated cybersecurity expert agents. These agents scrutinize graph structures derived from network flow data, identifying and potentially mitigating suspicious or adversarially perturbed elements before GNN processing. Our experiments, using a framework designed for realistic evaluation and testing with a variety of adversarial attacks including a dataset collected from physical testbed experiments, demonstrate that integrating LLM analysis can significantly improve the resilience of GNN-based NIDS against challenges, showcasing the potential of LLM agent as a complementary layer in intrusion detection architectures.
zh

[AI-44] Stochastic Parameter Decomposition

【速读】：该论文试图解决神经网络逆向工程中参数分解的难题，特别是现有分解方法在计算成本高、对超参数敏感以及参数收缩等问题上的不足。其解决方案的关键是提出一种名为随机参数分解（Stochastic Parameter Decomposition, SPD）的方法，该方法相比现有的基于归因的参数分解（Attribution-based Parameter Decomposition, APD）更具可扩展性和对超参数的鲁棒性，并能更准确地识别真实机制。

链接: https://arxiv.org/abs/2506.20790
作者: Lucius Bushnaq,Dan Braun,Lee Sharkey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition – a framework that has been proposed to resolve several issues with current decomposition methods – decomposes neural network parameters into a sum of sparsely used vectors in parameter space. However, the current main method in this framework, Attribution-based Parameter Decomposition (APD), is impractical on account of its computational cost and sensitivity to hyperparameters. In this work, we introduce \textitStochastic Parameter Decomposition (SPD), a method that is more scalable and robust to hyperparameters than APD, which we demonstrate by decomposing models that are slightly larger and more complex than was possible to decompose with APD. We also show that SPD avoids other issues, such as shrinkage of the learned parameters, and better identifies ground truth mechanisms in toy models. By bridging causal mediation analysis and network decomposition methods, this demonstration opens up new research possibilities in mechanistic interpretability by removing barriers to scaling linear parameter decomposition methods to larger models. We release a library for running SPD and reproducing our experiments at this https URL.
zh

[AI-45] Agile Management for Machine Learning: A Systematic Mapping Study MICRO

【速读】：该论文试图解决在生成式 AI (Generative AI) 驱动的系统中应用敏捷管理方法的挑战，特别是在动态且实验性较强的机器学习（ML）开发过程中，传统项目管理方法难以适应快速变化的数据和迭代需求的问题。解决方案的关键在于通过系统性映射研究，识别出八种框架及八个核心主题，如迭代灵活性、创新的ML特定工件和最小可行模型等，旨在为ML-enabled系统的敏捷管理提供指导，并强调准确评估ML相关任务工作量是当前研究中的主要挑战。

链接: https://arxiv.org/abs/2506.20759
作者: Lucas Romao,Hugo Villamizar,Romeu Oliveira,Silvio Alonso,Marcos Kalinowski
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 51st Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA) 2025

点击查看摘要

Abstract:[Context] Machine learning (ML)-enabled systems are present in our society, driving significant digital transformations. The dynamic nature of ML development, characterized by experimental cycles and rapid changes in data, poses challenges to traditional project management. Agile methods, with their flexibility and incremental delivery, seem well-suited to address this dynamism. However, it is unclear how to effectively apply these methods in the context of ML-enabled systems, where challenges require tailored approaches. [Goal] Our goal is to outline the state of the art in agile management for ML-enabled systems. [Method] We conducted a systematic mapping study using a hybrid search strategy that combines database searches with backward and forward snowballing iterations. [Results] Our study identified 27 papers published between 2008 and 2024. From these, we identified eight frameworks and categorized recommendations and practices into eight key themes, such as Iteration Flexibility, Innovative ML-specific Artifacts, and the Minimal Viable Model. The main challenge identified across studies was accurate effort estimation for ML-related tasks. [Conclusion] This study contributes by mapping the state of the art and identifying open gaps in the field. While relevant work exists, more robust empirical evaluation is still needed to validate these contributions.
zh

[AI-46] Exploring the Effects of Chatbot Anthropomorphism and Human Empathy on Human Prosocial Behavior Toward Chatbots

【速读】：该论文试图解决人类为何会帮助聊天机器人这一问题，特别是在当前聊天机器人日益融入人们生活背景下，缺乏对驱动人类助人行为因素的研究。其解决方案的关键在于基于“计算机作为社会行动者”（Computers Are Social Actors, CASA）框架，探讨聊天机器人的拟人化特征（包括人类身份、情感表达和非语言表达）如何通过增强人类对聊天机器人的共情，进而影响其亲社会行为和意图。研究通过在线实验验证了人类身份和情感表达对亲社会行为的促进作用，并揭示了参与者助人行为的两个主要动机：对聊天机器人的共情以及将聊天机器人视为类人实体的认知。

链接: https://arxiv.org/abs/2506.20748
作者: Jingshu Li,Zicheng Zhu,Renwen Zhang,Yi-Chieh Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chatbots are increasingly integrated into people’s lives and are widely used to help people. Recently, there has also been growing interest in the reverse direction-humans help chatbots-due to a wide range of benefits including better chatbot performance, human well-being, and collaborative outcomes. However, little research has explored the factors that motivate people to help chatbots. To address this gap, we draw on the Computers Are Social Actors (CASA) framework to examine how chatbot anthropomorphism-including human-like identity, emotional expression, and non-verbal expression-influences human empathy toward chatbots and their subsequent prosocial behaviors and intentions. We also explore people’s own interpretations of their prosocial behaviors toward chatbots. We conducted an online experiment (N = 244) in which chatbots made mistakes in a collaborative image labeling task and explained the reasons to participants. We then measured participants’ prosocial behaviors and intentions toward the chatbots. Our findings revealed that human identity and emotional expression of chatbots increased participants’ prosocial behavior and intention toward chatbots, with empathy mediating these effects. Qualitative analysis further identified two motivations for participants’ prosocial behaviors: empathy for the chatbot and perceiving the chatbot as human-like. We discuss the implications of these results for understanding and promoting human prosocial behaviors toward chatbots.
zh

[AI-47] st-time Scaling Techniques in Theoretical Physics – A Comparison of Methods on the TPBench Dataset

【速读】：该论文试图解决如何将测试时扩展技术（test-time scaling）从数学推理基准（如AIME）迁移至高级理论物理领域的问题，以验证这些方法在更复杂科学问题中的泛化能力。其解决方案的关键在于开发一种新颖的符号弱验证框架（symbolic weak-verifier framework），该框架通过更好地利用物理问题的结构，显著提升了并行扩展结果的性能。

链接: https://arxiv.org/abs/2506.20729
作者: Zhiqi Gao,Tianyi Li,Yurii Kvasiuk,Sai Chaitanya Tadepalli,Maja Rudolph,Daniel J.H. Chung,Frederic Sala,Moritz Münchmeyer
机构: 未知
类目: Machine Learning (cs.LG); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph); High Energy Physics - Theory (hep-th)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) have shown strong capabilities in complex reasoning, and test-time scaling techniques can enhance their performance with comparably low cost. Many of these methods have been developed and evaluated on mathematical reasoning benchmarks such as AIME. This paper investigates whether the lessons learned from these benchmarks generalize to the domain of advanced theoretical physics. We evaluate a range of common test-time scaling methods on the TPBench physics dataset and compare their effectiveness with results on AIME. To better leverage the structure of physics problems, we develop a novel, symbolic weak-verifier framework to improve parallel scaling results. Our empirical results demonstrate that this method significantly outperforms existing test-time scaling approaches on TPBench. We also evaluate our method on AIME, confirming its effectiveness in solving advanced mathematical problems. Our findings highlight the power of step-wise symbolic verification for tackling complex scientific problems.
zh

[AI-48] On Convolutions Intrinsic Dimension and Diffusion Models

【速读】：该论文试图解决生成式 AI (Generative AI) 模型中局部内在维度（Local Intrinsic Dimension, LID）估计的理论基础不完善问题，特别是 FLIPD 方法在非仿射子流形上的有效性证明不足。解决方案的关键在于通过形式化证明，验证 FLIPD 在现实假设下的正确性，并进一步探讨当高斯卷积被均匀卷积替代时，类似结果依然成立的理论依据。

链接: https://arxiv.org/abs/2506.20705
作者: Kin Kwan Leung,Rasa Hosseinzadeh,Gabriel Loaiza-Ganem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The manifold hypothesis asserts that data of interest in high-dimensional ambient spaces, such as image data, lies on unknown low-dimensional submanifolds. Diffusion models (DMs) – which operate by convolving data with progressively larger amounts of Gaussian noise and then learning to revert this process – have risen to prominence as the most performant generative models, and are known to be able to learn distributions with low-dimensional support. For a given datum in one of these submanifolds, we should thus intuitively expect DMs to have implicitly learned its corresponding local intrinsic dimension (LID), i.e. the dimension of the submanifold it belongs to. Kamkari et al. (2024b) recently showed that this is indeed the case by linking this LID to the rate of change of the log marginal densities of the DM with respect to the amount of added noise, resulting in an LID estimator known as FLIPD. LID estimators such as FLIPD have a plethora of uses, among others they quantify the complexity of a given datum, and can be used to detect outliers, adversarial examples and AI-generated text. FLIPD achieves state-of-the-art performance at LID estimation, yet its theoretical underpinnings are incomplete since Kamkari et al. (2024b) only proved its correctness under the highly unrealistic assumption of affine submanifolds. In this work we bridge this gap by formally proving the correctness of FLIPD under realistic assumptions. Additionally, we show that an analogous result holds when Gaussian convolutions are replaced with uniform ones, and discuss the relevance of this result.
zh

[AI-49] he Singapore Consensus on Global AI Safety Research Priorities WWW

【速读】：该论文旨在解决如何构建一个可信的AI生态系统，以确保AI的安全性，即其可信度、可靠性和安全性。解决方案的关键在于采用“纵深防御”模型，将AI安全研究领域划分为三个主要方向：开发可信AI系统所面临的挑战（Development）、评估其风险的挑战（Assessment）以及部署后监控与干预的挑战（Control）。通过这一结构化框架，论文旨在为AI安全研究提供清晰的优先级和指导方向。

链接: https://arxiv.org/abs/2506.20702
作者: Yoshua Bengio,Tegan Maharaj,Luke Ong,Stuart Russell,Dawn Song,Max Tegmark,Lan Xue,Ya-Qin Zhang,Stephen Casper,Wan Sie Lee,Sören Mindermann,Vanessa Wilfred,Vidhisha Balachandran,Fazl Barez,Michael Belinsky,Imane Bello,Malo Bourgon,Mark Brakel,Siméon Campos,Duncan Cass-Beggs,Jiahao Chen,Rumman Chowdhury,Kuan Chua Seah,Jeff Clune,Juntao Dai,Agnes Delaborde,Nouha Dziri,Francisco Eiras,Joshua Engels,Jinyu Fan,Adam Gleave,Noah Goodman,Fynn Heide,Dan Hendrycks,Cyrus Hodes,Bryan Low Kian Hsiang,Minlie Huang,Sami Jawhar,Wang Jingyu,Adam Tauman Kalai,Meindert Kamphuis,Mohan Kankanhalli,Subhash Kantamneni,Mathias Bonde Kirk,Thomas Kwa,Jeffrey Ladish,Kwok-Yan Lam,Wan Lee Sie,Taewhi Lee,Xiaojian Li,Jiajun Liu,Chaochao Lu,Yifan Mai,Richard Mallah,Julian Michael,Nick Moës,Simon Möller,Kihyuk Nam,Kwan Yee Ng,Mark Nitzberg,Besmira Nushi,Seán O hÉigeartaigh,Alejandro Ortega,Pierre Peigné,James Petrie,Benjamin Prud’Homme,Reihaneh Rabbany,Nayat Sanchez-Pi,Sarah Schwettmann,Buck Shlegeris,Saad Siddiqui,Aradhana Sinha,Martín Soto,Cheston Tan,Dong Ting,Robert Trager,Brian Tse,Anthony Tung K. H.,Vanessa Wilfred,John Willes,Denise Wong,Wei Xu,Rongwu Xu,Yi Zeng,HongJiang Zhang,Djordje Žikelić
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Final report from the “2025 Singapore Conference on AI (SCAI)” held April 26: this https URL

点击查看摘要

Abstract:Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential – it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. The “2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety” aimed to support research in this space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety. This resulting report builds on the International AI Safety Report chaired by Yoshua Bengio and backed by 33 governments. By adopting a defence-in-depth model, this report organises AI safety research domains into three types: challenges with creating trustworthy AI systems (Development), challenges with evaluating their risks (Assessment), and challenges with monitoring and intervening after deployment (Control). Comments: Final report from the “2025 Singapore Conference on AI (SCAI)” held April 26: this https URL Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2506.20702 [cs.AI] (or arXiv:2506.20702v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.20702 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-50] Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models

【速读】：该论文试图解决在推理阶段将预训练扩散模型适配到新目标的问题，现有控制方法在高噪声水平下存在价值估计不准确的问题，导致引导效果偏差，并且未能复用历史运行信息以提升样本质量，造成计算资源利用效率低下。解决方案的关键在于受蒙特卡洛树搜索成功的启发，将推理阶段的对齐问题转化为一个可复用历史计算的搜索问题，通过引入基于树的采样方法，利用终端奖励反向传播并通过每次生成迭代优化价值估计，从而实现对目标分布的渐近精确采样。

链接: https://arxiv.org/abs/2506.20701
作者: Vineet Jain,Kusha Sareen,Mohammad Pedramfar,Siamak Ravanbakhsh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Adapting a pretrained diffusion model to new objectives at inference time remains an open problem in generative modeling. Existing steering methods suffer from inaccurate value estimation, especially at high noise levels, which biases guidance. Moreover, information from past runs is not reused to improve sample quality, resulting in inefficient use of compute. Inspired by the success of Monte Carlo Tree Search, we address these limitations by casting inference-time alignment as a search problem that reuses past computations. We introduce a tree-based approach that samples from the reward-aligned target density by propagating terminal rewards back through the diffusion chain and iteratively refining value estimates with each additional generation. Our proposed method, Diffusion Tree Sampling (DTS), produces asymptotically exact samples from the target distribution in the limit of infinite rollouts, and its greedy variant, Diffusion Tree Search (DTS ^\star ), performs a global search for high reward samples. On MNIST and CIFAR-10 class-conditional generation, DTS matches the FID of the best-performing baseline with up to 10\times less compute. In text-to-image generation and language completion tasks, DTS ^\star effectively searches for high reward samples that match best-of-N with up to 5\times less compute. By reusing information from previous generations, we get an anytime algorithm that turns additional compute into steadily better samples, providing a scalable approach for inference-time alignment of diffusion models.
zh

[AI-51] Progressive Size-Adaptive Federated Learning: A Comprehensive Framework for Heterogeneous Multi-Modal Data Systems

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中数据集规模特性对训练动态影响被忽视的问题，特别是在异构多模态数据环境下，现有方法主要关注模型异质性和聚合技术，而未充分考虑数据集大小对联邦学习效果的影响。其解决方案的关键在于提出一种基于数据集规模的自适应联邦学习框架（Size-Based Adaptive Federated Learning, SAFL），通过系统性地根据数据集规模组织联邦学习过程，优化训练效率与性能。

链接: https://arxiv.org/abs/2506.20685
作者: Sajid Hussain,Muhammad Sohail,Nauman Ali Khan,Naima Iltaf,Ihtesham ul Islam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a transformative paradigm for distributed machine learning while preserving data privacy. However, existing approaches predominantly focus on model heterogeneity and aggregation techniques, largely overlooking the fundamental impact of dataset size characteristics on federated training dynamics. This paper introduces Size-Based Adaptive Federated Learning (SAFL), a novel progressive training framework that systematically organizes federated learning based on dataset size characteristics across heterogeneous multi-modal data. Our comprehensive experimental evaluation across 13 diverse datasets spanning 7 modalities (vision, text, time series, audio, sensor, medical vision, and multimodal) reveals critical insights: 1) an optimal dataset size range of 1000-1500 samples for federated learning effectiveness; 2) a clear modality performance hierarchy with structured data (time series, sensor) significantly outperforming unstructured data (text, multimodal); and 3) systematic performance degradation for large datasets exceeding 2000 samples. SAFL achieves an average accuracy of 87.68% across all datasets, with structured data modalities reaching 99%+ accuracy. The framework demonstrates superior communication efficiency, reducing total data transfer to 7.38 GB across 558 communications while maintaining high performance. Our real-time monitoring framework provides unprecedented insights into system resource utilization, network efficiency, and training dynamics. This work fills critical gaps in understanding how data characteristics should drive federated learning strategies, providing both theoretical insights and practical guidance for real-world FL deployments in neural network and learning systems.
zh

[AI-52] Utility-Driven Speculative Decoding for Mixture-of-Experts

【速读】：该论文旨在解决在低延迟大语言模型（Large Language Model, LLM）推理中，由于GPU内存带宽瓶颈导致的性能限制问题，特别是在混合专家（Mixture of Experts, MoE）模型中，推测解码（speculative decoding）技术失效甚至导致性能下降的问题。其解决方案的关键在于提出了一种基于效用驱动的框架Cascade，该框架通过动态评估推测效用（speculation utility）来决定是否启用推测解码，并根据效用最大化原则动态调整推测的token数量K，从而有效避免性能下降并提升MoE模型的吞吐量。

链接: https://arxiv.org/abs/2506.20675
作者: Anish Saxena,Po-An Tsai,Hritvik Taneja,Aamer Jaleel,Moinuddin Qureshi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:GPU memory bandwidth is the main bottleneck for low-latency Large Language Model (LLM) inference. Speculative decoding leverages idle GPU compute by using a lightweight drafter to propose K tokens, which the LLM verifies in parallel, boosting token throughput. In conventional dense LLMs, all model weights are fetched each iteration, so speculation adds no latency overhead. Emerging Mixture of Experts (MoE) models activate only a subset of weights per token, greatly reducing data movement. However, we show that speculation is ineffective for MoEs: draft tokens collectively activate more weights, increasing data movement and verification time by 2-3x. When token throughput gains fail to offset this overhead, speculation causes slowdowns up to 1.5x, making it infeasible. Even when useful, the optimal K varies by task, model, and even between requests and iterations. Thus, despite widespread use in dense LLMs, speculation remains impractical in leading MoEs. We present Cascade, a utility-driven framework that selectively enables speculation to avoid slowdowns and dynamically tunes K to accelerate MoE serving. Cascade uses a lightweight metric, speculation utility, the ratio of token gains to verification cost, which shows iteration-level locality, enabling periodic decisions via short test and longer set phases. For each request, Cascade disables speculation if utility drops below one during testing, and when utility exceeds one, tests multiple K-values to choose the utility-maximizing K for the set phase. We implement Cascade in vLLM and evaluate it on five popular MoEs with workloads spanning code, math, extraction, and mixed tasks. Cascade limits slowdown to 5% (vs. 1.5x) and improves throughput by 7-14% over static K, making speculative decoding practical for MoEs. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.20675 [cs.DC] (or arXiv:2506.20675v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2506.20675 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-53] ClusterRCA: Network Failure Diagnosis in HPC Systems Using Multimodal Data

【速读】：该论文旨在解决高性能计算（HPC）系统中网络故障诊断的挑战性问题，现有方法由于数据异构性和准确性不足而无法直接应用于HPC场景。解决方案的关键在于提出一种名为ClusterRCA的新型框架，该框架通过融合多模态数据，利用基于分类器和图的方法来定位故障节点并确定故障类型。ClusterRCA从拓扑连接的网络接口控制器（NIC）对中提取特征，并构建故障图，随后在图上执行定制化的随机游走以定位根本原因，从而实现了高精度的网络故障诊断。

链接: https://arxiv.org/abs/2506.20673
作者: Yongqian Sun,Xijie Pan,Xiao Xiong,Lei Tao,Jiaju Wang,Shenglin Zhang,Yuan Yuan,Yuqi Li,Kunlin Jian
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.
zh

[AI-54] DRAG ON: Distributional Rewards Optimize Diffusion Generative Models

【速读】：该论文试图解决如何有效微调媒体生成模型以达到期望输出的问题，特别是在传统强化学习与人类反馈（RLHF）或成对偏好方法（如直接偏好优化DPO）中存在灵活性不足的挑战。其解决方案的关键在于提出一种名为Distributional RewArds for Generative OptimizatioN (DRAGON)的框架，该框架能够优化评估单个样本或样本分布的奖励函数，从而兼容多种实例级、实例到分布以及分布到分布的奖励类型。通过构建基于编码器和参考样本的示例分布，并利用正负示例集之间的对比来最大化奖励，DRAGON实现了对生成质量的有效提升。

链接: https://arxiv.org/abs/2504.15217
作者: Yatong Bai,Jonah Casebeer,Somayeh Sojoudi,Nicholas J. Bryan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modality encoders such as CLAP are used, the reference examples may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them to construct a positive demonstration set and a negative set, and leverages the contrast between the two sets to maximize the reward. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 different reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Sound examples at this https URL.
zh

[AI-55] Hyperspherical Variational Autoencoders Using Efficient Spherical Cauchy Distribution

【速读】：该论文试图解决传统变分自编码器（VAE）中潜在空间表示不够自然、灵活性不足以及数值不稳定的问题。其解决方案的关键在于引入一种基于球面柯西分布（spCauchy）的潜在分布，该分布能够更自然地表征高维方向数据，同时具备重尾特性以避免过度正则化，并通过莫比乌斯变换实现可微且高效的重参数化技巧，从而提升训练稳定性与计算效率。此外，KL散度可通过快速收敛的幂级数计算，避免了超几何函数比值计算中的下溢或上溢问题。

链接: https://arxiv.org/abs/2506.21278
作者: Lukas Sablica,Kurt Hornik
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:We propose a novel variational autoencoder (VAE) architecture that employs a spherical Cauchy (spCauchy) latent distribution. Unlike traditional Gaussian latent spaces or the widely used von Mises-Fisher (vMF) distribution, spCauchy provides a more natural hyperspherical representation of latent variables, better capturing directional data while maintaining flexibility. Its heavy-tailed nature prevents over-regularization, ensuring efficient latent space utilization while offering a more expressive representation. Additionally, spCauchy circumvents the numerical instabilities inherent to vMF, which arise from computing normalization constants involving Bessel functions. Instead, it enables a fully differentiable and efficient reparameterization trick via Möbius transformations, allowing for stable and scalable training. The KL divergence can be computed through a rapidly converging power series, eliminating concerns of underflow or overflow associated with evaluation of ratios of hypergeometric functions. These properties make spCauchy a compelling alternative for VAEs, offering both theoretical advantages and practical efficiency in high-dimensional generative modeling.
zh

[AI-56] From On-chain to Macro: Assessing the Importance of Data Source Diversity in Cryptocurrency Market Forecasting

【速读】：该论文试图解决数据源多样性对加密货币预测模型性能影响的问题，其核心在于通过整合多种数据类别（包括技术指标、链上指标、情感与兴趣指标、传统市场指数和宏观经济指标）来提升预测模型的准确性。解决方案的关键在于引入Crypto100指数以及提出一种新的特征约简算法，以识别来自多样化数据源中最具影响力和稳健性的特征。

链接: https://arxiv.org/abs/2506.21246
作者: Giorgos Demosthenous,Chryssis Georgiou,Eliada Polydorou
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
备注:

点击查看摘要

Abstract:This study investigates the impact of data source diversity on the performance of cryptocurrency forecasting models by integrating various data categories, including technical indicators, on-chain metrics, sentiment and interest metrics, traditional market indices, and macroeconomic indicators. We introduce the Crypto100 index, representing the top 100 cryptocurrencies by market capitalization, and propose a novel feature reduction algorithm to identify the most impactful and resilient features from diverse data sources. Our comprehensive experiments demonstrate that data source diversity significantly enhances the predictive performance of forecasting models across different time horizons. Key findings include the paramount importance of on-chain metrics for both short-term and long-term predictions, the growing relevance of traditional market indices and macroeconomic indicators for longer-term forecasts, and substantial improvements in model accuracy when diverse data sources are utilized. These insights help demystify the short-term and long-term driving factors of the cryptocurrency market and lay the groundwork for developing more accurate and resilient forecasting models.
zh

[AI-57] A Novel Framework for Integrating 3D Ultrasound into Percutaneous Liver Tumour Ablation

【速读】：该论文旨在解决在经皮肝肿瘤消融过程中，超声（US）图像中肿瘤识别的挑战，从而推动三维超声（3D US）向治疗领域的临床应用。其解决方案的关键在于提出一种临床可行的二维超声（2D US）与计算机断层扫描（CT）/磁共振成像（MRI）配准方法，利用3D US作为中间媒介以降低配准复杂度，并通过直观的多模态图像可视化技术提高配准流程的验证效率。

链接: https://arxiv.org/abs/2506.21162
作者: Shuwei Xing,Derek W. Cool,David Tessier,Elvis C.S. Chen,Terry M. Peters,Aaron Fenster
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:3D ultrasound (US) imaging has shown significant benefits in enhancing the outcomes of percutaneous liver tumour ablation. Its clinical integration is crucial for transitioning 3D US into the therapeutic domain. However, challenges of tumour identification in US images continue to hinder its broader adoption. In this work, we propose a novel framework for integrating 3D US into the standard ablation workflow. We present a key component, a clinically viable 2D US-CT/MRI registration approach, leveraging 3D US as an intermediary to reduce registration complexity. To facilitate efficient verification of the registration workflow, we also propose an intuitive multimodal image visualization technique. In our study, 2D US-CT/MRI registration achieved a landmark distance error of approximately 2-4 mm with a runtime of 0.22s per image pair. Additionally, non-rigid registration reduced the mean alignment error by approximately 40% compared to rigid registration. Results demonstrated the efficacy of the proposed 2D US-CT/MRI registration workflow. Our integration framework advanced the capabilities of 3D US imaging in improving percutaneous tumour ablation, demonstrating the potential to expand the therapeutic role of 3D US in clinical interventions.
zh

[AI-58] ransformer-Based Spatial-Temporal Counterfactual Outcomes Estimation ICML2025

【速读】：该论文试图解决具有时空属性的反事实结果估计问题（counterfactual outcomes with spatial-temporal attributes），这是现实世界中一个关键问题。传统方法依赖于经典统计模型，但在性能和泛化能力上存在局限性。该论文提出了一种基于Transformer的新框架，其关键在于利用Transformer强大的建模能力，提升了反事实估计的准确性。在较弱假设下，所提出的估计器具有一致性和渐近正态性。通过模拟实验和真实数据实验验证了该方法的有效性，结果显示其优于基线方法，并在哥伦比亚冲突对森林损失的因果效应分析中提供了有价值的结论。

链接: https://arxiv.org/abs/2506.21154
作者: He Li,Haoang Chi,Mingyu Liu,Wanrong Huang,Liyang Xu,Wenjing Yang
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, accepted at ICML 2025

点击查看摘要

Abstract:The real world naturally has dimensions of time and space. Therefore, estimating the counterfactual outcomes with spatial-temporal attributes is a crucial problem. However, previous methods are based on classical statistical models, which still have limitations in performance and generalization. This paper proposes a novel framework for estimating counterfactual outcomes with spatial-temporal attributes using the Transformer, exhibiting stronger estimation ability. Under mild assumptions, the proposed estimator within this framework is consistent and asymptotically normal. To validate the effectiveness of our approach, we conduct simulation experiments and real data experiments. Simulation experiments show that our estimator has a stronger estimation capability than baseline methods. Real data experiments provide a valuable conclusion to the causal effect of conflicts on forest loss in Colombia. The source code is available at this https URL.
zh

[AI-59] CovDocker: Benchmarking Covalent Drug Design with Tasks Datasets and Solutions KDD2025

【速读】：该论文试图解决现有分子对接方法和深度学习方法在预测配体与靶蛋白之间共价相互作用时的不足，特别是对共价键形成及其结构变化的考虑不够。解决方案的关键在于提出一个全面的共价对接基准测试平台CovDocker，并将其过程分解为三个主要任务：反应位点预测、共价反应预测和共价对接。通过适配最先进的模型如Uni-Mol和Chemformer，建立了基线性能并验证了该基准在准确预测相互作用位点和建模共价结合中分子转化方面的有效性。

链接: https://arxiv.org/abs/2506.21085
作者: Yangzhe Peng,Kaiyuan Gao,Liang He,Yuheng Cong,Haiguang Liu,Kun He,Lijun Wu
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to KDD 2025 Research Track

点击查看摘要

Abstract:Molecular docking plays a crucial role in predicting the binding mode of ligands to target proteins, and covalent interactions, which involve the formation of a covalent bond between the ligand and the target, are particularly valuable due to their strong, enduring binding nature. However, most existing docking methods and deep learning approaches hardly account for the formation of covalent bonds and the associated structural changes. To address this gap, we introduce a comprehensive benchmark for covalent docking, CovDocker, which is designed to better capture the complexities of covalent binding. We decompose the covalent docking process into three main tasks: reactive location prediction, covalent reaction prediction, and covalent docking. By adapting state-of-the-art models, such as Uni-Mol and Chemformer, we establish baseline performances and demonstrate the effectiveness of the benchmark in accurately predicting interaction sites and modeling the molecular transformations involved in covalent binding. These results confirm the role of the benchmark as a rigorous framework for advancing research in covalent drug design. It underscores the potential of data-driven approaches to accelerate the discovery of selective covalent inhibitors and addresses critical challenges in therapeutic development.
zh

[AI-60] IMC-PINN-FE: A Physics-Informed Neural Network for Patient-Specific Left Ventricular Finite Element Modeling with Image Motion Consistency and Biomechanical Parameter Estimation

【速读】：该论文旨在解决心脏肌壁生物力学行为建模中计算成本高且难以准确再现心脏运动的问题。传统有限元（FE）方法虽能模拟心脏力学行为，但计算耗时且难以与临床影像数据一致。其解决方案的关键在于提出IMC-PINN-FE框架，该框架结合了成像运动一致性（IMC）与FE建模，利用物理信息神经网络（PINN）快速估算心肌刚度和主动张力，并通过运动约束提高图像位移匹配精度，从而实现高效、个性化且与影像一致的心脏生物力学建模。

链接: https://arxiv.org/abs/2506.20696
作者: Siyu Mu,Wei Xuan Chan,Choon Hwai Yap
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Elucidating the biomechanical behavior of the myocardium is crucial for understanding cardiac physiology, but cannot be directly inferred from clinical imaging and typically requires finite element (FE) simulations. However, conventional FE methods are computationally expensive and often fail to reproduce observed cardiac motions. We propose IMC-PINN-FE, a physics-informed neural network (PINN) framework that integrates imaged motion consistency (IMC) with FE modeling for patient-specific left ventricular (LV) biomechanics. Cardiac motion is first estimated from MRI or echocardiography using either a pre-trained attention-based network or an unsupervised cyclic-regularized network, followed by extraction of motion modes. IMC-PINN-FE then rapidly estimates myocardial stiffness and active tension by fitting clinical pressure measurements, accelerating computation from hours to seconds compared to traditional inverse FE. Based on these parameters, it performs FE modeling across the cardiac cycle at 75x speedup. Through motion constraints, it matches imaged displacements more accurately, improving average Dice from 0.849 to 0.927, while preserving realistic pressure-volume behavior. IMC-PINN-FE advances previous PINN-FE models by introducing back-computation of material properties and better motion fidelity. Using motion from a single subject to reconstruct shape modes also avoids the need for large datasets and improves patient specificity. IMC-PINN-FE offers a robust and efficient approach for rapid, personalized, and image-consistent cardiac biomechanical modeling.
zh

[AI-61] Evaluating PDE discovery methods for multiscale modeling of biological signals

【速读】：该论文试图解决生物系统中多尺度行为表征的挑战，特别是如何从微观数据中捕捉宏观动态。其解决方案的关键在于利用偏微分方程（Partial Differential Equation, PDE）发现技术，通过粒子模拟与PDE发现相结合的框架，从微观尺度的数据中推导出介观尺度的动力学特征。

链接: https://arxiv.org/abs/2506.20694
作者: Andréa Ducos(AISTROSIGHT),Audrey Denizot(AISTROSIGHT),Thomas Guyet(AISTROSIGHT),Hugues Berry(AISTROSIGHT)
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Biological systems are non-linear, include unobserved variables and the physical principles that govern their dynamics are partly unknown. This makes the characterization of their behavior very challenging. Notably, their activity occurs on multiple interdependent spatial and temporal scales that require linking mechanisms across scales. To address the challenge of bridging gaps between scales, we leverage partial differential equations (PDE) discovery. PDE discovery suggests meso-scale dynamics characteristics from micro-scale data. In this article, we present our framework combining particle-based simulations and PDE discovery and conduct preliminary experiments to assess equation discovery in controlled settings. We evaluate five state-of-the-art PDE discovery methods on particle-based simulations of calcium diffusion in astrocytes. The performances of the methods are evaluated on both the form of the discovered equation and the forecasted temporal variations of calcium concentration. Our results show that several methods accurately recover the diffusion term, highlighting the potential of PDE discovery for capturing macroscopic dynamics in biological systems from microscopic data.
zh

机器学习

[LG-0] Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

链接: https://arxiv.org/abs/2506.21551
作者: Ziyue Li,Chenrui Fan,Tianyi Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Grokking, i.e., test performance keeps improving long after training loss converged, has been recently witnessed in neural network training, making the mechanism of generalization and other emerging capabilities such as reasoning mysterious. While prior studies usually train small models on a few toy or highly-specific tasks for thousands of epochs, we conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE. We compute the training loss and evaluate generalization on diverse benchmark tasks, including math reasoning, code generation, and commonsense/domain-specific knowledge retrieval tasks. Our study, for the first time, verifies that grokking still happens in the pretraining of large-scale foundation models, though different data may enter grokking stages asynchronously. We further demystify grokking’s “emergence of generalization” by investigating LLM internal dynamics. Specifically, we find that training samples’ pathways (i.e., expert choices across layers) evolve from random, instance-specific to more structured and shareable between samples during grokking. Also, the complexity of a sample’s pathway reduces despite the converged loss. These indicate a memorization-to-generalization conversion, providing a mechanistic explanation of delayed generalization. In the study, we develop two novel metrics to quantify pathway distance and the complexity of a single pathway. We show their ability to predict the generalization improvement on diverse downstream tasks. They are efficient, simple to compute and solely dependent on training data. Hence, they have practical value for pretraining, enabling us to monitor the generalization performance without finetuning and test. Theoretically, we show that more structured pathways reduce model complexity and improve the generalization bound. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.21551 [cs.LG] (or arXiv:2506.21551v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.21551 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Devising a solution to the problems of Cancer awareness in Telangana

链接: https://arxiv.org/abs/2506.21500
作者: Priyanka Avhad,Vedanti Kshirsagar,Urvi Ranjan,Mahek Nakhua
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:According to the data, the percent of women who underwent screening for cervical cancer, breast and oral cancer in Telangana in the year 2020 was 3.3 percent, 0.3 percent and 2.3 percent respectively. Although early detection is the only way to reduce morbidity and mortality, people have very low awareness about cervical and breast cancer signs and symptoms and screening practices. We developed an ML classification model to predict if a person is susceptible to breast or cervical cancer based on demographic factors. We devised a system to provide suggestions for the nearest hospital or Cancer treatment centres based on the users location or address. In addition to this, we can integrate the health card to maintain medical records of all individuals and conduct awareness drives and campaigns. For ML classification models, we used decision tree classification and support vector classification algorithms for cervical cancer susceptibility and breast cancer susceptibility respectively. Thus, by devising this solution we come one step closer to our goal which is spreading cancer awareness, thereby, decreasing the cancer mortality and increasing cancer literacy among the people of Telangana.

[LG-2] A Keyword-Based Technique to Evaluate Broad Question Answer Script

链接: https://arxiv.org/abs/2506.21461
作者: Tamim Al Mahmud,Md Gulzar Hussain,Sumaiya Kabir,Hasnain Ahmad,Mahmudus Sobhan
类目: Machine Learning (cs.LG)
*备注: ACM Conference Proceedings (9 Pages)

点击查看摘要

Abstract:Evaluation is the method of assessing and determining the educational system through various techniques such as verbal or viva-voice test, subjective or objective written test. This paper presents an efficient solution to evaluate the subjective answer script electronically. In this paper, we proposed and implemented an integrated system that examines and evaluates the written answer script. This article focuses on finding the keywords from the answer script and then compares them with the keywords that have been parsed from both open and closed domain. The system also checks the grammatical and spelling errors in the answer script. Our proposed system tested with answer scripts of 100 students and gives precision score 0.91.

[LG-3] owards an Optimal Control Perspective of ResNet Training ICML2025

链接: https://arxiv.org/abs/2506.21453
作者: Jens Püttschneider,Simon Heilig,Asja Fischer,Timm Faulwasser
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: Accepted for presentation at the High-dimensional Learning Dynamics (HiLD) workshop at ICML 2025

点击查看摘要

Abstract:We propose a training formulation for ResNets reflecting an optimal control problem that is applicable for standard architectures and general loss functions. We suggest bridging both worlds via penalizing intermediate outputs of hidden states corresponding to stage cost terms in optimal control. For standard ResNets, we obtain intermediate outputs by propagating the state through the subsequent skip connections and the output layer. We demonstrate that our training dynamic biases the weights of the unnecessary deeper residual layers to vanish. This indicates the potential for a theory-grounded layer pruning strategy.

[LG-4] Learnable Adaptive Time-Frequency Representation via Differentiable Short-Time Fourier Transform

链接: https://arxiv.org/abs/2506.21440
作者: Maxime Leiber,Yosra Marnissi,Axel Barrau,Sylvain Meignen,Laurent Massoulié
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: DSTFT, STFT, spectrogram, time-frequency, IEEE Transactions on Signal Processing, 10 pages

点击查看摘要

Abstract:The short-time Fourier transform (STFT) is widely used for analyzing non-stationary signals. However, its performance is highly sensitive to its parameters, and manual or heuristic tuning often yields suboptimal results. To overcome this limitation, we propose a unified differentiable formulation of the STFT that enables gradient-based optimization of its parameters. This approach addresses the limitations of traditional STFT parameter tuning methods, which often rely on computationally intensive discrete searches. It enables fine-tuning of the time-frequency representation (TFR) based on any desired criterion. Moreover, our approach integrates seamlessly with neural networks, allowing joint optimization of the STFT parameters and network weights. The efficacy of the proposed differentiable STFT in enhancing TFRs and improving performance in downstream tasks is demonstrated through experiments on both simulated and real-world data.

[LG-5] Deception Detection in Dyadic Exchanges Using Multimodal Machine Learning: A Study on a Swedish Cohort

链接: https://arxiv.org/abs/2506.21429
作者: Franco Rugolon,Thomas Jack Samuels,Stephan Hau,Lennart Högman
类目: Machine Learning (cs.LG)
*备注: 40 pages, 2 figures, 2 tables. To be submitted in Behavior Research Methods

点击查看摘要

Abstract:This study investigates the efficacy of using multimodal machine learning techniques to detect deception in dyadic interactions, focusing on the integration of data from both the deceiver and the deceived. We compare early and late fusion approaches, utilizing audio and video data - specifically, Action Units and gaze information - across all possible combinations of modalities and participants. Our dataset, newly collected from Swedish native speakers engaged in truth or lie scenarios on emotionally relevant topics, serves as the basis for our analysis. The results demonstrate that incorporating both speech and facial information yields superior performance compared to single-modality approaches. Moreover, including data from both participants significantly enhances deception detection accuracy, with the best performance (71%) achieved using a late fusion strategy applied to both modalities and participants. These findings align with psychological theories suggesting differential control of facial and vocal expressions during initial interactions. As the first study of its kind on a Scandinavian cohort, this research lays the groundwork for future investigations into dyadic interactions, particularly within psychotherapy settings.

[LG-6] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

链接: https://arxiv.org/abs/2506.21427
作者: Prajwal Koirala,Cody Fleming
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the \textitSingle-Step Completion Policy (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and behavior cloning benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making.

[LG-7] Distributed Cross-Channel Hierarchical Aggregation for Foundation Models

链接: https://arxiv.org/abs/2506.21411
作者: Aristeidis Tsaris,Isaac Lyngaas,John Lagregren,Mohamed Wahib,Larry York,Prasanna Balaprakash,Dan Lu,Feiyi Wang,Xiao Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-based scientific foundation models hold significant promise for advancing scientific discovery and innovation. This potential stems from their ability to aggregate images from diverse sources such as varying physical groundings or data acquisition systems and to learn spatio-temporal correlations using transformer architectures. However, tokenizing and aggregating images can be compute-intensive, a challenge not fully addressed by current distributed methods. In this work, we introduce the Distributed Cross-Channel Hierarchical Aggregation (D-CHAG) approach designed for datasets with a large number of channels across image modalities. Our method is compatible with any model-parallel strategy and any type of vision transformer architecture, significantly improving computational efficiency. We evaluated D-CHAG on hyperspectral imaging and weather forecasting tasks. When integrated with tensor parallelism and model sharding, our approach achieved up to a 75% reduction in memory usage and more than doubled sustained throughput on up to 1,024 AMD GPUs on the Frontier Supercomputer.

[LG-8] Early Stopping Tabular In-Context Learning ICML

链接: https://arxiv.org/abs/2506.21387
作者: Jaris Küken,Lennart Purucker,Frank Hutter
类目: Machine Learning (cs.LG)
*备注: ICML Workshop Paper

点击查看摘要

Abstract:Tabular foundation models have shown strong performance across various tabular learning tasks via in-context learning, offering robust generalization without any downstream finetuning. However, their inference-time costs remain high, particularly for larger datasets. To address this, we propose early-stopping the in-context learning process. We achieve this by dynamically evaluating whether to stop in-context learning after each Transformer encoder layer. Once stopped, we decode the embedding using a pre-trained layer-wise decoder. Experiments across 34 small classification tasks size show that early stopping in-context learning accelerates inference by up to x1.3 with negligible degradation in predictive performance. To assess scalability, we further evaluate our method on five larger classification tasks, achieving speedups of up to x2.2. Our results demonstrate the potential of early exiting as an effective and practical strategy for improving the efficiency of tabular in-context learning.

[LG-9] MAx-DNN: Multi-Level Arithmetic Approximation for Energy-Efficient DNN Hardware Accelerators

链接: https://arxiv.org/abs/2506.21371
作者: Vasileios Leon,Georgios Makris,Sotirios Xydis,Kiamal Pekmestzi,Dimitrios Soudris
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Presented at the 13th IEEE LASCAS Conference

点击查看摘要

Abstract:Nowadays, the rapid growth of Deep Neural Network (DNN) architectures has established them as the defacto approach for providing advanced Machine Learning tasks with excellent accuracy. Targeting low-power DNN computing, this paper examines the interplay of fine-grained error resilience of DNN workloads in collaboration with hardware approximation techniques, to achieve higher levels of energy efficiency. Utilizing the state-of-the-art ROUP approximate multipliers, we systematically explore their fine-grained distribution across the network according to our layer-, filter-, and kernel-level approaches, and examine their impact on accuracy and energy. We use the ResNet-8 model on the CIFAR-10 dataset to evaluate our approximations. The proposed solution delivers up to 54% energy gains in exchange for up to 4% accuracy loss, compared to the baseline quantized model, while it provides 2x energy gains with better accuracy versus the state-of-the-art DNN approximations.

[LG-10] SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

链接: https://arxiv.org/abs/2506.21355
作者: Melanie Rieff,Maya Varma,Ossian Rabow,Subathra Adithan,Julie Kim,Ken Chang,Hannah Lee,Nidhi Rohatgi,Christian Bluethgen,Mohamed S. Muneer,Jean-Benoit Delbrouck,Michael Moor
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, example ordering exhibits a recency bias, i.e., placing the most relevant example last can lead to substantial performance improvements by up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context.

[LG-11] Lipschitz Bounds for Persistent Laplacian Eigenvalues under One-Simplex Insertions

链接: https://arxiv.org/abs/2506.21352
作者: Le Vu Anh,Mehmet Dik,Nguyen Viet Anh
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Metric Geometry (math.MG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Persistent Laplacians are matrix operators that track how the shape and structure of data transform across scales and are popularly adopted in biology, physics, and machine learning. Their eigenvalues are concise descriptors of geometric and topological features in a filtration. Although earlier work established global algebraic stability for these operators, the precise change in a single eigenvalue when one simplex, such as a vertex, edge, or triangle, is added has remained unknown. This is important because downstream tools, including heat-kernel signatures and spectral neural networks, depend directly on these eigenvalues. We close this gap by proving a uniform Lipschitz bound: after inserting one simplex, every up-persistent Laplacian eigenvalue can vary by at most twice the Euclidean norm of that simplex’s boundary, independent of filtration scale and complex size. This result delivers the first eigenvalue-level robustness guarantee for spectral topological data analysis. It guarantees that spectral features remain stable under local updates and enables reliable error control in dynamic data settings.

[LG-12] DynamicBench: Evaluating Real-Time Report Generation in Large Language Models

链接: https://arxiv.org/abs/2506.21343
作者: Jingyao Li,Hao Sun,Zile Qiao,Yong Jiang,Pengjun Xie,Fei Huang,Hong Xu,Jiaya Jia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional benchmarks for large language models (LLMs) typically rely on static evaluations through storytelling or opinion expression, which fail to capture the dynamic requirements of real-time information processing in contemporary applications. To address this limitation, we present DynamicBench, a benchmark designed to evaluate the proficiency of LLMs in storing and processing up-to-the-minute data. DynamicBench utilizes a dual-path retrieval pipeline, integrating web searches with local report databases. It necessitates domain-specific knowledge, ensuring accurate responses report generation within specialized fields. By evaluating models in scenarios that either provide or withhold external documents, DynamicBench effectively measures their capability to independently process recent information or leverage contextual enhancements. Additionally, we introduce an advanced report generation system adept at managing dynamic information synthesis. Our experimental results confirm the efficacy of our approach, with our method achieving state-of-the-art performance, surpassing GPT4o in document-free and document-assisted scenarios by 7.0% and 5.8%, respectively. The code and data will be made publicly available.

[LG-13] AGTCNet: A Graph-Temporal Approach for Principled Motor Imagery EEG Classification

链接: https://arxiv.org/abs/2506.21338
作者: Galvin Brice S. Lim,Brian Godwin S. Lim,Argel A. Bandala,John Anthony C. Jose,Timothy Scott C. Chu,Edwin Sybingco
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Brain-computer interface (BCI) technology utilizing electroencephalography (EEG) marks a transformative innovation, empowering motor-impaired individuals to engage with their environment on equal footing. Despite its promising potential, developing subject-invariant and session-invariant BCI systems remains a significant challenge due to the inherent complexity and variability of neural activity across individuals and over time, compounded by EEG hardware constraints. While prior studies have sought to develop robust BCI systems, existing approaches remain ineffective in capturing the intricate spatiotemporal dependencies within multichannel EEG signals. This study addresses this gap by introducing the attentive graph-temporal convolutional network (AGTCNet), a novel graph-temporal model for motor imagery EEG (MI-EEG) classification. Specifically, AGTCNet leverages the topographic configuration of EEG electrodes as an inductive bias and integrates graph convolutional attention network (GCAT) to jointly learn expressive spatiotemporal EEG representations. The proposed model significantly outperformed existing MI-EEG classifiers, achieving state-of-the-art performance while utilizing a compact architecture, underscoring its effectiveness and practicality for BCI deployment. With a 49.87% reduction in model size, 64.65% faster inference time, and shorter input EEG signal, AGTCNet achieved a moving average accuracy of 66.82% for subject-independent classification on the BCI Competition IV Dataset 2a, which further improved to 82.88% when fine-tuned for subject-specific classification. On the EEG Motor Movement/Imagery Dataset, AGTCNet achieved moving average accuracies of 64.14% and 85.22% for 4-class and 2-class subject-independent classifications, respectively, with further improvements to 72.13% and 90.54% for subject-specific classifications.

[LG-14] Stochastic Quantum Spiking Neural Networks with Quantum Memory and Local Learning

链接: https://arxiv.org/abs/2506.21324
作者: Jiechen Chen,Bipin Rajendran,Osvaldo Simeone
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neuromorphic and quantum computing have recently emerged as promising paradigms for advancing artificial intelligence, each offering complementary strengths. Neuromorphic systems built on spiking neurons excel at processing time-series data efficiently through sparse, event-driven computation, consuming energy only upon input events. Quantum computing, on the other hand, leverages superposition and entanglement to explore feature spaces that are exponentially large in the number of qubits. Hybrid approaches combining these paradigms have begun to show potential, but existing quantum spiking models have important limitations. Notably, prior quantum spiking neuron implementations rely on classical memory mechanisms on single qubits, requiring repeated measurements to estimate firing probabilities, and they use conventional backpropagation on classical simulators for training. Here we propose a stochastic quantum spiking (SQS) neuron model that addresses these challenges. The SQS neuron uses multi-qubit quantum circuits to realize a spiking unit with internal quantum memory, enabling event-driven probabilistic spike generation in a single shot. Furthermore, we outline how networks of SQS neurons – dubbed SQS neural networks (SQSNNs) – can be trained via a hardware-friendly local learning rule, eliminating the need for global classical backpropagation. The proposed SQSNN model fuses the time-series efficiency of neuromorphic computing with the exponentially large inner state space of quantum computing, paving the way for quantum spiking neural networks that are modular, scalable, and trainable on quantum hardware.

[LG-15] Improved seeding strategies for k-means and k-GMM

链接: https://arxiv.org/abs/2506.21291
作者: Guillaume Carrière,Frédéric Cazals
类目: Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:We revisit the randomized seeding techniques for k-means clustering and k-GMM (Gaussian Mixture model fitting with Expectation-Maximization), formalizing their three key ingredients: the metric used for seed sampling, the number of candidate seeds, and the metric used for seed selection. This analysis yields novel families of initialization methods exploiting a lookahead principle–conditioning the seed selection to an enhanced coherence with the final metric used to assess the algorithm, and a multipass strategy to tame down the effect of randomization. Experiments show a consistent constant factor improvement over classical contenders in terms of the final metric (SSE for k-means, log-likelihood for k-GMM), at a modest overhead. In particular, for k-means, our methods improve on the recently designed multi-swap strategy, which was the first one to outperform the greedy k-means++ seeding. Our experimental analysis also shed light on subtle properties of k-means often overlooked, including the (lack of) correlations between the SSE upon seeding and the final SSE, the variance reduction phenomena observed in iterative seeding methods, and the sensitivity of the final SSE to the pool size for greedy methods. Practically, our most effective seeding methods are strong candidates to become one of the–if not the–standard techniques. From a theoretical perspective, our formalization of seeding opens the door to a new line of analytical approaches. Comments: 13 pages Subjects: Machine Learning (cs.LG) ACMclasses: F.2; G.3 Cite as: arXiv:2506.21291 [cs.LG] (or arXiv:2506.21291v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.21291 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Frederic Cazals [view email] [v1] Thu, 26 Jun 2025 14:10:40 UTC (3,037 KB)

[LG-16] Zero-Shot Learning for Obsolescence Risk Forecasting

链接: https://arxiv.org/abs/2506.21240
作者: Elie Saad,Aya Mrabah,Mariem Besbes,Marc Zolghadri,Victor Czmil,Claude Baron,Vincent Bourgeois
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Component obsolescence poses significant challenges in industries reliant on electronic components, causing increased costs and disruptions in the security and availability of systems. Accurate obsolescence risk prediction is essential but hindered by a lack of reliable data. This paper proposes a novel approach to forecasting obsolescence risk using zero-shot learning (ZSL) with large language models (LLMs) to address data limitations by leveraging domain-specific knowledge from tabular datasets. Applied to two real-world datasets, the method demonstrates effective risk prediction. A comparative evaluation of four LLMs underscores the importance of selecting the right model for specific forecasting tasks.

[LG-17] Artificial Delegates Resolve Fairness Issues in Perpetual Voting with Partial Turnout

链接: https://arxiv.org/abs/2506.21186
作者: Apurva Shah,Axel Abels,Ann Nowé,Tom Lenaerts
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: The paper has been accepted at the ACM Collective Intelligence Conference (CI 2025), August 4 to 6, 2025, San Diego, CA, USA

点击查看摘要

Abstract:Perpetual voting addresses fairness in sequential collective decision-making by evaluating representational equity over time. However, existing perpetual voting rules rely on full participation and complete approval information, assumptions that rarely hold in practice, where partial turnout is the norm. In this work, we study the integration of Artificial Delegates, preference-learning agents trained to represent absent voters, into perpetual voting systems. We examine how absenteeism affects fairness and representativeness under various voting methods and evaluate the extent to which Artificial Delegates can compensate for missing participation. Our findings indicate that while absenteeism significantly affects fairness, Artificial Delegates reliably mitigate these effects and enhance robustness across diverse scenarios.

[LG-18] Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design

链接: https://arxiv.org/abs/2506.21158
作者: Hampus Gummesson Svensson,Ola Engkvist,Jon Paul Janet,Christian Tyrchan,Morteza Haghir Chehreghani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many real-world applications, evaluating the goodness of instances is often costly and time-consuming, e.g., human feedback and physics simulations, in contrast to proposing new instances. In particular, this is even more critical in reinforcement learning, as new interactions with the environment (i.e., new instances) need to be evaluated to provide a reward signal to learn from. As sufficient exploration is crucial, learning from a diverse mini-batch can have a large impact and help mitigate mode collapse. In this paper, we introduce diverse mini-batch selection for reinforcement learning and propose to use determinantal point processes for this task. We study this framework in the context of a real-world problem, namely drug discovery. We experimentally study how our proposed framework can improve the effectiveness of chemical exploration in de novo drug design, where finding diverse and high-quality solutions is essential. We conduct a comprehensive evaluation with three well-established molecular generation oracles over numerous generative steps. Our experiments conclude that our diverse mini-batch selection framework can substantially improve the diversity of the solutions, while still obtaining solutions of high quality. In drug discovery, such outcome can potentially lead to fulfilling unmet medication needs faster.

[LG-19] Generative Adversarial Evasion and Out-of-Distribution Detection for UAV Cyber-Attacks

链接: https://arxiv.org/abs/2506.21142
作者: Deepak Kumar Panda,Weisi Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing integration of UAVs into civilian airspace underscores the need for resilient and intelligent intrusion detection systems (IDS), as traditional anomaly detection methods often fail to identify novel threats. A common approach treats unfamiliar attacks as out-of-distribution (OOD) samples; however, this leaves systems vulnerable when mitigation is inadequate. Moreover, conventional OOD detectors struggle to distinguish stealthy adversarial attacks from genuine OOD events. This paper introduces a conditional generative adversarial network (cGAN)-based framework for crafting stealthy adversarial attacks that evade IDS mechanisms. We first design a robust multi-class IDS classifier trained on benign UAV telemetry and known cyber-attacks, including Denial of Service (DoS), false data injection (FDI), man-in-the-middle (MiTM), and replay attacks. Using this classifier, our cGAN perturbs known attacks to generate adversarial samples that misclassify as benign while retaining statistical resemblance to OOD distributions. These adversarial samples are iteratively refined to achieve high stealth and success rates. To detect such perturbations, we implement a conditional variational autoencoder (CVAE), leveraging negative log-likelihood to separate adversarial inputs from authentic OOD samples. Comparative evaluation shows that CVAE-based regret scores significantly outperform traditional Mahalanobis distance-based detectors in identifying stealthy adversarial threats. Our findings emphasize the importance of advanced probabilistic modeling to strengthen IDS capabilities against adaptive, generative-model-based cyber intrusions.

[LG-20] NaLaFormer: Norm-Aware Linear Attention for Transformer Models

链接: https://arxiv.org/abs/2506.21137
作者: Weikang Meng,Yadan Luo,Liangyu Huo,Yaowei Wang,Xin Li,Zheng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear attention has emerged as a viable alternative to softmax attention by reducing complexity from quadratic to linear in sequence length. To preserve two fundamental properties of softmax, non-negativity and entropy reduction, current works employ various linearly separatable kernel functions with L1 normalization instead of softmax operator. However, query norms are neglected by the normalization operation in linear attention, such degradation heavily leads to an entropy gap. Meanwhile, existing works inhibit negative values of query and key vectors resulting in a missing inner-product interactions after being mapped. To address these dual challenges, we propose a novel Norm-Aware Linear Attention mechanism serving to restore norm-guided dynamic spikiness and recover kernel-perturbed norm distributions. Specifically, we first decouple query and key matrices into two components: norm and direction, to achieve norm-aware spikiness control and norm consistency, respectively. We mathematically reveal that the extent of entropy reduction varies with the query norm in softmax normalization, motivating a query-norm aware kernel function for dynamic control over entropy reduction. Furthermore, to ensure norm consistency and enforce non-negativity constraints, we employ a norm-preserving mapping to project all elements of the angular matrix into positive values, leveraging cosine similarity to inhibit dimensions with opposite directions. We conduct extensive experiments demonstrating that the NaLaFormer improves performance on vision and language tasks, enhancing both expressiveness and efficiency by up to 4.2%.

[LG-21] Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges

链接: https://arxiv.org/abs/2506.21107
作者: Changxi Chi,Jun Xia,Yufei Huang,Jingbo Zhou,Siyuan Li,Yunfan Liu,Chang Yu,Stan Z. Li
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注:

点击查看摘要

Abstract:Estimating single-cell responses across various perturbations facilitates the identification of key genes and enhances drug screening, significantly boosting experimental efficiency. However, single-cell sequencing is a destructive process, making it impossible to capture the same cell’s phenotype before and after perturbation. Consequently, data collected under perturbed and unperturbed conditions are inherently unpaired. Existing methods either attempt to forcibly pair unpaired data using random sampling, or neglect the inherent relationship between unperturbed and perturbed cells during the modeling. In this work, we propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions, effectively addressing the challenge of unpaired data. We further interpret this framework as a form of data augmentation. We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way, and further incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles. Moreover, gene expression under the same perturbation often varies significantly across cells, frequently exhibiting a bimodal distribution that reflects intrinsic heterogeneity. To capture this, we introduce a more suitable evaluation metric. We propose Unlasting, dual conditional diffusion models that overcome the problem of unpaired single-cell perturbation data and strengthen the model’s insight into perturbations under the guidance of the GRN, with a dedicated mask model designed to improve generation quality by predicting silent genes. In addition, we introduce a biologically grounded evaluation metric that better reflects the inherent heterogeneity in single-cell responses.

[LG-22] Chain-of-Thought Enhanced Shallow Transformers for Wireless Symbol Detection

链接: https://arxiv.org/abs/2506.21093
作者: Li Fan,Peng Wang,Jing Yang,Cong Shen
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transformers have shown potential in solving wireless communication problems, particularly via in-context learning (ICL), where models adapt to new tasks through prompts without requiring model updates. However, prior ICL-based Transformer models rely on deep architectures with many layers to achieve satisfactory performance, resulting in substantial storage and computational costs. In this work, we propose CHain Of thOught Symbol dEtection (CHOOSE), a CoT-enhanced shallow Transformer framework for wireless symbol detection. By introducing autoregressive latent reasoning steps within the hidden space, CHOOSE significantly improves the reasoning capacity of shallow models (1-2 layers) without increasing model depth. This design enables lightweight Transformers to achieve detection performance comparable to much deeper models, making them well-suited for deployment on resource-constrained mobile devices. Experimental results demonstrate that our approach outperforms conventional shallow Transformers and achieves performance comparable to that of deep Transformers, while maintaining storage and computational efficiency. This represents a promising direction for implementing Transformer-based algorithms in wireless receivers with limited computational resources.

[LG-23] FedDAA: Dynamic Client Clustering for Concept Drift Adaptation in Federated Learning

链接: https://arxiv.org/abs/2506.21054
作者: Fu Peng,Ming Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In federated learning (FL), the data distribution of each client may change over time, introducing both temporal and spatial data heterogeneity, known as concept drift. Data heterogeneity arises from three drift sources: real drift (a shift in the conditional distribution P(y|x)), virtual drift (a shift in the input distribution P(x)), and label drift (a shift in the label distribution P(y)). However, most existing FL methods addressing concept drift primarily focus on real drift. When clients experience virtual or label drift, these methods often fail to selectively retain useful historical knowledge, leading to catastrophic forgetting. A key challenge lies in distinguishing different sources of drift, as they require distinct adaptation strategies: real drift calls for discarding outdated data, while virtual or label drift benefits from retaining historical data. Without explicitly identifying the drift sources, a general adaptation strategy is suboptimal and may harm generalization. To address this challenge, we propose FedDAA, a dynamic clustered FL framework designed to adapt to multi-source concept drift while preserving valuable historical knowledge. Specifically, FedDAA integrates three modules: a cluster number determination module to find the optimal number of clusters; a real drift detection module to distinguish real drift from virtual/label drift; and a concept drift adaptation module to adapt to new data while retaining useful historical information. We provide theoretical convergence guarantees, and experiments show that FedDAA achieves 7.84% to 8.52% accuracy improvements over state-of-the-art methods on Fashion-MNIST, CIFAR-10, and CIFAR-100.

[LG-24] An Information-Theoretic Analysis for Federated Learning under Concept Drift

链接: https://arxiv.org/abs/2506.21036
作者: Fu Peng,Meng Zhang,Ming Tang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Recent studies in federated learning (FL) commonly train models on static datasets. However, real-world data often arrives as streams with shifting distributions, causing performance degradation known as concept drift. This paper analyzes FL performance under concept drift using information theory and proposes an algorithm to mitigate the performance degradation. We model concept drift as a Markov chain and introduce the \emphStationary Generalization Error to assess a model’s capability to capture characteristics of future unseen data. Its upper bound is derived using KL divergence and mutual information. We study three drift patterns (periodic, gradual, and random) and their impact on FL performance. Inspired by this, we propose an algorithm that regularizes the empirical risk minimization approach with KL divergence and mutual information, thereby enhancing long-term performance. We also explore the performance-cost tradeoff by identifying a Pareto front. To validate our approach, we build an FL testbed using Raspberry Pi4 devices. Experimental results corroborate with theoretical findings, confirming that drift patterns significantly affect performance. Our method consistently outperforms existing approaches for these three patterns, demonstrating its effectiveness in adapting concept drift in FL.

[LG-25] Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning

链接: https://arxiv.org/abs/2506.21035
作者: Haodong Lu,Chongyang Zhao,Jason Xue,Lina Yao,Kristen Moore,Dong Gong
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Continual learning (CL) with large pre-trained models is challenged by catastrophic forgetting and task interference. Existing LoRA-based Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and freezing task-specific adapters, but suffer from interference, redundancy, and ambiguous routing due to coarse adapter-level selection. However, this design introduces three key challenges: 1) Interference: Activating full LoRA experts per input leads to subspace interference and prevents selective reuse of useful components across tasks. 2) Redundancy: Newly added experts often duplicate or contradict existing knowledge due to unnecessary activation of unrelated ranks and insufficient reuse of relevant ones. 3) Ambiguity: Overlapping features across tasks confuse the router, resulting in unstable expert assignments. As more experts accumulate, earlier task routing degrades, accelerating forgetting. We propose MoRA, a Mixture-of-Rank Adaptive learning approach with self-activated and sparse rank activation for CL. Unlike mixing multiple low-rank matrices, MoRA decomposes each rank-r update into r rank-1 components, each treated as an independent expert, enabling fine-grained mixture of rank-1 expert utilization while mitigating interference and redundancy. To avoid ambiguous routing, we propose that each rank-1 expert can infer its own relevance via intermediate activations. Coupled with our proposed rank pruning and activation budgets, MoRA adaptively selects a sparse mixture of ranks per input. We validate MoRA on continual learning tasks with CLIP and large language models (LLMs), analyzing both in-domain learning and out-of-domain forgetting/generalization during fine-tuning. MoRA shows significant effectiveness on enhancing CL with PTMs, and improving generalization while mitigating forgetting.

[LG-26] RIDENT: Tri-Modal Molecular Representation Learning with Taxonomic Annotations and Local Correspondence

链接: https://arxiv.org/abs/2506.21028
作者: Feng Jiang,Mangal Prakash,Hehuan Ma,Jianyuan Deng,Yuzhi Guo,Amina Mollaysa,Tommaso Mansi,Rui Liao,Junzhou Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular property prediction aims to learn representations that map chemical structures to functional properties. While multimodal learning has emerged as a powerful paradigm to learn molecular representations, prior works have largely overlooked textual and taxonomic information of molecules for representation learning. We introduce TRIDENT, a novel framework that integrates molecular SMILES, textual descriptions, and taxonomic functional annotations to learn rich molecular representations. To achieve this, we curate a comprehensive dataset of molecule-text pairs with structured, multi-level functional annotations. Instead of relying on conventional contrastive loss, TRIDENT employs a volume-based alignment objective to jointly align tri-modal features at the global level, enabling soft, geometry-aware alignment across modalities. Additionally, TRIDENT introduces a novel local alignment objective that captures detailed relationships between molecular substructures and their corresponding sub-textual descriptions. A momentum-based mechanism dynamically balances global and local alignment, enabling the model to learn both broad functional semantics and fine-grained structure-function mappings. TRIDENT achieves state-of-the-art performance on 11 downstream tasks, demonstrating the value of combining SMILES, textual, and taxonomic functional annotations for molecular property prediction.

[LG-27] Distilling Normalizing Flows CVPR2025 ALT CVPR

链接: https://arxiv.org/abs/2506.21003
作者: Steven Walton,Valeriy Klyukin,Maksim Artemev,Denis Derkach,Nikita Orlov,Humphrey Shi
类目: Machine Learning (cs.LG)
*备注: Published in eLVM @ CVPR ( this https URL )

点击查看摘要

Abstract:Explicit density learners are becoming an increasingly popular technique for generative models because of their ability to better model probability distributions. They have advantages over Generative Adversarial Networks due to their ability to perform density estimation and having exact latent-variable inference. This has many advantages, including: being able to simply interpolate, calculate sample likelihood, and analyze the probability distribution. The downside of these models is that they are often more difficult to train and have lower sampling quality. Normalizing flows are explicit density models, that use composable bijective functions to turn an intractable probability function into a tractable one. In this work, we present novel knowledge distillation techniques to increase sampling quality and density estimation of smaller student normalizing flows. We seek to study the capacity of knowledge distillation in Compositional Normalizing Flows to understand the benefits and weaknesses provided by these architectures. Normalizing flows have unique properties that allow for a non-traditional forms of knowledge transfer, where we can transfer that knowledge within intermediate layers. We find that through this distillation, we can make students significantly smaller while making substantial performance gains over a non-distilled student. With smaller models there is a proportionally increased throughput as this is dependent upon the number of bijectors, and thus parameters, in the network. Comments: Published in eLVM @ CVPR (this https URL) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.21003 [cs.LG] (or arXiv:2506.21003v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.21003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] EraRAG : Efficient and Incremental Retrieval Augmented Generation for Growing Corpora

链接: https://arxiv.org/abs/2506.20963
作者: Fangyuan Zhang,Zhengjun Huang,Yingli Zhou,Qintian Guo,Zhixun Li,Wensheng Luo,Di Jiang,Yixiang Fang,Xiaofang Zhou
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large language models (LLMs) by structuring retrieval over an external corpus. However, existing approaches typically assume a static corpus, requiring expensive full-graph reconstruction whenever new documents arrive, limiting their scalability in dynamic, evolving environments. To address these limitations, we introduce EraRAG, a novel multi-layered Graph-RAG framework that supports efficient and scalable dynamic updates. Our method leverages hyperplane-based Locality-Sensitive Hashing (LSH) to partition and organize the original corpus into hierarchical graph structures, enabling efficient and localized insertions of new data without disrupting the existing topology. The design eliminates the need for retraining or costly recomputation while preserving high retrieval accuracy and low latency. Experiments on large-scale benchmarks demonstrate that EraRag achieves up to an order of magnitude reduction in update time and token consumption compared to existing Graph-RAG systems, while providing superior accuracy performance. This work offers a practical path forward for RAG systems that must operate over continually growing corpora, bridging the gap between retrieval efficiency and adaptability. Our code and data are available at this https URL.

[LG-29] Model State Arithmetic for Machine Unlearning

链接: https://arxiv.org/abs/2506.20941
作者: Keivan Rezaei,Mehrdad Saberi,Abhilasha Ravichander,Soheil Feizi
类目: Machine Learning (cs.LG)
*备注: Preprint. Work in progress

点击查看摘要

Abstract:Large language models are trained on massive corpora of web data, which may include private data, copyrighted material, factually inaccurate data, or data that degrades model performance. Eliminating the influence of such problematic datapoints through complete retraining – by repeatedly pretraining the model on datasets that exclude these specific instances – is computationally prohibitive. For this reason, unlearning algorithms have emerged that aim to eliminate the influence of particular datapoints, while otherwise preserving the model – at a low computational cost. However, precisely estimating and undoing the influence of individual datapoints has proved to be challenging. In this work, we propose a new algorithm, MSA, for estimating and undoing the influence of datapoints – by leveraging model checkpoints i.e. artifacts capturing model states at different stages of pretraining. Our experimental results demonstrate that MSA consistently outperforms existing machine unlearning algorithms across multiple benchmarks, models, and evaluation metrics, suggesting that MSA could be an effective approach towards more flexible large language models that are capable of data erasure.

[LG-30] Explainable AI for Radar Resource Management: Modified LIME in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2506.20916
作者: Ziyang Lu,M. Cenk Gursoy,Chilukuri K. Mohan,Pramod K. Varshney
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning has been extensively studied in decision-making processes and has demonstrated superior performance over conventional approaches in various fields, including radar resource management (RRM). However, a notable limitation of neural networks is their ``black box" nature and recent research work has increasingly focused on explainable AI (XAI) techniques to describe the rationale behind neural network decisions. One promising XAI method is local interpretable model-agnostic explanations (LIME). However, the sampling process in LIME ignores the correlations between features. In this paper, we propose a modified LIME approach that integrates deep learning (DL) into the sampling process, which we refer to as DL-LIME. We employ DL-LIME within deep reinforcement learning for radar resource management. Numerical results show that DL-LIME outperforms conventional LIME in terms of both fidelity and task performance, demonstrating superior performance with both metrics. DL-LIME also provides insights on which factors are more important in decision making for radar resource management.

[LG-31] Optimal Single-Policy Sample Complexity and Transient Coverag e for Averag e-Reward Offline RL

链接: https://arxiv.org/abs/2506.20904
作者: Matthew Zurek,Guy Zamir,Yudong Chen
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage, and has been relatively underexamined from a theoretical perspective. While previous work obtains performance guarantees under single-policy data coverage assumptions, such guarantees utilize additional complexity measures which are uniform over all policies, such as the uniform mixing time. We develop sharp guarantees depending only on the target policy, specifically the bias span and a novel policy hitting radius, yielding the first fully single-policy sample complexity bound for average-reward offline RL. We are also the first to handle general weakly communicating MDPs, contrasting restrictive structural assumptions made in prior work. To achieve this, we introduce an algorithm based on pessimistic discounted value iteration enhanced by a novel quantile clipping technique, which enables the use of a sharper empirical-span-based penalty function. Our algorithm also does not require any prior parameter knowledge for its implementation. Remarkably, we show via hard examples that learning under our conditions requires coverage assumptions beyond the stationary distribution of the target policy, distinguishing single-policy complexity measures from previously examined cases. We also develop lower bounds nearly matching our main result.

[LG-32] Graph-Structured Feedback Multimodel Ensemble Online Conformal Prediction

链接: https://arxiv.org/abs/2506.20898
作者: Erfan Hajihashemi,Yanning Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online conformal prediction has demonstrated its capability to construct a prediction set for each incoming data point that covers the true label with a predetermined probability. To cope with potential distribution shift, multi-model online conformal prediction has been introduced to select and leverage different models from a preselected candidate set. Along with the improved flexibility, the choice of the preselected set also brings challenges. A candidate set that includes a large number of models may increase the computational complexity. In addition, the inclusion of irrelevant models with poor performance may negatively impact the performance and lead to unnecessarily large prediction sets. To address these challenges, we propose a novel multi-model online conformal prediction algorithm that identifies a subset of effective models at each time step by collecting feedback from a bipartite graph, which is refined upon receiving new data. A model is then selected from this subset to construct the prediction set, resulting in reduced computational complexity and smaller prediction sets. Additionally, we demonstrate that using prediction set size as feedback, alongside model loss, can significantly improve efficiency by constructing smaller prediction sets while still satisfying the required coverage guarantee. The proposed algorithms are proven to ensure valid coverage and achieve sublinear regret. Experiments on real and synthetic datasets validate that the proposed methods construct smaller prediction sets and outperform existing multi-model online conformal prediction approaches.

[LG-33] On the Necessity of Output Distribution Reweighting for Effective Class Unlearning

链接: https://arxiv.org/abs/2506.20893
作者: Yian Wang,Ali Ebrahimpour-Boroojeny,Hari Sundaram
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we introduce an output-reweighting unlearning method, RWFT, a lightweight technique that erases an entire class from a trained classifier without full retraining. Forgetting specific classes from trained models is essential for enforcing user deletion rights and mitigating harmful or biased predictions. The full retraining is costly and existing unlearning methods fail to replicate the behavior of the retrained models when predicting samples from the unlearned class. We prove this failure by designing a variant of membership inference attacks, MIA-NN that successfully reveals the unlearned class for any of these methods. We propose a simple redistribution of the probability mass for the prediction on the samples in the forgotten class which is robust to MIA-NN. We also introduce a new metric based on the total variation (TV) distance of the prediction probabilities to quantify residual leakage to prevent future methods from susceptibility to the new attack. Through extensive experiments with state of the art baselines in machine unlearning, we show that our approach matches the results of full retraining in both metrics used for evaluation by prior work and the new metric we propose in this work. Compare to state-of-the-art methods, we gain 2.79% in previously used metrics and 111.45% in our new TV-based metric over the best existing method.

[LG-34] Empowering Digital Agriculture: A Privacy-Preserving Framework for Data Sharing and Collaborative Research

链接: https://arxiv.org/abs/2506.20872
作者: Osama Zafar,Rosemarie Santa González,Mina Namazi,Alfonso Morales,Erman Ayday
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2409.06069

点击查看摘要

Abstract:Data-driven agriculture, which integrates technology and data into agricultural practices, has the potential to improve crop yield, disease resilience, and long-term soil health. However, privacy concerns, such as adverse pricing, discrimination, and resource manipulation, deter farmers from sharing data, as it can be used against them. To address this barrier, we propose a privacy-preserving framework that enables secure data sharing and collaboration for research and development while mitigating privacy risks. The framework combines dimensionality reduction techniques (like Principal Component Analysis (PCA)) and differential privacy by introducing Laplacian noise to protect sensitive information. The proposed framework allows researchers to identify potential collaborators for a target farmer and train personalized machine learning models either on the data of identified collaborators via federated learning or directly on the aggregated privacy-protected data. It also allows farmers to identify potential collaborators based on similarities. We have validated this on real-life datasets, demonstrating robust privacy protection against adversarial attacks and utility performance comparable to a centralized system. We demonstrate how this framework can facilitate collaboration among farmers and help researchers pursue broader research objectives. The adoption of the framework can empower researchers and policymakers to leverage agricultural data responsibly, paving the way for transformative advances in data-driven agriculture. By addressing critical privacy challenges, this work supports secure data integration, fostering innovation and sustainability in agricultural systems.

[LG-35] Multi-Objective Reinforcement Learning for Cognitive Radar Resource Management

链接: https://arxiv.org/abs/2506.20853
作者: Ziyang Lu,Subodh Kalia,M. Cenk Gursoy,Chilukuri K. Mohan,Pramod K. Varshney
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The time allocation problem in multi-function cognitive radar systems focuses on the trade-off between scanning for newly emerging targets and tracking the previously detected targets. We formulate this as a multi-objective optimization problem and employ deep reinforcement learning to find Pareto-optimal solutions and compare deep deterministic policy gradient (DDPG) and soft actor-critic (SAC) algorithms. Our results demonstrate the effectiveness of both algorithms in adapting to various scenarios, with SAC showing improved stability and sample efficiency compared to DDPG. We further employ the NSGA-II algorithm to estimate an upper bound on the Pareto front of the considered problem. This work contributes to the development of more efficient and adaptive cognitive radar systems capable of balancing multiple competing objectives in dynamic environments.

[LG-36] Learning-Based Resource Management in Integrated Sensing and Communication Systems

链接: https://arxiv.org/abs/2506.20849
作者: Ziyang Lu,M. Cenk Gursoy,Chilukuri K. Mohan,Pramod K. Varshney
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we tackle the task of adaptive time allocation in integrated sensing and communication systems equipped with radar and communication units. The dual-functional radar-communication system’s task involves allocating dwell times for tracking multiple targets and utilizing the remaining time for data transmission towards estimated target locations. We introduce a novel constrained deep reinforcement learning (CDRL) approach, designed to optimize resource allocation between tracking and communication under time budget constraints, thereby enhancing target communication quality. Our numerical results demonstrate the efficiency of our proposed CDRL framework, confirming its ability to maximize communication quality in highly dynamic environments while adhering to time constraints.

[LG-37] Demystifying Distributed Training of Graph Neural Networks for Link Prediction

链接: https://arxiv.org/abs/2506.20818
作者: Xin Huang,Chul-Ho Lee
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE ICDCS 2025

点击查看摘要

Abstract:Graph neural networks (GNNs) are powerful tools for solving graph-related problems. Distributed GNN frameworks and systems enhance the scalability of GNNs and accelerate model training, yet most are optimized for node classification. Their performance on link prediction remains underexplored. This paper demystifies distributed training of GNNs for link prediction by investigating the issue of performance degradation when each worker trains a GNN on its assigned partitioned subgraph without having access to the entire graph. We discover that the main sources of the issue come from not only the information loss caused by graph partitioning but also the ways of drawing negative samples during model training. While sharing the complete graph information with each worker resolves the issue and preserves link prediction accuracy, it incurs a high communication cost. We propose SpLPG, which effectively leverages graph sparsification to mitigate the issue of performance degradation at a reduced communication cost. Experiment results on several public real-world datasets demonstrate the effectiveness of SpLPG, which reduces the communication overhead by up to about 80% while mostly preserving link prediction accuracy.

[LG-38] Divide Specialize and Route: A New Approach to Efficient Ensemble Learning

链接: https://arxiv.org/abs/2506.20814
作者: Jakub Piwko,Jędrzej Ruciński,Dawid Płudowski,Antoni Zajko,Patryzja Żak,Mateusz Zacharecki,Anna Kozak,Katarzyna Woźnica
类目: Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Ensemble learning has proven effective in boosting predictive performance, but traditional methods such as bagging, boosting, and dynamic ensemble selection (DES) suffer from high computational cost and limited adaptability to heterogeneous data distributions. To address these limitations, we propose Hellsemble, a novel and interpretable ensemble framework for binary classification that leverages dataset complexity during both training and inference. Hellsemble incrementally partitions the dataset into circles of difficulty by iteratively passing misclassified instances from simpler models to subsequent ones, forming a committee of specialised base learners. Each model is trained on increasingly challenging subsets, while a separate router model learns to assign new instances to the most suitable base model based on inferred difficulty. Hellsemble achieves strong classification accuracy while maintaining computational efficiency and interpretability. Experimental results on OpenML-CC18 and Tabzilla benchmarks demonstrate that Hellsemble often outperforms classical ensemble methods. Our findings suggest that embracing instance-level difficulty offers a promising direction for constructing efficient and robust ensemble systems.

[LG-39] Spiking Neural Networks for SAR Interferometric Phase Unwrapping: A Theoretical Framework for Energy-Efficient Processing

链接: https://arxiv.org/abs/2506.20782
作者: Marc Bara
类目: Neural and Evolutionary Computing (cs.NE); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 8 pages, 2 figures, patent pending

点击查看摘要

Abstract:We present the first theoretical framework for applying spiking neural networks (SNNs) to synthetic aperture radar (SAR) interferometric phase unwrapping. Despite extensive research in both domains, our comprehensive literature review confirms that SNNs have never been applied to phase unwrapping, representing a significant gap in current methodologies. As Earth observation data volumes continue to grow exponentially (with missions like NISAR expected to generate 100PB in two years) energy-efficient processing becomes critical for sustainable data center operations. SNNs, with their event-driven computation model, offer potential energy savings of 30-100x compared to conventional approaches while maintaining comparable accuracy. We develop spike encoding schemes specifically designed for wrapped phase data, propose SNN architectures that leverage the spatial propagation nature of phase unwrapping, and provide theoretical analysis of computational complexity and convergence properties. Our framework demonstrates how the temporal dynamics inherent in SNNs can naturally model the spatial continuity constraints fundamental to phase unwrapping. This work opens a new research direction at the intersection of neuromorphic computing and SAR interferometry, offering a complementary approach to existing algorithms that could enable more sustainable large-scale InSAR processing.

[LG-40] Stochastic and Non-local Closure Modeling for Nonlinear Dynamical Systems via Latent Score-based Generative Models

链接: https://arxiv.org/abs/2506.20771
作者: Xinghao Dong,Huchen Yang,Jin-Long Wu
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We propose a latent score-based generative AI framework for learning stochastic, non-local closure models and constitutive laws in nonlinear dynamical systems of computational mechanics. This work addresses a key challenge of modeling complex multiscale dynamical systems without a clear scale separation, for which numerically resolving all scales is prohibitively expensive, e.g., for engineering turbulent flows. While classical closure modeling methods leverage domain knowledge to approximate subgrid-scale phenomena, their deterministic and local assumptions can be too restrictive in regimes lacking a clear scale separation. Recent developments of diffusion-based stochastic models have shown promise in the context of closure modeling, but their prohibitive computational inference cost limits practical applications for many real-world applications. This work addresses this limitation by jointly training convolutional autoencoders with conditional diffusion models in the latent spaces, significantly reducing the dimensionality of the sampling process while preserving essential physical characteristics. Numerical results demonstrate that the joint training approach helps discover a proper latent space that not only guarantees small reconstruction errors but also ensures good performance of the diffusion model in the latent space. When integrated into numerical simulations, the proposed stochastic modeling framework via latent conditional diffusion models achieves significant computational acceleration while maintaining comparable predictive accuracy to standard diffusion models in physical spaces.

[LG-41] Characterization and Mitigation of Training Instabilities in Microscaling Formats

链接: https://arxiv.org/abs/2506.20752
作者: Huangyuan Su,Mujin Kwun,Stephanie Gil,Sham Kakade,Nikhil Anand
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 14 pages + appendices

点击查看摘要

Abstract:Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware accelerators increasingly support lower-precision arithmetic formats, such as the Microscaling (MX) formats introduced in NVIDIA’s Blackwell architecture. These formats use a shared scale within blocks of parameters to extend representable range and perform forward/backward GEMM operations in reduced precision for efficiency gains. In this work, we investigate the challenges and viability of block-scaled precision formats during model training. Across nearly one thousand language models trained from scratch – spanning compute budgets from 2 \times 10^17 to 4.8 \times 10^19 FLOPs and sweeping over a broad range of weight-activation precision combinations – we consistently observe that training in MX formats exhibits sharp, stochastic instabilities in the loss, particularly at larger compute scales. To explain this phenomenon, we conduct controlled experiments and ablations on a smaller proxy model that exhibits similar behavior as the language model, sweeping across architectural settings, hyperparameters, and precision formats. These experiments motivate a simple model in which multiplicative gradient bias introduced by the quantization of layer-norm affine parameters and a small fraction of activations can trigger runaway divergence. Through \emphin situ intervention experiments on our proxy model, we demonstrate that instabilities can be averted or delayed by modifying precision schemes mid-training. Guided by these findings, we evaluate stabilization strategies in the LLM setting and show that certain hybrid configurations recover performance competitive with full-precision training. We release our code at this https URL.

[LG-42] Multiple Streams of Relation Extraction: Enriching and Recalling in Transformers

链接: https://arxiv.org/abs/2506.20746
作者: Todd Nief,David Reber,Sean Richardson,Ari Holtzman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When an LLM learns a relation during finetuning (e.g., new movie releases, corporate mergers, etc.), where does this information go? Is it extracted when the model processes an entity, recalled just-in-time before a prediction, or are there multiple separate heuristics? Existing localization approaches (e.g. activation patching) are ill-suited for this analysis because they tend to replace parts of the residual stream, potentially deleting information. To fill this gap, we propose dynamic weight-grafting between fine-tuned and pre-trained language models to show that fine-tuned language models both (1) extract relation information learned during finetuning while processing entities and (2) recall" this information in later layers while generating predictions. In some cases, models need both of these pathways to correctly generate finetuned information while, in other cases, a single enrichment" or recall" pathway alone is sufficient. We examine the necessity and sufficiency of these information pathways, examining what layers they occur at, how much redundancy they exhibit, and which model components are involved -- finding that the recall" pathway occurs via both task-specific attention mechanisms and a relation extraction step in the output of the attention and the feedforward networks at the final layers before next token prediction.

[LG-43] A Survey of AI for Materials Science: Foundation Models LLM Agents Datasets and Tools

链接: https://arxiv.org/abs/2506.20743
作者: Minh-Hao Van,Prateek Verma,Chen Zhao,Xintao Wu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Foundation models (FMs) are catalyzing a transformative shift in materials science (MatSci) by enabling scalable, general-purpose, and multimodal AI systems for scientific discovery. Unlike traditional machine learning models, which are typically narrow in scope and require task-specific engineering, FMs offer cross-domain generalization and exhibit emergent capabilities. Their versatility is especially well-suited to materials science, where research challenges span diverse data types and scales. This survey provides a comprehensive overview of foundation models, agentic systems, datasets, and computational tools supporting this growing field. We introduce a task-driven taxonomy encompassing six broad application areas: data extraction, interpretation and Q\A; atomistic simulation; property prediction; materials structure, design and discovery; process planning, discovery, and optimization; and multiscale modeling. We discuss recent advances in both unimodal and multimodal FMs, as well as emerging large language model (LLM) agents. Furthermore, we review standardized datasets, open-source tools, and autonomous experimental platforms that collectively fuel the development and integration of FMs into research workflows. We assess the early successes of foundation models and identify persistent limitations, including challenges in generalizability, interpretability, data imbalance, safety concerns, and limited multimodal fusion. Finally, we articulate future research directions centered on scalable pretraining, continual learning, data governance, and trustworthiness.

[LG-44] On Context-Content Uncertainty Principle

链接: https://arxiv.org/abs/2506.20699
作者: Xin Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Context-Content Uncertainty Principle (CCUP) proposes that inference under uncertainty is governed by an entropy asymmetry between context and content: high-entropy contexts must be interpreted through alignment with low-entropy, structured content. In this paper, we develop a layered computational framework that derives operational principles from this foundational asymmetry. At the base level, CCUP formalizes inference as directional entropy minimization, establishing a variational gradient that favors content-first structuring. Building upon this, we identify four hierarchical layers of operational principles: (\textbfL1) \emphCore Inference Constraints, including structure-before-specificity, asymmetric inference flow, cycle-consistent bootstrapping, and conditional compression, all shown to be mutually reducible; (\textbfL2) \emphResource Allocation Principles, such as precision-weighted attention, asymmetric learning rates, and attractor-based memory encoding; (\textbfL3) \emphTemporal Bootstrapping Dynamics, which organize learning over time via structure-guided curricula; and (\textbfL4) \emphSpatial Hierarchical Composition, which integrates these mechanisms into self-organizing cycles of memory, inference, and planning. We present formal equivalence theorems, a dependency lattice among principles, and computational simulations demonstrating the efficiency gains of CCUP-aligned inference. This work provides a unified theoretical foundation for understanding how brains and machines minimize uncertainty through recursive structure-specificity alignment. The brain is not just an inference machine. It is a cycle-consistent entropy gradient resolver, aligning structure and specificity via path-dependent, content-seeded simulation.

[LG-45] E-ABIN: an Explainable module for Anomaly detection in BIological Networks

链接: https://arxiv.org/abs/2506.20693
作者: Ugo Lomoio,Tommaso Mazza,Pierangelo Veltri,Pietro Hiram Guzzi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing availability of large-scale omics data calls for robust analytical frameworks capable of handling complex gene expression datasets while offering interpretable results. Recent advances in artificial intelligence have enabled the identification of aberrant molecular patterns distinguishing disease states from healthy controls. Coupled with improvements in model interpretability, these tools now support the identification of genes potentially driving disease phenotypes. However, current approaches to gene anomaly detection often remain limited to single datasets and lack accessible graphical interfaces. Here, we introduce E-ABIN, a general-purpose, explainable framework for Anomaly detection in Biological Networks. E-ABIN combines classical machine learning and graph-based deep learning techniques within a unified, user-friendly platform, enabling the detection and interpretation of anomalies from gene expression or methylation-derived networks. By integrating algorithms such as Support Vector Machines, Random Forests, Graph Autoencoders (GAEs), and Graph Adversarial Attributed Networks (GAANs), E-ABIN ensures a high predictive accuracy while maintaining interpretability. We demonstrate the utility of E-ABIN through case studies of bladder cancer and coeliac disease, where it effectively uncovers biologically relevant anomalies and offers insights into disease mechanisms.

[LG-46] Gaussian Invariant Markov Chain Monte Carlo

链接: https://arxiv.org/abs/2506.21511
作者: Michalis K. Titsias,Angelos Alexopoulos,Siran Liu,Petros Dellaportas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 29, 2 figures

点击查看摘要

Abstract:We develop sampling methods, which consist of Gaussian invariant versions of random walk Metropolis (RWM), Metropolis adjusted Langevin algorithm (MALA) and second order Hessian or Manifold MALA. Unlike standard RWM and MALA we show that Gaussian invariant sampling can lead to ergodic estimators with improved statistical efficiency. This is due to a remarkable property of Gaussian invariance that allows us to obtain exact analytical solutions to the Poisson equation for Gaussian targets. These solutions can be used to construct efficient and easy to use control variates for variance reduction of estimators under any intractable target. We demonstrate the new samplers and estimators in several examples, including high dimensional targets in latent Gaussian models where we compare against several advanced methods and obtain state-of-the-art results. We also provide theoretical results regarding geometric ergodicity, and an optimal scaling analysis that shows the dependence of the optimal acceptance rate on the Gaussianity of the target.

[LG-47] Wild refitting for black box prediction

链接: https://arxiv.org/abs/2506.21460
作者: Martin J. Wainwright
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We describe and analyze a computionally efficient refitting procedure for computing high-probability upper bounds on the instance-wise mean-squared prediction error of penalized nonparametric estimates based on least-squares minimization. Requiring only a single dataset and black box access to the prediction method, it consists of three steps: computing suitable residuals, symmetrizing and scaling them with a pre-factor \rho , and using them to define and solve a modified prediction problem recentered at the current estimate. We refer to it as wild refitting, since it uses Rademacher residual symmetrization as in a wild bootstrap variant. Under relatively mild conditions allowing for noise heterogeneity, we establish a high probability guarantee on its performance, showing that the wild refit with a suitably chosen wild noise scale \rho gives an upper bound on prediction error. This theoretical analysis provides guidance into the design of such procedures, including how the residuals should be formed, the amount of noise rescaling in the wild sub-problem needed for upper bounds, and the local stability properties of the block-box procedure. We illustrate the applicability of this procedure to various problems, including non-rigid structure-from-motion recovery with structured matrix penalties; plug-and-play image restoration with deep neural network priors; and randomized sketching with kernel methods.

[LG-48] Performance improvement of spatial semantic segmentation with enriched audio features and agent -based error correction for DCASE 2025 Challenge Task 4

链接: https://arxiv.org/abs/2506.21174
作者: Jongyeon Park,Joonhee Lee,Do-Hyeon Lim,Hong Kook Kim,Hyeongcheol Geum,Jeong Eun Lim
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: DCASE 2025 challenge Task4, 5 pages

点击查看摘要

Abstract:This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is motivated by the fact that mixed audio often contains subtle cues that are difficult to capture with mel-spectrograms alone. Thus, these additional features offer alterna-tive perspectives for the model. Second, an agent-based label correction system is applied to the outputs processed by the S5 system. This system reduces false positives, improving the final class-aware signal-to-distortion ratio improvement (CA-SDRi) metric. Finally, we refine the training dataset to enhance the classi-fication accuracy of low-performing classes by removing irrele-vant samples and incorporating external data. That is, audio mix-tures are generated from a limited number of data points; thus, even a small number of out-of-class data points could degrade model performance. The experiments demonstrate that the submit-ted systems employing these approaches relatively improve CA-SDRi by up to 14.7% compared to the baseline of DCASE 2025 Challenge Task 4.

[LG-49] Homogenization of Multi-agent Learning Dynamics in Finite-state Markov Games

链接: https://arxiv.org/abs/2506.21079
作者: Yann Kerzreho(ENS Paris Saclay)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:This paper introduces a new approach for approximating the learning dynamics of multiple reinforcement learning (RL) agents interacting in a finite-state Markov game. The idea is to rescale the learning process by simultaneously reducing the learning rate and increasing the update frequency, effectively treating the agent’s parameters as a slow-evolving variable influenced by the fast-mixing game state. Under mild assumptions-ergodicity of the state process and continuity of the updates-we prove the convergence of this rescaled process to an ordinary differential equation (ODE). This ODE provides a tractable, deterministic approximation of the agent’s learning dynamics. An implementation of the framework is available at,: this https URL

[LG-50] Forecasting Geopolitical Events with a Sparse Temporal Fusion Transformer and Gaussian Process Hybrid: A Case Study in Middle Eastern and U.S. Conflict Dynamics

链接: https://arxiv.org/abs/2506.20935
作者: Hsin-Hsiung Huang,Hayden Hampton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Forecasting geopolitical conflict from data sources like the Global Database of Events, Language, and Tone (GDELT) is a critical challenge for national security. The inherent sparsity, burstiness, and overdispersion of such data cause standard deep learning models, including the Temporal Fusion Transformer (TFT), to produce unreliable long-horizon predictions. We introduce STFT-VNNGP, a hybrid architecture that won the 2023 Algorithms for Threat Detection (ATD) competition by overcoming these limitations. Designed to bridge this gap, our model employs a two-stage process: first, a TFT captures complex temporal dynamics to generate multi-quantile forecasts. These quantiles then serve as informed inputs for a Variational Nearest Neighbor Gaussian Process (VNNGP), which performs principled spatiotemporal smoothing and uncertainty quantification. In a case study forecasting conflict dynamics in the Middle East and the U.S., STFT-VNNGP consistently outperforms a standalone TFT, showing a superior ability to predict the timing and magnitude of bursty event periods, particularly at long-range horizons. This work offers a robust framework for generating more reliable and actionable intelligence from challenging event data, with all code and workflows made publicly available to ensure reproducibility.

[LG-51] Lower Bounds on the Size of Markov Equivalence Classes

链接: https://arxiv.org/abs/2506.20933
作者: Erik Jahn,Frederick Eberhardt,Leonard J. Schulman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Causal discovery algorithms typically recover causal graphs only up to their Markov equivalence classes unless additional parametric assumptions are made. The sizes of these equivalence classes reflect the limits of what can be learned about the underlying causal graph from purely observational data. Under the assumptions of acyclicity, causal sufficiency, and a uniform model prior, Markov equivalence classes are known to be small on average. In this paper, we show that this is no longer the case when any of these assumptions is relaxed. Specifically, we prove exponentially large lower bounds for the expected size of Markov equivalence classes in three settings: sparse random directed acyclic graphs, uniformly random acyclic directed mixed graphs, and uniformly random directed cyclic graphs.

[LG-52] Quantum Reinforcement Learning Trading Agent for Sector Rotation in the Taiwan Stock Market

链接: https://arxiv.org/abs/2506.20930
作者: Chi-Sheng Chen,Xinyu Zhang,Ya-Chuan Chen
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:We propose a hybrid quantum-classical reinforcement learning framework for sector rotation in the Taiwan stock market. Our system employs Proximal Policy Optimization (PPO) as the backbone algorithm and integrates both classical architectures (LSTM, Transformer) and quantum-enhanced models (QNN, QRWKV, QASA) as policy and value networks. An automated feature engineering pipeline extracts financial indicators from capital share data to ensure consistent model input across all configurations. Empirical backtesting reveals a key finding: although quantum-enhanced models consistently achieve higher training rewards, they underperform classical models in real-world investment metrics such as cumulative return and Sharpe ratio. This discrepancy highlights a core challenge in applying reinforcement learning to financial domains – namely, the mismatch between proxy reward signals and true investment objectives. Our analysis suggests that current reward designs may incentivize overfitting to short-term volatility rather than optimizing risk-adjusted returns. This issue is compounded by the inherent expressiveness and optimization instability of quantum circuits under Noisy Intermediate-Scale Quantum (NISQ) constraints. We discuss the implications of this reward-performance gap and propose directions for future improvement, including reward shaping, model regularization, and validation-based early stopping. Our work offers a reproducible benchmark and critical insights into the practical challenges of deploying quantum reinforcement learning in real-world finance.

[LG-53] Active Learning for Manifold Gaussian Process Regression

链接: https://arxiv.org/abs/2506.20928
作者: Yuanxing Cheng,Lulu Kang,Yiwei Wang,Chun Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:This paper introduces an active learning framework for manifold Gaussian Process (GP) regression, combining manifold learning with strategic data selection to improve accuracy in high-dimensional spaces. Our method jointly optimizes a neural network for dimensionality reduction and a Gaussian process regressor in the latent space, supervised by an active learning criterion that minimizes global prediction error. Experiments on synthetic data demonstrate superior performance over randomly sequential learning. The framework efficiently handles complex, discontinuous functions while preserving computational tractability, offering practical value for scientific and engineering applications. Future work will focus on scalability and uncertainty-aware manifold learning.

[LG-54] Faster Fixed-Point Methods for Multichain MDPs

链接: https://arxiv.org/abs/2506.20910
作者: Matthew Zurek,Yudong Chen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study value-iteration (VI) algorithms for solving general (a.k.a. multichain) Markov decision processes (MDPs) under the average-reward criterion, a fundamental but theoretically challenging setting. Beyond the difficulties inherent to all average-reward problems posed by the lack of contractivity and non-uniqueness of solutions to the Bellman operator, in the multichain setting an optimal policy must solve the navigation subproblem of steering towards the best connected component, in addition to optimizing long-run performance within each component. We develop algorithms which better solve this navigational subproblem in order to achieve faster convergence for multichain MDPs, obtaining improved rates of convergence and sharper measures of complexity relative to prior work. Many key components of our results are of potential independent interest, including novel connections between average-reward and discounted problems, optimal fixed-point methods for discounted VI which extend to general Banach spaces, new sublinear convergence rates for the discounted value error, and refined suboptimality decompositions for multichain MDPs. Overall our results yield faster convergence rates for discounted and average-reward problems and expand the theoretical foundations of VI approaches.

[LG-55] Uncertainty-Aware Machine-Learning Framework for Predicting Dislocation Plasticity and Stress-Strain Response in FCC Alloys

链接: https://arxiv.org/abs/2506.20839
作者: Jing Luo,Yejun Gu,Yanfei Wang,Xiaolong Ma,Jaafar.A El-Awady
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning has significantly advanced the understanding and application of structural materials, with an increasing emphasis on integrating existing data and quantifying uncertainties in predictive modeling. This study presents a comprehensive methodology utilizing a mixed density network (MDN) model, trained on extensive experimental data from literature. This approach uniquely predicts the distribution of dislocation density, inferred as a latent variable, and the resulting stress distribution at the grain level. The incorporation of statistical parameters of those predicted distributions into a dislocation-mediated plasticity model allows for accurate stress-strain predictions with explicit uncertainty quantification. This strategy not only improves the accuracy and reliability of mechanical property predictions but also plays a vital role in optimizing alloy design, thereby facilitating the development of new materials in a rapidly evolving industry.

[LG-56] Efficacy of Temporal Fusion Transformers for Runoff Simulation

链接: https://arxiv.org/abs/2506.20831
作者: Sinan Rasiya Koya,Tirthankar Roy
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Combining attention with recurrence has shown to be valuable in sequence modeling, including hydrological predictions. Here, we explore the strength of Temporal Fusion Transformers (TFTs) over Long Short-Term Memory (LSTM) networks in rainfall-runoff modeling. We train ten randomly initialized models, TFT and LSTM, for 531 CAMELS catchments in the US. We repeat the experiment with five subsets of the Caravan dataset, each representing catchments in the US, Australia, Brazil, Great Britain, and Chile. Then, the performance of the models, their variability regarding the catchment attributes, and the difference according to the datasets are assessed. Our findings show that TFT slightly outperforms LSTM, especially in simulating the midsection and peak of hydrographs. Furthermore, we show the ability of TFT to handle longer sequences and why it can be a better candidate for higher or larger catchments. Being an explainable AI technique, TFT identifies the key dynamic and static variables, providing valuable scientific insights. However, both TFT and LSTM exhibit a considerable drop in performance with the Caravan dataset, indicating possible data quality issues. Overall, the study highlights the potential of TFT in improving hydrological modeling and understanding.

[LG-57] Structural System Identification via Validation and Adaptation

链接: https://arxiv.org/abs/2506.20799
作者: Cristian López,Keegan J. Moore
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Estimating the governing equation parameter values is essential for integrating experimental data with scientific theory to understand, validate, and predict the dynamics of complex systems. In this work, we propose a new method for structural system identification (SI), uncertainty quantification, and validation directly from data. Inspired by generative modeling frameworks, a neural network maps random noise to physically meaningful parameters. These parameters are then used in the known equation of motion to obtain fake accelerations, which are compared to real training data via a mean square error loss. To simultaneously validate the learned parameters, we use independent validation datasets. The generated accelerations from these datasets are evaluated by a discriminator network, which determines whether the output is real or fake, and guides the parameter-generator network. Analytical and real experiments show the parameter estimation accuracy and model validation for different nonlinear structural systems.

[LG-58] Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon

链接: https://arxiv.org/abs/2506.20779
作者: Tongtong Liang,Dan Qiao,Yu-Xiang Wang,Rahul Parhi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Comments Welcome!

点击查看摘要

Abstract:We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs – a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions vis-à-vis low-norm solutions (i.e., weight decay), which knowingly do not suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of ‘‘neural shattering’’ where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.

[LG-59] Control and optimization for Neural Partial Differential Equations in Supervised Learning

链接: https://arxiv.org/abs/2506.20764
作者: Alain Bensoussan,Minh-Binh Tran,Bangjie Wang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although there is a substantial body of literature on control and optimization problems for parabolic and hyperbolic systems, the specific problem of controlling and optimizing the coefficients of the associated operators within such systems has not yet been thoroughly explored. In this work, we aim to initiate a line of research in control theory focused on optimizing and controlling the coefficients of these operators-a problem that naturally arises in the context of neural networks and supervised learning. In supervised learning, the primary objective is to transport initial data toward target data through the layers of a neural network. We propose a novel perspective: neural networks can be interpreted as partial differential equations (PDEs). From this viewpoint, the control problem traditionally studied in the context of ordinary differential equations (ODEs) is reformulated as a control problem for PDEs, specifically targeting the optimization and control of coefficients in parabolic and hyperbolic operators. To the best of our knowledge, this specific problem has not yet been systematically addressed in the control theory of PDEs. To this end, we propose a dual system formulation for the control and optimization problem associated with parabolic PDEs, laying the groundwork for the development of efficient numerical schemes in future research. We also provide a theoretical proof showing that the control and optimization problem for parabolic PDEs admits minimizers. Finally, we investigate the control problem associated with hyperbolic PDEs and prove the existence of solutions for a corresponding approximated control problem. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) Cite as: arXiv:2506.20764 [math.OC] (or arXiv:2506.20764v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2506.20764 Focus to learn more arXiv-issued DOI via DataCite

[LG-60] scMamba: A Scalable Foundation Model for Single-Cell Multi-Omics Integration Beyond Highly Variable Feature Selection

链接: https://arxiv.org/abs/2506.20697
作者: Zhen Yuan,Shaoqing Jiao,Yihang Xiao,Jiajie Peng
类目: Cell Behavior (q-bio.CB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advent of single-cell multi-omics technologies has enabled the simultaneous profiling of diverse omics layers within individual cells. Integrating such multimodal data provides unprecedented insights into cellular identity, regulatory processes, and disease mechanisms. However, it remains challenging, as current methods often rely on selecting highly variable genes or peaks during preprocessing, which may inadvertently discard crucial biological information. Here, we present scMamba, a foundation model designed to integrate single-cell multi-omics data without the need for prior feature selection while preserving genomic positional information. scMamba introduces a patch-based cell tokenization strategy that treats genomics regions as words (tokens) and cells as sentences. Building upon the concept of state space duality, scMamba distills rich biological insights from high-dimensional, sparse single-cell multi-omics data. Additionally, our novel contrastive learning approach, enhanced with cosine similarity regularization, enables superior alignment across omics layers compared to traditional methods. Systematic benchmarking across multiple datasets demonstrates that scMamba significantly outperforms state-of-the-art methods in preserving biological variation, aligning omics layers, and enhancing key downstream tasks such as clustering, cell type annotation, and trajectory inference. Our findings position scMamba as a powerful tool for large-scale single-cell multi-omics integration, capable of handling large-scale atlases and advancing biological discovery.

[LG-61] MegaFold: System-Level Optimizations for Accelerating Protein Structure Prediction Models

链接: https://arxiv.org/abs/2506.20686
作者: Hoa La,Ahan Gupta,Alex Morehead,Jianlin Cheng,Minjia Zhang
类目: Biomolecules (q-bio.BM); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 13 pages, 12 figures

点击查看摘要

Abstract:Protein structure prediction models such as AlphaFold3 (AF3) push the frontier of biomolecular modeling by incorporating science-informed architectural changes to the transformer architecture. However, these advances come at a steep system cost, introducing: compute- and memory-intensive operators, 2D attention mechanisms, and retrieval-augmented data pipelines, which collectively hinder the scalability of AF3 training. In this work, we present MegaFold, a cross-platform system to accelerate AF3 training. MegaFold tackles key bottlenecks through ahead-of-time caching to eliminate GPU idle time from the retrieval-augmented data pipeline, Triton-based kernels for memory-efficient EvoAttention on heterogeneous devices, and deep fusion for common and critical small operators in AF3. Evaluation on both NVIDIA H200 and AMD MI250 GPUs shows that MegaFold reduces peak memory usage of AF3 training by up to 1.23 \times and improves per-iteration training time by up-to 1.73 \times and 1.62 \times respectively. More importantly, MegaFold enables training on 1.35 \times longer sequence lengths compared to PyTorch baselines without running out-of-memory, significantly improving the scalability of modern protein folding models. We open source our code at this https URL.

[LG-62] he final solution of the Hitchhikers problem #5

链接: https://arxiv.org/abs/2506.20672
作者: Matjaž Omladič,Martin Vuk,Aljaž Zalar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: 20 pages

点击查看摘要

Abstract:A recent survey, nicknamed “Hitchhiker’s Guide”, J.J. Arias-Garcıa, R. Mesiar, and B. De Baets, A hitchhiker’s guide to quasi-copulas, Fuzzy Sets and Systems 393 (2020) 1-28, has raised the rating of quasi-copula problems in the dependence modeling community in spite of the lack of statistical interpretation of quasi-copulas. In our previous work (arXiv:2410.19339, accepted in Fuzzy Sets and Systems), we addressed the question of extreme values of the mass distribution associated with multivariate quasi-copulas. Using a linear programming approach, we were able to solve Open Problem 5 of the “Guide” up to dimension d = 17 and disprove a recent conjecture on the solution to that problem. In this paper, we use an analytical approach to provide a complete answer to the original question.

信息检索

[IR-0] PeakNetFP: Peak-based Neural Audio Fingerprinting Robust to Extreme Time Stretching

链接: https://arxiv.org/abs/2506.21086
作者: Guillem Cortès-Sebastià,Benjamin Martin,Emilio Molina,Xavier Serra,Romain Hennequin
类目: ound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: Accepted at ISMIR 2025

点击查看摘要

Abstract:This work introduces PeakNetFP, the first neural audio fingerprinting (AFP) system designed specifically around spectral peaks. This novel system is designed to leverage the sparse spectral coordinates typically computed by traditional peak-based AFP methods. PeakNetFP performs hierarchical point feature extraction techniques similar to the computer vision model PointNet++, and is trained using contrastive learning like in the state-of-the-art deep learning AFP, NeuralFP. This combination allows PeakNetFP to outperform conventional AFP systems and achieves comparable performance to NeuralFP when handling challenging time-stretched audio data. In extensive evaluation, PeakNetFP maintains a Top-1 hit rate of over 90% for stretching factors ranging from 50% to 200%. Moreover, PeakNetFP offers significant efficiency advantages: compared to NeuralFP, it has 100 times fewer parameters and uses 11 times smaller input data. These features make PeakNetFP a lightweight and efficient solution for AFP tasks where time stretching is involved. Overall, this system represents a promising direction for future AFP technologies, as it successfully merges the lightweight nature of peak-based AFP with the adaptability and pattern recognition capabilities of neural network-based approaches, paving the way for more scalable and efficient solutions in the field.

[IR-1] RecCoT: Enhancing Recommendation via Chain-of-Thought

链接: https://arxiv.org/abs/2506.21032
作者: Shuo Yang,Jiangxia Cao,Haipeng Li,Yuqi Mao,Shuchao Pang
类目: Information Retrieval (cs.IR)
*备注: Work in progress

点击查看摘要

Abstract:In real-world applications, users always interact with items in multiple aspects, such as through implicit binary feedback (e.g., clicks, dislikes, long views) and explicit feedback (e.g., comments, reviews). Modern recommendation systems (RecSys) learn user-item collaborative signals from these implicit feedback signals as a large-scale binary data-streaming, subsequently recommending other highly similar items based on users’ personalized historical interactions. However, from this collaborative-connection perspective, the RecSys does not focus on the actual content of the items themselves but instead prioritizes higher-probability signals of behavioral co-occurrence among items. Consequently, under this binary learning paradigm, the RecSys struggles to understand why a user likes or dislikes certain items. To alleviate it, some works attempt to utilize the content-based reviews to capture the semantic knowledge to enhance recommender models. However, most of these methods focus on predicting the ratings of reviews, but do not provide a human-understandable explanation.

[IR-2] Response Quality Assessment for Retrieval-Augmented Generation via Conditional Conformal Factuality SIGIR2025

链接: https://arxiv.org/abs/2506.20978
作者: Naihe Feng,Yi Sui,Shiyi Hou,Jesse C. Cresswell,Ga Wu
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR 2025 short paper, 5 pages, Code is available at this https URL

点击查看摘要

Abstract:Existing research on Retrieval-Augmented Generation (RAG) primarily focuses on improving overall question-answering accuracy, often overlooking the quality of sub-claims within generated responses. Recent methods that attempt to improve RAG trustworthiness, such as through auto-evaluation metrics, lack probabilistic guarantees or require ground truth answers. To address these limitations, we propose Conformal-RAG, a novel framework inspired by recent applications of conformal prediction (CP) on large language models (LLMs). Conformal-RAG leverages CP and internal information from the RAG mechanism to offer statistical guarantees on response quality. It ensures group-conditional coverage spanning multiple sub-domains without requiring manual labelling of conformal sets, making it suitable for complex RAG applications. Compared to existing RAG auto-evaluation methods, Conformal-RAG offers statistical guarantees on the quality of refined sub-claims, ensuring response reliability without the need for ground truth answers. Additionally, our experiments demonstrate that by leveraging information from the RAG system, Conformal-RAG retains up to 60% more high-quality sub-claims from the response compared to direct applications of CP to LLMs, while maintaining the same reliability guarantee.

[IR-3] Metadata Enrichment of Long Text Documents using Large Language Models

链接: https://arxiv.org/abs/2506.20918
作者: Manika Lamba,You Peng,Sophie Nikolov,Glen Layne-Worthey,J. Stephen Downie
类目: Digital Libraries (cs.DL); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020 through a combination of manual efforts and large language models. This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science. Our paper shows that enriching metadata using LLMs is particularly beneficial for digital repositories by introducing additional metadata access points that may not have originally been foreseen to accommodate various content types. This approach is particularly effective for repositories that have significant missing data in their existing metadata fields, enhancing search results and improving the accessibility of the digital repository.

[IR-4] owards Two-Stage Counterfactual Learning to Rank SIGIR2025 ICTIR2025

链接: https://arxiv.org/abs/2506.20854
作者: Shashank Gupta,Yiming Liao,Maarten de Rijke
类目: Information Retrieval (cs.IR)
*备注: Accepted at ICTIR 2025 (co-located with SIGIR 2025)

点击查看摘要

Abstract:Counterfactual learning to rank (CLTR) aims to learn a ranking policy from user interactions while correcting for the inherent biases in interaction data, such as position bias. Existing CLTR methods assume a single ranking policy that selects top-K ranking from the entire document candidate set. In real-world applications, the candidate document set is on the order of millions, making a single-stage ranking policy impractical. In order to scale to millions of documents, real-world ranking systems are designed in a two-stage fashion, with a candidate generator followed by a ranker. The existing CLTR method for a two-stage offline ranking system only considers the top-1 ranking set-up and only focuses on training the candidate generator, with the ranker fixed. A CLTR method for training both the ranker and candidate generator jointly is missing from the existing literature. In this paper, we propose a two-stage CLTR estimator that considers the interaction between the two stages and estimates the joint value of the two policies offline. In addition, we propose a novel joint optimization method to train the candidate and ranker policies, respectively. To the best of our knowledge, we are the first to propose a CLTR estimator and learning method for two-stage ranking. Experimental results on a semi-synthetic benchmark demonstrate the effectiveness of the proposed joint CLTR method over baselines.

[IR-5] he Next Phase of Scientific Fact-Checking: Advanced Evidence Retrieval from Complex Structured Academic Papers ICTIR’25 SIGIR

链接: https://arxiv.org/abs/2506.20844
作者: Xingyu Deng,Xi Wang,Mark Stevenson
类目: Information Retrieval (cs.IR)
*备注: Accepted for ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR’25)

点击查看摘要

Abstract:Scientific fact-checking aims to determine the veracity of scientific claims by retrieving and analysing evidence from research literature. The problem is inherently more complex than general fact-checking since it must accommodate the evolving nature of scientific knowledge, the structural complexity of academic literature and the challenges posed by long-form, multimodal scientific expression. However, existing approaches focus on simplified versions of the problem based on small-scale datasets consisting of abstracts rather than full papers, thereby avoiding the distinct challenges associated with processing complete documents. This paper examines the limitations of current scientific fact-checking systems and reveals the many potential features and resources that could be exploited to advance their performance. It identifies key research challenges within evidence retrieval, including (1) evidence-driven retrieval that addresses semantic limitations and topic imbalance (2) time-aware evidence retrieval with citation tracking to mitigate outdated information, (3) structured document parsing to leverage long-range context, (4) handling complex scientific expressions, including tables, figures, and domain-specific terminology and (5) assessing the credibility of scientific literature. Preliminary experiments were conducted to substantiate these challenges and identify potential solutions. This perspective paper aims to advance scientific fact-checking with a specialised IR system tailored for real-world applications.

[IR-6] RAG -VisualRec: An Open Resource for Vision- and Text-Enhanced Retrieval-Augmented Generation in Recommendation

链接: https://arxiv.org/abs/2506.20817
作者: Ali Tourani,Fatemeh Nazary,Yashar Deldjoo
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: 20 pages, 6 figures, 5 tables

点击查看摘要

Abstract:This paper addresses the challenge of developing multimodal recommender systems for the movie domain, where limited metadata (e.g., title, genre) often hinders the generation of robust recommendations. We introduce a resource that combines LLM-generated plot descriptions with trailer-derived visual embeddings in a unified pipeline supporting both Retrieval-Augmented Generation (RAG) and collaborative filtering. Central to our approach is a data augmentation step that transforms sparse metadata into richer textual signals, alongside fusion strategies (e.g., PCA, CCA) that integrate visual cues. Experimental evaluations demonstrate that CCA-based fusion significantly boosts recall compared to unimodal baselines, while an LLM-driven re-ranking step further improves NDCG, particularly in scenarios with limited textual data. By releasing this framework, we invite further exploration of multi-modal recommendation techniques tailored to cold-start, novelty-focused, and domain-specific settings. All code, data, and detailed documentation are publicly available at: this https URL

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-06-27

目录

概览 (2025-06-27)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载