本篇博文主要内容为 2025-08-14 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-14)

今日共更新492篇论文,其中:

  • 自然语言处理60篇(Computation and Language (cs.CL))
  • 人工智能174篇(Artificial Intelligence (cs.AI))
  • 计算机视觉149篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习146篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Echo-4o: Harnessing the Power of GPT -4o Synthetic Images for Improved Image Generation

【速读】: 该论文旨在解决当前开源图像生成模型在覆盖真实世界数据盲区(如幻想场景、多参考图像生成)以及文本-图像对齐准确性方面存在的不足。其核心解决方案在于利用GPT-4o生成的合成图像数据来弥补现实数据集的局限性:一方面,合成图像可补充稀有场景(如超现实幻想内容),增强模型对复杂用户查询的理解能力;另一方面,合成图像提供纯净背景与可控监督信号,减少真实数据中常见的背景噪声和图文错位问题,从而提升文本到图像的对齐精度。基于此,作者构建了180K规模的Echo-4o-Image合成数据集,并通过微调统一多模态生成基线Bagel得到Echo-4o模型,同时提出GenEval++和Imagine-Bench两个新评估基准以更准确地衡量生成能力,实验表明该方案在多个模型上均展现出优异的迁移性能和泛化能力。

链接: https://arxiv.org/abs/2508.09987
作者: Junyan Ye,Dongzhi Jiang,Zihao Wang,Leqi Zhu,Zhenghao Hu,Zilong Huang,Jun He,Zhiyuan Yan,Jinghua Yu,Hongsheng Li,Conghui He,Weijia Li
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Sun Yat-sen University (中山大学); CUHK MMLab; Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability.
zh

[NLP-1] Neural Bandit Based Optimal LLM Selection for a Pipeline of Tasks AAAI2026

【速读】: 该论文旨在解决在多步骤任务中如何高效选择一系列大语言模型(Large Language Models, LLMs)以实现最优成本与成功率平衡的问题。传统LLM选择方法通常针对单个模型进行决策,而本文关注的是复杂任务需分解为多个子任务时,如何动态选择每一步对应的LLM,使得前序模型输出的质量直接影响后续模型的输入和性能表现,从而形成复杂的依赖关系。解决方案的关键在于提出一种基于神经上下文多臂老虎机(neural contextual bandit)的在线学习算法,该算法能够实时训练神经网络来建模每个子任务上不同LLM的成功概率,并据此自适应地指导序列化LLM的选择策略,即使在缺乏历史性能数据的情况下也能有效优化整体任务执行效果。

链接: https://arxiv.org/abs/2508.09958
作者: Baran Atalar,Eddie Zhang,Carlee Joe-Wong
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to AAAI 2026

点击查看摘要

Abstract:With the increasing popularity of large language models (LLMs) for a variety of tasks, there has been a growing interest in strategies that can predict which out of a set of LLMs will yield a successful answer at low cost. This problem promises to become more and more relevant as providers like Microsoft allow users to easily create custom LLM “assistants” specialized to particular types of queries. However, some tasks (i.e., queries) may be too specialized and difficult for a single LLM to handle alone. These applications often benefit from breaking down the task into smaller subtasks, each of which can then be executed by a LLM expected to perform well on that specific subtask. For example, in extracting a diagnosis from medical records, one can first select an LLM to summarize the record, select another to validate the summary, and then select another, possibly different, LLM to extract the diagnosis from the summarized record. Unlike existing LLM selection or routing algorithms, this setting requires that we select a sequence of LLMs, with the output of each LLM feeding into the next and potentially influencing its success. Thus, unlike single LLM selection, the quality of each subtask’s output directly affects the inputs, and hence the cost and success rate, of downstream LLMs, creating complex performance dependencies that must be learned and accounted for during selection. We propose a neural contextual bandit-based algorithm that trains neural networks that model LLM success on each subtask in an online manner, thus learning to guide the LLM selections for the different subtasks, even in the absence of historical LLM performance data. Experiments on telecommunications question answering and medical diagnosis prediction datasets illustrate the effectiveness of our proposed approach compared to other LLM selection algorithms.
zh

[NLP-2] Which one Performs Better? Wav2Vec or Whisper? Applying both in Badini Kurdish Speech to Text (BKSTT)

【速读】: 该论文旨在解决Badini(一种库尔德语方言)缺乏高质量语音识别系统的问题,以弥补其在自然语言处理资源上的不足。解决方案的关键在于构建一个基于Badini儿童故事的语音到文本(Speech-to-Text, STT)语言模型,通过收集6位播讲者录制的约17小时音频数据(来自8本共78个故事),进行清洗、分段和分词预处理后,使用Wav2Vec2-Large-XLSR-53与Whisper-small两种预训练模型进行对比实验。结果表明,Wav2Vec2-Large-XLSR-53模型在可读性和准确率上显著优于Whisper-small模型(分别为90.38% vs. 65.45% 和 82.67% vs. 53.17%),成为更适配Badini方言STT任务的首选方案。

链接: https://arxiv.org/abs/2508.09957
作者: Renas Adnan,Hossein Hassani
机构: University of Kurdistan Hewlêr (库尔德斯坦大学海尔)
类目: Computation and Language (cs.CL)
备注: 21 pages, 20 figures, 7 tables

点击查看摘要

Abstract:Speech-to-text (STT) systems have a wide range of applications. They are available in many languages, albeit at different quality levels. Although Kurdish is considered a less-resourced language from a processing perspective, SST is available for some of the Kurdish dialects, for instance, Sorani (Central Kurdish). However, that is not applied to other Kurdish dialects, Badini and Hawrami, for example. This research is an attempt to address this gap. Bandin, approximately, has two million speakers, and STT systems can help their community use mobile and computer-based technologies while giving their dialect more global visibility. We aim to create a language model based on Badini’s speech and evaluate its performance. To cover a conversational aspect, have a proper confidence level of grammatical accuracy, and ready transcriptions, we chose Badini kids’ stories, eight books including 78 stories, as the textual input. Six narrators narrated the books, which resulted in approximately 17 hours of recording. We cleaned, segmented, and tokenized the input. The preprocessing produced nearly 15 hours of speech, including 19193 segments and 25221 words. We used Wav2Vec2-Large-XLSR-53 and Whisper-small to develop the language models. The experiments indicate that the transcriptions process based on the Wav2Vec2-Large-XLSR-53 model provides a significantly more accurate and readable output than the Whisper-small model, with 90.38% and 65.45% readability, and 82.67% and 53.17% accuracy, respectively.
zh

[NLP-3] Performance of GPT -5 Frontier Models in Ophthalmology Question Answering

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂医学问答任务中如何实现高准确率与成本效益最优平衡的问题。针对最新一代推理模型GPT-5,研究系统评估了12种配置(三类模型层级与四类推理努力设置),并引入基于参考锚定的LLM-as-a-judge框架对生成推理过程的质量进行量化评估,同时通过token级成本估算分析准确率-成本权衡关系。关键解决方案在于识别出GPT-5-high在准确率和推理质量上均显著优于其他对比模型(如o1-high、GPT-4o),且在帕累托前沿中发现GPT-5-mini-low为最具性价比的低资源方案,从而为临床场景下LLMs的部署提供了可量化的性能-成本决策依据。

链接: https://arxiv.org/abs/2508.09956
作者: Fares Antaki,David Mikhail,Daniel Milad,Danny A Mammo,Sumit Sharma,Sunil K Srivastava,Bing Yu Chen,Samir Touma,Mertcan Sevgi,Jonathan El-Khoury,Pearse A Keane,Qingyu Chen,Yih Chung Tham,Renaud Duval
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. We evaluated 12 configurations of OpenAI’s GPT-5 series (three model tiers across four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome was multiple-choice accuracy; secondary outcomes included head-to-head ranking via a Bradley-Terry model, rationale quality assessment using a reference-anchored, pairwise LLM-as-a-judge framework, and analysis of accuracy-cost trade-offs using token-based cost estimates. GPT-5-high achieved the highest accuracy (0.965; 95% CI, 0.942-0.985), outperforming all GPT-5-nano variants (P .001), o1-high (P = .04), and GPT-4o (P .001), but not o3-high (0.958; 95% CI, 0.931-0.981). GPT-5-high ranked first in both accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high). Cost-accuracy analysis identified several GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low offering the most favorable low-cost, high-performance balance. These results benchmark GPT-5 on a high-quality ophthalmology dataset, demonstrate the influence of reasoning effort on accuracy, and introduce an autograder framework for scalable evaluation of LLM-generated answers against reference standards in ophthalmology.
zh

[NLP-4] Shaping Event Backstories to Estimate Potential Emotion Contexts

【速读】: 该论文试图解决情绪分析(emotion analysis)任务中因情境信息缺失而导致的标注不一致性问题。现有研究多聚焦于标注者个体属性对分歧的影响,却忽视了事件描述本身缺乏足够背景信息可能引发的歧义。其解决方案的关键在于通过自动构建与不同情绪相关的多个事件链(event chains),为原始事件描述注入合理的上下文信息,从而生成连贯的情境化叙事(contextual narratives)。这种方法不仅提升了人类标注者对特定情绪的理解一致性,还构建了一个专门用于系统性评估情境化情绪分析的高质量数据集。

链接: https://arxiv.org/abs/2508.09954
作者: Johannes Schäfer,Roman Klinger
机构: University of Bamberg (巴伐利亚大学)
类目: Computation and Language (cs.CL)
备注: May 2025 version

点击查看摘要

Abstract:Emotion analysis is an inherently ambiguous task. Previous work studied annotator properties to explain disagreement, but this overlooks the possibility that ambiguity may stem from missing information about the context of events. In this paper, we propose a novel approach that adds reasonable contexts to event descriptions, which may better explain a particular situation. Our goal is to understand whether these enriched contexts enable human annotators to annotate emotions more reliably. We disambiguate a target event description by automatically generating multiple event chains conditioned on differing emotions. By combining techniques from short story generation in various settings, we achieve coherent narratives that result in a specialized dataset for the first comprehensive and systematic examination of contextualized emotion analysis. Through automatic and human evaluation, we find that contextual narratives enhance the interpretation of specific emotions and support annotators in producing more consistent annotations.
zh

[NLP-5] Specialised or Generic? Tokenization Choices for Radiology Language Models MICCAI2025

【速读】: 该论文试图解决语言模型(Language Model, LM)在放射学报告摘要任务中,词汇表(vocabulary)设计对生成质量影响不明确的问题。其解决方案的关键在于系统性地比较通用、医学专用及领域特定(domain-specific)三种分词器(tokenizer)在三种影像模态下的表现,并考察预训练(pre-training)对结果的影响。研究发现,医学和领域特定词汇表在模型从零训练时显著优于通用自然语言分词器;预训练可部分缓解不同分词器间的性能差异,而领域特定分词器在性能与计算效率(如更小的词汇量和更短序列长度)上均表现最优,从而提升了模型在临床场景中的实用性与可部署性。

链接: https://arxiv.org/abs/2508.09952
作者: Hermione Warr,Wentian Xu,Harry Anthony,Yasin Ibrahim,Daniel McGowan,Konstantinos Kamnitsas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ELAMI@MICCAI2025

点击查看摘要

Abstract:The vocabulary used by language models (LM) - defined by the tokenizer - plays a key role in text generation quality. However, its impact remains under-explored in radiology. In this work, we address this gap by systematically comparing general, medical, and domain-specific tokenizers on the task of radiology report summarisation across three imaging modalities. We also investigate scenarios with and without LM pre-training on PubMed abstracts. Our findings demonstrate that medical and domain-specific vocabularies outperformed widely used natural language alternatives when models are trained from scratch. Pre-training partially mitigates performance differences between tokenizers, whilst the domain-specific tokenizers achieve the most favourable results. Domain-specific tokenizers also reduce memory requirements due to smaller vocabularies and shorter sequences. These results demonstrate that adapting the vocabulary of LMs to the clinical domain provides practical benefits, including improved performance and reduced computational demands, making such models more accessible and effective for both research and real-world healthcare settings.
zh

[NLP-6] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在从视觉与文本联合输入中生成代码方面能力有限的问题。其解决方案的关键在于提出VisCodex框架,通过任务向量(task vector)驱动的模型融合技术,将先进的代码生成语言模型无缝整合进强大的视觉-语言主干网络中,从而在保持原有视觉理解能力的同时,显著增强模型的多模态代码生成能力。此外,作者还构建了大规模、多样化的多模态编码数据集(Multimodal Coding Dataset, MCD)和专门用于评估视觉丰富场景下编程理解能力的InfiBench-V基准测试,有效支撑了训练与评测,使VisCodex在开源模型中达到最先进水平,并逼近GPT-4o等闭源商用模型的性能。

链接: https://arxiv.org/abs/2508.09945
作者: Lingjie Jiang,Shaohan Huang,Xun Wu,Yixia Li,Dongdong Zhang,Furu Wei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.
zh

[NLP-7] A Comprehensive Evaluation framework of Alignment Techniques for LLM s

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)对齐方法缺乏统一评估框架的问题,使得不同对齐范式(如基于强化学习的人类反馈(RLHF)、指令微调、后处理校正系统及推理时干预)难以系统比较与部署决策支持。其解决方案的关键在于提出一个四维评估框架,从对齐检测能力、对齐质量、计算效率和鲁棒性四个维度,对主流对齐技术进行系统性量化分析,从而揭示现有先进模型在不同维度上的优势与局限,为后续研究提供可操作的指导方向。

链接: https://arxiv.org/abs/2508.09937
作者: Muneeza Azmat,Momin Abbas,Maysa Malfiza Garcia de Macedo,Marcelo Carpinette Grave,Luan Soares de Souza,Tiago Machado,Rogerio A de Paula,Raya Horesh,Yixin Chen,Heloisa Caroline de Souza Pereira Candello,Rebecka Nordenlow,Aminat Adebiyi
机构: IBM Research(IBM研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In submission

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly integrated into real-world applications, ensuring their outputs align with human values and safety standards has become critical. The field has developed diverse alignment approaches including traditional fine-tuning methods (RLHF, instruction tuning), post-hoc correction systems, and inference-time interventions, each with distinct advantages and limitations. However, the lack of unified evaluation frameworks makes it difficult to systematically compare these paradigms and guide deployment decisions. This paper introduces a multi-dimensional evaluation of alignment techniques for LLMs, a comprehensive evaluation framework that provides a systematic comparison across all major alignment paradigms. Our framework assesses methods along four key dimensions: alignment detection, alignment quality, computational efficiency, and robustness. Through experiments across diverse base models and alignment strategies, we demonstrate the utility of our framework in identifying strengths and limitations of current state-of-the-art models, providing valuable insights for future research directions.
zh

[NLP-8] Language of Persuasion and Misrepresentation in Business Communication: A Textual Detection Approach

【速读】: 该论文旨在解决商业沟通数字化背景下欺骗性语言(misrepresentation)日益隐蔽且难以识别的问题,尤其是在财务报告、可持续性话语和数字营销等场景中。其核心挑战在于传统修辞学与传播心理学理论难以有效映射到实际文本中的欺骗行为检测。解决方案的关键在于结合经典修辞学、传播心理学与语言学理论,并借助计算文本分析(computational textual analysis)及个性化Transformer模型,构建基于说服性词汇(persuasive lexicon)的自动检测系统,在受控环境下实现了超过99%的检测准确率。这一方法为应对AI驱动的语境下人类交互的真实性问题提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2508.09935
作者: Sayem Hossen,Monalisa Moon Joti,Md. Golam Rashed
机构: Sonargaon University (Sonargaon大学); University of Rajshahi (拉杰沙希大学)
类目: Computation and Language (cs.CL); Computational Finance (q-fin.CP); General Finance (q-fin.GN)
备注: 21

点击查看摘要

Abstract:Business communication digitisation has reorganised the process of persuasive discourse, which allows not only greater transparency but also advanced deception. This inquiry synthesises classical rhetoric and communication psychology with linguistic theory and empirical studies in the financial reporting, sustainability discourse, and digital marketing to explain how deceptive language can be systematically detected using persuasive lexicon. In controlled settings, detection accuracies of greater than 99% were achieved by using computational textual analysis as well as personalised transformer models. However, reproducing this performance in multilingual settings is also problematic and, to a large extent, this is because it is not easy to find sufficient data, and because few multilingual text-processing infrastructures are in place. This evidence shows that there has been an increasing gap between the theoretical representations of communication and those empirically approximated, and therefore, there is a need to have strong automatic text-identification systems where AI-based discourse is becoming more realistic in communicating with humans. Comments: 21 Subjects: Computation and Language (cs.CL); Computational Finance (q-fin.CP); General Finance (q-fin.GN) Cite as: arXiv:2508.09935 [cs.CL] (or arXiv:2508.09935v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.09935 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Md. Golam Rashed [view email] [v1] Wed, 13 Aug 2025 16:38:31 UTC (1,969 KB) Full-text links: Access Paper: View a PDF of the paper titled Language of Persuasion and Misrepresentation in Business Communication: A Textual Detection Approach, by Sayem Hossen and Monalisa Moon Joti and Md. Golam RashedView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-08 Change to browse by: cs q-fin q-fin.CP q-fin.GN References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[NLP-9] COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets ICCV2025

【速读】: 该论文旨在解决超声(Ultrasound, US)图像分析中因数据分布差异导致的模型泛化能力不足问题,尤其在小样本或未见数据场景下表现不佳。传统单数据集训练方法难以适应多异构US数据集,且现有方法如单一源特定解码器或域自适应策略在跨域迁移时性能下降明显。其解决方案的关键在于提出通用协同混合专家模型(Universal Collaborative Mixture of Heterogeneous Source-Specific Experts, COME),通过构建双结构-语义共享专家以建立统一表示空间,并与源特定专家协作提取判别性特征,从而在保留各数据集特有信息的同时有效缓解跨数据集干扰,实现对多异构US数据的鲁棒泛化。

链接: https://arxiv.org/abs/2508.09886
作者: Lingyu Chen,Yawen Zeng,Yue Wang,Peng Wan,Guo-chen Ning,Hongen Liao,Daoqiang Zhang,Fang Chen
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); ByteDance Inc. (字节跳动); Tsinghua University (清华大学); Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICCV 2025

点击查看摘要

Abstract:Conventional single-dataset training often fails with new data distributions, especially in ultrasound (US) image analysis due to limited data, acoustic shadows, and speckle noise. Therefore, constructing a universal framework for multi-heterogeneous US datasets is imperative. However, a key challenge arises: how to effectively mitigate inter-dataset interference while preserving dataset-specific discriminative features for robust downstream task? Previous approaches utilize either a single source-specific decoder or a domain adaptation strategy, but these methods experienced a decline in performance when applied to other domains. Considering this, we propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME). Specifically, COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features through providing complementary features. This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios. Extensive experiments under three evaluation modes (single-dataset, intra-organ, and inter-organ integration datasets) demonstrate COME’s superiority, achieving significant mean AP improvements over state-of-the-art methods. Our project is available at: this https URL.
zh

[NLP-10] A Survey of Cognitive Distortion Detection and Classification in NLP ACL EMNLP2025 DATE

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)在心理健康领域应用中,关于认知扭曲(Cognitive Distortions, CDs)自动检测与分类研究的碎片化问题。当前该领域存在CD分类体系不统一、任务定义模糊及评估方法各异等挑战,阻碍了研究的可比性与可复现性。其解决方案的关键在于系统性地回顾过去二十年内的38项相关研究,构建一个整合的CD分类参考框架,总结常见的任务设置与建模策略,并识别出当前研究中的开放性挑战,从而为该新兴领域提供结构化、标准化的研究指引,推动更一致且可复现的学术进展。

链接: https://arxiv.org/abs/2508.09878
作者: Archie Sage,Jeroen Keppens,Helen Yannakoudakis
机构: King’s College London (伦敦国王学院)
类目: Computation and Language (cs.CL)
备注: Under review via ACL Rolling Review and committed to EMNLP 2025. Camera-ready updates to follow

点击查看摘要

Abstract:As interest grows in the application of natural language processing (NLP) techniques to mental health, a growing body of work explores the automatic detection and classification of cognitive distortions (CDs). CDs are habitual patterns of negatively biased or flawed thinking that distort how people perceive events, judge themselves, and react to the world around them. Identifying and addressing them is an important part of therapy. Despite its momentum, the field remains fragmented, with inconsistencies in CD taxonomies, task formulations, and evaluation practices. This survey reviews 38 studies spanning two decades, providing a structured overview of datasets, modelling approaches, and evaluation strategies. We provide a consolidated CD taxonomy reference, summarise common task setups, and highlight open challenges to support more coherent and reproducible research in this emerging area.
zh

[NLP-11] Memory Decoder: A Pretrained Plug-and-Play Memory for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域适应过程中存在的两大问题:一是传统域自适应预训练(Domain Adaptive Pretraining, DAPT)方法需要全参数微调,成本高且易引发灾难性遗忘;二是检索增强生成(Retrieval-Augmented Generation, RAG)虽能提升领域相关性,但因依赖昂贵的最近邻搜索和长上下文处理导致推理延迟显著增加。解决方案的关键在于提出一种可即插即用的预训练记忆解码器(Memory Decoder),其核心是一个小型Transformer解码器,通过学习模仿外部非参数化检索器的行为,实现对任意共享分词器的预训练语言模型的高效领域适配,无需修改原模型参数,同时显著降低推理开销并提升领域性能。

链接: https://arxiv.org/abs/2508.09874
作者: Jiaqi Cao,Jiarui Wang,Rubin Wei,Qipeng Guo,Kai Chen,Bowen Zhou,Zhouhan Lin
机构: LUMIA Lab, Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Department of Electronic Engineering, Tsinghua University (清华大学电子工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model’s parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.
zh

[NLP-12] Assessing the Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription

【速读】: 该论文旨在解决低资源环境下乌尔都语(Urdu)语音识别(Automatic Speech Recognition, ASR)系统性能不足的问题,尤其针对乌尔都语因方言多样性、语言混用(code-switching)及训练数据稀缺导致的建模困难。解决方案的关键在于评估轻量级 Whisper 模型(Tiny、Base、Small)在未进行微调(fine-tuning)条件下的直接适用性,并基于词错误率(Word Error Rate, WER)进行基准测试。结果表明,Whisper-Small 在无微调情况下表现最优(WER=33.68%),验证了其作为低成本部署方案的潜力,同时揭示了在音素准确性和词汇连贯性方面仍存在显著挑战,为未来面向低资源场景的高效 ASR 系统研究提供了基础。

链接: https://arxiv.org/abs/2508.09865
作者: Abdul Rehman Antall,Naveed Akhtar
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures, 1 table, including references and appendix

点击查看摘要

Abstract:This study evaluates the feasibility of lightweight Whisper models (Tiny, Base, Small) for Urdu speech recognition in low-resource settings. Despite Urdu being the 10th most spoken language globally with over 230 million speakers, its representation in automatic speech recognition (ASR) systems remains limited due to dialectal diversity, code-switching, and sparse training data. We benchmark these models on a curated Urdu dataset using word error rate (WER), without fine-tuning. Results show Whisper-Small achieves the lowest error rates (33.68% WER), outperforming Tiny (67.08% WER) and Base (53.67% WER). Qualitative analysis reveals persistent challenges in phonetic accuracy and lexical coherence, particularly for complex utterances. While Whisper-Small demonstrates promise for deployable Urdu ASR, significant gaps remain. Our findings emphasize lay the groundwork for future research into effective, low-resource ASR systems.
zh

[NLP-13] PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文理解与深度推理能力方面的不足,特别是针对非直接相关文本信息的整合与一致性判断问题。其解决方案的关键在于提出PRELUDE基准,通过评估角色前传故事是否与原著正典叙事一致的任务,强制模型进行跨段落的信息检索、关联与逻辑推理——该任务要求88%的实例需融合多个文本片段,从而显著提升对全局语义理解和因果推理能力的考验。实验表明,当前主流方法(包括上下文学习、检索增强生成RAG、领域微调及商用DeepResearch服务)在准确率上仍落后于人类15%,且模型常因推理过程错误而得出正确答案,导致推理准确性与人类差距超过30%,凸显了长上下文推理仍是LLMs亟待突破的核心挑战。

链接: https://arxiv.org/abs/2508.09848
作者: Mo Yu,Tsz Ting Chung,Chulun Zhou,Tong Li,Rui Lu,Jiangnan Li,Liyan Xu,Haoshu Lu,Ning Zhang,Jing Li,Jie Zhou
机构: WeChat AI (微信AI); Tencent (腾讯); HKUST (香港科技大学); CUHK (香港中文大学); NJIT (新泽西理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: First 7 authors contributed equally. Project page: this https URL

点击查看摘要

Abstract:We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character’s prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks – as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by 15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.
zh

[NLP-14] Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练和部署过程中因传统Transformer架构计算复杂度高而带来的效率瓶颈问题。其核心解决方案在于系统性地梳理和总结一系列高效模型架构创新,包括线性与稀疏序列建模方法、高效的全注意力变体、稀疏专家混合(Sparse Mixture-of-Experts)机制、融合上述技术的混合架构,以及新兴的扩散式LLM结构,从而在保持性能的同时显著降低计算资源消耗,推动可扩展、资源感知的基础模型发展。

链接: https://arxiv.org/abs/2508.09834
作者: Weigao Sun,Jiaxi Hu,Yucheng Zhou,Jusen Du,Disen Lan,Kexin Wang,Tong Zhu,Xiaoye Qu,Yu Zhang,Xiaoyu Mo,Daizong Liu,Yuxuan Liang,Wenliang Chen,Guoqi Li,Yu Cheng
机构: Shanghai AI Laboratory (上海人工智能实验室); HKUST (GZ) (香港科技大学(广州)); University of Macau (澳门大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Soochow University (苏州大学); KTH Royal Institute of Technology (皇家理工学院); Peking University (北京大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Survey, 82 pages, GitHub: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.
zh

[NLP-15] A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems

【速读】: 该论文旨在解决当前临床心理卫生AI系统发展中因高质量、结构化、可访问的数据集稀缺而导致的模型可复现性差、泛化能力弱及公平性不足的问题。其解决方案的关键在于首次系统性地对用于训练AI临床辅助工具的心理健康数据集进行全面综述,按精神障碍类型、数据模态、任务类型、可访问性和社会文化背景等维度进行分类整理,并评估合成数据集的应用现状。通过识别现有数据集在纵向数据缺失、文化语言代表性不足、标注标准不一致以及合成数据模态单一等方面的显著空白,论文提出了标准化数据采集与标注流程、加强跨文化多样性覆盖、推动开放共享机制等具体建议,以促进更稳健、普适且公平的心理健康AI系统的研发。

链接: https://arxiv.org/abs/2508.09809
作者: Aishik Mandal,Prottay Kumar Adhikary,Hiba Arnaout,Iryna Gurevych,Tanmoy Chakraborty
机构: Technische Universität Darmstadt (达姆施塔特工业大学); Indian Institute of Technology Delhi (印度理工学院德里分校); National Research Center for Applied Cybersecurity ATHENE (应用网络安全国家研究中心 ATHENE)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:Mental health disorders are rising worldwide. However, the availability of trained clinicians has not scaled proportionally, leaving many people without adequate or timely support. To bridge this gap, recent studies have shown the promise of Artificial Intelligence (AI) to assist mental health diagnosis, monitoring, and intervention. However, the development of efficient, reliable, and ethical AI to assist clinicians is heavily dependent on high-quality clinical training datasets. Despite growing interest in data curation for training clinical AI assistants, existing datasets largely remain scattered, under-documented, and often inaccessible, hindering the reproducibility, comparability, and generalizability of AI models developed for clinical mental health care. In this paper, we present the first comprehensive survey of clinical mental health datasets relevant to the training and development of AI-powered clinical assistants. We categorize these datasets by mental disorders (e.g., depression, schizophrenia), data modalities (e.g., text, speech, physiological signals), task types (e.g., diagnosis prediction, symptom severity estimation, intervention generation), accessibility (public, restricted or private), and sociocultural context (e.g., language and cultural background). Along with these, we also investigate synthetic clinical mental health datasets. Our survey identifies critical gaps such as a lack of longitudinal data, limited cultural and linguistic representation, inconsistent collection and annotation standards, and a lack of modalities in synthetic data. We conclude by outlining key challenges in curating and standardizing future datasets and provide actionable recommendations to facilitate the development of more robust, generalizable, and equitable mental health AI systems.
zh

[NLP-16] BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在图表理解(chart comprehension)任务中表现不佳的问题,其根本原因在于训练数据缺乏多样性与真实世界代表性,以及依赖自动提取的低质量数据表进行监督微调,导致模型泛化能力受限。解决方案的关键在于两个方面:一是提出BigCharts数据集构建流程,通过条件渲染真实世界来源的图表图像(来自多个在线平台),结合重新绘制(replotting)机制以保留准确的底层数据,从而生成兼具视觉多样性和真实性的高质量训练数据;二是设计包含监督微调与基于Group Relative Policy Optimization(GRPO)的强化学习相结合的综合训练框架,并引入专为图表推理设计的新奖励信号,显著提升模型在不同图表风格和领域下的鲁棒性与泛化性能,最终实现了当前最优的图表推理模型BigCharts-R1。

链接: https://arxiv.org/abs/2508.09804
作者: Ahmed Masry,Abhay Puri,Masoud Hashemi,Juan A. Rodriguez,Megh Thakkar,Khyati Mahajan,Vikas Yadav,Sathwik Tejaswi Madhusudhan,Alexandre Piché,Dzmitry Bahdanau,Christopher Pal,David Vazquez,Enamul Hoque,Perouz Taslakian,Sai Rajeswar,Spandana Gella
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Charts are essential to data analysis, transforming raw data into clear visual representations that support human decision-making. Although current vision-language models (VLMs) have made significant progress, they continue to struggle with chart comprehension due to training on datasets that lack diversity and real-world authenticity, or on automatically extracted underlying data tables of charts, which can contain numerous estimation errors. Furthermore, existing models only rely on supervised fine-tuning using these low-quality datasets, severely limiting their effectiveness. To address these issues, we first propose BigCharts, a dataset creation pipeline that generates visually diverse chart images by conditioning the rendering process on real-world charts sourced from multiple online platforms. Unlike purely synthetic datasets, BigCharts incorporates real-world data, ensuring authenticity and visual diversity, while still retaining accurate underlying data due to our proposed replotting process. Additionally, we introduce a comprehensive training framework that integrates supervised fine-tuning with Group Relative Policy Optimization (GRPO)-based reinforcement learning. By introducing novel reward signals specifically designed for chart reasoning, our approach enhances model robustness and generalization across diverse chart styles and domains, resulting in a state-of-the-art chart reasoning model, BigCharts-R1. Extensive experiments demonstrate that our models surpass existing methods on multiple chart question-answering benchmarks compared to even larger open-source and closed-source models.
zh

[NLP-17] Adoption of Explainable Natural Language Processing: Perspectives from Industry and Academia on Practices and Challenges AAAI

【速读】: 该论文试图解决当前可解释自然语言处理(Explainable Natural Language Processing, XNLP)领域中实践者视角的空白问题,即缺乏对工业界从业者在实际应用中采用解释方法的动机、技术选择、满意度及挑战的系统性理解。其解决方案的关键在于通过定性访谈方法,结合来自产业界和学术界的双重视角,深入分析并比较两者对XNLP方法的实际使用经验,从而揭示现有解释方法在实用性与有效性上的不足,并强调需建立清晰的定义和以用户为中心的评估框架,以推动可解释NLP在真实场景中的有效落地。

链接: https://arxiv.org/abs/2508.09786
作者: Mahdi Dhaini,Tobias Müller,Roksoliana Rabets,Gjergji Kasneci
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to AAAI/ACM Conference on AI, Ethics, and Society (AIES 2025)

点击查看摘要

Abstract:The field of explainable natural language processing (NLP) has grown rapidly in recent years. The growing opacity of complex models calls for transparency and explanations of their decisions, which is crucial to understand their reasoning and facilitate deployment, especially in high-stakes environments. Despite increasing attention given to explainable NLP, practitioners’ perspectives regarding its practical adoption and effectiveness remain underexplored. This paper addresses this research gap by investigating practitioners’ experiences with explainability methods, specifically focusing on their motivations for adopting such methods, the techniques employed, satisfaction levels, and the practical challenges encountered in real-world NLP applications. Through a qualitative interview-based study with industry practitioners and complementary interviews with academic researchers, we systematically analyze and compare their perspectives. Our findings reveal conceptual gaps, low satisfaction with current explainability methods, and highlight evaluation challenges. Our findings emphasize the need for clear definitions and user-centric frameworks for better adoption of explainable NLP in practice.
zh

[NLP-18] Can LLM -Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study ICANN2025

【速读】: 该论文旨在解决可解释自然语言处理(Explainable NLP)中文本解释(textual explanations)生成的高成本与低可扩展性问题,传统依赖人工标注的方式难以满足大规模数据集构建和模型性能提升的需求。解决方案的关键在于提出一种自动化框架,利用多个先进的大语言模型(Large Language Models, LLMs)自动生成高质量的文本解释,并通过综合自然语言生成(Natural Language Generation, NLG)指标评估其质量,同时验证这些自动解释在自然语言推理任务上对预训练语言模型(Pre-trained Language Models, PLMs)和LLMs下游性能的提升效果。实验表明,自动化生成的解释在性能上可媲美人工标注解释,为NLP数据集扩展与模型增强提供了一条高效、可扩展的新路径。

链接: https://arxiv.org/abs/2508.09776
作者: Mahdi Dhaini,Juraj Vladika,Ege Erdogan,Zineb Attaoui,Gjergji Kasneci
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the 34th International Conference on Artificial Neural Networks (ICANN 2025)

点击查看摘要

Abstract:In the rapidly evolving field of Explainable Natural Language Processing (NLP), textual explanations, i.e., human-like rationales, are pivotal for explaining model predictions and enriching datasets with interpretable labels. Traditional approaches rely on human annotation, which is costly, labor-intensive, and impedes scalability. In this work, we present an automated framework that leverages multiple state-of-the-art large language models (LLMs) to generate high-quality textual explanations. We rigorously assess the quality of these LLM-generated explanations using a comprehensive suite of Natural Language Generation (NLG) metrics. Furthermore, we investigate the downstream impact of these explanations on the performance of pre-trained language models (PLMs) and LLMs across natural language inference tasks on two diverse benchmark datasets. Our experiments demonstrate that automated explanations exhibit highly competitive effectiveness compared to human-annotated explanations in improving model performance. Our findings underscore a promising avenue for scalable, automated LLM-based textual explanation generation for extending NLP datasets and enhancing model performance.
zh

[NLP-19] UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech

【速读】: 该论文旨在解决多语言文本到语音(TTS)系统中目标语言发音控制能力不足的问题,尤其是在缺乏显式音素映射(grapheme-to-phoneme, G2P)模块、直接处理低编码文本(如字节对编码)的情况下,如何提升目标语言(本文以日语为例)的音段发音和声调重音(pitch accent)控制精度,同时保持其他语言的自然度与说话人相似性。解决方案的关键在于提出轻量级适配方法 UtterTune,其核心是利用低秩适应(low-rank adaptation)技术,在不显著增加参数量的前提下,实现对目标语言音素层面发音与韵律特征的精细化调控,并在零样本(zero-shot)设置下验证了其有效性。

链接: https://arxiv.org/abs/2508.09767
作者: Shuhei Kato
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We propose UtterTune, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its effectiveness.
zh

[NLP-20] Echoes of Agreement: Argument Driven Opinion Shifts in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在政治议题上的偏见评估中,因提示词(prompt)本身包含支持或反驳性论点而引发的响应敏感性问题。现有研究多关注模型固有偏见,但未充分探讨提示内容如何动态影响模型立场输出,这直接影响偏见评估的稳健性与模型行为理解。论文的关键解决方案在于设计实验,在单轮和多轮对话场景下引入具说服力的支持或反驳论点,系统性地考察模型响应方向的变化。结果表明,这些论点显著改变模型立场,且论点强度与响应一致性呈正相关,揭示出模型存在迎合倾向(sycophantic tendency),即倾向于调整自身立场以匹配提示中的论点,这对政治偏见测量及缓解策略开发具有重要启示。

链接: https://arxiv.org/abs/2508.09759
作者: Avneet Kaur
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There have been numerous studies evaluating bias of LLMs towards political topics. However, how positions towards these topics in model outputs are highly sensitive to the prompt. What happens when the prompt itself is suggestive of certain arguments towards those positions remains underexplored. This is crucial for understanding how robust these bias evaluations are and for understanding model behaviour, as these models frequently interact with opinionated text. To that end, we conduct experiments for political bias evaluation in presence of supporting and refuting arguments. Our experiments show that such arguments substantially alter model responses towards the direction of the provided argument in both single-turn and multi-turn settings. Moreover, we find that the strength of these arguments influences the directional agreement rate of model responses. These effects point to a sycophantic tendency in LLMs adapting their stance to align with the presented arguments which has downstream implications for measuring political bias and developing effective mitigation strategies.
zh

[NLP-21] ransforming Questions and Documents for Semantically Aligned Retrieval-Augmented Generation

【速读】: 该论文旨在解决多跳问答(multihop question answering)中因查询语义模糊导致的检索不准确问题,尤其在传统检索增强生成(RAG)框架下难以有效关联多个知识片段的问题。其解决方案的关键在于两个创新:一是利用大语言模型(LLM)将复杂多跳问题分解为一系列单跳子问题,从而明确指向不同知识维度以减少歧义;二是不直接对文档块进行嵌入,而是使用Qwen3-8B生成可回答的问题,并基于问题-问题嵌入相似度进行检索,提升相关文档块的召回准确性。此方法显著提升了RAG在多跳场景下的性能表现。

链接: https://arxiv.org/abs/2508.09755
作者: Seokgi Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce a novel retrieval-augmented generation (RAG) framework tailored for multihop question answering. First, our system uses large language model (LLM) to decompose complex multihop questions into a sequence of single-hop subquestions that guide document retrieval. This decomposition mitigates the ambiguity inherent in multi-hop queries by clearly targeting distinct knowledge facets. Second, instead of embedding raw or chunked documents directly, we generate answerable questions from each document chunk using Qwen3-8B, embed these generated questions, and retrieve relevant chunks via question-question embedding similarity. During inference, the retrieved chunks are then fed along with the original question into the RAG pipeline. We evaluate on three multihop question datasets (MuSiQue, 2WikiMultiHopQa, HotpotQA) from LongBench. Our method improves RAG performacne compared to baseline systems. Our contributions highlight the benefits of using answerable-question embeddings for RAG, and the effectiveness of LLM-based query decomposition for multihop scenarios.
zh

[NLP-22] Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

【速读】: 该论文旨在解决大语言模型在基于可验证奖励的强化学习训练过程中出现的长度膨胀问题(length inflation),即模型为提升准确性而生成冗长、重复且无实质内容的“填充文本”(filler),导致推理效率低下。解决方案的关键在于提出GFPO(Group Filtered Policy Optimization)方法,通过在训练阶段对每个问题采样更大规模的响应群体,并基于两个核心指标——响应长度和token效率(reward per token ratio)——进行筛选,从而引导模型在训练时更高效地思考,减少推理时的冗余计算。此外,引入自适应难度GFPO机制,根据实时难度估计动态分配训练资源,进一步优化了复杂任务下的计算效率与准确性的平衡。实验表明,GFPO显著降低了模型在STEM和编程基准上的长度膨胀(46–85%),同时保持原有精度,证明了训练时间计算资源投入可直接转化为测试时间计算效率提升。

链接: https://arxiv.org/abs/2508.09726
作者: Vaishnavi Shrivastava,Ahmed Awadallah,Vidhisha Balachandran,Shivam Garg,Harkirat Behl,Dimitris Papailiopoulos
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length–inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely “filler”: repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training time, we teach models to think less at inference time. On the Phi-4-reasoning model, GFPO cuts GRPO’s length inflation by 46-71% across challenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while maintaining accuracy. Optimizing for reward per token further increases reductions in length inflation to 71-85%. We also propose Adaptive Difficulty GFPO, which dynamically allocates more training resources to harder problems based on real-time difficulty estimates, improving the balance between computational efficiency and accuracy especially on difficult questions. GFPO demonstrates that increased training-time compute directly translates to reduced test-time compute–a simple yet effective trade-off for efficient reasoning.
zh

[NLP-23] he Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models IEEE-VIS2025

【速读】: 该论文试图解决的问题是:视觉语言模型(Vision-Language Models, VLMs)在面对具有欺骗性设计的可视化图表时,是否容易被误导从而产生错误解读,进而传播视觉信息误导(visual misinformation)。解决方案的关键在于通过系统性评估十种不同VLM对八类典型误导性图表设计(如截断轴、非标准3D效果等)的响应,发现大多数VLM会因这些设计而误读数据,即使原始数据未变。这一结果揭示了当前VLM在处理可视化内容时缺乏鲁棒性,强调需建立针对视觉误导的防护机制以保障其决策可靠性。

链接: https://arxiv.org/abs/2508.09716
作者: Ridwan Mahbub,Mohammed Saidul Islam,Md Tahmid Rahman Laskar,Mizanur Rahman,Mir Tafseer Nayeem,Enamul Hoque
机构: York University (约克大学); University of Alberta (阿尔伯塔大学)
类目: Computation and Language (cs.CL)
备注: Accepted to IEEE VIS 2025

点击查看摘要

Abstract:Information visualizations are powerful tools that help users quickly identify patterns, trends, and outliers, facilitating informed decision-making. However, when visualizations incorporate deceptive design elements-such as truncated or inverted axes, unjustified 3D effects, or violations of best practices-they can mislead viewers and distort understanding, spreading misinformation. While some deceptive tactics are obvious, others subtly manipulate perception while maintaining a facade of legitimacy. As Vision-Language Models (VLMs) are increasingly used to interpret visualizations, especially by non-expert users, it is critical to understand how susceptible these models are to deceptive visual designs. In this study, we conduct an in-depth evaluation of VLMs’ ability to interpret misleading visualizations. By analyzing over 16,000 responses from ten different models across eight distinct types of misleading chart designs, we demonstrate that most VLMs are deceived by them. This leads to altered interpretations of charts, despite the underlying data remaining the same. Our findings highlight the need for robust safeguards in VLMs against visual misinformation.
zh

[NLP-24] Evaluating the Role of Large Language Models in Legal Practice in India

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在印度法律语境下执行关键法律任务(如事实识别、法律起草、建议、研究和推理)的能力是否足以替代或辅助人类法律从业者。其解决方案的关键在于通过一项调查实验,将LLMs(如GPT、Claude和Llama)的输出与初级律师的成果进行对比,并由高级法学生从实用性、准确性和全面性三个维度进行评分,从而实证评估LLMs在不同法律任务中的表现差异。研究发现,LLMs在法律起草和事实识别方面表现优异,甚至可超越人类;但在专业法律研究中易产生幻觉,输出常包含事实错误或虚构内容,表明人类专家在复杂法律推理和精准适用法律方面仍不可替代。

链接: https://arxiv.org/abs/2508.09713
作者: Rahul Hemrajani(National Law School of India University, Bengaluru)
机构: National Law School of India (印度国家法律学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Artificial Intelligence(AI) into the legal profession raises significant questions about the capacity of Large Language Models(LLM) to perform key legal tasks. In this paper, I empirically evaluate how well LLMs, such as GPT, Claude, and Llama, perform key legal tasks in the Indian context, including issue spotting, legal drafting, advice, research, and reasoning. Through a survey experiment, I compare outputs from LLMs with those of a junior lawyer, with advanced law students rating the work on helpfulness, accuracy, and comprehensiveness. LLMs excel in drafting and issue spotting, often matching or surpassing human work. However, they struggle with specialised legal research, frequently generating hallucinations, factually incorrect or fabricated outputs. I conclude that while LLMs can augment certain legal tasks, human expertise remains essential for nuanced reasoning and the precise application of law.
zh

[NLP-25] Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)蒸馏过程中对小型语言模型(Small Language Models, SLMs)安全性造成的负面影响问题。现有方法虽能提升SLM的推理能力,但可能引入安全风险,且现有的安全对齐手段往往需要额外计算资源或标注数据,并可能损害模型推理性能。解决方案的关键在于提出一种名为SLowED(Slow Tuning and Low-Entropy Masking Distillation)的安全蒸馏方法,其核心包含两个模块:一是慢速微调(Slow Tuning),通过限制模型权重更新幅度以维持初始权重分布附近的邻域优化,从而保障早期训练阶段的安全性;二是低熵掩码(Low-Entropy Masking),通过屏蔽低熵token(即冗余学习目标)来减少有害信息的传播,延长安全训练周期。实验证明,SLowED在保持SLM安全性的同时显著提升了推理能力。

链接: https://arxiv.org/abs/2508.09666
作者: Ziyang Ma,Qingyue Yuan,Linhai Zhang,Deyu Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Previous chain-of-thought (CoT) distillation methods primarily focused on enhancing the reasoning capabilities of Small Language Models (SLMs) by utilizing high-quality rationales generated by powerful Large Language Models (LLMs, e.g., GPT-4). However, few works have noted the negative effects on SLM safety brought by the training, which are revealed in this study. Although there are works on safety alignment that fine-tune language models or manipulate model weights to defend against harmful inputs, they require extra computation or annotated data, and probably impact the reasoning ability of SLMs. In this paper, we investigate how to maintain the safety of SLMs during the CoT distillation process. Specifically, we propose a safe distillation method, Slow Tuning and Low-Entropy Masking Distillation (SLowED), containing two modules: Slow Tuning and Low-Entropy Masking. Slow Tuning scales down the magnitude of model weight changes to optimize the model weights in the neighboring space near the initial weight distribution. Low-Entropy Masking masks low-entropy tokens, which are regarded as unnecessary learning targets, to exclude them from fine-tuning. Experiments on three SLMs (Qwen2.5-1.5B, Llama-3.2-1B, BLOOM-1.1B) across reasoning benchmarks (BBH, BB-Sub, ARC, AGIEval) and safety evaluation (AdvBench) show that SLowED retains the safety of SLMs and comparably improves their reasoning capability compared to existing distillation methods. Furthermore, our ablation study presents the effectiveness of Slow Tuning and Low-Entropy Masking, with the former maintaining the model’s safety in the early stage and the latter prolonging the safe training epochs.
zh

[NLP-26] EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverag e Maximization

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)评估过程中因评测基准数据集规模庞大而带来的计算资源消耗高、数据冗余严重的问题。其核心挑战在于如何在保证评估结果可靠性的同时显著降低计算成本,并确保评估的代表性、公平性和泛化能力。解决方案的关键在于提出EffiEval方法,该方法无需训练即可高效筛选高质量代表性子集,其核心机制是基于模型效用指数(Model Utility Index, MUI)自适应地选择样本,从而避免依赖模型绝对性能或大量标注数据,实现对不同模型家族和数据集的灵活迁移与高效评估。

链接: https://arxiv.org/abs/2508.09662
作者: Yaoning Wang,Jiahao Ying,Yixin Cao,Yubo Ma,Yugang Jiang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) and the development of increasingly large and diverse evaluation benchmarks have introduced substantial computational challenges for model assessment. In this paper, we present EffiEval, a training-free approach for efficient benchmarking that effectively addresses data redundancy while maintaining high evaluation reliability. Our method is specifically designed to meet three key criteria for high-quality evaluation: representativeness, by ensuring comprehensive coverage of model capabilities; fairness, by remaining independent of model performance during sample selection to avoid bias; and generalizability, by enabling flexible transfer across datasets and model families without reliance on large-scale evaluation data. Unlike traditional methods that rely on absolute performance or require extensive evaluation data, our approach adaptively selects high-quality representative subsets based on the Model Utility Index (MUI). Extensive experiments on multiple public benchmarks and diverse LLMs demonstrate that EffiEval achieves strong ranking consistency with full-dataset evaluation using only a small fraction of the original data. Furthermore, our method is flexible and scalable in size, allowing users to balance evaluation efficiency and representativeness according to specific needs. Overall, EffiEval provides a practical and generalizable solution for reliable, fair, and efficient evaluation in the era of LLMs.
zh

[NLP-27] Improving Diversity in Language Models: When Temperature Fails Change the Loss ICML2025

【速读】: 该论文旨在解决语言模型在多样性提升与生成质量之间难以平衡的问题,特别是针对通过升高解码温度(decoding temperature)来增加多样性的方法效果有限的现象。研究表明,单纯依赖温度调整无法有效提升覆盖率(Recall),而降低温度虽可提高精确度(Precision),却可能牺牲多样性。其解决方案的关键在于重新设计语言模型的损失函数,引入基于精确度-召回率(Precision-Recall)框架的优化思路,使模型在训练阶段即具备良好的可调性(tunability)——从而实现更优的Precision-Recall权衡,相较于传统的负对数似然(negative log-likelihood)训练结合温度缩放的方法具有显著优势。

链接: https://arxiv.org/abs/2508.09654
作者: Alexandre Verine,Florian Le Bronnec,Kunhao Zheng,Alexandre Allauzen,Yann Chevaleyre,Benjamin Negrevergne
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Forty-Second International Conference on Machine Learning, ICML2025

点击查看摘要

Abstract:Increasing diversity in language models is a challenging yet essential objective. A common approach is to raise the decoding temperature. In this work, we investigate this approach through a simplistic yet common case to provide insights into why decreasing temperature can improve quality (Precision), while increasing it often fails to boost coverage (Recall). Our analysis reveals that for a model to be effectively tunable through temperature adjustments, it must be trained toward coverage. To address this, we propose rethinking loss functions in language models by leveraging the Precision-Recall framework. Our results demonstrate that this approach achieves a substantially better trade-off between Precision and Recall than merely combining negative log-likelihood training with temperature scaling. These findings offer a pathway toward more versatile and robust language modeling techniques.
zh

[NLP-28] A Close Reading Approach to Gender Narrative Biases in AI-Generated Stories

【速读】: 该论文试图解决生成式 AI(Generative AI)在叙事内容中可能存在的性别偏见问题,特别是隐性偏见如何通过角色设定、行为描写和情节发展等层面体现。解决方案的关键在于采用基于普罗普(Propp)角色分类和弗莱塔格(Freytag)叙事结构的提示设计,并结合细读法(close reading approach)对生成故事进行多维度分析,包括角色性别分布、身心描述、行动逻辑及情节推进与人物关系等,从而揭示并评估这些偏见的存在及其表现形式。

链接: https://arxiv.org/abs/2508.09651
作者: Daniel Raffini,Agnese Macori,Marco Angelini,Tiziana Catarci
机构: Sapienza University of Rome (罗马大学); Link University of Rome (Link大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 8-pages

点击查看摘要

Abstract:The paper explores the study of gender-based narrative biases in stories generated by ChatGPT, Gemini, and Claude. The prompt design draws on Propp’s character classifications and Freytag’s narrative structure. The stories are analyzed through a close reading approach, with particular attention to adherence to the prompt, gender distribution of characters, physical and psychological descriptions, actions, and finally, plot development and character relationships. The results reveal the persistence of biases - especially implicit ones - in the generated stories and highlight the importance of assessing biases at multiple levels using an interpretative approach.
zh

[NLP-29] AINL-Eval 2025 Shared Task: Detection of AI-Generated Scientific Abstracts in Russian

【速读】: 该论文旨在解决生成式 AI (Generative AI) 生成的科学摘要在俄语语境下难以被准确检测的问题,尤其关注学术诚信与多语言环境下检测资源匮乏的挑战。其解决方案的关键在于构建了一个大规模、多样化的数据集(共52,305条样本),涵盖12个不同科学领域的真人撰写摘要和来自五种前沿大语言模型(GPT-4-Turbo、Gemma2-27B、Llama3.3-70B、Deepseek-V3 和 GigaChat-Lite)生成的对应内容,并设计了 AINL-Eval 2025 共享任务,要求参赛系统具备对未见领域及训练中未包含模型的泛化能力。这一机制推动了鲁棒检测方法的发展,并通过持续运行的共享平台促进长期研究进展。

链接: https://arxiv.org/abs/2508.09622
作者: Tatiana Batura,Elena Bruches,Milana Shvenk,Valentin Malykh
机构: A.P. Ershov Institute of Informatics Systems (A.P. Ershov信息系统研究所); Novosibirsk State University (托木斯克国立大学); ITMO University (ITMO大学)
类目: Computation and Language (cs.CL)
备注: AINL 2025 Conference

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has revolutionized text generation, making it increasingly difficult to distinguish between human- and AI-generated content. This poses a significant challenge to academic integrity, particularly in scientific publishing and multilingual contexts where detection resources are often limited. To address this critical gap, we introduce the AINL-Eval 2025 Shared Task, specifically focused on the detection of AI-generated scientific abstracts in Russian. We present a novel, large-scale dataset comprising 52,305 samples, including human-written abstracts across 12 diverse scientific domains and AI-generated counterparts from five state-of-the-art LLMs (GPT-4-Turbo, Gemma2-27B, Llama3.3-70B, Deepseek-V3, and GigaChat-Lite). A core objective of the task is to challenge participants to develop robust solutions capable of generalizing to both (i) previously unseen scientific domains and (ii) models not included in the training data. The task was organized in two phases, attracting 10 teams and 159 submissions, with top systems demonstrating strong performance in identifying AI-generated content. We also establish a continuous shared task platform to foster ongoing research and long-term progress in this important area. The dataset and platform are publicly available at this https URL.
zh

[NLP-30] How Persuasive Could LLM s Be? A First Study Combining Linguistic-Rhetorical Analysis and User Experiments

【速读】: 该论文旨在解决生成式 AI(Generative AI)在伦理敏感领域中生成论辩文本的说服力及其对人类观点影响的问题。其解决方案的关键在于通过一项包含62名参与者的用户研究及前后测交互问卷,系统分析了ChatGPT生成的论辩文本的语言与修辞特征,并评估其对用户态度改变和感知的影响。研究发现,尽管ChatGPT能构建结构清晰、逻辑连贯的论证,但其说服效果受限,尤其在涉及伦理议题时,用户虽认可其中部分益处,伦理关切反而可能增强或持续存在,且这种效应具有话题依赖性。这一发现揭示了AI生成内容在伦理敏感场景下的局限性,为未来相关研究提供了重要基础。

链接: https://arxiv.org/abs/2508.09614
作者: Daniel Raffini,Agnese Macori,Lorenzo Porcaro,Tiziana Catarci,Marco Angelini
机构: Sapienza University of Rome (罗马大学); ISTC-CNR (国家研究中心信息科学与技术研究所); Link University of Rome (罗马Link大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 9-pages

点击查看摘要

Abstract:This study examines the rhetorical and linguistic features of argumentative texts generated by ChatGPT on ethically nuanced topics and investigates their persuasive impact on human this http URL a user study involving 62 participants and pre-post interaction surveys, the paper analyzes how exposure to AI-generated arguments affects opinion change and user perception. A linguistic and rhetorical analysis of the generated texts reveals a consistent argumentative macrostructure, reliance on formulaic expressions, and limited stylistic richness. While ChatGPT demonstrates proficiency in constructing coherent argumentative texts, its persuasive efficacy appears constrained, particularly on topics involving ethical this http URL study finds that while participants often acknowledge the benefits highlighted by ChatGPT, ethical concerns tend to persist or even intensify post-interaction. The results also demonstrate a variation depending on the topic. These findings highlight new insights on AI-generated persuasion in ethically sensitive domains and are a basis for future research.
zh

[NLP-31] he Surprising Effectiveness of Membership Inference with Simple N-Gram Coverag e

【速读】: 该论文旨在解决当前会员推理攻击(Membership Inference Attacks)在实际应用中受限于对模型隐藏状态或概率分布的访问需求,从而无法有效应用于仅通过API调用的黑盒语言模型(如GPT-4)的问题。其解决方案的关键在于提出N-Gram Coverage Attack方法,该方法仅依赖目标模型生成的文本输出,利用训练数据中常见n-gram模式在模型输出中的高覆盖率特性进行推理:具体而言,通过在候选样本前缀条件下多次生成文本,并基于n-gram重叠度计算生成结果与真实后缀的相似性,若相似度较高则判定该样本为成员。该方法在多个基准测试中表现优于现有黑盒攻击手段,且达到甚至超越白盒攻击水平,同时展现出攻击性能随计算预算增加而提升的特性,为评估闭源大模型隐私泄露风险提供了可操作的新路径。

链接: https://arxiv.org/abs/2508.09603
作者: Skyler Hallinan,Jaehun Jung,Melanie Sclar,Ximing Lu,Abhilasha Ravichander,Sahana Ramnath,Yejin Choi,Sai Praneeth Karimireddy,Niloofar Mireshghallah,Xiang Ren
机构: University of Southern California (南加州大学); University of Washington (华盛顿大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: CoLM 2025

点击查看摘要

Abstract:Membership inference attacks serves as useful tool for fair use of language models, such as detecting potential copyright infringement and auditing data leakage. However, many current state-of-the-art attacks require access to models’ hidden states or probability distribution, which prevents investigation into more widely-used, API-access only models like GPT-4. In this work, we introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model, enabling attacks on completely black-box models. We leverage the observation that models are more likely to memorize and subsequently generate text patterns that were commonly observed in their training data. Specifically, to make a prediction on a candidate member, N-Gram Coverage Attack first obtains multiple model generations conditioned on a prefix of the candidate. It then uses n-gram overlap metrics to compute and aggregate the similarities of these outputs with the ground truth suffix; high similarities indicate likely membership. We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods while also impressively achieving comparable or even better performance to state-of-the-art white-box attacks - despite having access to only text outputs. Interestingly, we find that the success rate of our method scales with the attack compute budget - as we increase the number of sequences generated from the target model conditioned on the prefix, attack performance tends to improve. Having verified the accuracy of our method, we use it to investigate previously unstudied closed OpenAI models on multiple domains. We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference, suggesting an evolving trend toward improved privacy protections.
zh

[NLP-32] AI Blob! LLM -Driven Recontextualization of Italian Television Archives

【速读】: 该论文旨在解决传统档案电视影像检索依赖静态元数据标签所导致的语义局限性问题,即难以实现基于内容语义的高效查询与再语境化(recontextualization)。其解决方案的关键在于构建一个融合自动语音识别(ASR)、语义嵌入(semantic embeddings)与检索增强生成(RAG)技术的实验系统 AI Blob!,通过将视频音频转录为句级单元并存储于向量数据库中,实现基于主题提示的动态语义检索,并由大语言模型(LLM)生成相关查询以驱动片段选择与叙事重组,从而生成具有讽刺性对比和主题一致性的蒙太奇式内容,推动档案利用从静态标注向内容感知的智能交互演进。

链接: https://arxiv.org/abs/2508.09535
作者: Roberto Balestri
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: Preprint

点击查看摘要

Abstract:This paper introduces AI Blob!, an experimental system designed to explore the potential of semantic cataloging and Large Language Models (LLMs) for the retrieval and recontextualization of archival television footage. Drawing methodological inspiration from Italian television programs such as Blob (RAI Tre, 1989-), AI Blob! integrates automatic speech recognition (ASR), semantic embeddings, and retrieval-augmented generation (RAG) to organize and reinterpret archival content. The system processes a curated dataset of 1,547 Italian television videos by transcribing audio, segmenting it into sentence-level units, and embedding these segments into a vector database for semantic querying. Upon user input of a thematic prompt, the LLM generates a range of linguistically and conceptually related queries, guiding the retrieval and recombination of audiovisual fragments. These fragments are algorithmically selected and structured into narrative sequences producing montages that emulate editorial practices of ironic juxtaposition and thematic coherence. By foregrounding dynamic, content-aware retrieval over static metadata schemas, AI Blob! demonstrates how semantic technologies can facilitate new approaches to archival engagement, enabling novel forms of automated narrative construction and cultural analysis. The project contributes to ongoing debates in media historiography and AI-driven archival research, offering both a conceptual framework and a publicly available dataset to support further interdisciplinary experimentation.
zh

[NLP-33] COMPEER: Controllable Empathetic Reinforcement Reasoning for Emotional Support Conversation

【速读】: 该论文旨在解决当前情感支持对话模型缺乏基于心理学原理的深度共情推理的问题。其解决方案的关键在于提出可控共情推理(controllable empathetic reasoning),该方法将自然语言推理与结构化的心理步骤相结合,并构建了一个细粒度标注的数据集,包含推理正确性和响应偏好信息,以支持该能力的训练;同时引入统一的过程-结果奖励模型进行强化学习优化,并通过基于个性的对话重写和冗余感知的奖励重加权策略缓解熵崩溃导致的回复重复性问题,从而显著提升模型的情感支持能力。

链接: https://arxiv.org/abs/2508.09521
作者: Yunxiao Wang,Meng Liu,Wenqi Liu,Kaiyu Jiang,Bin Wen,Fan Yang,Tingting Gao,Guorui Zhou,Liqiang Nie
机构: Shandong University (山东大学); Shandong Jianzhu University (山东建筑大学); Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotional support conversations are crucial for promoting emotional well-being, yet current models often lack deep empathetic reasoning grounded in psychological principles. To address this, we propose controllable empathetic reasoning, which combines natural language reasoning with structured psychological steps. We construct a fine-grained dataset annotated with reasoning correctness and response preferences to enable this capability. To further enhance training, we employ reinforcement learning with a unified process-outcome reward model that delivers precise feedback. To mitigate response repetitiveness from entropy collapse, we introduce personality-based dialogue rewriting and a redundancy-aware reward reweighting strategy. Our approach significantly improves model’s emotional support ability, advancing the development of empathetic, human-like support systems.
zh

[NLP-34] UWBa at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval SEMEVAL-1 ACL SEMEVAL-2025

【速读】: 该论文旨在解决事实核查中主张(claim)检索的问题,即从大规模语料库中快速准确地找到与给定社交媒体帖子相关的已验证主张。其解决方案的关键在于利用多种先进的大语言模型(Large Language Models, LLMs)生成文本嵌入(text embeddings),并通过计算余弦相似度(cosine similarity)来识别最相关的主张。研究发现,仅使用英文翻译作为输入可获得更优效果,且最佳性能由 NVIDIA NV-Embed-v2 模型实现;对于部分语言,通过组合模型(如 NV-Embed + GPT 或 Mistral)进一步提升了检索精度。

链接: https://arxiv.org/abs/2508.09517
作者: Ladislav Lenc,Daniel Cífka,Jiří Martínek,Jakub Šmíd,Pavel Král
机构: University of West Bohemia in Pilsen (西波希米亚大学皮尔森分校)
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025). Official version: this https URL

点击查看摘要

Abstract:This paper presents a zero-shot system for fact-checked claim retrieval. We employed several state-of-the-art large language models to obtain text embeddings. The models were then combined to obtain the best possible result. Our approach achieved 7th place in monolingual and 9th in cross-lingual subtasks. We used only English translations as an input to the text embedding models since multilingual models did not achieve satisfactory results. We identified the most relevant claims for each post by leveraging the embeddings and measuring cosine similarity. Overall, the best results were obtained by the NVIDIA NV-Embed-v2 model. For some languages, we benefited from model combinations (NV-Embed GPT or Mistral).
zh

[NLP-35] Cross-lingual Aspect-Based Sentiment Analysis: A Survey on Tasks Approaches and Challenges WWW DATE

【速读】: 该论文旨在解决跨语言方面情感分析(Cross-lingual Aspect-Based Sentiment Analysis, Cross-lingual ABSA)领域缺乏系统性综述的问题,其核心挑战在于如何将资源丰富语言(如英语)中的知识有效迁移至低资源语言,以提升跨语言场景下的细粒度情感分析性能。解决方案的关键在于对现有方法进行全面梳理,包括关键任务(如方面词提取、方面情感分类及复合任务)、数据集、建模范式以及跨语言迁移技术,并深入分析单语和多语ABS A研究以及大语言模型(LLMs)在该方向上的贡献,从而为未来研究提供明确的方向与理论支撑。

链接: https://arxiv.org/abs/2508.09516
作者: Jakub Šmíd,Pavel Král
机构: University of West Bohemia (西波希米亚大学)
类目: Computation and Language (cs.CL)
备注: Submitted version prior to peer review. Updated version accepted in Information Fusion. Official version: this https URL

点击查看摘要

Abstract:Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task that focuses on understanding opinions at the aspect level, including sentiment towards specific aspect terms, categories, and opinions. While ABSA research has seen significant progress, much of the focus has been on monolingual settings. Cross-lingual ABSA, which aims to transfer knowledge from resource-rich languages (such as English) to low-resource languages, remains an under-explored area, with no systematic review of the field. This paper aims to fill that gap by providing a comprehensive survey of cross-lingual ABSA. We summarize key ABSA tasks, including aspect term extraction, aspect sentiment classification, and compound tasks involving multiple sentiment elements. Additionally, we review the datasets, modelling paradigms, and cross-lingual transfer methods used to solve these tasks. We also examine how existing work in monolingual and multilingual ABSA, as well as ABSA with LLMs, contributes to the development of cross-lingual ABSA. Finally, we highlight the main challenges and suggest directions for future research to advance cross-lingual ABSA systems.
zh

[NLP-36] LACA: Improving Cross-lingual Aspect-Based Sentiment Analysis with LLM Data Augmentation ACL ACL2025

【速读】: 该论文旨在解决跨语言方面情感分析(Cross-lingual Aspect-Based Sentiment Analysis, ABSA)中依赖不可靠翻译工具的问题,这类工具常导致知识迁移效果不佳。其解决方案的关键在于利用大语言模型(Large Language Model, LLM)生成高质量伪标签数据,无需依赖翻译工具:首先训练一个ABSA模型对目标语言的无标注数据进行预测,随后通过LLM提示(prompting)生成更自然、能更好体现这些噪声预测结果的句子,从而构建伪标签数据集,并进一步微调ABSA模型。该方法在六种语言和五种骨干模型上验证有效,显著优于基于翻译的传统方法,且支持生成式模型,表明微调后的LLM在性能上超越较小的多语言模型。

链接: https://arxiv.org/abs/2508.09515
作者: Jakub Šmíd,Pavel Přibáň,Pavel Král
机构: University of West Bohemia in Pilsen (西波希米亚大学); Department of Computer Science and Engineering (计算机科学与工程系); NTIS – New Technologies for the Information Society (信息社会新技术中心)
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics; Volume 1: Long Papers (ACL 2025). Official version: this https URL

点击查看摘要

Abstract:Cross-lingual aspect-based sentiment analysis (ABSA) involves detailed sentiment analysis in a target language by transferring knowledge from a source language with available annotated data. Most existing methods depend heavily on often unreliable translation tools to bridge the language gap. In this paper, we propose a new approach that leverages a large language model (LLM) to generate high-quality pseudo-labelled data in the target language without the need for translation tools. First, the framework trains an ABSA model to obtain predictions for unlabelled target language data. Next, LLM is prompted to generate natural sentences that better represent these noisy predictions than the original text. The ABSA model is then further fine-tuned on the resulting pseudo-labelled dataset. We demonstrate the effectiveness of this method across six languages and five backbone models, surpassing previous state-of-the-art translation-based approaches. The proposed framework also supports generative models, and we show that fine-tuned LLMs outperform smaller multilingual models.
zh

[NLP-37] From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中重排序模块的性能瓶颈问题,尤其是传统方法在处理复杂多跳查询时因固定Top-K选择策略导致的信息遗漏或噪声引入问题。其解决方案的关键在于提出一种动态文档选择器(Dynamic Passage Selector, DPS),将 passage 选择建模为一个监督学习任务,并通过微调捕捉跨文档之间的依赖关系,从而动态地为生成模块选择最相关的文档集合,而非采用静态的 Top-K 截断策略。DPS 可无缝集成至标准 RAG 流程中,无需修改原有架构,在多个基准测试上显著优于现有最优重排序方法,尤其在 MuSiQue 数据集上 F1 分数提升达 30.06%。

链接: https://arxiv.org/abs/2508.09497
作者: Siyuan Meng,Junming Liu,Yirong Chen,Song Mao,Pinlong Cai,Guohang Yan,Botian Shi,Ding Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 tables

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems are often bottlenecked by their reranking modules, which typically score passages independently and select a fixed Top-K size. This approach struggles with complex multi-hop queries that require synthesizing evidence across multiple documents, creating a trade-off where small K values omit crucial information and large K values introduce noise. To address this, we introduce the Dynamic Passage Selector (DPS), a novel reranking framework that treats passage selection as a supervised learning problem. Unlike traditional point-wise or list-wise methods, DPS is fine-tuned to capture inter-passage dependencies and dynamically select the most relevant set of passages for generation. As a seamless plug-and-play module, DPS requires no modifications to the standard RAG pipeline. Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods. Notably, on the challenging MuSiQue dataset, DPS improves the F1-score by 30.06% and 15.4% over strong baselines like Qwen3-reranker and RankingGPT, respectively. Our results demonstrate that by enabling adaptive evidence selection, DPS substantially enhances reasoning capabilities in complex RAG scenarios.
zh

[NLP-38] Learning Facts at Scale with Active Reading

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在参数记忆中存储知识时存在的不可靠性问题,即模型对事实的获取与回忆能力高度依赖训练数据中特定事实的出现频率及其他尚不明确的因素,导致知识学习不稳定且难以控制。为应对这一挑战,作者提出“主动阅读”(Active Reading)框架,其核心在于让模型通过自生成的学习策略来系统性地研读指定材料,从而显著提升知识吸收效率。关键创新在于将学习过程从被动的数据增强转向由模型自主设计和执行的学习机制,实验证明该方法在专家领域微调中能大幅提升模型性能(如在SimpleQA上相对基线提升313%,在FinanceBench上提升160%),并可在预训练规模下构建更具备事实准确性的模型(如Meta WikiExpert-8B在1万亿token上训练后超越数百亿参数模型)。

链接: https://arxiv.org/abs/2508.09494
作者: Jessy Lin,Vincent-Pierre Berges,Xilun Chen,Wen-Tau Yih,Gargi Ghosh,Barlas Oğuz
机构: Berkeley(伯克利); Meta(Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs are known to store vast amounts of knowledge in their parametric memory. However, learning and recalling facts from this memory is known to be unreliable, depending largely on the prevalence of particular facts in the training data and other factors which are poorly understood. Practitioners are lacking tools which will allow them to ensure that the models learn a given body of knowledge reliably and consistently. To this end, we propose Active Reading: a framework where we train models to study a given set of material with self-generated learning strategies. First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning and other data augmentations. We train expert 8B models that achieve 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative over vanilla finetuning) and 26% on FinanceBench (+160% relative over vanilla finetuning) by applying Active Reading to the source documents for each benchmark. Finally, we show that Active Reading can be utilized at pre-training scale to build more factual models. As a demonstration of this, we release Meta WikiExpert-8B, a Wikipedia-expert model trained on 1 trillion generated tokens, which outcompetes models with hundreds of billions of parameters on factual QA.
zh

[NLP-39] NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)与功能实用性(utility)之间难以平衡的问题,现有方法普遍存在对恶意攻击鲁棒性不足、对良性请求频繁拒绝以及生成文本质量下降和通用任务性能退化等缺陷。其核心解决方案是提出一种细粒度的神经元调优框架 NeuronTune,通过基于归因分析识别各层中的安全关键神经元和保用神经元,并利用元学习动态增强安全神经元激活、抑制保用神经元激活,从而实现安全与实用性的协同优化;该方法还支持通过神经元数量阈值灵活调节干预范围,以适应不同场景下对安全性或实用性的优先需求。

链接: https://arxiv.org/abs/2508.09473
作者: Birong Pan,Mayi Xu,Qiankun Pi,Jianhao Chen,Yuanyuan Zhu,Ming Zhong,Tieyun Qian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ensuring robust safety alignment while preserving utility is critical for the reliable deployment of Large Language Models (LLMs). However, current techniques fundamentally suffer from intertwined deficiencies: insufficient robustness against malicious attacks, frequent refusal of benign queries, degradation in generated text quality and general task performance–the former two reflecting deficits in robust safety and the latter constituting utility impairment. We trace these limitations to the coarse-grained layer-wise interventions in existing methods. To resolve this, we propose NeuronTune, a fine-grained framework that dynamically modulates sparse neurons to achieve simultaneous safety-utility optimization. Our approach first identifies safety-critical and utility-preserving neurons across all layers via attribution, then employs meta-learning to adaptively amplify safety-neuron activations and suppress utility-neuron activations. Crucially, NeuronTune enables tunable adjustment of intervention scope via neuron-count thresholds, supporting flexible adaptation to security-critical or utility-priority scenarios. Extensive experimental results demonstrate that our method significantly outperforms existing state-of-the-art technologies, achieving superior model safety while maintaining excellent utility.
zh

[NLP-40] User-centric Subjective Leaderboard by Customizable Reward Modeling

【速读】: 该论文旨在解决现有大型语言模型(Large Language Models, LLMs)评估基准主要依赖可验证任务所带来的局限性,这些静态、客观的评测方式难以反映用户在实际应用中的主观偏好,从而导致模型选择困难。为应对这一问题,作者提出了首个以用户为中心的主观排行榜(User-Centric Subjective Leaderboard, USL),其核心创新在于基于超过10,000条真实人类偏好数据构建动态排名体系,并引入轻量级可定制奖励模型(Customizable Reward Models, CRMs)。CRMs仅含40亿参数,却在多项指标上超越GPT-4.1和Gemini-2.5-pro,展现出卓越的跨主题与跨标准泛化能力,且USL与人类偏好矛盾程度呈强负相关,显著提升了模型选型的实用性与准确性。

链接: https://arxiv.org/abs/2508.09463
作者: Qi Jia,Xiujie Song,Zicheng Zhang,Yijin Guo,Kaiwei Zhang,Zijian Chen,Guangtao Zhai
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing benchmarks for large language models (LLMs) predominantely focus on assessing their capabilities through verifiable tasks. Such objective and static benchmarks offer limited utility for practical LLM selection, making it difficult for users to find suitable models for their individual needs. To bridge this gap, we present the first User-Centric Subjective Leaderboard (USL), which provides a preference-driven, dynamic ranking of LLMs across diverse real-world scenarios. Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries. Our investigation reveals significant diversity and contradictions in human preferences, which limit the effectiveness of state-of-the-art reward models. To address this, we introduce Customizable Reward Models (CRMs). With only 4B parameters, our CRM surpasses the performance of leading models such as GPT-4.1 and Gemini-2.5-pro, showing exceptional generalization capabilities across new topics and criteria. The USL, powered by CRMs, exhibits strong negative correlations to contradictory preferences.
zh

[NLP-41] IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在视觉定位任务中面临的后门攻击安全问题,特别是如何隐蔽地操控模型将特定目标对象错误地定位到输入图像中的任意位置,而与用户查询无关。解决方案的关键在于提出一种输入感知的后门攻击方法(Input-aware Backdoor Attack, IAG),其核心创新包括:1)设计了一个文本条件U-Net架构的自适应触发器生成器,能够将攻击目标的语义信息嵌入原始图像,从而应对开放词汇场景下的攻击挑战;2)引入重建损失以最小化污染图像与干净图像之间的视觉差异,提升攻击的隐蔽性;3)构建统一的数据生成策略,增强攻击的可迁移性和有效性。实验表明,IAG在多个基准测试集上实现了超过65%的攻击成功率(ASR@0.5),且对Clean样本性能影响极小,验证了其可行性与鲁棒性。

链接: https://arxiv.org/abs/2508.09456
作者: Junxian Li,Beining Xu,Di Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 13 pages, 13 Figures

点击查看摘要

Abstract:Vision-language models (VLMs) have shown significant advancements in tasks such as visual grounding, where they localize specific objects in images based on natural language queries and images. However, security issues in visual grounding tasks for VLMs remain underexplored, especially in the context of backdoor attacks. In this paper, we introduce a novel input-aware backdoor attack method, IAG, designed to manipulate the grounding behavior of VLMs. This attack forces the model to ground a specific target object in the input image, regardless of the user’s query. We propose an adaptive trigger generator that embeds the semantic information of the attack target’s description into the original image using a text-conditional U-Net, thereby overcoming the open-vocabulary attack challenge. To ensure the attack’s stealthiness, we utilize a reconstruction loss to minimize visual discrepancies between poisoned and clean images. Additionally, we introduce a unified method for generating attack data. IAG is evaluated theoretically and empirically, demonstrating its feasibility and effectiveness. Notably, our ASR@0.5 on InternVL-2.5-8B reaches over 65% on various testing sets. IAG also shows promising potential on manipulating Ferret-7B and LlaVA-1.5-7B with very little accuracy decrease on clean samples. Extensive specific experiments, such as ablation study and potential defense, also indicate the robustness and transferability of our attack.
zh

[NLP-42] From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在图表到文本(chart-to-text)任务中可能放大地缘经济偏见的问题,即模型在生成不同收入水平国家的图表摘要时,倾向于对高收入国家给出更积极的描述,从而可能引发社会危害。解决方案的关键在于通过大规模评估六种主流视觉语言模型(Vision-Language Models, VLMs)在6000个图表-国家配对上的输出,量化其对低收入和中等收入国家的系统性负面倾向,并探索基于推理时提示(inference-time prompt-based)的去偏技术,例如引入正向干扰项(positive distractors),尽管效果有限,仍揭示了当前方法在应对复杂偏见问题上的不足,强调需发展更鲁棒的去偏策略。

链接: https://arxiv.org/abs/2508.09450
作者: Ridwan Mahbub,Mohammed Saidul Islam,Mir Tafseer Nayeem,Md Tahmid Rahman Laskar,Mizanur Rahman,Shafiq Joty,Enamul Hoque
机构: York University, Canada; University of Alberta, Canada; Dialpad Inc., Canada; RBC, Canada; Nanyang Technological University, Singapore; Salesforce AI, USA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Charts are very common for exploring data and communicating insights, but extracting key takeaways from charts and articulating them in natural language can be challenging. The chart-to-text task aims to automate this process by generating textual summaries of charts. While with the rapid advancement of large Vision-Language Models (VLMs), we have witnessed great progress in this domain, little to no attention has been given to potential biases in their outputs. This paper investigates how VLMs can amplify geo-economic biases when generating chart summaries, potentially causing societal harm. Specifically, we conduct a large-scale evaluation of geo-economic biases in VLM-generated chart summaries across 6,000 chart-country pairs from six widely used proprietary and open-source models to understand how a country’s economic status influences the sentiment of generated summaries. Our analysis reveals that existing VLMs tend to produce more positive descriptions for high-income countries compared to middle- or low-income countries, even when country attribution is the only variable changed. We also find that models such as GPT-4o-mini, Gemini-1.5-Flash, and Phi-3.5 exhibit varying degrees of bias. We further explore inference-time prompt-based debiasing techniques using positive distractors but find them only partially effective, underscoring the complexity of the issue and the need for more robust debiasing strategies. Our code and dataset are publicly available here.
zh

[NLP-43] Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)推理过程中Key-Value (KV)缓存所引入的隐私泄露问题。KV缓存通过存储中间注意力计算结果(即Key和Value对)来加速推理,但本文首次系统性地揭示了其潜在的隐私风险:攻击者可直接从KV缓存中重建出敏感用户输入。为此,作者设计并实现了三种攻击向量——直接反演攻击(Inversion Attack)、更具普适性和破坏力的碰撞攻击(Collision Attack)以及基于语义的注入攻击(Injection Attack),验证了该漏洞的实际危害性。解决方案的关键在于提出KV-Cloak,一种轻量级、高效的防御机制,其核心是基于可逆矩阵的混淆方案,并结合算子融合技术,在几乎不降低模型准确率且仅带来极小性能开销的前提下,有效将KV缓存内容转化为随机噪声,从而彻底阻断攻击路径,为可信LLM部署提供了实用保障。

链接: https://arxiv.org/abs/2508.09442
作者: Zhifan Luo,Shuo Shao,Su Zhang,Lijing Zhou,Yuke Hu,Chenxu Zhao,Zhihao Liu,Zhan Qin
机构: Zhejiang University (浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新区(滨江区)区块链与数据安全研究院); Huawei Technology (华为技术)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Key-Value (KV) cache, which stores intermediate attention computations (Key and Value pairs) to avoid redundant calculations, is a fundamental mechanism for accelerating Large Language Model (LLM) inference. However, this efficiency optimization introduces significant yet underexplored privacy risks. This paper provides the first comprehensive analysis of these vulnerabilities, demonstrating that an attacker can reconstruct sensitive user inputs directly from the KV-cache. We design and implement three distinct attack vectors: a direct Inversion Attack, a more broadly applicable and potent Collision Attack, and a semantic-based Injection Attack. These methods demonstrate the practicality and severity of KV-cache privacy leakage issues. To mitigate this, we propose KV-Cloak, a novel, lightweight, and efficient defense mechanism. KV-Cloak uses a reversible matrix-based obfuscation scheme, combined with operator fusion, to secure the KV-cache. Our extensive experiments show that KV-Cloak effectively thwarts all proposed attacks, reducing reconstruction quality to random noise. Crucially, it achieves this robust security with virtually no degradation in model accuracy and minimal performance overhead, offering a practical solution for trustworthy LLM deployment.
zh

[NLP-44] Leverag ing Zipformer Model for Effective Language Identification in Code-Switched Child-Directed Speech

【速读】: 该论文旨在解决儿童导向场景中双语环境下的代码转换(Code-switching)与语言识别(Language Identification)难题,尤其针对中文与英文在单个话语中存在显著不平衡分布的情况。其解决方案的关键在于利用Zipformer模型内部层提取的嵌入表示(embeddings),通过合理选择中间层特征并结合不同后端(back-ends)进行对比分析,发现Zipformer对多种后端均表现出鲁棒性,从而有效提升了语言识别性能,在不平衡数据下实现了81.89%的平衡准确率(Balanced Accuracy),相较基线提升15.47%。

链接: https://arxiv.org/abs/2508.09430
作者: Lavanya Shankar,Leibny Paola Garcia Perera
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Code-switching and language identification in child-directed scenarios present significant challenges, particularly in bilingual environments. This paper addresses this challenge by using Zipformer to handle the nuances of speech, which contains two imbalanced languages, Mandarin and English, in an utterance. This work demonstrates that the internal layers of the Zipformer effectively encode the language characteristics, which can be leveraged in language identification. We present the selection methodology of the inner layers to extract the embeddings and make a comparison with different back-ends. Our analysis shows that Zipformer is robust across these backends. Our approach effectively handles imbalanced data, achieving a Balanced Accuracy (BAC) of 81.89%, a 15.47% improvement over the language identification baseline. These findings highlight the potential of the transformer encoder architecture model in real scenarios.
zh

[NLP-45] Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models

【速读】: 该论文旨在解决表格列名缩写扩展(Abbreviation Expansion)问题,即如何将如“esal”这类简写准确映射为完整语义表达(如“employee salary”),这一任务在企业、科学领域和政府机构的数据处理中至关重要。解决方案的关键在于提出Columbo——一个基于大语言模型(Large Language Model, LLM)的系统,其创新性地融合了上下文信息、规则约束、链式思维推理(chain-of-thought reasoning)以及token级分析能力,从而显著提升扩展准确性。此外,作者还构建了4个真实世界场景下的新数据集,并设计了更精准的同义词感知评估指标,有效克服了现有方法在数据质量和度量标准上的局限性。

链接: https://arxiv.org/abs/2508.09403
作者: Ting Cai,Stephen Sheen,AnHai Doan
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Expanding the abbreviated column names of tables, such as esal'' to employee salary’', is critical for numerous downstream data tasks. This problem arises in enterprises, domain sciences, government agencies, and more. In this paper we make three contributions that significantly advances the state of the art. First, we show that synthetic public data used by prior work has major limitations, and we introduce 4 new datasets in enterprise/science domains, with real-world abbreviations. Second, we show that accuracy measures used by prior work seriously undercount correct expansions, and we propose new synonym-aware measures that capture accuracy much more accurately. Finally, we develop Columbo, a powerful LLM-based solution that exploits context, rules, chain-of-thought reasoning, and token-level analysis. Extensive experiments show that Columbo significantly outperforms NameGuess, the current most advanced solution, by 4-29%, over 5 datasets. Columbo has been used in production on EDI, a major data portal for environmental sciences.
zh

[NLP-46] APIO: Automatic Prompt Induction and Optimization for Grammatical Error Correction and Text Simplification

【速读】: 该论文旨在解决如何在无需人工指定初始提示(seed prompt)的情况下,自动优化大语言模型(Large Language Models, LLMs)的提示以提升其在特定自然语言处理任务中的性能问题。针对语法错误纠正(Grammatical Error Correction, GEC)和文本简化(Text Simplification)任务,作者提出了一种名为APIO的提示诱导与优化方法,其关键在于通过自动化机制直接从任务目标出发诱导并优化提示结构,从而在不依赖人工设计种子提示的前提下实现纯基于LLM提示方法的最先进性能表现。

链接: https://arxiv.org/abs/2508.09378
作者: Artem Chernodub,Aman Saini,Yejin Huh,Vivek Kulkarni,Vipul Raheja
机构: Zendesk( Zendesk); Grammarly( Grammarly)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication at Recent Advances in Natural Language Processing conference (RANLP 2025)

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have enabled a wide range of natural language processing (NLP) tasks to be performed through simple prompt-based interactions. Consequently, several approaches have been proposed to engineer prompts that most effectively enable LLMs to perform a given task (e.g., chain-of-thought prompting). In settings with a well-defined metric to optimize model performance, automatic prompt optimization (APO) methods have been developed to refine a seed prompt. Advancing this line of research, we propose APIO, a simple but effective prompt induction and optimization approach for the tasks of Grammatical Error Correction (GEC) and Text Simplification, without relying on manually specified seed prompts. APIO achieves a new state-of-the-art performance for purely LLM-based prompting methods on these tasks. We make our data, code, prompts, and outputs publicly available.
zh

[NLP-47] Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

【速读】: 该论文旨在解决当前无文本监督的语音语言模型(Textless Spoken Language Models, SLMs)在生成语音时缺乏声学上下文信息和对声学细节控制能力不足的问题。现有SLMs通常仅预测语义标记(semantic tokens),依赖外部声码器(vocoder)添加声学信息,导致无法直接建模声学特征并限制生成质量。其解决方案的关键在于提出一种联合建模框架,同时生成语义标记与连续实值声学帧表示(continuous real-valued representation of the acoustic frame),并通过流匹配(flow-matching)目标函数实现以语义标记为条件的连续向量预测,从而在保持语义保真度的同时显著提升生成语音的声学细节表现力。

链接: https://arxiv.org/abs/2508.09350
作者: Ju-Chieh Chou,Jiawei Zhou,Karen Livescu
机构: Toyota Technological Institute at Chicago (丰田技术学院芝加哥分校); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ASRU 2025

点击查看摘要

Abstract:Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.
zh

[NLP-48] he Human-AI Hybrid Delphi Model: A Structured Framework for Context-Rich Expert Consensus in Complex Domains

【速读】: 该论文旨在解决传统专家共识制定方法(如德尔菲法、共识会议和系统性指南合成)在面对复杂、冲突或证据不足的情境时所面临的局限性,包括专家小组负担过重、解释过度简化以及对条件性细微差别的压制等问题。此外,信息过载、证据碎片化及对缺乏专家过滤的公开来源的依赖进一步加剧了这些挑战。其解决方案的关键在于提出并验证了一种“人机混合德尔菲”(Human-AI Hybrid Delphi, HAH-Delphi)框架,该框架通过整合生成式AI模型(Gemini 2.5 Pro)、少量资深人类专家小组与结构化引导机制,实现高效且高质量的共识生成。实证研究表明,该框架不仅在回顾性复制中准确再现了95%的既有共识结论,在前瞻性比较中与人类专家达成95%方向一致性,并显著加速了主题饱和过程,从而为可扩展、情境敏感的个性化指导和大规模共识框架构建提供了可靠的方法论基础。

链接: https://arxiv.org/abs/2508.09349
作者: Cathy Speed,Ahmed A. Metwally
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Expert consensus plays a critical role in domains where evidence is complex, conflicting, or insufficient for direct prescription. Traditional methods, such as Delphi studies, consensus conferences, and systematic guideline synthesis, offer structure but face limitations including high panel burden, interpretive oversimplification, and suppression of conditional nuance. These challenges are now exacerbated by information overload, fragmentation of the evidence base, and increasing reliance on publicly available sources that lack expert filtering. This study introduces and evaluates a Human-AI Hybrid Delphi (HAH-Delphi) framework designed to augment expert consensus development by integrating a generative AI model (Gemini 2.5 Pro), small panels of senior human experts, and structured facilitation. The HAH-Delphi was tested in three phases: retrospective replication, prospective comparison, and applied deployment in two applied domains (endurance training and resistance and mixed cardio/strength training). The AI replicated 95% of published expert consensus conclusions in Phase I and showed 95% directional agreement with senior human experts in Phase II, though it lacked experiential and pragmatic nuance. In Phase III, compact panels of six senior experts achieved 90% consensus coverage and reached thematic saturation before the final participant. The AI provided consistent, literature-grounded scaffolding that supported divergence resolution and accelerated saturation. The HAH-Delphi framework offers a flexible, scalable approach for generating high-quality, context-sensitive consensus. Its successful application across health, coaching, and performance science confirms its methodological robustness and supports its use as a foundation for generating conditional, personalised guidance and published consensus frameworks at scale.
zh

[NLP-49] Decoding Neural Emotion Patterns through Natural Language Processing Embeddings

【速读】: 该论文旨在解决情感表达的语言特征与大脑功能之间关联的计算建模问题,尤其是在传统神经影像学方法成本高、场景受限的情况下,如何利用大规模数字文本实现情绪-脑区映射。其核心解决方案是提出一种无需神经影像数据即可将文本情感内容映射到解剖定义脑区的计算框架:首先使用OpenAI的text-embedding-ada-002生成高维语义表征,继而通过降维与聚类识别情绪类别,并将其对应至18个与情绪处理相关的脑区;实验验证了该方法在健康人群与抑郁患者间的差异识别能力、对离散情绪的有效区分性,以及大语言模型(LLM)与人类文本在脑激活模式上的异同,展现出高空间特异性与临床区分价值。

链接: https://arxiv.org/abs/2508.09337
作者: Gideon Vos,Maryam Ebrahimpour,Liza van Eijk,Zoltan Sarnyai,Mostafa Rahimi Azghadi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages, 9 figures

点击查看摘要

Abstract:Understanding how emotional expression in language relates to brain function is a challenge in computational neuroscience and affective computing. Traditional neuroimaging is costly and lab-bound, but abundant digital text offers new avenues for emotion-brain mapping. Prior work has largely examined neuroimaging-based emotion localization or computational text analysis separately, with little integration. We propose a computational framework that maps textual emotional content to anatomically defined brain regions without requiring neuroimaging. Using OpenAI’s text-embedding-ada-002, we generate high-dimensional semantic representations, apply dimensionality reduction and clustering to identify emotional groups, and map them to 18 brain regions linked to emotional processing. Three experiments were conducted: i) analyzing conversational data from healthy vs. depressed subjects (DIAC-WOZ dataset) to compare mapping patterns, ii) applying the method to the GoEmotions dataset and iii) comparing human-written text with large language model (LLM) responses to assess differences in inferred brain activation. Emotional intensity was scored via lexical analysis. Results showed neuroanatomically plausible mappings with high spatial specificity. Depressed subjects exhibited greater limbic engagement tied to negative affect. Discrete emotions were successfully differentiated. LLM-generated text matched humans in basic emotion distribution but lacked nuanced activation in empathy and self-referential regions (medial prefrontal and posterior cingulate cortex). This cost-effective, scalable approach enables large-scale analysis of naturalistic language, distinguishes between clinical populations, and offers a brain-based benchmark for evaluating AI emotional expression.
zh

[NLP-50] EN: Table Explicitization Neurosymbolically

【速读】: 该论文旨在解决从半结构化文本中提取表格数据的难题,尤其针对那些未使用一致分隔符区分列和行的输入文本。传统纯神经方法因幻觉(hallucination)现象及无法强制执行硬约束而导致性能不佳。解决方案的关键在于提出一种神经符号混合方法TEN:首先利用结构分解提示(Structural Decomposition prompting)引导大语言模型(LLM)生成初始表格,随后通过符号检查器(symbolic checker)验证表格的格式正确性并识别幻觉或遗漏;检查结果由另一个LLM生成修正建议,并以自调试循环方式反馈给原始模型进行迭代优化。该设计有效结合了神经模型的泛化能力与符号推理的精确性,显著提升了表格提取的准确性与鲁棒性。

链接: https://arxiv.org/abs/2508.09324
作者: Nikita Mehrotra,Aayush Kumar,Sumit Gulwani,Arjun Radhakrishna,Ashish Tiwari
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a neurosymbolic approach, TEN, for extracting tabular data from semistructured input text. This task is particularly challenging for text input that does not use special delimiters consistently to separate columns and rows. Purely neural approaches perform poorly due to hallucinations and their inability to enforce hard constraints. TEN uses Structural Decomposition prompting - a specialized chain-of-thought prompting approach - on a large language model (LLM) to generate an initial table, and thereafter uses a symbolic checker to evaluate not only the well-formedness of that table, but also detect cases of hallucinations or forgetting. The output of the symbolic checker is processed by a critique-LLM to generate guidance for fixing the table, which is presented to the original LLM in a self-debug loop. Our extensive experiments demonstrate that TEN significantly outperforms purely neural baselines across multiple datasets and metrics, achieving significantly higher exact match accuracy and substantially reduced hallucination rates. A 21-participant user study further confirms that TEN’s tables are rated significantly more accurate (mean score: 5.0 vs 4.3; p = 0.021), and are consistently preferred for ease of verification and correction, with participants favoring our method in over 60% of the cases.
zh

[NLP-51] Leverag ing Large Language Models for Rare Disease Named Entity Recognition

【速读】: 该论文旨在解决罕见疾病领域命名实体识别(Named Entity Recognition, NER)在低资源条件下面临的挑战,包括标注数据稀缺、实体类型间语义歧义以及长尾分布问题。其关键解决方案在于设计一种结构化提示框架,融合领域特定知识与实体类型消歧规则,并引入两种语义引导的少样本示例选择方法,在降低标注成本的同时提升上下文学习性能;此外,通过任务级微调(task-level fine-tuning)实现了新的SOTA结果,表明优化提示策略的大语言模型(LLM)可作为传统监督模型的有效且可扩展替代方案,尤其适用于标注数据匮乏的生物医学NER场景。

链接: https://arxiv.org/abs/2508.09323
作者: Nan Miles Xi,Yu Deng,Lin Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Named Entity Recognition (NER) in the rare disease domain poses unique challenges due to limited labeled data, semantic ambiguity between entity types, and long-tail distributions. In this study, we evaluate the capabilities of GPT-4o for rare disease NER under low-resource settings, using a range of prompt-based strategies including zero-shot prompting, few-shot in-context learning, retrieval-augmented generation (RAG), and task-level fine-tuning. We design a structured prompting framework that encodes domain-specific knowledge and disambiguation rules for four entity types. We further introduce two semantically guided few-shot example selection methods to improve in-context performance while reducing labeling effort. Experiments on the RareDis Corpus show that GPT-4o achieves competitive or superior performance compared to BioClinicalBERT, with task-level fine-tuning yielding new state-of-the-art (SOTA) results. Cost-performance analysis reveals that few-shot prompting delivers high returns at low token budgets, while RAG offers marginal additional benefit. An error taxonomy highlights common failure modes such as boundary drift and type confusion, suggesting opportunities for post-processing and hybrid refinement. Our results demonstrate that prompt-optimized LLMs can serve as effective, scalable alternatives to traditional supervised models in biomedical NER, particularly in rare disease applications where annotated data is scarce.
zh

[NLP-52] ParallelSearch: Train your LLM s to Decompose Query and Search Sub-queries in Parallel with Reinforcement Learning

【速读】: 该论文旨在解决现有推理增强型搜索代理(reasoning-augmented search agents)在处理多步信息检索任务时存在的序列化执行瓶颈问题,即即使面对逻辑上独立且可并行化的查询结构,仍采用严格的顺序处理方式,导致计算效率低下。解决方案的关键在于提出ParallelSearch框架,该框架通过引入专门设计的奖励函数,使大语言模型(LLMs)能够识别查询中的可并行组件,并在保证答案准确性的前提下,实现多个搜索操作的并发执行;其核心创新在于联合优化正确性、查询分解质量与并行执行收益三个维度的奖励信号,从而显著提升性能并减少资源消耗。

链接: https://arxiv.org/abs/2508.09303
作者: Shu Zhao,Tan Yu,Anbang Xu,Japinder Singh,Aaditya Shukla,Rama Akkiraju
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Reasoning-augmented search agents such as Search-R1, trained via reinforcement learning with verifiable rewards (RLVR), demonstrate remarkable capabilities in multi-step information retrieval from external knowledge sources. These agents address the limitations of their parametric memory by dynamically gathering relevant facts to address complex reasoning tasks. However, existing approaches suffer from a fundamental architectural limitation: they process search queries strictly sequentially, even when handling inherently parallelizable and logically independent comparisons. This sequential bottleneck significantly constrains computational efficiency, particularly for queries that require multiple entity comparisons. To address this critical limitation, we propose ParallelSearch, a novel reinforcement learning framework that empowers large language models (LLMs) to recognize parallelizable query structures and execute multiple search operations concurrently. Our approach introduces dedicated reward functions that incentivize the identification of independent query components while preserving answer accuracy through jointly considering correctness, query decomposition quality, and parallel execution benefits. Comprehensive experiments demonstrate that ParallelSearch outperforms state-of-the-art baselines by an average performance gain of 2.9% across seven question-answering benchmarks. Notably, on parallelizable questions, our method achieves a 12.7% performance improvement while requiring only 69.6% of the LLM calls compared to sequential approaches.
zh

[NLP-53] Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段对提示注入(prompt injection)和相关越狱攻击(jailbreak attacks)的脆弱性问题,此类攻击常通过绕过启发式防护机制(如规则、过滤器或LLM判别器)实现。其解决方案的核心是提出一种名为上下文完整性验证(Contextual Integrity Verification, CIV)的推理时安全架构:通过为每个token附加密码学签名的溯源标签,并在Transformer内部利用预softmax硬注意力掩码(hard attention mask)构建源信任层级(source-trust lattice),从而在冻结模型上实现逐token的非干扰性保证——即低信任度token无法影响高信任度表示。该方法无需微调即可实现0%攻击成功率,同时保持93.1%的token级相似度和无退化的困惑度表现。

链接: https://arxiv.org/abs/2508.09288
作者: Aayush Gupta
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 2 figures, 3 tables; code and certification harness: this https URL ; Elite-Attack dataset: this https URL

点击查看摘要

Abstract:Large language models (LLMs) remain acutely vulnerable to prompt injection and related jailbreak attacks; heuristic guardrails (rules, filters, LLM judges) are routinely bypassed. We present Contextual Integrity Verification (CIV), an inference-time security architecture that attaches cryptographically signed provenance labels to every token and enforces a source-trust lattice inside the transformer via a pre-softmax hard attention mask (with optional FFN/residual gating). CIV provides deterministic, per-token non-interference guarantees on frozen models: lower-trust tokens cannot influence higher-trust representations. On benchmarks derived from recent taxonomies of prompt-injection vectors (Elite-Attack + SoK-246), CIV attains 0% attack success rate under the stated threat model while preserving 93.1% token-level similarity and showing no degradation in model perplexity on benign tasks; we note a latency overhead attributable to a non-optimized data path. Because CIV is a lightweight patch – no fine-tuning required – we demonstrate drop-in protection for Llama-3-8B and Mistral-7B. We release a reference implementation, an automated certification harness, and the Elite-Attack corpus to support reproducible research.
zh

[NLP-54] NEFMind: Parameter-Efficient Fine-Tuning of Open-Source LLM s for Telecom APIs Automation

【速读】: 该论文旨在解决现代电信网络中基于服务的架构(Service-Based Architecture, SBA)因网络功能(Network Functions, NFs)和应用编程接口(Application Programming Interfaces, APIs)数量激增而导致的服务发现与管理复杂性问题。解决方案的关键在于提出一个名为NEFMind的框架,其核心是利用开源大语言模型(Large Language Models, LLMs)的参数高效微调技术,集成三个关键组件:从网络暴露功能(Network Exposure Function, NEF)API规范生成合成数据集、通过量化低秩适配(Quantized-Low-Rank Adaptation)优化模型性能,并采用GPT-4 Ref Score与BertScore进行评估。该方法在5G SBA API场景下实现了较人工发现方法85%的通信开销降低,且使用Phi-2模型微调后达到98–100%的API调用识别准确率,性能媲美GPT-4的同时保持计算效率,验证了领域特定、参数高效的LLM策略在下一代电信网络API治理中的有效性。

链接: https://arxiv.org/abs/2508.09240
作者: Zainab Khan,Ahmed Hussain,Mukesh Thakur,Arto Hellas,Panos Papadimitratos
机构: Alto University (阿尔托大学); KTH Royal Institute of Technology (皇家理工学院); Ericsson (爱立信)
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages

点击查看摘要

Abstract:The use of Service-Based Architecture in modern telecommunications has exponentially increased Network Functions (NFs) and Application Programming Interfaces (APIs), creating substantial operational complexities in service discovery and management. We introduce \textitNEFMind, a framework leveraging parameter-efficient fine-tuning of open-source Large Language Models (LLMs) to address these challenges. It integrates three core components: synthetic dataset generation from Network Exposure Function (NEF) API specifications, model optimization through Quantized-Low-Rank Adaptation, and performance evaluation via GPT-4 Ref Score and BertScore metrics. Targeting 5G Service-Based Architecture APIs, our approach achieves 85% reduction in communication overhead compared to manual discovery methods. Experimental validation using the open-source Phi-2 model demonstrates exceptional API call identification performance at 98-100% accuracy. The fine-tuned Phi-2 model delivers performance comparable to significantly larger models like GPT-4 while maintaining computational efficiency for telecommunications infrastructure deployment. These findings validate domain-specific, parameter-efficient LLM strategies for managing complex API ecosystems in next-generation telecommunications networks.
zh

[NLP-55] From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

【速读】: 该论文旨在解决传统大型语言模型(Large Language Models)在安全训练中依赖二元拒绝边界(binary refusal boundaries)所带来的局限性问题,即模型仅根据用户意图判断是否完全遵守或直接拒绝请求,这种机制在面对意图模糊或具有双重用途(dual-use)的提示时容易失效,尤其在生物学或网络安全等场景中,可能因过度细化而引发恶意用途。解决方案的关键在于提出“安全完成”(safe-completions)方法:将安全训练的核心从对用户意图的二分类转向对模型输出本身的安全性约束,通过最大化在安全策略框架内的有用性(helpfulness),使模型能够在不违反安全规范的前提下提供更细致、更有价值的回答,从而提升安全性与实用性之间的平衡。

链接: https://arxiv.org/abs/2508.09224
作者: Yuan Yuan,Tina Sriskandarajah,Anna-Luisa Brakman,Alec Helyar,Alex Beutel,Andrea Vallone,Saachi Jain
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user’s intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant’s output, rather than a binary classification of the user’s intent. Safe-completions seek to maximize helpfulness within the safety policy’s constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness.
zh

[NLP-56] Δ-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation

【速读】: 该论文旨在解决视觉指令微调(Visual Instruction Finetuning, VIF)中因多模态数据需求量大而带来的数据选择难题,即如何在保证图像与文本内容质量及其对齐的前提下,高效筛选高质量样本以降低数据依赖并提升训练效率。解决方案的关键在于提出一种名为 \Delta -AttnMask 的数据高效框架,其通过模型隐藏状态的注意力引导掩码机制,计算原始状态与基于高注意力区域掩码后状态之间的损失差异(\Delta),从而无需领域标签、辅助模型或额外训练即可内在地评估图像-文本对的质量,实现对样本质量的联合判别性量化。

链接: https://arxiv.org/abs/2508.09199
作者: Jucheng Hu,Suorong Yang,Dongzhan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Visual Instruction Finetuning (VIF) is pivotal for post-training Vision-Language Models (VLMs). Unlike unimodal instruction finetuning in plain-text large language models, which mainly requires instruction datasets to enable model instruction-following ability, VIF also requires multimodal data to enable joint visual and textual understanding; therefore, it typically requires more data. Consequently, VIF imposes stricter data selection challenges: the method must scale efficiently to handle larger data demands while ensuring the quality of both visual and textual content, as well as their alignment. Despite its critical impact on performance, data selection for VIF remains an understudied area. In this paper, we propose \Delta -AttnMask. This data-efficient framework quantifies sample quality through attention-guided masking of the model’s hidden states, jointly evaluating image-text pairs without requiring domain labels, auxiliary models, or extra training. By computing loss differences ( \Delta ) between the original states and states masked using high-attention regions, \Delta -AttnMask intrinsically assesses sample quality. Experiments across multiple VLMs and datasets show that \Delta -AttnMask achieves state-of-the-art performance with just 20% of data, accelerating training by 5x while surpassing full-dataset baselines by +10.1% in overall accuracy. Its model-agnostic and data-agnostic design ensures broad applicability across modalities and architectures.
zh

[NLP-57] MoLAN: A Unified Modality-Aware Noise Dynamic Editing Framework for Multimodal Sentiment Analysis

【速读】: 该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis)中因视觉和听觉信息冗余或误导而导致的性能下降问题。现有方法通常将整个模态(如图像、音频片段或文本段落)视为独立单元进行特征增强或去噪,容易在抑制噪声的同时丢失关键信息。其解决方案的关键在于提出一种统一且灵活的模态感知噪声动态编辑框架(MoLAN),通过模态感知分块(modality-aware blocking)将各模态特征划分为多个块,并基于每个块的噪声水平与语义相关性动态分配不同的去噪强度,从而实现细粒度噪声抑制并保留重要多模态信息。

链接: https://arxiv.org/abs/2508.09145
作者: Xingle Xu,Yongkang Liu,Dexian Cai,Shi Feng,Xiaocui Yang,Daling Wang,Yifei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Sentiment Analysis aims to integrate information from various modalities, such as audio, visual, and text, to make complementary predictions. However, it often struggles with irrelevant or misleading visual and auditory information. Most existing approaches typically treat the entire modality information (e.g., a whole image, audio segment, or text paragraph) as an independent unit for feature enhancement or denoising. They often suppress the redundant and noise information at the risk of losing critical information. To address this challenge, we propose MoLAN, a unified ModaLity-aware noise dynAmic editiNg framework. Specifically, MoLAN performs modality-aware blocking by dividing the features of each modality into multiple blocks. Each block is then dynamically assigned a distinct denoising strength based on its noise level and semantic relevance, enabling fine-grained noise suppression while preserving essential multimodal information. Notably, MoLAN is a unified and flexible framework that can be seamlessly integrated into a wide range of multimodal models. Building upon this framework, we further introduce MoLAN+, a new multimodal sentiment analysis approach. Experiments across five models and four datasets demonstrate the broad effectiveness of the MoLAN framework. Extensive evaluations show that MoLAN+ achieves the state-of-the-art performance. The code is publicly available at this https URL.
zh

[NLP-58] ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs INTERSPEECH2025

【速读】: 该论文旨在解决语音合成中韵律建模不足的问题,即如何从文本中准确生成具有自然情感和语义信息的韵律特征(如基频 F0 和能量),以提升语音合成系统的表现。解决方案的关键在于提出一种独立的端到端模型 ProMode,其核心机制是通过编码器将部分遮蔽的声学特征与时间对齐的文本内容融合,生成固定长度的韵律嵌入(latent prosodic embedding),再由解码器利用该嵌入及未遮蔽文本内容预测被遮蔽区域的声学特征,从而实现高精度的韵律特征重建。该方法在 GigaSpeech 数据集上训练并验证,相较于现有最优风格编码器,在不同粒度下均实现了 F0 和能量预测性能的稳定提升,并在语音合成任务中通过感知测试证明了其在韵律表达上的优势。

链接: https://arxiv.org/abs/2508.09389
作者: Eray Eren,Qingju Liu,Hyeongwoo Kim,Pablo Garrido,Abeer Alwan
机构: University of California, Los Angeles (加州大学洛杉矶分校); Imperial College London (帝国理工学院); FlawlessAI (FlawlessAI)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Interspeech 2025; demo page at this https URL

点击查看摘要

Abstract:Prosody conveys rich emotional and semantic information of the speech signal as well as individual idiosyncrasies. We propose a stand-alone model that maps text-to-prosodic features such as F0 and energy and can be used in downstream tasks such as TTS. The ProMode encoder takes as input acoustic features and time-aligned textual content, both are partially masked, and obtains a fixed-length latent prosodic embedding. The decoder predicts acoustics in the masked region using both the encoded prosody input and unmasked textual content. Trained on the GigaSpeech dataset, we compare our method with state-of-the-art style encoders. For F0 and energy predictions, we show consistent improvements for our model at different levels of granularity. We also integrate these predicted prosodic features into a TTS system and conduct perceptual tests, which show higher prosody preference compared to the baselines, demonstrating the model’s potential in tasks where prosody modeling is important.
zh

[NLP-59] Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attentions Alternative

【速读】: 该论文旨在解决生成式语音合成(Generative AI)技术发展带来的深度伪造语音(Deepfake Speech)安全威胁问题,特别是如何在实时场景下高效检测合成语音。其解决方案的关键在于提出一种基于双向Mamba架构的新型检测框架Fake-Mamba,通过结合XLSR(Cross-Lingual Speech Representation)前端与三种高效的编码器(TransBiMamba、ConBiMamba和PN-BiMamba),有效捕捉合成语音中的局部与全局异常特征,其中PN-BiMamba利用XLSR丰富的语言表征能力精准识别细微的合成痕迹,从而在ASVspoof 21 LA、21 DF及In-The-Wild等多个基准上实现显著优于当前最优模型(如XLSR-Conformer和XLSR-Mamba)的等错误率(EER),且保持实时推理性能,展现出良好的泛化能力和实际部署潜力。

链接: https://arxiv.org/abs/2508.09294
作者: Xi Xuan,Zimo Zhu,Wenxin Zhang,Yi-Cheng Lin,Tomi Kinnunen
机构: University of Eastern Finland (东芬兰大学); University of California Santa Barbara (加州大学圣巴巴拉分校); University of Chinese Academy of Sciences (中国科学院大学); University of Toronto (多伦多大学); National Taiwan University (台湾大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Accepted at IEEE ASRU 2025

点击查看摘要

Abstract:Advances in speech synthesis intensify security threats, motivating real-time deepfake detection research. We investigate whether bidirectional Mamba can serve as a competitive alternative to Self-Attention in detecting synthetic speech. Our solution, Fake-Mamba, integrates an XLSR front-end with bidirectional Mamba to capture both local and global artifacts. Our core innovation introduces three efficient encoders: TransBiMamba, ConBiMamba, and PN-BiMamba. Leveraging XLSR’s rich linguistic representations, PN-BiMamba can effectively capture the subtle cues of synthetic speech. Evaluated on ASVspoof 21 LA, 21 DF, and In-The-Wild benchmarks, Fake-Mamba achieves 0.97%, 1.74%, and 5.85% EER, respectively, representing substantial relative gains over SOTA models XLSR-Conformer and XLSR-Mamba. The framework maintains real-time inference across utterance lengths, demonstrating strong generalization and practical viability. The code is available at this https URL.
zh

计算机视觉

[CV-0] Story2Board: A Training-Free Approach for Expressive Storyboard Generation

【速读】:该论文旨在解决现有生成式AI (Generative AI) 方法在视觉叙事生成中忽视空间构图、背景演变和叙事节奏等关键要素的问题,导致生成的分镜(storyboard)缺乏连贯性和表现力。解决方案的核心在于提出一个无需训练的轻量级一致性框架——Story2Board,其关键创新包括:1)Latent Panel Anchoring(潜空间面板锚定),用于跨画面保持角色特征的一致性;2)Reciprocal Attention Value Mixing(互注意值混合),通过软融合具有强互注意力的token对之间的视觉特征,增强画面间语义关联。这两个机制无需修改模型架构或微调即可提升扩散模型生成分镜的视觉多样性与叙事一致性,从而实现更动态、连贯且富有叙事吸引力的分镜生成。

链接: https://arxiv.org/abs/2508.09983
作者: David Dinkevich,Matan Levy,Omri Avrahami,Dvir Samuel,Dani Lischinski
机构: Hebrew University of Jerusalem (希伯来大学); Bar-Ilan University (巴伊兰大学); OriginAI (OriginAI)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project page is available at this https URL

点击查看摘要

Abstract:We present Story2Board, a training-free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine-tuning, enabling state-of-the-art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off-the-shelf language model to convert free-form stories into grounded panel-level prompts. To evaluate, we propose the Rich Storyboard Benchmark, a suite of open-domain narratives designed to assess layout diversity and background-grounded storytelling, in addition to consistency. We also introduce a new Scene Diversity metric that quantifies spatial and pose variation across storyboards. Our qualitative and quantitative results, as well as a user study, show that Story2Board produces more dynamic, coherent, and narratively engaging storyboards than existing baselines.
zh

[CV-1] LLM C: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)因长视觉标记序列和庞大参数规模导致的计算与内存开销过高的问题。现有训练-free 压缩方法存在三大局限:缺乏模块化分解以公平比较空间与时间冗余压缩技术、评估局限于简单单轮任务而无法反映真实场景表现、以及孤立使用单一压缩技术未探索其协同潜力。解决方案的关键在于提出 LLMC+,一个全面的 VLM 压缩基准测试平台,配备可插拔式工具包,支持超过 20 种算法在五个代表性 VLM 家族中的系统性研究,涵盖标记级与模型级压缩,并通过实证发现:空间与时间冗余需差异化策略;标记压缩在多轮对话和细节敏感任务中性能显著下降;联合使用标记压缩与模型压缩可实现极致压缩且性能损失最小。

链接: https://arxiv.org/abs/2508.09981
作者: Chengtao Lv,Bilang Zhang,Yang Yong,Ruihao Gong,Yushi Huang,Shiqiao Gu,Jiajun Wu,Yumeng Shi,Jinyang Guo,Wenya Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at this https URL.
zh

[CV-2] A Survey on 3D Gaussian Splatting Applications: Segmentation Editing and Generation

【速读】:该论文旨在解决3D场景表示中如何高效支持下游任务(如分割、编辑、生成等)的问题,特别是在传统Neural Radiance Fields (NeRF)方法难以满足实时性和几何语义理解需求的背景下。其解决方案的关键在于系统梳理和归纳基于3D Gaussian Splatting (3DGS) 的各类应用进展,强调3DGS因其显式、紧凑的表示特性,能够有效支撑多种需要几何与语义理解的任务,并通过总结2D基础模型、监督策略、学习范式及基准评估体系,提炼出共性设计原则与发展趋势,从而为后续研究提供结构化参考和资源支持。

链接: https://arxiv.org/abs/2508.09977
作者: Shuting He,Peilin Ji,Yitong Yang,Changshuo Wang,Jiayi Ji,Yinglin Wang,Henghui Ding
机构: Shanghai University of Finance and Economics (上海财经大学); University College London (伦敦大学学院); National University of Singapore (新加坡国立大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: GitHub Repo: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a powerful alternative to Neural Radiance Fields (NeRF) for 3D scene representation, offering high-fidelity photorealistic rendering with real-time performance. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first introduces 2D foundation models that support semantic understanding and control in 3DGS applications, followed by a review of NeRF-based methods that inform their 3DGS counterparts. We then categorize 3DGS applications into segmentation, editing, generation, and other functional tasks. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with comparative analyses of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at this https URL.
zh

[CV-3] PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image ICCV2025

【速读】:该论文试图解决现有生成可动画化人类虚拟形象方法中存在的两大难题:一是基于3D的方法(如NeRF或3DGS)虽然能实现个性化身份表征,但需要大量姿态丰富的视频来建模非刚性形变(如衣物褶皱),而这类数据在日常生活中难以获取;二是基于扩散模型的方法虽能从大规模野生视频中学习姿态驱动的形变,却面临身份保持困难和姿态依赖的身份混淆问题。解决方案的关键在于提出PERSONA框架,该框架结合了两种方法的优势:首先利用扩散模型从单张输入图像生成姿态丰富的视频作为训练数据,再基于这些视频优化一个3D人体虚拟形象。为提升渲染真实感与细节保真度,引入平衡采样策略以减少扩散生成视频中的身份漂移,并采用几何加权优化策略,在损失函数中优先考虑几何约束而非图像重建误差,从而在多样化姿态下保持高质量的渲染效果。

链接: https://arxiv.org/abs/2508.09973
作者: Geonhee Sim,Gyeongsik Moon
机构: Korea University (韩国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025. this https URL

点击查看摘要

Abstract:Two major approaches exist for creating animatable human avatars. The first, a 3D-based approach, optimizes a NeRF- or 3DGS-based avatar from videos of a single person, achieving personalization through a disentangled identity representation. However, modeling pose-driven deformations, such as non-rigid cloth deformations, requires numerous pose-rich videos, which are costly and impractical to capture in daily life. The second, a diffusion-based approach, learns pose-driven deformations from large-scale in-the-wild videos but struggles with identity preservation and pose-dependent identity entanglement. We present PERSONA, a framework that combines the strengths of both approaches to obtain a personalized 3D human avatar with pose-driven deformations from a single image. PERSONA leverages a diffusion-based approach to generate pose-rich videos from the input image and optimizes a 3D avatar based on them. To ensure high authenticity and sharp renderings across diverse poses, we introduce balanced sampling and geometry-weighted optimization. Balanced sampling oversamples the input image to mitigate identity shifts in diffusion-generated training videos. Geometry-weighted optimization prioritizes geometry constraints over image loss, preserving rendering quality in diverse poses.
zh

[CV-4] Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models

【速读】:该论文旨在解决测试时扩展(test-time scaling)方法在推理阶段计算开销过大、导致实际应用受限的问题。尽管测试时扩展能显著提升大型语言模型和生成式视觉模型的性能,但其增加的计算时间使其难以部署于对效率敏感的场景。解决方案的关键在于将测试时扩展的知识在后训练阶段融入模型本身,具体通过引入一个噪声超网络(Noise Hypernetwork)替代扩散模型中基于奖励引导的测试时噪声优化机制,该超网络可调节初始输入噪声,从而在训练阶段学习到一种受奖励偏置的噪声分布;这一过程构建了一个可解析的噪声空间目标函数,在保持与基础模型一致性的前提下优化生成质量,最终实现以极低的计算成本恢复大部分显式测试时优化带来的性能增益。

链接: https://arxiv.org/abs/2508.09968
作者: Luca Eyring,Shyamgopal Karthik,Alexey Dosovitskiy,Nataniel Ruiz,Zeynep Akata
机构: Technical University of Munich (慕尼黑技术大学); Munich Center of Machine Learning (慕尼黑机器学习中心); Helmholtz Munich (赫尔姆霍兹慕尼黑研究中心); University of Tübingen (图宾根大学); Inceptive; Google(谷歌)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:The new paradigm of test-time scaling has yielded remarkable breakthroughs in Large Language Models (LLMs) (e.g. reasoning models) and in generative vision models, allowing models to allocate additional computation during inference to effectively tackle increasingly complex problems. Despite the improvements of this approach, an important limitation emerges: the substantial increase in computation time makes the process slow and impractical for many applications. Given the success of this paradigm and its growing usage, we seek to preserve its benefits while eschewing the inference overhead. In this work we propose one solution to the critical problem of integrating test-time scaling knowledge into a model during post-training. Specifically, we replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise. We propose a theoretically grounded framework for learning this reward-tilted distribution for distilled generators, through a tractable noise-space objective that maintains fidelity to the base model while optimizing for desired characteristics. We show that our approach recovers a substantial portion of the quality gains from explicit test-time optimization at a fraction of the computational cost. Code is available at this https URL
zh

[CV-5] MOC: Meta-Optimized Classifier for Few-Shot Whole Slide Image Classification MICCAI2025

【速读】:该论文旨在解决当前基于视觉-语言基础模型(Vision-Language Foundation Models, VLFMs)的全切片图像(Whole Slide Image, WSI)分类方法在数据稀缺场景下性能不足的问题,尤其是现有少量样本学习(few-shot learning)方法因依赖传统分类器设计而对标注数据极度敏感、泛化能力弱。其解决方案的关键在于提出一种元优化分类器(Meta-Optimized Classifier, MOC),包含两个核心组件:一是元学习器(meta-learner),用于从候选分类器集合中自动优化出最优分类器配置;二是分类器库(classifier bank),集成多种多样化的候选分类器以实现病理图像的多维度理解与综合判读。该设计显著提升了模型在极端低样本量条件下的诊断准确性,尤其在TCGA-NSCLC基准上相较最先进方法提升AUC达10.4%,1样本条件下最高提升26.25%。

链接: https://arxiv.org/abs/2508.09967
作者: Tianqi Xiang,Yi Li,Qixiang Zhang,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in MICCAI 2025

点击查看摘要

Abstract:Recent advances in histopathology vision-language foundation models (VLFMs) have shown promise in addressing data scarcity for whole slide image (WSI) classification via zero-shot adaptation. However, these methods remain outperformed by conventional multiple instance learning (MIL) approaches trained on large datasets, motivating recent efforts to enhance VLFM-based WSI classification through fewshot learning paradigms. While existing few-shot methods improve diagnostic accuracy with limited annotations, their reliance on conventional classifier designs introduces critical vulnerabilities to data scarcity. To address this problem, we propose a Meta-Optimized Classifier (MOC) comprising two core components: (1) a meta-learner that automatically optimizes a classifier configuration from a mixture of candidate classifiers and (2) a classifier bank housing diverse candidate classifiers to enable a holistic pathological interpretation. Extensive experiments demonstrate that MOC outperforms prior arts in multiple few-shot benchmarks. Notably, on the TCGA-NSCLC benchmark, MOC improves AUC by 10.4% over the state-of-the-art few-shot VLFM-based methods, with gains up to 26.25% under 1-shot conditions, offering a critical advancement for clinical deployments where diagnostic training data is severely limited. Code is available at this https URL.
zh

[CV-6] January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis

【速读】:该论文旨在解决自动化营养分析领域中缺乏标准化评估方法和高质量真实世界基准数据集的问题。其关键解决方案包括:构建公开可用的January Food Benchmark(JFB)数据集,包含1,000张经人工验证标注的食物图像;提出一套全面的评估框架,包含稳健的指标和一种面向应用的整体评分(Overall Score)以实现模型性能的综合评估;并基于此框架提供了通用视觉-语言模型(Vision-Language Models, VLMs)与专为食物视觉任务设计的january/food-vision-v1模型的基线结果,表明专用模型在整体评分上比最优通用配置提升12.1分,显著推动了该领域的研究进展。

链接: https://arxiv.org/abs/2508.09966
作者: Amir Hosseinian,Ashkan Dehghani Zahedani,Umer Mansoor,Noosheen Hashemi,Mark Woodward
机构: January AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Progress in AI for automated nutritional analysis is critically hampered by the lack of standardized evaluation methodologies and high-quality, real-world benchmark datasets. To address this, we introduce three primary contributions. First, we present the January Food Benchmark (JFB), a publicly available collection of 1,000 food images with human-validated annotations. Second, we detail a comprehensive benchmarking framework, including robust metrics and a novel, application-oriented overall score designed to assess model performance holistically. Third, we provide baseline results from both general-purpose Vision-Language Models (VLMs) and our own specialized model, january/food-vision-v1. Our evaluation demonstrates that the specialized model achieves an Overall Score of 86.2, a 12.1-point improvement over the best-performing general-purpose configuration. This work offers the research community a valuable new evaluation dataset and a rigorous framework to guide and benchmark future developments in automated nutritional analysis.
zh

[CV-7] LIA-X: Interpretable Latent Portrait Animator

【速读】:该论文旨在解决人脸动画生成中缺乏细粒度控制与可解释性的问题,即如何在保持源肖像身份不变的前提下,精确迁移驱动视频中的面部动态(如表情、姿态),同时支持用户对关键面部语义进行可控编辑。解决方案的关键在于提出LIA-X模型——一种基于稀疏运动字典(Sparse Motion Dictionary)的可解释自编码器架构,它将运动转移建模为潜在空间中运动码的线性导航,并通过稀疏运动字典实现面部动态的解耦表示,从而支持“编辑-变形-重渲染”(edit-warp-render)的可控流程,显著提升动画结果的准确性和可控性。

链接: https://arxiv.org/abs/2508.09959
作者: Yaohui Wang,Di Yang,Xinyuan Chen,Francois Bremond,Yu Qiao,Antitza Dantcheva
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Inria, Université Côte d’Azur (法国国家信息与自动化研究院,蔚蓝海岸大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce LIA-X, a novel interpretable portrait animator designed to transfer facial dynamics from a driving video to a source portrait with fine-grained control. LIA-X is an autoencoder that models motion transfer as a linear navigation of motion codes in latent space. Crucially, it incorporates a novel Sparse Motion Dictionary that enables the model to disentangle facial dynamics into interpretable factors. Deviating from previous ‘warp-render’ approaches, the interpretability of the Sparse Motion Dictionary allows LIA-X to support a highly controllable ‘edit-warp-render’ strategy, enabling precise manipulation of fine-grained facial semantics in the source portrait. This helps to narrow initial differences with the driving video in terms of pose and expression. Moreover, we demonstrate the scalability of LIA-X by successfully training a large-scale model with approximately 1 billion parameters on extensive datasets. Experimental results show that our proposed method outperforms previous approaches in both self-reenactment and cross-reenactment tasks across several benchmarks. Additionally, the interpretable and controllable nature of LIA-X supports practical applications such as fine-grained, user-guided image and video editing, as well as 3D-aware portrait video manipulation.
zh

[CV-8] Stable Diffusion Models are Secretly Good at Visual In-Context Learning ICCV2025

【速读】:该论文旨在解决视觉领域中**在上下文学习(Visual In-Context Learning, V-ICL)**的泛化能力不足问题,即如何在不进行额外微调(fine-tuning)的情况下,利用少量示例提示(prompt)使预训练模型适应多种下游视觉任务。传统V-ICL方法通常依赖于特定训练或额外数据,限制了其通用性。解决方案的关键在于:将现成的Stable Diffusion模型重构为支持视觉上下文学习的框架,通过在自注意力(self-attention)层中引入一种“原位注意力重计算”机制(in-place attention re-computation),显式建模查询与示例提示之间的上下文关系。该方法无需任何额外训练即可在六类视觉任务(如前景分割、关键点检测等)上实现显著性能提升,且可通过多提示集成进一步增强推理效果。

链接: https://arxiv.org/abs/2508.09949
作者: Trevine Oorloff,Vishwanath Sindagi,Wele Gedara Chaminda Bandara,Ali Shafahi,Amin Ghiasi,Charan Prakash,Reza Ardekani
机构: Apple(苹果); University of Maryland - College Park(马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in-context learning (ICL) – the ability to leverage a few sets of example prompts to adapt to various tasks without having to explicitly update the model weights. ICL has recently been explored for computer vision tasks with promising early outcomes. These approaches involve specialized training and/or additional data that complicate the process and limit its generalizability. In this work, we show that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning (V-ICL). Specifically, we formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine-tuning, we show that this repurposed Stable Diffusion model is able to adapt to six different tasks: foreground segmentation, single object detection, semantic segmentation, keypoint detection, edge detection, and colorization. For example, the proposed approach improves the mean intersection over union (mIoU) for the foreground segmentation task on Pascal-5i dataset by 8.9% and 3.2% over recent methods such as Visual Prompting and IMProv, respectively. Additionally, we show that the proposed method is able to effectively leverage multiple prompts through ensembling to infer the task better and further improve the performance.
zh

[CV-9] AST-n: A Fast Sampling Approach for Low-Dose CT Reconstruction using Diffusion Models

【速读】:该论文旨在解决低剂量CT(Low-dose CT, LDCT)成像中因辐射暴露降低而导致图像噪声增加、进而影响诊断信心的问题。其核心解决方案是提出一种加速推理框架AST-n,通过从中间噪声水平启动逆扩散过程,并在条件模型中集成高阶常微分方程(ODE)求解器以减少采样步数。该方法在保持图像保真度(PSNR > 38 dB,SSIM > 0.95)的同时,将每切片的推理时间从约16秒缩短至1秒以内,显著提升了扩散模型在临床场景中的实用性。

链接: https://arxiv.org/abs/2508.09943
作者: Tomás de la Sotta,José M. Saavedra,Héctor Henríquez,Violeta Chang,Aline Xavier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-dose CT (LDCT) protocols reduce radiation exposure but increase image noise, compromising diagnostic confidence. Diffusion-based generative models have shown promise for LDCT denoising by learning image priors and performing iterative refinement. In this work, we introduce AST-n, an accelerated inference framework that initiates reverse diffusion from intermediate noise levels, and integrate high-order ODE solvers within conditioned models to further reduce sampling steps. We evaluate two acceleration paradigms–AST-n sampling and standard scheduling with high-order solvers – on the Low Dose CT Grand Challenge dataset, covering head, abdominal, and chest scans at 10-25 % of standard dose. Conditioned models using only 25 steps (AST-25) achieve peak signal-to-noise ratio (PSNR) above 38 dB and structural similarity index (SSIM) above 0.95, closely matching standard baselines while cutting inference time from ~16 seg to under 1 seg per slice. Unconditional sampling suffers substantial quality loss, underscoring the necessity of conditioning. We also assess DDIM inversion, which yields marginal PSNR gains at the cost of doubling inference time, limiting its clinical practicality. Our results demonstrate that AST-n with high-order samplers enables rapid LDCT reconstruction without significant loss of image fidelity, advancing the feasibility of diffusion-based methods in clinical workflows.
zh

[CV-10] Quo Vadis Handwritten Text Generation for Handwritten Text Recognition? ICCV

【速读】:该论文旨在解决历史手稿数字化过程中,由于小规模、作者特异性文本数据与训练数据分布差异较大而导致的手写文本识别(Handwritten Text Recognition, HTR)系统性能下降问题。其解决方案的关键在于利用手写文本生成(Handwritten Text Generation, HTG)技术,通过合成特定书写风格的训练数据来增强HTR模型在低资源场景下的泛化能力。研究系统比较了三种代表当前最先进水平的HTG模型(分别基于生成对抗网络、扩散模型和自回归范式),并量化分析了合成数据的视觉与语言特征对HTR微调效果的影响,从而为选择最优HTG方法提供实证依据。

链接: https://arxiv.org/abs/2508.09936
作者: Vittorio Pippi,Konstantina Nikolaidou,Silvia Cascianelli,George Retsinas,Giorgos Sfikas,Rita Cucchiara,Marcus Liwicki
机构: University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学); Luleå University of Technology (吕勒奥理工大学); National Technical University of Athens (雅典国立技术大学); University of West Attica (西阿提卡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注: Accepted at ICCV Workshop VisionDocs

点击查看摘要

Abstract:The digitization of historical manuscripts presents significant challenges for Handwritten Text Recognition (HTR) systems, particularly when dealing with small, author-specific collections that diverge from the training data distributions. Handwritten Text Generation (HTG) techniques, which generate synthetic data tailored to specific handwriting styles, offer a promising solution to address these challenges. However, the effectiveness of various HTG models in enhancing HTR performance, especially in low-resource transcription settings, has not been thoroughly evaluated. In this work, we systematically compare three state-of-the-art styled HTG models (representing the generative adversarial, diffusion, and autoregressive paradigms for HTG) to assess their impact on HTR fine-tuning. We analyze how visual and linguistic characteristics of synthetic data influence fine-tuning outcomes and provide quantitative guidelines for selecting the most effective HTG model. The results of our analysis provide insights into the current capabilities of HTG methods and highlight key areas for further improvement in their application to low-resource HTR.
zh

[CV-11] owards Comprehensive Cellular Characterisation of HE slides

【速读】:该论文旨在解决当前肿瘤微环境(TME)分析中细胞检测、分割与分类方法在罕见或未见细胞类型上的性能不足,以及跨域泛化能力有限的问题。其解决方案的关键在于构建了一个涵盖13种细胞类型的新型多癌种核图像数据集(共108,722个 nuclei),并基于此训练出HistoPLUS模型——该模型在外部独立队列验证中相比现有最优模型在检测质量上提升5.2%,整体F1分类得分提升23.7%,同时参数量减少至五分之一,并首次实现了对7种 understudied cell types 的有效识别及对8种细胞类型的显著性能改进,且具备良好的跨适应症迁移能力。

链接: https://arxiv.org/abs/2508.09926
作者: Benjamin Adjadj(1),Pierre-Antoine Bannier(1),Guillaume Horent(1),Sebastien Mandela,Aurore Lyon(1),Kathryn Schutte,Ulysse Marteau(1),Valentin Gaury(1),Laura Dumont(1),Thomas Mathieu(1),Reda Belbahri(1),Benoît Schmauch(1),Eric Durand(1),Katharina Von Loga(1),Lucie Gillet(1) ((1) Owkin)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 33 pages, 4 figures

点击查看摘要

Abstract:Cell detection, segmentation and classification are essential for analyzing tumor microenvironments (TME) on hematoxylin and eosin (HE) slides. Existing methods suffer from poor performance on understudied cell types (rare or not present in public datasets) and limited cross-domain generalization. To address these shortcomings, we introduce HistoPLUS, a state-of-the-art model for cell analysis, trained on a novel curated pan-cancer dataset of 108,722 nuclei covering 13 cell types. In external validation across 4 independent cohorts, HistoPLUS outperforms current state-of-the-art models in detection quality by 5.2% and overall F1 classification score by 23.7%, while using 5x fewer parameters. Notably, HistoPLUS unlocks the study of 7 understudied cell types and brings significant improvements on 8 of 13 cell types. Moreover, we show that HistoPLUS robustly transfers to two oncology indications unseen during training. To support broader TME biomarker research, we release the model weights and inference code at this https URL.
zh

[CV-12] SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection NEURIPS2024

【速读】:该论文旨在解决人脸伪造视频检测中跨数据集泛化能力弱以及对常见扰动敏感的问题。其解决方案的关键在于利用音频与视觉语音信号之间的协同关系,通过自监督的掩码预测任务学习真实视频中的精确音视频语音表征,从而同时捕捉局部和全局语义信息,并将该表征模型直接迁移至伪造检测任务中,无需使用任何伪造视频进行训练即可实现优异的跨域泛化性能和鲁棒性。

链接: https://arxiv.org/abs/2508.09913
作者: Yachao Liang,Min Yu,Gang Li,Jianguo Jiang,Boquan Li,Feng Yu,Ning Zhang,Xiang Meng,Weiqing Huang
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Deakin University (迪肯大学); Harbin Engineering University (哈尔滨工程大学); Institute of Computing Technology (计算技术研究所); Institute of Forensic Science (司法鉴定科学研究所); Ministry of Public Security (公安部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Detection of face forgery videos remains a formidable challenge in the field of digital forensics, especially the generalization to unseen datasets and common perturbations. In this paper, we tackle this issue by leveraging the synergy between audio and visual speech elements, embarking on a novel approach through audio-visual speech representation learning. Our work is motivated by the finding that audio signals, enriched with speech content, can provide precise information effectively reflecting facial movements. To this end, we first learn precise audio-visual speech representations on real videos via a self-supervised masked prediction task, which encodes both local and global semantic information simultaneously. Then, the derived model is directly transferred to the forgery detection task. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of cross-dataset generalization and robustness, without the participation of any fake video in model training. Code is available at this https URL.
zh

[CV-13] E-4DGS: High-Fidelity Dynamic Reconstruction from the Multi-view Event Cameras

【速读】:该论文旨在解决传统基于RGB相机的视角合成(novel view synthesis)与4D重建技术在高速运动和低光照场景下存在的局限性,如对光照条件依赖性强、易受运动模糊影响以及动态范围有限等问题。其解决方案的关键在于提出E-4DGS,首个面向多视角事件流的事件驱动动态高斯点绘制方法,通过事件感知初始化策略确保训练稳定性,引入事件自适应切片点绘制(event-adaptive slicing splatting)实现时间感知重建,并结合强度重要性剪枝(intensity importance pruning)去除浮点伪影、提升三维一致性,同时采用自适应对比度阈值优化精度,从而在挑战性运动场景中实现更鲁棒且高质量的动态场景重建。

链接: https://arxiv.org/abs/2508.09912
作者: Chaoran Feng,Zhenyu Tang,Wangbo Yu,Yatian Pang,Yian Zhao,Jianbin Zhao,Li Yuan,Yonghong Tian
机构: Peking University (北京大学); National University of Singapore (新加坡国立大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures, 5 Tables, accepted by ACMMM 2025

点击查看摘要

Abstract:Novel view synthesis and 4D reconstruction techniques predominantly rely on RGB cameras, thereby inheriting inherent limitations such as the dependence on adequate lighting, susceptibility to motion blur, and a limited dynamic range. Event cameras, offering advantages of low power, high temporal resolution and high dynamic range, have brought a new perspective to addressing the scene reconstruction challenges in high-speed motion and low-light scenes. To this end, we propose E-4DGS, the first event-driven dynamic Gaussian Splatting approach, for novel view synthesis from multi-view event streams with fast-moving cameras. Specifically, we introduce an event-based initialization scheme to ensure stable training and propose event-adaptive slicing splatting for time-aware reconstruction. Additionally, we employ intensity importance pruning to eliminate floating artifacts and enhance 3D consistency, while incorporating an adaptive contrast threshold for more precise optimization. We design a synthetic multi-view camera setup with six moving event cameras surrounding the object in a 360-degree configuration and provide a benchmark multi-view event stream dataset that captures challenging motion scenarios. Our approach outperforms both event-only and event-RGB fusion baselines and paves the way for the exploration of multi-view event-based reconstruction as a novel approach for rapid scene capture.
zh

[CV-14] HumanGenesis: Agent -Based Geometric and Generative Modeling for Synthetic Human Dynamics

【速读】:该论文旨在解决合成人类动态(Synthetic human dynamics)中的两大核心挑战:一是由于有限的3D建模能力和细节保留不足导致的几何不一致性和重建粗糙问题;二是由于生成能力较弱引发的动作泛化限制和场景不协调问题。解决方案的关键在于提出HumanGenesis框架,通过四个协同工作的智能体实现几何与生成建模的深度融合:(1) Reconstructor利用3D高斯泼溅(3D Gaussian Splatting)和形变分解构建三维一致的人体-场景表示;(2) Critique Agent借助多轮基于多模态大语言模型(MLLM)的反思机制提升重建精度;(3) Pose Guider采用时间感知参数编码器生成富有表现力的姿态序列以增强动作泛化能力;(4) Video Harmonizer结合扩散模型与混合渲染流水线生成逼真且连贯的视频,并通过Back-to-4D反馈回路优化Reconstructor。该方案在文本引导合成、视频重演和新姿态泛化等任务上达到当前最优性能,显著提升了表达性、几何保真度和场景融合能力。

链接: https://arxiv.org/abs/2508.09858
作者: Weiqi Li,Zehao Zhang,Liang Lin,Guangrun Wang
机构: 1. Institute of Artificial Intelligence, Chinese Academy of Sciences (中国科学院人工智能研究院); 2. School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院); 3. National Engineering Research Center for Intelligent Computing Systems (国家智能计算系统工程研究中心); 4. Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education (教育部智能计算与信号处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:\textbfSynthetic human dynamics aims to generate photorealistic videos of human subjects performing expressive, intention-driven motions. However, current approaches face two core challenges: (1) \emphgeometric inconsistency and \emphcoarse reconstruction, due to limited 3D modeling and detail preservation; and (2) \emphmotion generalization limitations and \emphscene inharmonization, stemming from weak generative capabilities. To address these, we present \textbfHumanGenesis, a framework that integrates geometric and generative modeling through four collaborative agents: (1) \textbfReconstructor builds 3D-consistent human-scene representations from monocular video using 3D Gaussian Splatting and deformation decomposition. (2) \textbfCritique Agent enhances reconstruction fidelity by identifying and refining poor regions via multi-round MLLM-based reflection. (3) \textbfPose Guider enables motion generalization by generating expressive pose sequences using time-aware parametric encoders. (4) \textbfVideo Harmonizer synthesizes photorealistic, coherent video via a hybrid rendering pipeline with diffusion, refining the Reconstructor through a Back-to-4D feedback loop. HumanGenesis achieves state-of-the-art performance on tasks including text-guided synthesis, video reenactment, and novel-pose generalization, significantly improving expressiveness, geometric fidelity, and scene integration.
zh

[CV-15] OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better

【速读】:该论文旨在解决离散视频变分自编码器(Discrete Video VAE)在训练不稳定、收敛速度慢以及重建质量差等问题,同时探索如何有效融合连续与离散表示以提升性能。其关键解决方案在于:首先,利用预训练的连续VAE先验知识,通过FSQ(Fast Search Quantization)方法实现更优的离散化过程,显著加速收敛并提升最终性能;其次,提出多标记量化机制和首帧重建增强策略,分别在不牺牲压缩比的前提下提升PSNR指标,并改善高压缩率下的重建效果;最后,设计了一种联合离散-连续优化方案,首次在一个网络中统一实现两种表示范式的竞争力,命名为OneVAE。

链接: https://arxiv.org/abs/2508.09857
作者: Yupeng Zhou,Zhen Li,Ziheng Ouyang,Yuming Chen,Ruoyi Du,Daquan Zhou,Bin Fu,Yihao Liu,Peng Gao,Ming-Ming Cheng,Qibin Hou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Encoding videos into discrete tokens could align with text tokens to facilitate concise and unified multi-modal LLMs, yet introducing significant spatiotemporal compression compared to continuous video representation. Previous discrete video VAEs experienced unstable training, long training time, and degraded reconstruction quality. Given the easier training and superior performance of continuous VAEs, an intuitive idea is to enhance discrete video VAEs by leveraging continuous VAEs. After rethinking the intrinsic link between discrete and continuous representations, we found that FSQ could effectively preserve pre-trained continuous VAE priors compared to other quantization methods. By leveraging continuous VAE priors, it converges several times faster than training from scratch and achieves superior performance at convergence. Meanwhile, two structural improvements are proposed. First, inspired by how continuous VAEs enhance reconstruction via enlarged latent dimensions, we introduce a multi-token quantization mechanism, which achieves nearly a 1 dB improvement in PSNR without compromising the token compression ratio. Second, to tackle reconstruction challenges in high-compression video VAEs, we strengthen first-frame reconstruction, enabling the causal VAE to leverage this information in subsequent frames and markedly improving the performance of 4 x 16 x 16 discrete VAEs. Furthermore, we propose a joint discrete-continuous optimization scheme that unifies the two paradigms and, for the first time, achieves competitive performance on both continuous and discrete representations within a single network. We name our method OneVAE to reflect this connection.
zh

[CV-16] oward Human-Robot Teaming: Learning Handover Behaviors from 3D Scenes

【速读】:该论文旨在解决人机协作(Human-Robot Teaming, HRT)系统中,特别是在近距离协作任务如人机交接(human-to-robot handover)场景下,机器人从真实世界RGB图像中学习操作策略时面临的高成本数据采集问题。传统方法依赖大量物理环境中的机器人动作试验,而仿真训练则受限于仿真与现实之间的视觉域差距(visual domain gap)。其解决方案的关键在于:利用稀疏视角的高斯点绘(Gaussian Splatting)重建技术,从RGB图像中构建人机交接场景的三维表示,并基于该重建场景生成包含图像-动作对的机器人示范数据,从而无需真实机器人训练即可学习有效的交接策略。该方法通过将模拟相机位姿变化直接映射为夹爪位姿变化,显著提升了策略迁移至真实场景的能力,实现了更鲁棒、无缝的人机协作。

链接: https://arxiv.org/abs/2508.09855
作者: Yuekun Wu,Yik Lung Pang,Andrea Cavallaro,Changjae Oh
机构: Centre for Intelligent Sensing, Queen Mary University of London, UK; Idiap Research Institute; École Polytechnique Fédérale de Lausanne, Switzerland
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 3 pages, 3 figures

点击查看摘要

Abstract:Human-robot teaming (HRT) systems often rely on large-scale datasets of human and robot interactions, especially for close-proximity collaboration tasks such as human-robot handovers. Learning robot manipulation policies from raw, real-world image data requires a large number of robot-action trials in the physical environment. Although simulation training offers a cost-effective alternative, the visual domain gap between simulation and robot workspace remains a major limitation. We introduce a method for training HRT policies, focusing on human-to-robot handovers, solely from RGB images without the need for real-robot training or real-robot data collection. The goal is to enable the robot to reliably receive objects from a human with stable grasping while avoiding collisions with the human hand. The proposed policy learner leverages sparse-view Gaussian Splatting reconstruction of human-to-robot handover scenes to generate robot demonstrations containing image-action pairs captured with a camera mounted on the robot gripper. As a result, the simulated camera pose changes in the reconstructed scene can be directly translated into gripper pose changes. Experiments in both Gaussian Splatting reconstructed scene and real-world human-to-robot handover experiments demonstrate that our method serves as a new and effective representation for the human-to-robot handover task, contributing to more seamless and robust HRT.
zh

[CV-17] Do Vision Transformers See Like Humans? Evaluating their Perceptual Alignment

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在图像识别任务中表现优异,但其与人类感知对齐程度尚不明确的问题。研究通过系统分析模型规模、数据集规模、数据增强和正则化策略等因素对ViT在TID2013数据集上与人类判断一致性的影响,揭示了模型复杂度与训练策略如何影响其感知对齐性。解决方案的关键在于量化不同训练因素对感知对齐性的边际效应:发现更大模型、更强的数据增强与正则化以及重复训练循环均显著降低感知对齐度,从而提出一个核心结论——模型性能提升与人类感知一致性之间存在权衡关系,为需具备类人视觉理解能力的应用提供了重要设计约束。

链接: https://arxiv.org/abs/2508.09850
作者: Pablo Hernández-Cámara,Jose Manuel Jaén-Lorites,Jorge Vila-Tomás,Valero Laparra,Jesus Malo
机构: Image Processing Lab, Universidad de Valencia, Paterna, Spain (图像处理实验室,瓦伦西亚大学,帕特纳,西班牙); Center for Biomaterials and Tissue Engineering Universitat Politecnica de Valencia, Valencia, Spain (生物材料与组织工程中心,瓦伦西亚理工大学,瓦伦西亚,西班牙)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) achieve remarkable performance in image recognition tasks, yet their alignment with human perception remains largely unexplored. This study systematically analyzes how model size, dataset size, data augmentation and regularization impact ViT perceptual alignment with human judgments on the TID2013 dataset. Our findings confirm that larger models exhibit lower perceptual alignment, consistent with previous works. Increasing dataset diversity has a minimal impact, but exposing models to the same images more times reduces alignment. Stronger data augmentation and regularization further decrease alignment, especially in models exposed to repeated training cycles. These results highlight a trade-off between model complexity, training strategies, and alignment with human perception, raising important considerations for applications requiring human-like visual understanding.
zh

[CV-18] ARI3D: A Software for Interactive Quantification of Regions in X-Ray CT 3D Images

【速读】:该论文旨在解决X射线计算机断层扫描(X-ray computed tomography, XCT)在定量分析材料内部微观结构时面临的挑战,特别是由于成像伪影(如束硬化效应和部分体积效应)导致的相识别困难、对象量化精度低以及分析流程缺乏标准化的问题。解决方案的关键在于提出一种名为ARI3D的交互式三维图像分析软件工具,其核心能力包括:提升相识别准确性、有效校正部分体积效应、提高对象检测极限与量化精度,并通过结构化协议实现跨学科领域一致的定量三维分析流程。

链接: https://arxiv.org/abs/2508.09849
作者: Jan Phillipp Albrecht,Jose R.A. Godinho,Christina Hübers,Deborah Schmidt
机构: Max Delbrück Center for Molecular Medicine in the Helmholtz Association(亥姆霍兹分子医学中心); Helmholtz Imaging(亥姆霍兹成像); Humboldt-Universität zu Berlin(柏林洪堡大学); Helmholtz-Zentrum Dresden-Rossendorf(德累斯顿罗斯多夫亥姆霍兹中心); Helmholtz-Institut Freiberg für Ressourcentechnologie (HIF)(弗莱贝格资源技术亥姆霍兹研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: 2 figures and 6 pages main article, 17 pages total, 8 figures total, to be published in SoftwareX

点击查看摘要

Abstract:X-ray computed tomography (CT) is the main 3D technique for imaging the internal microstructures of materials. Quantitative analysis of the microstructures is usually achieved by applying a sequence of steps that are implemented to the entire 3D image. This is challenged by various imaging artifacts inherent from the technique, e.g., beam hardening and partial volume. Consequently, the analysis requires users to make a number of decisions to segment and classify the microstructures based on the voxel gray-values. In this context, a software tool, here called ARI3D, is proposed to interactively analyze regions in three-dimensional X-ray CT images, assisting users through the various steps of a protocol designed to classify and quantify objects within regions of a three-dimensional image. ARI3D aims to 1) Improve phase identification; 2) Account for partial volume effect; 3) Increase the detection limit and accuracy of object quantification; and 4) Harmonize quantitative 3D analysis that can be implemented in different fields of science.
zh

[CV-19] Enhancing Diffusion Face Generation with Contrastive Embeddings and SegFormer Guidance

【速读】:该论文旨在解决小规模数据集下人脸生成模型的可控性和语义一致性问题,尤其是在有限标注数据条件下如何提升属性引导生成的质量与精确度。其解决方案的关键在于:首先引入InfoNCE损失函数用于属性嵌入的对比学习,增强属性向量的区分能力;其次采用SegFormer作为分割编码器替代传统方法,实现更精准的语义分割掩码提取;二者结合显著提升了属性引导生成中语义对齐和控制精度,尤其在CelebAMask-HQ等小规模数据集上表现优异。

链接: https://arxiv.org/abs/2508.09847
作者: Dhruvraj Singh Rawat,Enggen Sherpa,Rishikesan Kirupanantha,Tin Hoang
机构: University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, preprint

点击查看摘要

Abstract:We present a benchmark of diffusion models for human face generation on a small-scale CelebAMask-HQ dataset, evaluating both unconditional and conditional pipelines. Our study compares UNet and DiT architectures for unconditional generation and explores LoRA-based fine-tuning of pretrained Stable Diffusion models as a separate experiment. Building on the multi-conditioning approach of Giambi and Lisanti, which uses both attribute vectors and segmentation masks, our main contribution is the integration of an InfoNCE loss for attribute embedding and the adoption of a SegFormer-based segmentation encoder. These enhancements improve the semantic alignment and controllability of attribute-guided synthesis. Our results highlight the effectiveness of contrastive embedding learning and advanced segmentation encoding for controlled face generation in limited data settings.
zh

[CV-20] Hierarchical Graph Attention Network for No-Reference Omnidirectional Image Quality Assessment

【速读】:该论文旨在解决当前全景图像质量评估(Omnidirectional Image Quality Assessment, OIQA)方法在评估局部非均匀失真时表现不佳的问题,其核心挑战在于对空间质量变化建模不足以及特征表示无法同时捕捉局部细节与全局上下文信息。解决方案的关键在于提出一种基于图神经网络(Graph Neural Network)的OIQA框架,通过斐波那契球采样生成具有结构化拓扑关系的视口(viewport),将每个视口表示为图节点,并利用多阶段特征提取网络构建高维节点表征;进一步融合图注意力网络(Graph Attention Network, GAT)以建模邻近视口间的细粒度局部失真差异,以及图Transformer以捕获远距离区域间的长程质量交互关系,从而实现对空间失真非均匀性的有效感知与综合评估。

链接: https://arxiv.org/abs/2508.09843
作者: Hao Yang,Xu Zhang,Jiaqi Ma,Linwei Zhu,Yun Zhang,Huan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current Omnidirectional Image Quality Assessment (OIQA) methods struggle to evaluate locally non-uniform distortions due to inadequate modeling of spatial variations in quality and ineffective feature representation capturing both local details and global context. To address this, we propose a graph neural network-based OIQA framework that explicitly models structural relationships between viewports to enhance perception of spatial distortion non-uniformity. Our approach employs Fibonacci sphere sampling to generate viewports with well-structured topology, representing each as a graph node. Multi-stage feature extraction networks then derive high-dimensional node representation. To holistically capture spatial dependencies, we integrate a Graph Attention Network (GAT) modeling fine-grained local distortion variations among adjacent viewports, and a graph transformer capturing long-range quality interactions across distant regions. Extensive experiments on two large-scale OIQA databases with complex spatial distortions demonstrate that our method significantly outperforms existing approaches, confirming its effectiveness and strong generalization capability.
zh

[CV-21] RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians ICCV2025

【速读】:该论文旨在解决从原始点云或由RGB图像通过3D高斯溅射(3D Gaussians, 3DGS)预估计得到的3D高斯表示中进行高效且通用的三维表面重建问题。现有基于坐标的重建方法在显式表面渲染时通常计算复杂度较高,难以兼顾精度与效率。本文提出的RayletDF方法的关键创新在于引入了一种称为“射线小片段距离场”(raylet distance field)的新技术,其核心思想是直接从查询射线预测表面点,而非依赖传统网格化或隐式函数采样策略。该方案通过三个关键模块协同工作:射线小片段特征提取器用于捕捉局部几何细节,射线小片段距离场预测器用于输出每条射线上的最近表面点距离,多射线小片段融合器则聚合多个射线预测结果以生成精确的表面点集,从而实现高保真、高泛化能力的3D表面重建,且仅需单次前向传播即可在未见数据集上取得优异效果。

链接: https://arxiv.org/abs/2508.09830
作者: Shenxing Wei,Jinxi Li,Yafei Yang,Siyuan Zhou,Bo Yang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ICCV 2025 Highlight. Shenxing and Jinxi are co-first authors. Code and data are available at: this https URL

点击查看摘要

Abstract:In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing.
zh

[CV-22] Reverse Convolution and Its Applications to Image Restoration ICCV2025

【速读】:该论文试图解决的问题是:现有的转置卷积(transposed convolution,又称deconvolution)无法作为卷积的真正逆运算,因其数学形式存在本质差异,导致其在神经网络中难以实现精确的特征重构。为弥补这一缺陷,作者提出了一种新颖的深度可分离反卷积(depthwise reverse convolution)算子,其关键在于将反卷积建模为一个正则化最小二乘优化问题,并通过求解该问题来获得可学习的、接近真实逆操作的卷积核参数。该方法不仅解决了传统转置卷积不可逆的问题,还进一步构建了类似Transformer结构的反卷积模块(reverse convolution block),从而可在图像恢复任务中直接替代标准卷积与转置卷积层,推动新型深度模型设计的发展。

链接: https://arxiv.org/abs/2508.09824
作者: Xuhong Huang,Shiqi Liu,Kai Zhang,Ying Tai,Jian Yang,Hui Zeng,Lei Zhang
机构: Nanjing University (南京大学); The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025; this https URL

点击查看摘要

Abstract:Convolution and transposed convolution are fundamental operators widely used in neural networks. However, transposed convolution (a.k.a. deconvolution) does not serve as a true inverse of convolution due to inherent differences in their mathematical formulations. To date, no reverse convolution operator has been established as a standard component in neural architectures. In this paper, we propose a novel depthwise reverse convolution operator as an initial attempt to effectively reverse depthwise convolution by formulating and solving a regularized least-squares optimization problem. We thoroughly investigate its kernel initialization, padding strategies, and other critical aspects to ensure its effective implementation. Building upon this operator, we further construct a reverse convolution block by combining it with layer normalization, 1 \times 1 convolution, and GELU activation, forming a Transformer-like structure. The proposed operator and block can directly replace conventional convolution and transposed convolution layers in existing architectures, leading to the development of ConverseNet. Corresponding to typical image restoration models such as DnCNN, SRResNet and USRNet, we train three variants of ConverseNet for Gaussian denoising, super-resolution and deblurring, respectively. Extensive experiments demonstrate the effectiveness of the proposed reverse convolution operator as a basic building module. We hope this work could pave the way for developing new operators in deep model design and applications.
zh

[CV-23] KonfAI: A Modular and Fully Configurable Framework for Deep Learning in Medical Imaging

【速读】:该论文旨在解决医学图像分析中深度学习模型开发与部署过程中存在的流程复杂、可复现性差、实验追踪困难以及高级训练策略难以实现等问题。其解决方案的关键在于提出一个模块化、可扩展且完全可配置的深度学习框架 KonfAI,通过结构化的 YAML 配置文件定义完整的训练、推理和评估工作流,无需修改代码即可实现灵活定制;同时原生支持 patch-based learning(基于补丁的学习)、测试时增强(test-time augmentation)、模型集成(model ensembling)及中间特征表示访问等高级策略,从而提升模型性能与透明度,并显著缩短开发周期。

链接: https://arxiv.org/abs/2508.09823
作者: Valentin Boussot,Jean-Louis Dillenseger
机构: INSERM(法国国家健康与医学研究院); LTSI - UMR 1099(生物医学成像与信号处理实验室); University of Rennes(雷恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:KonfAI is a modular, extensible, and fully configurable deep learning framework specifically designed for medical imaging tasks. It enables users to define complete training, inference, and evaluation workflows through structured YAML configuration files, without modifying the underlying code. This declarative approach enhances reproducibility, transparency, and experimental traceability while reducing development time. Beyond the capabilities of standard pipelines, KonfAI provides native abstractions for advanced strategies including patch-based learning, test-time augmentation, model ensembling, and direct access to intermediate feature representations for deep supervision. It also supports complex multi-model training setups such as generative adversarial architectures. Thanks to its modular and extensible architecture, KonfAI can easily accommodate custom models, loss functions, and data processing components. The framework has been successfully applied to segmentation, registration, and image synthesis tasks, and has contributed to top-ranking results in several international medical imaging challenges. KonfAI is open source and available at \hrefthis https URLthis https URL.
zh

[CV-24] Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

【速读】:该论文旨在解决机器人操作中因缺乏高质量操控数据而导致的模型训练困难问题。其解决方案的关键在于提出物理自回归模型(Physical Autoregressive Model, PAR),该模型通过将帧(frame)与动作(action)融合为物理标记(physical token),以联合建模机器人与其环境的动态演化过程。PAR利用视频预训练中嵌入的世界知识理解物理规律,无需动作预训练即可实现高精度视频预测和一致的动作轨迹生成;同时采用基于DiT(Diffusion Transformer)的解标记器将帧和动作建模为连续标记,减少量化误差并促进相互增强,结合因果掩码、逆运动学约束、并行训练及KV缓存机制进一步提升性能与效率。

链接: https://arxiv.org/abs/2508.09822
作者: Zijian Song,Sihan Qin,Tianshui Chen,Liang Lin,Guangrun Wang
机构: Sun Yat-sen University (中山大学); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室); Peng Cheng Laboratory (鹏城实验室); X-Era AI Lab (X-Era人工智能实验室); Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining.
zh

[CV-25] ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video ICCV

【速读】:该论文旨在解决当前模型在理解人类行为时仅依赖运动数据或视频数据单一模态,难以全面捕捉动作的细微差异与语义信息的问题。其解决方案的关键在于提出ViMoNet框架,采用联合训练策略融合两种数据类型的优势:详细但有限的运动-文本数据(motion-text data)提供精确的动作细节,以及更广泛但较粗糙的视频-文本数据(video-text data)增强语义覆盖范围,从而实现对人类行为在时空维度上的丰富表征。此外,研究还构建了VIMOS数据集和ViMoNet-Bench基准测试,为评估模型在行为理解任务中的性能提供了标准化工具。

链接: https://arxiv.org/abs/2508.09818
作者: Rajan Das Gupta,Md Yeasin Rahat,Nafiz Fahad,Abir Ahmed,Liew Tze Hui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICCVDM '25

点击查看摘要

Abstract:This study investigates how large language models (LLMs) can be used to understand human behavior using motion and video data. We think that mixing both types is essential to completely capture the nuanced movements and meanings of human actions, in contrast to recent models that simply concentrate on motion data or films. To address this, we provide ViMoNet, a straightforward yet effective framework for comprehending, characterizing, and deducing human action. ViMoNet employs a joint training strategy that leverages the advantages of two data types: detailed motion-text data, which is more exact, and generic video-text data, which is more comprehensive but less detailed. This aids in the model’s acquisition of rich data regarding time and space in human behavior. Additionally, we provide a brand new dataset named VIMOS that contains a variety of films, motion sequences, instructions, and subtitles. We developed ViMoNet-Bench, a standardized benchmark with carefully labeled samples, to evaluate how well models understand human behavior. Our tests show that ViMoNet outperforms existing methods in caption generation, motion understanding, and behavior interpretation.
zh

[CV-26] Evolution of Low-Level and Texture Human-CLIP Alignment

【速读】:该论文试图解决多模态模型(如CLIP)在训练过程中出现的“低层次人类图像质量评估相关性先升后降”这一现象的问题,旨在揭示其成因并优化视觉-语言模型在感知对齐(perceptual alignment)与鲁棒性(robustness)之间的权衡。解决方案的关键在于识别两个核心因素:形状-纹理偏差对齐(shape-texture bias alignment)和噪声环境下分类准确率下降(classification accuracy drop under noise)。研究发现,模型早期学习低级视觉特征,提升与人类低层次感知的一致性,但同时加剧了对噪声的敏感性和纹理偏好;随着训练深入,模型转向更抽象的基于形状的表征,增强了抗噪能力,却削弱了与人类低层次感知的对齐。这表明两者共享一个潜在的学习机制,为改进多模态模型的性能提供了新思路。

链接: https://arxiv.org/abs/2508.09814
作者: Pablo Hernández-Cámara,Jose Manuel Jaén-Lorites,Jorge Vila-Tomás,Jesus Malo,Valero Laparra
机构: Image Processing Lab, Universidad de Valencia, Paterna, Spain (图像处理实验室,瓦伦西亚大学,帕特纳,西班牙); Center for Biomaterials and Tissue Engineering Universitat Politecnica de Valencia, Valencia, Spain (生物材料与组织工程中心,瓦伦西亚理工大学,瓦伦西亚,西班牙)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:During the training of multi-modal models like CLIP, we observed an intriguing phenomenon: the correlation with low-level human image quality assessments peaks in the early epochs before gradually declining. This study investigates this observation and seeks to understand its causes through two key factors: shape-texture bias alignment and classification accuracy drop under noise. Our findings suggest that CLIP initially learn low-level visual features, enhancing its alignment with low-level human perception but also increasing its sensitivity to noise and its texture bias. As training progresses, the model shifts toward more abstract shape-based representations, improving noise robustness but reducing alignment with low-level human perception. These results suggest that these factors shared an underlying learning mechanism and provide new insights into optimizing the trade-off between perceptual alignment and robustness in vision-language models.
zh

[CV-27] Poaching Hotspot Identification Using Satellite Imagery

【速读】:该论文旨在解决非洲象盗猎问题日益加剧且传统反盗猎手段效率低下的困境,尤其针对盗猎热点区域动态变化、人力巡查难以覆盖偏远地区等挑战。其解决方案的关键在于构建一个计算机视觉(Computer Vision, CV)模型,通过分析卫星影像中与盗猎行为相关的地理指标(如水源地、地形、季节性特征等),自动识别潜在的盗猎高风险区域,从而实现资源的精准部署和高效反制,同时避免对野生动物及跨境航空活动造成干扰。

链接: https://arxiv.org/abs/2508.09812
作者: Aryan Pandhi,Shrey Baid,Sanjali Jha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Elephant Poaching in African countries has been a decade-old problem. So much so that African Forest Elephants are now listed as an endangered species, and African Savannah Elephants as critically endangered by the IUCN (International Union for Conservation of Nature). [1] Elephants are hunted primarily for their ivory tusks which caused many elephants to be born tuskless as a genetic modification for survival. [2] Data gathered by recent studies shows that though poaching methods remain the same, the poaching grounds are rather dynamic. Poachers have shifted to areas with less ranger patrols and several other factors like watering holes, seasons, altitude etc. cause constant shifts in poaching hotspot locations. [3] After a period of low poaching from 2000-2014, poaching numbers in African countries are now on the rise again – WWF (World Wildlife Foundation) says there are 20,000 elephants poached annually [4]. In African countries, anti-poaching efforts are concentrated near towns, while a majority of poaching occurs in the deserted regions. All of these factors result in the need for a Computer Vision Model to identify poaching hotspots through locating the geographic indicators of favorable poaching regions. A CV model eliminates the need to manually track poachers and account for the environmental factors to deploy resources and its combination with satellite imagery allows us to survey large areas without disturbing local species or cross border aviation restrictions.
zh

[CV-28] RACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos ICCV2025

【速读】:该论文旨在解决仅从动态多视角视频中建模三维场景几何、外观及物理信息的问题,且无需任何人工标注。现有方法通常依赖物理启发的损失函数或简单物理模型嵌入神经网络,但难以学习复杂运动物理规律,或需额外标签(如物体类别或掩码)。其解决方案的关键在于提出一种名为TRACE的新框架,通过将每个3D点建模为具有空间尺寸和朝向的刚性粒子,直接学习每个粒子的平移-旋转动力学系统,并显式估计一套完整的物理参数以控制粒子随时间的运动,从而实现对复杂动态场景物理特性的端到端学习。

链接: https://arxiv.org/abs/2508.09811
作者: Jinxi Li,Ziyang Song,Bo Yang
机构: vLAR Group, The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ICCV 2025. Code and data are available at: this https URL

点击查看摘要

Abstract:In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multi-view videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural nets, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. We propose a new framework named TRACE to model the motion physics of complex dynamic 3D scenes. The key novelty of our method is that, by formulating each 3D point as a rigid particle with size and orientation in space, we directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle’s motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters.
zh

[CV-29] Automated Segmentation of Coronal Brain Tissue Slabs for 3D Neuropathology

【速读】:该论文旨在解决尸检脑组织图像中自动分割(segmentation)的问题,即从常规冠状切片照片中精确提取脑组织区域,以支持体积分析。当前方法依赖人工标注,成本高且效率低。解决方案的关键在于提出一种基于U-Net架构的深度学习模型,该模型通过结合1,414张手动标注的真实组织图像(涵盖不同诊断和拍摄条件)与2,000张由MRI生成的合成图像及其对应掩膜进行训练,显著提升了模型在未见过的摄影设置下的泛化能力。评估结果显示,该模型在未见数据上的性能接近人眼标注的变异水平,Dice分数中位数超过0.98,平均表面距离低于0.4 mm,95% Hausdorff距离低于1.60 mm。

链接: https://arxiv.org/abs/2508.09805
作者: Jonathan Williams Ramirez,Dina Zemlyanker,Lucas Deden-Binder,Rogeny Herisse,Erendira Garcia Pallares,Karthik Gopinath,Harshvardhan Gazula,Christopher Mount,Liana N. Kozanno,Michael S. Marshall,Theresa R. Connors,Matthew P. Frosch,Mark Montine,Derek H. Oakley,Christine L. Mac Donald,C. Dirk Keene,Bradley T. Hyman,Juan Eugenio Iglesias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 10 figures

点击查看摘要

Abstract:Advances in image registration and machine learning have recently enabled volumetric analysis of \emphpostmortem brain tissue from conventional photographs of coronal slabs, which are routinely collected in brain banks and neuropathology laboratories worldwide. One caveat of this methodology is the requirement of segmentation of the tissue from photographs, which currently requires costly manual intervention. In this article, we present a deep learning model to automate this process. The automatic segmentation tool relies on a U-Net architecture that was trained with a combination of \textit(i)1,414 manually segmented images of both fixed and fresh tissue, from specimens with varying diagnoses, photographed at two different sites; and \textit(ii)~2,000 synthetic images with randomized contrast and corresponding masks generated from MRI scans for improved generalizability to unseen photographic setups. Automated model predictions on a subset of photographs not seen in training were analyzed to estimate performance compared to manual labels – including both inter- and intra-rater variability. Our model achieved a median Dice score over 0.98, mean surface distance under 0.4~mm, and 95% Hausdorff distance under 1.60~mm, which approaches inter-/intra-rater levels. Our tool is publicly available at this http URL.
zh

[CV-30] MUJICA: Reforming SISR Models for PBR Material Super-Resolution via Cross-Map Attention

【速读】:该论文旨在解决物理基础渲染(Physically Based Rendering, PBR)材料在超分辨率重建中面临的跨贴图不一致性、模态特异性特征建模不足以及因数据分布偏移导致的泛化能力有限等问题。其解决方案的关键在于提出一种名为多模态联合推理跨贴图注意力机制(Multi-modal Upscaling Joint Inference via Cross-map Attention, MUJICA)的灵活适配器,该模块可无缝接入预训练的Swin Transformer-based单图像超分辨率(Single Image Super-Resolution, SISR)模型后端,通过跨贴图注意力机制融合不同材质贴图(如基础色、法线、金属度和粗糙度)的特征,同时保留原始SISR模型强大的重建能力,从而在提升峰值信噪比(PSNR)、结构相似性(SSIM)和感知图像块相似度(LPIPS)指标的同时,有效维持多贴图间的一致性。

链接: https://arxiv.org/abs/2508.09802
作者: Xin Du,Maoyuan Xu,Zhi Ying
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Physically Based Rendering (PBR) materials are typically characterized by multiple 2D texture maps such as basecolor, normal, metallic, and roughness which encode spatially-varying bi-directional reflectance distribution function (SVBRDF) parameters to model surface reflectance properties and microfacet interactions. Upscaling SVBRDF material is valuable for modern 3D graphics applications. However, existing Single Image Super-Resolution (SISR) methods struggle with cross-map inconsistency, inadequate modeling of modality-specific features, and limited generalization due to data distribution shifts. In this work, we propose Multi-modal Upscaling Joint Inference via Cross-map Attention (MUJICA), a flexible adapter that reforms pre-trained Swin-transformer-based SISR models for PBR material super-resolution. MUJICA is seamlessly attached after the pre-trained and frozen SISR backbone. It leverages cross-map attention to fuse features while preserving remarkable reconstruction ability of the pre-trained SISR model. Applied to SISR models such as SwinIR, DRCT, and HMANet, MUJICA improves PSNR, SSIM, and LPIPS scores while preserving cross-map consistency. Experiments demonstrate that MUJICA enables efficient training even with limited resources and delivers state-of-the-art performance on PBR material datasets.
zh

[CV-31] MeMoSORT: Memory-Assisted Filtering and Motion-Adaptive Association Metric for Multi-Person Tracking

【速读】:该论文旨在解决人类主导场景下多目标跟踪(Multi-object Tracking, MOT)中存在的两大核心问题:一是传统基于卡尔曼滤波(Kalman Filter, KF)的运动模型与真实物体动态行为不匹配,导致滤波误差;二是基于固定交并比(Intersection over Union, IoU)的关联策略在严重遮挡情况下易引发身份切换或目标丢失。解决方案的关键在于提出MeMoSORT追踪器,其创新性地引入两个模块:一是记忆增强型卡尔曼滤波器(Memory-assisted Kalman Filter, MeKF),利用记忆增强神经网络补偿运动模型偏差;二是运动自适应IoU(Motion-adaptive IoU, Mo-IoU),通过动态扩展匹配空间并融合高度相似性来降低检测误差和关联失败的影响,同时保持轻量化特性。

链接: https://arxiv.org/abs/2508.09796
作者: Yingjie Wang,Zhixing Wang,Le Zheng,Tianxiao Liu,Roujing Li,Xueyao Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) in human-dominant scenarios, which involves continuously tracking multiple people within video sequences, remains a significant challenge in computer vision due to targets’ complex motion and severe occlusions. Conventional tracking-by-detection methods are fundamentally limited by their reliance on Kalman filter (KF) and rigid Intersection over Union (IoU)-based association. The motion model in KF often mismatches real-world object dynamics, causing filtering errors, while rigid association struggles under occlusions, leading to identity switches or target loss. To address these issues, we propose MeMoSORT, a simple, online, and real-time MOT tracker with two key innovations. First, the Memory-assisted Kalman filter (MeKF) uses memory-augmented neural networks to compensate for mismatches between assumed and actual object motion. Second, the Motion-adaptive IoU (Mo-IoU) adaptively expands the matching space and incorporates height similarity to reduce the influence of detection errors and association failures, while remaining lightweight. Experiments on DanceTrack and SportsMOT show that MeMoSORT achieves state-of-the-art performance, with HOTA scores of 67.9% and 82.1%, respectively.
zh

[CV-32] Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

【速读】:该论文旨在解决现有视频推荐系统依赖用户定义的元数据或低级视觉与声学特征所带来的语义缺失问题,这些特征无法捕捉如意图、幽默感和世界知识等深层次语义信息,从而影响个性化推荐效果。其解决方案的关键在于引入一个无需微调(zero-finetuning)的框架,通过提示预训练的多模态大语言模型(Multimodal Large Language Model, MLLM)将每段视频生成丰富的自然语言描述(如“一场包含滑稽打斗和管弦乐插入的超级英雄恶搞”),以此注入高层语义信息,并将其与先进的文本编码器结合后直接接入标准协同过滤、内容基础及生成式推荐模型中,显著提升了推荐系统的意图感知能力。

链接: https://arxiv.org/abs/2508.09789
作者: Marco De Nadai,Andreas Damianou,Mounia Lalmas
机构: Spotify( Spotify)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. “a superhero parody with slapstick fights and orchestral stabs”), bridging the gap between raw content and user intent. We use MLLM output with a state-of-the-art text encoder and feed it into standard collaborative, content-based, and generative recommenders. On the MicroLens-100K dataset, which emulates user interactions with TikTok-style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to build more intent-aware video recommenders.
zh

[CV-33] DSS-Prompt: Dynamic-Static Synergistic Prompting for Few-Shot Class-Incremental Learning

【速读】:该论文旨在解决少样本类增量学习(Few-Shot Class-Incremental Learning, FSCIL)任务中的挑战,即在仅用少量样本的情况下持续学习新类别,同时避免对旧类别的灾难性遗忘。其解决方案的关键在于提出一种名为DSS-Prompt的简单但高效的方法,通过在预训练视觉Transformer(Vision Transformer)中引入两类互补提示(prompts)来实现:静态提示(static prompts)用于弥合预训练数据与下游任务之间的域差距,提升适应能力;动态提示(dynamic prompts)用于捕捉实例级语义信息,从而实现从基础类到新类的便捷迁移。特别地,动态提示利用预训练多模态模型提取输入相关的多样化语义,并自适应调整不同层中提示的重要性,最终在无需对增量任务进一步训练的前提下,仅基于提示后的视觉嵌入即可构建原型分类器,显著优于现有方法并有效缓解遗忘问题。

链接: https://arxiv.org/abs/2508.09785
作者: Linpu He,Yanan Li,Bingze Li,Elvis Han Cui,Donghui Wang
机构: Zhejiang University (浙江大学); Zhejiang Lab (浙江省实验室); University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACMMM 2025

点击查看摘要

Abstract:Learning from large-scale pre-trained models with strong generalization ability has shown remarkable success in a wide range of downstream tasks recently, but it is still underexplored in the challenging few-shot class-incremental learning (FSCIL) task. It aims to continually learn new concepts from limited training samples without forgetting the old ones at the same time. In this paper, we introduce DSS-Prompt, a simple yet effective approach that transforms the pre-trained Vision Transformer with minimal modifications in the way of prompts into a strong FSCIL classifier. Concretely, we synergistically utilize two complementary types of prompts in each Transformer block: static prompts to bridge the domain gap between the pre-training and downstream datasets, thus enabling better adaption; and dynamic prompts to capture instance-aware semantics, thus enabling easy transfer from base to novel classes. Specially, to generate dynamic prompts, we leverage a pre-trained multi-modal model to extract input-related diverse semantics, thereby generating complementary input-aware prompts, and then adaptively adjust their importance across different layers. In this way, on top of the prompted visual embeddings, a simple prototype classifier can beat state-of-the-arts without further training on the incremental tasks. We conduct extensive experiments on four benchmarks to validate the effectiveness of our DSS-Prompt and show that it consistently achieves better performance than existing approaches on all datasets and can alleviate the catastrophic forgetting issue as well.
zh

[CV-34] Combinative Matching for Geometric Shape Assembly ICCV2025

【速读】:该论文旨在解决几何形状装配中因局部相似性导致的匹配歧义问题,特别是在处理互锁部件时如何准确建立跨区域的对应关系。传统方法依赖于表面形状的完全一致进行对齐,但忽略了互锁结构中“表面形状相同而体积占据相反空间”的关键特性。解决方案的关键在于提出一种组合匹配(combinative matching)方法,显式建模两个互锁形状的双重属性:一是表面形状的一致性(identical surface shape),二是体积占据的互补性(opposite volume occupancy)。通过引入等变神经网络估计形状方向,该方法在旋转对齐的基础上实现更鲁棒的跨区域匹配,从而显著降低局部模糊性并提升装配精度。

链接: https://arxiv.org/abs/2508.09780
作者: Nahyuk Lee,Juhong Min,Junhong Lee,Chunghyun Park,Minsu Cho
机构: POSTECH; Samsung Research America; RLWRLD
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025 (Highlight)

点击查看摘要

Abstract:This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. In contrast, we explicitly model two distinct properties of interlocking shapes: ‘identical surface shape’ and ‘opposite volume occupancy.’ Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art. Project page: this https URL.
zh

[CV-35] MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

【速读】:该论文旨在解决大规模视觉-语言模型(Large Vision-Language Models, LVLMs)在参数量和计算成本上的高开销问题,同时保持多模态特征建模能力。现有密集型LVLM虽性能优异,但难以高效扩展;而混合专家(Mixture of Experts, MoE)架构虽提升参数效率,却面临如何有效协同建模模态内特征与跨模态关联的挑战。其解决方案的关键在于提出模态内与模态间专家混合机制(Mixture of Intra- and Inter-Modality Experts, MoIIE):针对每个输入token,通过基于模态的路由策略将其分配至对应的模态内专家(intra-modality experts)以及共享的跨模态专家(inter-modality experts),从而实现对模态特异性信息与跨模态交互关系的联合学习。此外,采用两阶段训练策略进一步增强MoE结构与多模态能力的协同激活效果,显著提升了模型在不同数据规模和大语言模型(LLM)骨干网络下的性能、效率与泛化性。

链接: https://arxiv.org/abs/2508.09779
作者: Dianyi Wang,Siyuan Wang,Zejun Li,Yikun Wang,Yitong Li,Duyu Tang,Xiaoyu Shen,Xuanjing Huang,Zhongyu Wei
机构: Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学); University of Southern California (南加州大学); Huawei Technologies Co., Ltd (华为技术有限公司); Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT (宁波空间智能与数字衍生重点实验室,数字孪生研究所,EIT)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at this https URL.
zh

[CV-36] Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection

【速读】:该论文旨在解决基于潜在扩散模型(Latent Diffusion Model, LDM)进行图像融合时存在的细节丢失和融合能力有限的问题,以及现有合成数据集依赖颜色迁移方法、缺乏局部变化且无法模拟复杂真实光照条件的局限性。其解决方案的关键在于提出一种区域到区域(Region-to-Region, R2R)变换框架:首先设计Clear-VAE模块,在保持高频细节的同时通过自适应滤波消除不和谐成分;其次引入带有掩码感知自适应通道注意力机制(Mask-aware Adaptive Channel Attention, MACA)的和谐控制器,动态调整前景区域以匹配背景的通道重要性;最后提出随机泊松混合(Random Poisson Blending)方法构建新的合成数据集RPHarmony,从而提升模型在真实场景下的泛化能力和视觉一致性。

链接: https://arxiv.org/abs/2508.09746
作者: Zhiqiu Zhang,Dongqi Fan,Mingjie Wang,Qiang Tang,Jian Yang,Zili Yi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The goal of image harmonization is to adjust the foreground in a composite image to achieve visual consistency with the background. Recently, latent diffusion model (LDM) are applied for harmonization, achieving remarkable results. However, LDM-based harmonization faces challenges in detail preservation and limited harmonization ability. Additionally, current synthetic datasets rely on color transfer, which lacks local variations and fails to capture complex real-world lighting conditions. To enhance harmonization capabilities, we propose the Region-to-Region transformation. By injecting information from appropriate regions into the foreground, this approach preserves original details while achieving image harmonization or, conversely, generating new composite data. From this perspective, We propose a novel model R2R. Specifically, we design Clear-VAE to preserve high-frequency details in the foreground using Adaptive Filter while eliminating disharmonious elements. To further enhance harmonization, we introduce the Harmony Controller with Mask-aware Adaptive Channel Attention (MACA), which dynamically adjusts the foreground based on the channel importance of both foreground and background regions. To address the limitation of existing datasets, we propose Random Poisson Blending, which transfers color and lighting information from a suitable region to the foreground, thereby generating more diverse and challenging synthetic images. Using this method, we construct a new synthetic dataset, RPHarmony. Experiments demonstrate the superiority of our method over other methods in both quantitative metrics and visual harmony. Moreover, our dataset helps the model generate more realistic images in real examples. Our code, dataset, and model weights have all been released for open access.
zh

[CV-37] Seeing Listening Remembering and Reasoning : A Multimodal Agent with Long-Term Memory

【速读】:该论文旨在解决多模态智能体(Multimodal Agent)在长期记忆建模与基于记忆的推理能力方面的不足,尤其是如何实现类似人类的持续学习和环境理解。其核心挑战在于现有系统难以有效整合视觉、听觉等多模态信息以构建结构化、可检索的长期记忆,并在此基础上进行多轮迭代推理。解决方案的关键在于提出M3-Agent框架,该框架采用以实体为中心(entity-centric)的多模态记忆组织方式,支持从实时输入中提取并更新语义记忆(semantic memory)与情景记忆(episodic memory),从而实现对环境更深层次的一致性理解;同时,通过强化学习训练使智能体能自主执行多轮推理任务,并从记忆中检索相关信息完成指令。实验表明,M3-Agent在新提出的M3-Bench基准上显著优于基线模型,验证了其在长视频问答任务中的有效性。

链接: https://arxiv.org/abs/2508.09736
作者: Lin Long,Yichen He,Wentao Ye,Yiyuan Pan,Yuan Lin,Hang Li,Junbo Zhao,Wei Li
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot’s perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at this https URL
zh

[CV-38] Predictive Uncertainty for Runtime Assurance of a Real-Time Computer Vision-Based Landing System

【速读】:该论文旨在解决航空领域中视觉导航系统在安全关键场景下(如自动着陆与跑道检测)的鲁棒性与安全性验证难题,尤其关注如何实现高精度的飞机位姿估计并提供可信赖的不确定性量化。其解决方案的关键在于三个创新:(i) 基于空间Soft Argmax算子的高效灵活神经架构,支持多种视觉骨干网络并实现实时推理;(ii) 一种具有原理性的损失函数,生成校准的预测不确定性,通过sharpness和calibration指标进行评估;(iii) 对残差型接收机自主完好性监测(Residual-based Receiver Autonomous Integrity Monitoring, RAIM)的适应性改进,实现在运行时对异常模型输出的检测与剔除。该方法在跑道图像数据集上验证,不仅提升了位姿估计精度,还实现了亚像素级校准不确定性,可用于下游故障检测。

链接: https://arxiv.org/abs/2508.09732
作者: Romeo Valentin,Sydney M. Katz,Artur B. Carneiro,Don Walker,Mykel J. Kochenderfer
机构: Stanford Intelligent Systems Laboratory, Stanford University (斯坦福大学); A3 by Airbus LLC (空中客车公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 5 figures, accepted at DASC 2025

点击查看摘要

Abstract:Recent advances in data-driven computer vision have enabled robust autonomous navigation capabilities for civil aviation, including automated landing and runway detection. However, ensuring that these systems meet the robustness and safety requirements for aviation applications remains a major challenge. In this work, we present a practical vision-based pipeline for aircraft pose estimation from runway images that represents a step toward the ability to certify these systems for use in safety-critical aviation applications. Our approach features three key innovations: (i) an efficient, flexible neural architecture based on a spatial Soft Argmax operator for probabilistic keypoint regression, supporting diverse vision backbones with real-time inference; (ii) a principled loss function producing calibrated predictive uncertainties, which are evaluated via sharpness and calibration metrics; and (iii) an adaptation of Residual-based Receiver Autonomous Integrity Monitoring (RAIM), enabling runtime detection and rejection of faulty model outputs. We implement and evaluate our pose estimation pipeline on a dataset of runway images. We show that our model outperforms baseline architectures in terms of accuracy while also producing well-calibrated uncertainty estimates with sub-pixel precision that can be used downstream for fault detection.
zh

[CV-39] Multimodal Sheaf-based Network for Glioblastoma Molecular Subtype Prediction

【速读】:该论文旨在解决胶质母细胞瘤(glioblastoma)分子亚型分类中因依赖侵入性组织活检而导致的诊断延迟问题,以及现有多模态融合方法在保留MRI与组织病理图像间共享结构信息方面的不足。其关键解决方案是提出一种基于层化丛(sheaf-based)的结构感知一致性融合框架,该框架能够有效建模异质图结构中的判别特征,并通过结构重建机制提升对缺失或不完整模态数据的鲁棒性,从而为虚拟活检(virtual biopsy)工具的开发提供支持。

链接: https://arxiv.org/abs/2508.09717
作者: Shekhnaz Idrissova,Islem Rekik
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Glioblastoma is a highly invasive brain tumor with rapid progression rates. Recent studies have shown that glioblastoma molecular subtype classification serves as a significant biomarker for effective targeted therapy selection. However, this classification currently requires invasive tissue extraction for comprehensive histopathological analysis. Existing multimodal approaches combining MRI and histopathology images are limited and lack robust mechanisms for preserving shared structural information across modalities. In particular, graph-based models often fail to retain discriminative features within heterogeneous graphs, and structural reconstruction mechanisms for handling missing or incomplete modality data are largely underexplored. To address these limitations, we propose a novel sheaf-based framework for structure-aware and consistent fusion of MRI and histopathology data. Our model outperforms baseline methods and demonstrates robustness in incomplete or missing data scenarios, contributing to the development of virtual biopsy tools for rapid diagnostics. Our source code is available at this https URL.
zh

[CV-40] NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation

【速读】:该论文旨在解决多模态医学影像数据(如胸部X光片)在资源受限临床环境中面临的存储与传输难题,同时保持其诊断价值。解决方案的关键在于提出NEURAL框架,该框架利用语义引导的数据压缩机制:通过微调后的生成式视觉-语言模型(Generative Vision-Language Model)提取图像与其放射学报告之间的交叉注意力分数(cross-attention scores),以此结构化地剪枝胸部X光图像,仅保留具有诊断意义的区域;进而将压缩后的图像转化为图结构表示,并融合来自临床报告的知识图谱(Knowledge Graph),形成统一的、任务无关的图结构数据资产,从而在实现93.4–97.7%图像数据量压缩的同时维持0.88–0.95 AUC的高诊断性能,有效平衡了数据规模与临床实用性。

链接: https://arxiv.org/abs/2508.09715
作者: Devvrat Joshi,Islem Rekik
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid growth of multimodal medical imaging data presents significant storage and transmission challenges, particularly in resource-constrained clinical settings. We propose NEURAL, a novel framework that addresses this by using semantics-guided data compression. Our approach repurposes cross-attention scores between the image and its radiological report from a fine-tuned generative vision-language model to structurally prune chest X-rays, preserving only diagnostically critical regions. This process transforms the image into a highly compressed, graph representation. This unified graph-based representation fuses the pruned visual graph with a knowledge graph derived from the clinical report, creating a universal data structure that simplifies downstream modeling. Validated on the MIMIC-CXR and CheXpert Plus dataset for pneumonia detection, NEURAL achieves a 93.4-97.7% reduction in image data size while maintaining a high diagnostic performance of 0.88-0.95 AUC, outperforming other baseline models that use uncompressed data. By creating a persistent, task-agnostic data asset, NEURAL resolves the trade-off between data size and clinical utility, enabling efficient workflows and teleradiology without sacrificing performance. Our NEURAL code is available at this https URL.
zh

[CV-41] MangaDiT: Reference-Guided Line Art Colorization with Hierarchical Attention in Diffusion Transformers

【速读】:该论文旨在解决参考图像引导的线稿上色(reference-guided line art colorization)中区域级颜色一致性不足的问题,尤其在参考图与目标图存在角色姿态或动作差异时表现不佳。其关键解决方案是提出一种基于扩散变换器(Diffusion Transformers, DiT)的模型 MangaDiT,通过引入层次化注意力机制与动态注意力加权策略,在不依赖外部匹配标注的前提下,利用内部注意力机制隐式发现语义对应关系;该机制扩展了标准注意力的接受域,通过引入一个上下文感知路径来融合池化后的空间特征,从而有效提升区域级别的颜色对齐能力。

链接: https://arxiv.org/abs/2508.09709
作者: Qianru Qiu,Jiafeng Mao,Kento Masui,Xueting Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Codes and benchmarks will be released soon

点击查看摘要

Abstract:Recent advances in diffusion models have significantly improved the performance of reference-guided line art colorization. However, existing methods still struggle with region-level color consistency, especially when the reference and target images differ in character pose or motion. Instead of relying on external matching annotations between the reference and target, we propose to discover semantic correspondences implicitly through internal attention mechanisms. In this paper, we present MangaDiT, a powerful model for reference-guided line art colorization based on Diffusion Transformers (DiT). Our model takes both line art and reference images as conditional inputs and introduces a hierarchical attention mechanism with a dynamic attention weighting strategy. This mechanism augments the vanilla attention with an additional context-aware path that leverages pooled spatial features, effectively expanding the model’s receptive field and enhancing region-level color alignment. Experiments on two benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches, achieving superior performance in both qualitative and quantitative evaluations.
zh

[CV-42] Slot Attention-based Feature Filtering for Few-Shot Learning CVPR

【速读】:该论文旨在解决少样本学习(Few-Shot Learning, FSL)中因无关特征(如背景元素)干扰导致的性能下降问题,这些问题会误导查询图像与支持图像之间的匹配过程,从而引发误分类。解决方案的关键在于提出基于槽注意力机制的特征过滤方法(Slot Attention-based Feature Filtering, SAFF),其核心创新是将槽注意力机制与图像块嵌入(patch embeddings)相结合,通过统一的类感知槽结构实现对弱相关特征的有效筛选;同时引入跨支持集与查询图像的相似性矩阵,量化过滤后嵌入的相关性以提升分类精度。实验表明,该方法在多个标准少样本学习数据集上优于现有先进方法。

链接: https://arxiv.org/abs/2508.09699
作者: Javier Rodenas,Eduardo Aguilar,Petia Radeva
机构: AIBA, Departament de Matemàtiques & Informàtica, Universitat de Barcelona (巴塞罗那大学数学与计算机系); Departamento de Ingeniería de Sistemas y Computación, Universidad Católica del Norte (北方天主教大学工程与计算机系); Institute of Neuroscience, Universitat de Barcelona (巴塞罗那大学神经科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Workshop LatinX 2025

点击查看摘要

Abstract:Irrelevant features can significantly degrade few-shot learn ing performance. This problem is used to match queries and support images based on meaningful similarities despite the limited data. However, in this process, non-relevant fea tures such as background elements can easily lead to confu sion and misclassification. To address this issue, we pro pose Slot Attention-based Feature Filtering for Few-Shot Learning (SAFF) that leverages slot attention mechanisms to discriminate and filter weak features, thereby improving few-shot classification performance. The key innovation of SAFF lies in its integration of slot attention with patch em beddings, unifying class-aware slots into a single attention mechanism to filter irrelevant features effectively. We intro duce a similarity matrix that computes across support and query images to quantify the relevance of filtered embed dings for classification. Through experiments, we demon strate that Slot Attention performs better than other atten tion mechanisms, capturing discriminative features while reducing irrelevant information. We validate our approach through extensive experiments on few-shot learning bench marks: CIFAR-FS, FC100, miniImageNet and tieredIma geNet, outperforming several state-of-the-art methods.
zh

[CV-43] Combating Noisy Labels via Dynamic Connection Masking

【速读】:该论文旨在解决真实场景中标签噪声(noisy labels)对深度神经网络性能造成显著下降的问题。由于深度神经网络具有强大的记忆能力,容易过拟合被污染的标签,从而导致模型泛化性能恶化。解决方案的关键在于提出一种动态连接掩码(Dynamic Connection Masking, DCM)机制,该机制通过评估各连接边的信息承载能力,在训练过程中自适应地屏蔽不重要的连接,从而降低梯度误差并增强模型鲁棒性。DCM可无缝集成至多种抗噪训练方法中,包括鲁棒损失函数、样本选择策略和正则化技术,并在合成与真实数据集上验证了其优于当前最先进(SOTA)方法的有效性。此外,论文首次将Kolmogorov-Arnold Networks (KANs) 作为分类器用于噪声标签场景,揭示其在真实噪声环境下相比多层感知机(MLP)具有更强的鲁棒性。

链接: https://arxiv.org/abs/2508.09697
作者: Xinlei Zhang,Fan Liu,Chuanyi Zhang,Fan Cheng,Yuhui Zheng
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Noisy labels are inevitable in real-world scenarios. Due to the strong capacity of deep neural networks to memorize corrupted labels, these noisy labels can cause significant performance degradation. Existing research on mitigating the negative effects of noisy labels has mainly focused on robust loss functions and sample selection, with comparatively limited exploration of regularization in model architecture. Inspired by the sparsity regularization used in Kolmogorov-Arnold Networks (KANs), we propose a Dynamic Connection Masking (DCM) mechanism for both Multi-Layer Perceptron Networks (MLPs) and KANs to enhance the robustness of classifiers against noisy labels. The mechanism can adaptively mask less important edges during training by evaluating their information-carrying capacity. Through theoretical analysis, we demonstrate its efficiency in reducing gradient error. Our approach can be seamlessly integrated into various noise-robust training methods to build more robust deep networks, including robust loss functions, sample selection strategies, and regularization techniques. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our method consistently outperforms state-of-the-art (SOTA) approaches. Furthermore, we are also the first to investigate KANs as classifiers against noisy labels, revealing their superior noise robustness over MLPs in real-world noisy scenarios. Our code will soon be publicly available.
zh

[CV-44] PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training

【速读】:该论文旨在解决面部表征预训练中的三大核心问题:(1)难以捕捉面部特征的细粒度语义信息;(2)忽略面部解剖结构固有的空间布局;(3)对有限标注数据利用效率低。其解决方案的关键在于提出一种无监督框架PaCo-FR,该框架融合掩码图像建模(Masked Image Modeling, MIM)与patch-pixel对齐机制,包含三个创新组件:(1)基于语义面部区域对齐的结构化掩码策略以保持空间连贯性;(2)基于patch的代码本(codebook)通过多候选token增强特征判别力;(3)空间一致性约束以保留面部部件间的几何关系。该方法仅需200万张未标注图像即可实现领先性能,在姿态变化、遮挡和光照差异等复杂场景下表现尤为突出。

链接: https://arxiv.org/abs/2508.09691
作者: Yin Xie,Zhichao Chen,Xiaoze Yu,Yongle Zhao,Xiang An,Kaicheng Yang,Zimin Ran,Jia Guo,Ziyong Feng,Jiankang Deng
机构: DeepGlint; University of Technology Sydney (悉尼科技大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial representation pre-training is crucial for tasks like facial recognition, expression analysis, and virtual reality. However, existing methods face three key challenges: (1) failing to capture distinct facial features and fine-grained semantics, (2) ignoring the spatial structure inherent to facial anatomy, and (3) inefficiently utilizing limited labeled data. To overcome these, we introduce PaCo-FR, an unsupervised framework that combines masked image modeling with patch-pixel alignment. Our approach integrates three innovative components: (1) a structured masking strategy that preserves spatial coherence by aligning with semantically meaningful facial regions, (2) a novel patch-based codebook that enhances feature discrimination with multiple candidate tokens, and (3) spatial consistency constraints that preserve geometric relationships between facial components. PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images for pre-training. Our method demonstrates significant improvements, particularly in scenarios with varying poses, occlusions, and lighting conditions. We believe this work advances facial representation learning and offers a scalable, efficient solution that reduces reliance on expensive annotated datasets, driving more effective facial analysis systems.
zh

[CV-45] Surg-InvNeRF: Invertible NeRF for 3D tracking and reconstruction in surgical vision

【速读】:该论文旨在解决长时序3D点跟踪中运动一致性差、现有方法多局限于2D跟踪的问题。其关键解决方案是提出一种基于可逆神经辐射场(Invertible Neural Radiance Field, InvNeRF)的测试时优化(Test-Time Optimization, TTO)框架,通过参数化一个聚合多个先进方法对应关系的函数,在手术场景下实现2D与3D联合跟踪。该方法利用渲染引导策略监督像素对应点的重投影误差,引入双向可变形-规范映射以高效处理限定工作空间,并结合多尺度HexPlanes结构加速推理及改进像素采样策略与收敛准则,从而在STIR和SCARE数据集上显著优于现有TTO方法,在2D跟踪精度上平均提升近50%,并在3D跟踪中首次超越前馈式方法,同时融合了可变形NeRF重建的优势。

链接: https://arxiv.org/abs/2508.09681
作者: Gerardo Loza,Junlei Hu,Dominic Jones,Sharib Ali,Pietro Valdastri
机构: University of Leeds (利兹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages

点击查看摘要

Abstract:We proposed a novel test-time optimisation (TTO) approach framed by a NeRF-based architecture for long-term 3D point tracking. Most current methods in point tracking struggle to obtain consistent motion or are limited to 2D motion. TTO approaches frame the solution for long-term tracking as optimising a function that aggregates correspondences from other specialised state-of-the-art methods. Unlike the state-of-the-art on TTO, we propose parametrising such a function with our new invertible Neural Radiance Field (InvNeRF) architecture to perform both 2D and 3D tracking in surgical scenarios. Our approach allows us to exploit the advantages of a rendering-based approach by supervising the reprojection of pixel correspondences. It adapts strategies from recent rendering-based methods to obtain a bidirectional deformable-canonical mapping, to efficiently handle a defined workspace, and to guide the rays’ density. It also presents our multi-scale HexPlanes for fast inference and a new algorithm for efficient pixel sampling and convergence criteria. We present results in the STIR and SCARE datasets, for evaluating point tracking and testing the integration of kinematic data in our pipeline, respectively. In 2D point tracking, our approach surpasses the precision and accuracy of the TTO state-of-the-art methods by nearly 50% on average precision, while competing with other approaches. In 3D point tracking, this is the first TTO approach, surpassing feed-forward methods while incorporating the benefits of a deformable NeRF-based reconstruction.
zh

[CV-46] GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors

【速读】:该论文旨在解决从稀疏视图重建三维场景时,由于信息不足导致的病态问题(ill-posed problem),进而产生明显伪影(artifacts)的问题。现有方法虽尝试利用生成先验(generative priors)填补欠约束区域,但难以保证生成内容与输入观测的一致性。其解决方案的关键在于提出GSFixer框架,核心是基于DiT(Diffusion Transformer)架构的视频修复模型,该模型在成对的带伪影3D高斯溅射(3D Gaussian Splatting, 3DGS)渲染帧与干净帧上训练,并引入参考条件(reference-based conditions)。通过将稀疏输入视图作为参考,模型融合来自视觉几何基础模型(visual geometry foundation model)提取的2D语义特征和3D几何特征,从而提升修复后新视角的语义一致性与三维一致性。

链接: https://arxiv.org/abs/2508.09667
作者: Xingyilang Yin,Qi Zhang,Jiahao Chang,Ying Feng,Qingnan Fan,Xi Yang,Chi-Man Pun,Huaqi Zhang,Xiaodong Cun
机构: University of Macau (澳门大学); VIVO; CUHKSZ (香港中文大学(深圳)); Xidian University (西安电子科技大学); GVC Lab, Great Bay University (大湾区大学GVC实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing 3D scenes using 3D Gaussian Splatting (3DGS) from sparse views is an ill-posed problem due to insufficient information, often resulting in noticeable artifacts. While recent approaches have sought to leverage generative priors to complete information for under-constrained regions, they struggle to generate content that remains consistent with input observations. To address this challenge, we propose GSFixer, a novel framework designed to improve the quality of 3DGS representations reconstructed from sparse inputs. The core of our approach is the reference-guided video restoration model, built upon a DiT-based video diffusion model trained on paired artifact 3DGS renders and clean frames with additional reference-based conditions. Considering the input sparse views as references, our model integrates both 2D semantic features and 3D geometric features of reference views extracted from the visual geometry foundation model, enhancing the semantic coherence and 3D consistency when fixing artifact novel views. Furthermore, considering the lack of suitable benchmarks for 3DGS artifact restoration evaluation, we present DL3DV-Res which contains artifact frames rendered using low-quality 3DGS. Extensive experiments demonstrate our GSFixer outperforms current state-of-the-art methods in 3DGS artifact restoration and sparse-view 3D reconstruction. Project page: this https URL.
zh

[CV-47] NegFaceDiff: The Power of Negative Context in Identity-Conditioned Diffusion for Synthetic Face Generation ICCV

【速读】:该论文旨在解决身份条件扩散模型在生成人脸图像时存在的类间可分性不足问题,即生成数据中存在身份重叠现象,从而导致人脸识别(Face Recognition, FR)性能下降。解决方案的关键在于提出NegFaceDiff方法,通过在扩散过程中引入负条件(negative conditions),显式引导模型避开不希望出现的特征,同时保持类内一致性,从而显著提升生成数据的身份分离能力。实验表明,该方法使Fisher判别比(Fisher Discriminant Ratio, FDR)从2.427提升至5.687,并在多个基准测试中验证了基于NegFaceDiff生成数据训练的FR系统优于未使用负条件的数据集。

链接: https://arxiv.org/abs/2508.09661
作者: Eduarda Caldeira,Naser Damer,Fadi Boutros
机构: Fraunhofer IGD (弗劳恩霍夫协会图像图形与数据处理研究所); TU Darmstadt (达姆施塔特工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV Workshops

点击查看摘要

Abstract:The use of synthetic data as an alternative to authentic datasets in face recognition (FR) development has gained significant attention, addressing privacy, ethical, and practical concerns associated with collecting and using authentic data. Recent state-of-the-art approaches have proposed identity-conditioned diffusion models to generate identity-consistent face images, facilitating their use in training FR models. However, these methods often lack explicit sampling mechanisms to enforce inter-class separability, leading to identity overlap in the generated data and, consequently, suboptimal FR performance. In this work, we introduce NegFaceDiff, a novel sampling method that incorporates negative conditions into the identity-conditioned diffusion process. NegFaceDiff enhances identity separation by leveraging negative conditions that explicitly guide the model away from unwanted features while preserving intra-class consistency. Extensive experiments demonstrate that NegFaceDiff significantly improves the identity consistency and separability of data generated by identity-conditioned diffusion models. Specifically, identity separability, measured by the Fisher Discriminant Ratio (FDR), increases from 2.427 to 5.687. These improvements are reflected in FR systems trained on the NegFaceDiff dataset, which outperform models trained on data generated without negative conditions across multiple benchmarks.
zh

[CV-48] Noise-adapted Neural Operator for Robust Non-Line-of-Sight Imaging

【速读】:该论文旨在解决非视域(non-line-of-sight, NLOS)成像中因间接光信号强度弱且易受噪声干扰而导致的3D图像重建精度低、鲁棒性差的问题。其解决方案的关键在于构建一个参数化的逆问题框架,结合噪声估计模块与参数化神经算子(parameterized neural operator),通过深度算法展开(deep algorithm unfolding)实现端到端的快速重建;同时引入全局与局部时空特征融合机制,有效整合结构信息与细节特征,从而在稀疏照明和快速扫描条件下仍能保持高精度与强鲁棒性。

链接: https://arxiv.org/abs/2508.09655
作者: Lianfang Wang,Kuilin Qin,Xueying Liu,Huibin Chang,Yong Wang,Yuping Duan
机构: Beijing Normal University (北京师范大学); Tianjin University (天津大学); Tianjin Normal University (天津师范大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computational imaging, especially non-line-of-sight (NLOS) imaging, the extraction of information from obscured or hidden scenes is achieved through the utilization of indirect light signals resulting from multiple reflections or scattering. The inherently weak nature of these signals, coupled with their susceptibility to noise, necessitates the integration of physical processes to ensure accurate reconstruction. This paper presents a parameterized inverse problem framework tailored for large-scale linear problems in 3D imaging reconstruction. Initially, a noise estimation module is employed to adaptively assess the noise levels present in transient data. Subsequently, a parameterized neural operator is developed to approximate the inverse mapping, facilitating end-to-end rapid image reconstruction. Our 3D image reconstruction framework, grounded in operator learning, is constructed through deep algorithm unfolding, which not only provides commendable model interpretability but also enables dynamic adaptation to varying noise levels in the acquired data, thereby ensuring consistently robust and accurate reconstruction outcomes. Furthermore, we introduce a novel method for the fusion of global and local spatiotemporal data features. By integrating structural and detailed information, this method significantly enhances both accuracy and robustness. Comprehensive numerical experiments conducted on both simulated and real datasets substantiate the efficacy of the proposed method. It demonstrates remarkable performance with fast scanning data and sparse illumination point data, offering a viable solution for NLOS imaging in complex scenarios.
zh

[CV-49] OTNet: Occlusion-Aware Temporal Tracking for Robust Ball Detection in Sports Videos

【速读】:该论文旨在解决体育视频分析中因遮挡导致的球体跟踪鲁棒性不足的问题(robust ball tracking under occlusion),这一问题直接影响事件检测与裁判决策等下游任务。其解决方案的关键在于提出TOTNet——一种结合时空特征建模、可见性加权损失函数和遮挡增强策略的Temporal Occlusion Tracking Network,通过3D卷积捕捉时序动态信息,并在训练阶段引入模拟遮挡场景以提升模型对部分及完全遮挡的适应能力,从而显著提升复杂运动场景下的跟踪精度与稳定性。

链接: https://arxiv.org/abs/2508.09650
作者: Hao Xu,Arbind Agrahari Baniya,Sam Wells,Mohamed Reda Bouadjenek,Richard Dazely,Sunil Aryal
机构: Deakin University (迪肯大学); Paralympics Australia (澳大利亚残奥委员会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures,

点击查看摘要

Abstract:Robust ball tracking under occlusion remains a key challenge in sports video analysis, affecting tasks like event detection and officiating. We present TOTNet, a Temporal Occlusion Tracking Network that leverages 3D convolutions, visibility-weighted loss, and occlusion augmentation to improve performance under partial and full occlusions. Developed in collaboration with Paralympics Australia, TOTNet is designed for real-world sports analytics. We introduce TTA, a new occlusion-rich table tennis dataset collected from professional-level Paralympic matches, comprising 9,159 samples with 1,996 occlusion cases. Evaluated on four datasets across tennis, badminton, and table tennis, TOTNet significantly outperforms prior state-of-the-art methods, reducing RMSE from 37.30 to 7.19 and improving accuracy on fully occluded frames from 0.63 to 0.80. These results demonstrate TOTNets effectiveness for offline sports analytics in fast-paced scenarios. Code and data access:\hrefthis https URLAugustRushG/TOTNet.
zh

[CV-50] he Brain Resection Multimodal Image Registration (ReMIND2Reg) 2025 Challenge

【速读】:该论文旨在解决脑肿瘤手术中因脑移位(brain shift)导致的术前磁共振成像(MRI)与术中超声(iUS)图像空间定位不准确的问题,从而影响最大安全切除的实现。其解决方案的关键在于通过多模态图像配准技术,将术后iUS图像与术前MRI进行精确对齐,以估计脑移位变形,从而恢复导航系统的空间准确性。该方法依赖于ReMIND2Reg 2025挑战赛提供的大规模公开基准数据集,包含99个训练病例、5个验证病例和10个私有测试病例,所有数据均未标注用于训练,评估则基于人工标注的解剖标志点,指标包括目标配准误差(TRE)、最坏情况标志点偏移鲁棒性(TRE30)及运行时间,旨在推动开发适用于临床部署的鲁棒、可泛化、多模态图像配准算法。

链接: https://arxiv.org/abs/2508.09649
作者: Reuben Dorent,Laura Rigolo,Colin P. Galvin,Junyu Chen,Mattias P. Heinrich,Aaron Carass,Olivier Colliot,Demian Wassermann,Alexandra Golby,Tina Kapur,William Wells
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate intraoperative image guidance is critical for achieving maximal safe resection in brain tumor surgery, yet neuronavigation systems based on preoperative MRI lose accuracy during the procedure due to brain shift. Aligning post-resection intraoperative ultrasound (iUS) with preoperative MRI can restore spatial accuracy by estimating brain shift deformations, but it remains a challenging problem given the large anatomical and topological changes and substantial modality intensity gap. The ReMIND2Reg 2025 Challenge provides the largest public benchmark for this task, built upon the ReMIND dataset. It offers 99 training cases, 5 validation cases, and 10 private test cases comprising paired 3D ceT1 MRI, T2 MRI, and post-resection 3D iUS volumes. Data are provided without annotations for training, while validation and test performance are evaluated on manually annotated anatomical landmarks. Metrics include target registration error (TRE), robustness to worst-case landmark misalignment (TRE30), and runtime. By establishing a standardized evaluation framework for this clinically critical and technically complex problem, ReMIND2Reg aims to accelerate the development of robust, generalizable, and clinically deployable multimodal registration algorithms for image-guided neurosurgery.
zh

[CV-51] Multi-Sequence Parotid Gland Lesion Segmentation via Expert Text-Guided Segment Anything Model

【速读】:该论文旨在解决腮腺病变分割中因病灶尺寸变异大、边界复杂以及现有方法依赖精确提示(如点、框、掩码)难以在真实临床场景中获取的问题,同时指出当前医学图像分割方法忽视了专家领域知识的融入。其解决方案的关键在于提出一种专家诊断文本引导的分割一切模型(Parotid gland Segment Anything Model, PG-SAM),通过引入专家诊断报告引导的提示生成模块自动提取先验领域知识以指导分割过程,并结合跨序列注意力机制融合多模态互补信息,从而提升跨序列腮腺病变分割的准确性与临床实用性。

链接: https://arxiv.org/abs/2508.09645
作者: Zhongyuan Wu,Chuan-Xian Ren,Yu Wang,Xiaohua Ban,Jianning Xiao,Xiaohui Duan
机构: Sun Yat-sen University (中山大学); Sun Yat-sen University Cancer Center (中山大学肿瘤防治中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parotid gland lesion segmentation is essential for the treatment of parotid gland diseases. However, due to the variable size and complex lesion boundaries, accurate parotid gland lesion segmentation remains challenging. Recently, the Segment Anything Model (SAM) fine-tuning has shown remarkable performance in the field of medical image segmentation. Nevertheless, SAM’s interaction segmentation model relies heavily on precise lesion prompts (points, boxes, masks, etc.), which are very difficult to obtain in real-world applications. Besides, current medical image segmentation methods are automatically generated, ignoring the domain knowledge of medical experts when performing segmentation. To address these limitations, we propose the parotid gland segment anything model (PG-SAM), an expert diagnosis text-guided SAM incorporating expert domain knowledge for cross-sequence parotid gland lesion segmentation. Specifically, we first propose an expert diagnosis report guided prompt generation module that can automatically generate prompt information containing the prior domain knowledge to guide the subsequent lesion segmentation process. Then, we introduce a cross-sequence attention module, which integrates the complementary information of different modalities to enhance the segmentation effect. Finally, the multi-sequence image features and generated prompts are feed into the decoder to get segmentation result. Experimental results demonstrate that PG-SAM achieves state-of-the-art performance in parotid gland lesion segmentation across three independent clinical centers, validating its clinical applicability and the effectiveness of diagnostic text for enhancing image segmentation in real-world clinical settings.
zh

[CV-52] Multi-Contrast Fusion Module: An attention mechanism integrating multi-contrast features for fetal torso plane classification

【速读】:该论文旨在解决产前超声图像中胎儿躯干标准切面识别困难的问题,尤其针对图像对比度低、纹理细节不清晰导致的精细解剖结构识别准确性不足。其解决方案的关键在于提出一种多对比度融合模块(Multi-Contrast Fusion Module, MCFM),该模块仅作用于神经网络的底层,直接处理原始超声数据,通过为不同对比度条件下的图像表征分配注意力权重,增强特征建模能力并保持极小的参数开销,从而显著提升对细微解剖结构的捕捉能力与分类准确率。

链接: https://arxiv.org/abs/2508.09644
作者: Shengjun Zhu,Siyu Liu,Runqing Xiong,Liping Zheng,Duo Ma,Rongshang Chen,Jiaxin Cai
机构: Xiamen University of Technology (厦门理工学院); The Second Affiliated Hospital of Xiamen Medical College (厦门医学院第二附属医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Prenatal ultrasound is a key tool in evaluating fetal structural development and detecting abnormalities, contributing to reduced perinatal complications and improved neonatal survival. Accurate identification of standard fetal torso planes is essential for reliable assessment and personalized prenatal care. However, limitations such as low contrast and unclear texture details in ultrasound imaging pose significant challenges for fine-grained anatomical recognition. Methods: We propose a novel Multi-Contrast Fusion Module (MCFM) to enhance the model’s ability to extract detailed information from ultrasound images. MCFM operates exclusively on the lower layers of the neural network, directly processing raw ultrasound data. By assigning attention weights to image representations under different contrast conditions, the module enhances feature modeling while explicitly maintaining minimal parameter overhead. Results: The proposed MCFM was evaluated on a curated dataset of fetal torso plane ultrasound images. Experimental results demonstrate that MCFM substantially improves recognition performance, with a minimal increase in model complexity. The integration of multi-contrast attention enables the model to better capture subtle anatomical structures, contributing to higher classification accuracy and clinical reliability. Conclusions: Our method provides an effective solution for improving fetal torso plane recognition in ultrasound imaging. By enhancing feature representation through multi-contrast fusion, the proposed approach supports clinicians in achieving more accurate and consistent diagnoses, demonstrating strong potential for clinical adoption in prenatal screening. The codes are available at this https URL.
zh

[CV-53] Preacher: Paper-to-Video Agent ic System

【速读】:该论文旨在解决纸面研究论文到结构化视频摘要(paper-to-video)转换中的关键挑战,包括现有视频生成模型受限的上下文窗口、固定的视频时长约束、风格多样性不足以及无法有效表达领域特定知识等问题。解决方案的核心在于提出首个论文到视频的智能体系统 Preacher,其采用自顶向下(top-down)的分解、总结与重构策略,结合自底向上(bottom-up)的视频生成流程,通过定义关键场景并引入渐进式思维链(Progressive Chain of Thought, P-CoT)实现跨模态表征对齐,从而生成高质量且领域适配的视频摘要,在五个不同研究领域中展现出超越当前视频生成模型的专业能力。

链接: https://arxiv.org/abs/2508.09632
作者: Jingwei Liu,Ling Yang,Hao Luo,Fan Wang Hongyan Li,Mengdi Wang
机构: Peking University (北京大学); Alibaba group (阿里巴巴集团); Hupan Lab (湖畔实验室); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a top-down approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: this https URL
zh

[CV-54] Enhancing Monocular 3D Hand Reconstruction with Learned Texture Priors

【速读】:该论文旨在解决单目3D手部重建中纹理信息未被充分利用的问题,即现有高性能模型在预测手部几何结构时,其与图像外观的对齐往往不完美,表明纹理可作为潜在的监督信号。解决方案的关键在于提出一个轻量级纹理模块,该模块将像素级观测嵌入到UV纹理空间,并引入一种新型的密集对齐损失函数,实现预测与观测手部外观之间的像素级对齐。该方法依赖于可微分渲染管线和已知拓扑结构的3D手部网格生成模型,通过将纹理化手部回投影至图像平面进行像素级匹配,从而提升姿态与形状估计的准确性与真实感。

链接: https://arxiv.org/abs/2508.09629
作者: Giorgos Karvounas,Nikolaos Kyriazis,Iason Oikonomidis,Georgios Pavlakos,Antonis A. Argyros
机构: ICS-FORTH(希腊信息与计算研究所); University of Texas at Austin(德克萨斯大学奥斯汀分校); University of Crete(克里特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We revisit the role of texture in monocular 3D hand reconstruction, not as an afterthought for photorealism, but as a dense, spatially grounded cue that can actively support pose and shape estimation. Our observation is simple: even in high-performing models, the overlay between predicted hand geometry and image appearance is often imperfect, suggesting that texture alignment may be an underused supervisory signal. We propose a lightweight texture module that embeds per-pixel observations into UV texture space and enables a novel dense alignment loss between predicted and observed hand appearances. Our approach assumes access to a differentiable rendering pipeline and a model that maps images to 3D hand meshes with known topology, allowing us to back-project a textured hand onto the image and perform pixel-based alignment. The module is self-contained and easily pluggable into existing reconstruction pipelines. To isolate and highlight the value of texture-guided supervision, we augment HaMeR, a high-performing yet unadorned transformer architecture for 3D hand pose estimation. The resulting system improves both accuracy and realism, demonstrating the value of appearance-guided alignment in hand reconstruction.
zh

[CV-55] Semantic-aware DropSplat: Adaptive Pruning of Redundant Gaussians for 3D Aerial-View Segmentation AAAI2026

【速读】:该论文旨在解决3D航空视角场景语义分割(3D-AVS-SS)中因尺度变化和结构遮挡导致的语义模糊问题,此类问题限制了传统方法的分割精度与一致性。解决方案的关键在于提出一种名为SAD-Splat的新方法:其核心创新包括一个高斯点删除模块(Gaussian point drop module),该模块结合语义置信度估计与基于Hard Concrete分布的可学习稀疏机制,有效剔除冗余及语义模糊的高斯点,从而提升分割性能与表示紧凑性;同时引入高置信度伪标签生成流水线,利用2D基础模型在真值标签稀缺时增强监督信号,进一步提高分割准确性。

链接: https://arxiv.org/abs/2508.09626
作者: Xu Tang,Junan Jia,Yijing Wang,Jingjing Ma,Xiangrong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, AAAI 2026

点击查看摘要

Abstract:In the task of 3D Aerial-view Scene Semantic Segmentation (3D-AVS-SS), traditional methods struggle to address semantic ambiguity caused by scale variations and structural occlusions in aerial images. This limits their segmentation accuracy and consistency. To tackle these challenges, we propose a novel 3D-AVS-SS approach named SAD-Splat. Our method introduces a Gaussian point drop module, which integrates semantic confidence estimation with a learnable sparsity mechanism based on the Hard Concrete distribution. This module effectively eliminates redundant and semantically ambiguous Gaussian points, enhancing both segmentation performance and representation compactness. Furthermore, SAD-Splat incorporates a high-confidence pseudo-label generation pipeline. It leverages 2D foundation models to enhance supervision when ground-truth labels are limited, thereby further improving segmentation accuracy. To advance research in this domain, we introduce a challenging benchmark dataset: 3D Aerial Semantic (3D-AS), which encompasses diverse real-world aerial scenes with sparse annotations. Experimental results demonstrate that SAD-Splat achieves an excellent balance between segmentation accuracy and representation compactness. It offers an efficient and scalable solution for 3D aerial scene understanding.
zh

[CV-56] Plane Detection and Ranking via Model Information Optimization IROS

【速读】:该论文旨在解决深度图像中平面检测的误检问题,尤其在复杂真实场景下,由于RANSAC算法中内点阈值的模糊性导致的虚假平面检测现象。其解决方案的关键在于提出一种基于模型信息优化(model information optimization)的通用框架:将深度读数视为离散随机变量,并通过重复随机子采样生成包含不同候选平面约束的多种模型;结合深度传感器的物理特性与噪声模型计算各模型的信息量,选择信息最少的模型作为最可能的真实平面结构,从而客观确定平面数量并抑制误检;同时,通过累加每个平面内点的信息减少量对检测结果进行质量排序,提升平面参数估计的准确性。

链接: https://arxiv.org/abs/2508.09625
作者: Daoxin Zhong,Jun Li,Meng Yee Michael Chuah
机构: Institute for Infocomm Research (I2R), A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted as contributed paper in the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

点击查看摘要

Abstract:Plane detection from depth images is a crucial subtask with broad robotic applications, often accomplished by iterative methods such as Random Sample Consensus (RANSAC). While RANSAC is a robust strategy with strong probabilistic guarantees, the ambiguity of its inlier threshold criterion makes it susceptible to false positive plane detections. This issue is particularly prevalent in complex real-world scenes, where the true number of planes is unknown and multiple planes coexist. In this paper, we aim to address this limitation by proposing a generalised framework for plane detection based on model information optimization. Building on previous works, we treat the observed depth readings as discrete random variables, with their probability distributions constrained by the ground truth planes. Various models containing different candidate plane constraints are then generated through repeated random sub-sampling to explain our observations. By incorporating the physics and noise model of the depth sensor, we can calculate the information for each model, and the model with the least information is accepted as the most likely ground truth. This information optimization process serves as an objective mechanism for determining the true number of planes and preventing false positive detections. Additionally, the quality of each detected plane can be ranked by summing the information reduction of inlier points for each plane. We validate these properties through experiments with synthetic data and find that our algorithm estimates plane parameters more accurately compared to the default Open3D RANSAC plane segmentation. Furthermore, we accelerate our algorithm by partitioning the depth map using neural network segmentation, which enhances its ability to generate more realistic plane parameters in real-world data.
zh

[CV-57] MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography

【速读】:该论文旨在解决稀疏视图锥形束计算机断层成像(Cone Beam Computed Tomography, CBCT)中因投影数据不足导致的伪影问题,从而降低成像辐射剂量。其核心解决方案是提出MInDI-3D模型,首次将二维“直接迭代去噪”(InDI)概念扩展至三维(3D)医学图像体积域,通过迭代去噪过程直接从稀疏视图输入重构高质量CBCT体积;关键创新在于构建了一个包含16,182个样本的大规模伪CBCT数据集(基于CT-RATE公开胸部CT数据),并利用该数据集训练模型,使其在真实世界场景下实现显著的伪影抑制效果(如PSNR提升达12.96 dB),同时支持多扫描几何配置下的泛化能力,并获得临床医生对患者定位和肿瘤边界保持的积极评价。

链接: https://arxiv.org/abs/2508.09616
作者: Daniel Barco(1),Marc Stadelmann(1),Martin Oswald(1),Ivo Herzig(2),Lukas Lichtensteiger(2),Pascal Paysan(3),Igor Peterlik(3),Michal Walczak(3),Bjoern Menze(4),Frank-Peter Schilling(1) ((1) Centre for Artificial Intelligence (CAI), Zurich University of Applied Sciences (ZHAW), Winterthur, Switzerland, (2) Institute of Applied Mathematics and Physics (IAMP), Zurich University of Applied Sciences (ZHAW), Winterthur, Switzerland, (3) Varian Medical Systems Imaging Lab, Baden, Switzerland, (4) Biomedical Image Analysis and Machine Learning, University of Zurich, Zurich, Switzerland)
机构: Zurich University of Applied Sciences (ZHAW)(苏黎世应用科学大学); Varian Medical Systems Imaging Lab(瓦里安医疗系统成像实验室); University of Zurich(苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present MInDI-3D (Medical Inversion by Direct Iteration in 3D), the first 3D conditional diffusion-based model for real-world sparse-view Cone Beam Computed Tomography (CBCT) artefact removal, aiming to reduce imaging radiation exposure. A key contribution is extending the “InDI” concept from 2D to a full 3D volumetric approach for medical images, implementing an iterative denoising process that refines the CBCT volume directly from sparse-view input. A further contribution is the generation of a large pseudo-CBCT dataset (16,182) from chest CT volumes of the CT-RATE public dataset to robustly train MInDI-3D. We performed a comprehensive evaluation, including quantitative metrics, scalability analysis, generalisation tests, and a clinical assessment by 11 clinicians. Our results show MInDI-3D’s effectiveness, achieving a 12.96 (6.10) dB PSNR gain over uncorrected scans with only 50 projections on the CT-RATE pseudo-CBCT (independent real-world) test set and enabling an 8x reduction in imaging radiation exposure. We demonstrate its scalability by showing that performance improves with more training data. Importantly, MInDI-3D matches the performance of a 3D U-Net on real-world scans from 16 cancer patients across distortion and task-based metrics. It also generalises to new CBCT scanner geometries. Clinicians rated our model as sufficient for patient positioning across all anatomical sites and found it preserved lung tumour boundaries well.
zh

[CV-58] BridgeTA: Bridging the Representation Gap in Knowledge Distillation via Teacher Assistant for Birds Eye View Map Segmentation

【速读】:该论文旨在解决摄像头-only(Camera-only)方法在鸟瞰图(Bird’s-Eye-View, BEV)地图分割任务中性能显著落后于激光雷达-相机融合(LiDAR-Camera, LC)方法的问题,同时避免传统知识蒸馏(Knowledge Distillation, KD)通过扩大学生模型规模带来的推理成本上升。其解决方案的关键在于提出了一种名为BridgeTA的轻量级蒸馏框架,通过引入一个教师助手(Teacher Assistant, TA)网络,在不改变学生模型结构和推理开销的前提下,构建教师与学生BEV表示之间的共享潜在空间,从而弥合二者表征差异;理论层面,作者基于Young不等式推导出新的蒸馏损失函数,将直接的师生蒸馏路径分解为教师-TA与TA-学生双路径,有效稳定优化过程并增强知识迁移效率。

链接: https://arxiv.org/abs/2508.09599
作者: Beomjun Kim,Suhan Woo,Sejong Heo,Euntai Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Bird’s-Eye-View (BEV) map segmentation is one of the most important and challenging tasks in autonomous driving. Camera-only approaches have drawn attention as cost-effective alternatives to LiDAR, but they still fall behind LiDAR-Camera (LC) fusion-based methods. Knowledge Distillation (KD) has been explored to narrow this gap, but existing methods mainly enlarge the student model by mimicking the teacher’s architecture, leading to higher inference cost. To address this issue, we introduce BridgeTA, a cost-effective distillation framework to bridge the representation gap between LC fusion and Camera-only models through a Teacher Assistant (TA) network while keeping the student’s architecture and inference cost unchanged. A lightweight TA network combines the BEV representations of the teacher and student, creating a shared latent space that serves as an intermediate representation. To ground the framework theoretically, we derive a distillation loss using Young’s Inequality, which decomposes the direct teacher-student distillation path into teacher-TA and TA-student dual paths, stabilizing optimization and strengthening knowledge transfer. Extensive experiments on the challenging nuScenes dataset demonstrate the effectiveness of our method, achieving an improvement of 4.2% mIoU over the Camera-only baseline, up to 45% higher than the improvement of other state-of-the-art KD methods.
zh

[CV-59] Images Speak Louder Than Scores: Failure Mode Escape for Enhancing Generative Quality

【速读】:该论文旨在解决当前扩散模型(Diffusion Models)在图像生成中存在的一大矛盾:尽管主流模型在Fréchet Inception Distance (FID)等全局分布指标上表现优异,但其生成的个体样本常出现失真或低质量问题。这一现象的根本原因在于FID仅衡量整体分布对齐度,而忽略了单张图像的感知质量(perceptual quality)。为应对此问题,作者提出一种无需训练且推理高效的解决方案FaME(Failure-aware Model Enhancement),其核心创新在于利用预训练的图像质量评估模型(IQA)识别低质量生成样本,并记录其采样轨迹;随后将这些失败模式作为负向引导信号,在后续采样过程中主动规避低质量区域,从而提升生成图像的视觉质量而不损害FID性能。

链接: https://arxiv.org/abs/2508.09598
作者: Jie Shao,Ke Zhu,Minghao Fu,Guo-hua Wang,Jianxin Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable progress in class-to-image generation. However, we observe that despite impressive FID scores, state-of-the-art models often generate distorted or low-quality images, especially in certain classes. This gap arises because FID evaluates global distribution alignment, while ignoring the perceptual quality of individual samples. We further examine the role of CFG, a common technique used to enhance generation quality. While effective in improving metrics and suppressing outliers, CFG can introduce distribution shift and visual artifacts due to its misalignment with both training objectives and user expectations. In this work, we propose FaME, a training-free and inference-efficient method for improving perceptual quality. FaME uses an image quality assessment model to identify low-quality generations and stores their sampling trajectories. These failure modes are then used as negative guidance to steer future sampling away from poor-quality regions. Experiments on ImageNet demonstrate that FaME brings consistent improvements in visual quality without compromising FID. FaME also shows the potential to be extended to improve text-to-image generation.
zh

[CV-60] SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing

【速读】:该论文旨在解决高保真且可编辑的头部虚拟形象(head avatar)构建问题,尤其聚焦于实时外观编辑的挑战,这主要受限于隐式表示和几何与全局外观之间的耦合建模。解决方案的关键在于提出了一种新颖的混合表示方法——表面-体积高斯头部虚拟形象(Surface-Volumetric Gaussian Head Avatar, SVG-Head),其核心创新包括:1)通过绑定在FLAME网格上的3D高斯点显式建模几何结构,并利用可学习的解耦纹理图像捕捉全局外观,从而支持实时纹理编辑;2)引入两类高斯点——表面高斯点用于基于纹理图像的外观建模,体积高斯点增强非朗伯区域(如嘴唇和头发)的重建质量;3)设计了基于网格的高斯UV映射方法,利用FLAME网格提供的UV坐标实现清晰纹理图像和实时渲染速度;4)采用分层优化策略,在重建质量和编辑灵活性之间取得最优平衡。

链接: https://arxiv.org/abs/2508.09597
作者: Heyi Sun,Cong Wang,Tian-Xing Xu,Jingwei Huang,Di Kang,Chunchao Guo,Song-Hai Zhang
机构: Tsinghua University (清华大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creating high-fidelity and editable head avatars is a pivotal challenge in computer vision and graphics, boosting many AR/VR applications. While recent advancements have achieved photorealistic renderings and plausible animation, head editing, especially real-time appearance editing, remains challenging due to the implicit representation and entangled modeling of the geometry and global appearance. To address this, we propose Surface-Volumetric Gaussian Head Avatar (SVG-Head), a novel hybrid representation that explicitly models the geometry with 3D Gaussians bound on a FLAME mesh and leverages disentangled texture images to capture the global appearance. Technically, it contains two types of Gaussians, in which surface Gaussians explicitly model the appearance of head avatars using learnable texture images, facilitating real-time texture editing, while volumetric Gaussians enhance the reconstruction quality of non-Lambertian regions (e.g., lips and hair). To model the correspondence between 3D world and texture space, we provide a mesh-aware Gaussian UV mapping method, which leverages UV coordinates given by the FLAME mesh to obtain sharp texture images and real-time rendering speed. A hierarchical optimization strategy is further designed to pursue the optimal performance in both reconstruction quality and editing flexibility. Experiments on the NeRSemble dataset show that SVG-Head not only generates high-fidelity rendering results, but also is the first method to obtain explicit texture images for Gaussian head avatars and support real-time appearance editing.
zh

[CV-61] Hierarchical Brain Structure Modeling for Predicting Genotype of Glioma

【速读】:该论文旨在解决当前胶质瘤IDH突变状态预测方法受限于功能磁共振成像(fMRI)数据低可用性和噪声干扰的问题,同时克服现有基于结构和形态连接组的方法在建模脑网络层次结构与多尺度交互时的不足。其解决方案的关键在于提出一种名为Hi-SMGNN的分层框架,通过整合区域到模块级别的结构与形态连接组信息,引入双胞胎网络(Siamese network)与跨模态注意力机制实现多模态交互,设计多尺度特征融合机制以降低冗余,并采用个性化模块划分策略提升个体特异性和可解释性,从而显著增强模型在UCSF-PDGM数据集上的鲁棒性与预测性能。

链接: https://arxiv.org/abs/2508.09593
作者: Haotian Tang,Jianwei Chen,Xinrui Tang,Yunjia Wu,Zhengyang Miao,Chao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Isocitrate DeHydrogenase (IDH) mutation status is a crucial biomarker for glioma prognosis. However, current prediction methods are limited by the low availability and noise of functional MRI. Structural and morphological connectomes offer a non-invasive alternative, yet existing approaches often ignore the brain’s hierarchical organisation and multiscale interactions. To address this, we propose Hi-SMGNN, a hierarchical framework that integrates structural and morphological connectomes from regional to modular levels. It features a multimodal interaction module with a Siamese network and cross-modal attention, a multiscale feature fusion mechanism for reducing redundancy, and a personalised modular partitioning strategy to enhance individual specificity and interpretability. Experiments on the UCSF-PDGM dataset demonstrate that Hi-SMGNN outperforms baseline and state-of-the-art models, showing improved robustness and effectiveness in IDH mutation prediction.
zh

[CV-62] Offline Auto Labeling: BAAS

【速读】:该论文旨在解决自动驾驶场景中雷达检测数据的多目标跟踪(Extended Object Tracking, EOT)与标签标注融合问题,尤其在不同监督水平下实现精确的目标轨迹估计与形状建模,从而生成可靠的标注标签。其解决方案的关键在于提出了一种基于贝叶斯(Bayesian)的跟踪、平滑及融合框架(BAAS),通过多模块协同处理,能够在无监督或弱监督条件下提供高精度的物体轨迹和形状估计,并支持对跟踪性能与标注误差的量化评估,同时具备模块化分析能力以实现闭环迭代优化。

链接: https://arxiv.org/abs/2508.09585
作者: Stefan Haag,Bharanidhar Duraisamy,Felix Govaers,Wolfgang Koch,Martin Fritzsche,Juergen Dickmann
机构: Mercedes-Benz AG (梅赛德斯-奔驰集团); Fraunhofer FKIE (弗劳恩霍夫信息融合研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This paper introduces BAAS, a new Extended Object Tracking (EOT) and fusion-based label annotation framework for radar detections in autonomous driving. Our framework utilizes Bayesian-based tracking, smoothing and eventually fusion methods to provide veritable and precise object trajectories along with shape estimation to provide annotation labels on the detection level under various supervision levels. Simultaneously, the framework provides evaluation of tracking performance and label annotation. If manually labeled data is available, each processing module can be analyzed independently or combined with other modules to enable closed-loop continuous improvements. The framework performance is evaluated in a challenging urban real-world scenario in terms of tracking performance and the label annotation errors. We demonstrate the functionality of the proposed approach for varying dynamic objects and class types
zh

[CV-63] SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,特别是忠实性(faithfulness)和事实性(factuality)幻觉的评估不足与数据构建瓶颈。现有研究多在粗粒度层面(如物体级别)评估忠实性幻觉,缺乏细粒度分析;同时,主流基准依赖人工标注或复用公开数据集,存在可扩展性差和数据泄露风险。解决方案的关键在于提出一个自动化的数据构建流水线,实现可扩展、可控且多样化的评估数据生成,并设计了一种分层幻觉诱导框架,通过输入扰动模拟真实噪声场景,从而构建了SHALE基准——一个涵盖30K图像-指令对、覆盖12个视觉感知维度和6个知识领域的细粒度幻觉评估体系,支持在干净与噪声场景下对LVLMs进行系统性评测。

链接: https://arxiv.org/abs/2508.09584
作者: Bei Yan,Zhiyuan Chen,Yuecong Min,Jie Zhang,Jiahao Wang,Xiaozhen Wang,Shiguang Shan
机构: Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS); University of Chinese Academy of Sciences; Trustworthy Technology and Engineering Laboratory, Huawei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite rapid advances, Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge, which correspond to faithfulness and factuality hallucinations, respectively. Prior studies primarily evaluate faithfulness hallucination at a coarse level (e.g., object-level) and lack fine-grained analysis. Additionally, existing benchmarks rely on costly manual curation or reused public datasets, raising concerns about scalability and data leakage. To address these limitations, we propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data. We also design a hierarchical hallucination induction framework with input perturbations to simulate realistic noisy scenarios. Integrating these designs, we construct SHALE, a Scalable HALlucination Evaluation benchmark designed to assess both faithfulness and factuality hallucinations via a fine-grained hallucination categorization scheme. SHALE comprises over 30K image-instruction pairs spanning 12 representative visual perception aspects for faithfulness and 6 knowledge domains for factuality, considering both clean and noisy scenarios. Extensive experiments on over 20 mainstream LVLMs reveal significant factuality hallucinations and high sensitivity to semantic perturbations.
zh

[CV-64] Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion

【速读】:该论文旨在解决可控文本到图像(Text-to-Image, T2I)扩散模型在空间结构保持和细粒度条件建模方面的局限性,尤其是当控制条件涉及物体姿态(pose)与场景布局(scene layout)时,现有方法难以准确保留空间结构并实现精细控制。解决方案的关键在于提出一种无需训练的双递归反馈(Dual Recursive Feedback, DRF)机制,通过外观反馈(appearance feedback)和生成反馈(generation feedback)双重路径递归优化中间潜在表示(intermediate latents),从而引导潜在空间走向更可靠的流形,有效融合结构与外观属性,并支持类不变结构-外观融合任务(如将人类动作迁移到虎的形态上)。

链接: https://arxiv.org/abs/2508.09575
作者: Jiwon Kim,Pureum Kim,SeonHwa Kim,Soobin Park,Eunju Cha,Kyong Hwan Jin
机构: Korea University (韩国大学); Sookmyung Women’s University (淑明女子大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in controllable text-to-image (T2I) diffusion models, such as Ctrl-X and FreeControl, have demonstrated robust spatial and appearance control without requiring auxiliary module training. However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. The proposed DRF consists of appearance feedback and generation feedback that recursively refines the intermediate latents to better reflect the given appearance information and the user’s intent. This dual-update mechanism guides latent representations toward reliable manifolds, effectively integrating structural and appearance attributes. Our approach enables fine-grained generation even between class-invariant structure-appearance fusion, such as transferring human motion onto a tiger’s form. Extensive experiments demonstrate the efficacy of our method in producing high-quality, semantically coherent, and structurally consistent image generations. Our source code is available at this https URL.
zh

[CV-65] A Chain of Diagnosis Framework for Accurate and Explainable Radiology Report Generation

【速读】:该论文旨在解决放射学报告生成(Radiology Report Generation, RRG)中存在的两大问题:一是临床有效性不足,尤其在病灶属性描述方面表现欠佳;二是生成文本缺乏可解释性,导致放射科医生难以信任模型输出。为应对这些挑战,作者提出了一种名为“诊断链”(Chain of Diagnosis, CoD)的可信RRG框架,其核心在于通过模拟诊断流程实现准确且可解释的报告生成。关键创新包括:1)利用诊断对话生成问题-答案(QA)对以提取关键发现;2)基于QA诊断结果提示大语言模型进行精准文本生成;3)设计诊断定位模块将生成句子与QA诊断关联,提升可解释性;4)引入病灶定位模块实现图像中异常区域的精确标注,从而提高放射科医生的工作效率。此外,论文还提出一种融合临床一致性的全监督学习策略,有效利用多源标注数据,显著提升了模型性能与可信度。

链接: https://arxiv.org/abs/2508.09566
作者: Haibo Jin,Haoxuan Che,Sunan He,Hao Chen
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE TMI

点击查看摘要

Abstract:Despite the progress of radiology report generation (RRG), existing works face two challenges: 1) The performances in clinical efficacy are unsatisfactory, especially for lesion attributes description; 2) the generated text lacks explainability, making it difficult for radiologists to trust the results. To address the challenges, we focus on a trustworthy RRG model, which not only generates accurate descriptions of abnormalities, but also provides basis of its predictions. To this end, we propose a framework named chain of diagnosis (CoD), which maintains a chain of diagnostic process for clinically accurate and explainable RRG. It first generates question-answer (QA) pairs via diagnostic conversation to extract key findings, then prompts a large language model with QA diagnoses for accurate generation. To enhance explainability, a diagnosis grounding module is designed to match QA diagnoses and generated sentences, where the diagnoses act as a reference. Moreover, a lesion grounding module is designed to locate abnormalities in the image, further improving the working efficiency of radiologists. To facilitate label-efficient training, we propose an omni-supervised learning strategy with clinical consistency to leverage various types of annotations from different datasets. Our efforts lead to 1) an omni-labeled RRG dataset with QA pairs and lesion boxes; 2) a evaluation tool for assessing the accuracy of reports in describing lesion location and severity; 3) extensive experiments to demonstrate the effectiveness of CoD, where it outperforms both specialist and generalist models consistently on two RRG benchmarks and shows promising explainability by accurately grounding generated sentences to QA diagnoses and images.
zh

[CV-66] WEC-DG: Multi-Exposure Wavelet Correction Method Guided by Degradation Description

【速读】:该论文旨在解决单曝光图像在复杂光照条件下(如不同天气、拍摄环境等)因光照不足或过曝导致的视觉质量下降问题,尤其针对现有多曝光校正方法难以应对类内光照差异及“模糊”曝光退化现象所引发的误校正问题。解决方案的关键在于提出一种基于小波变换的带退化引导的曝光校正方法(WEC-DG),其核心创新包括:1)在处理流程两端引入退化描述符的曝光一致性对齐模块(ECAM),以识别并补偿“模糊”类曝光退化,确保曝光一致性;2)利用小波变换的光-细节解耦特性设计曝光恢复与细节重建模块(EDRM),先在低频域进行曝光增强,再以高频信息作为先验指导空间细节重建,从而实现精准光照校正与高质量细节恢复。

链接: https://arxiv.org/abs/2508.09565
作者: Ming Zhao,Pingping Liu,Tongshun Zhang,Zhe Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-exposure correction technology is essential for restoring images affected by insufficient or excessive lighting, enhancing the visual experience by improving brightness, contrast, and detail richness. However, current multi-exposure correction methods often encounter challenges in addressing intra-class variability caused by diverse lighting conditions, shooting environments, and weather factors, particularly when processing images captured at a single exposure level. To enhance the adaptability of these models under complex imaging conditions, this paper proposes a Wavelet-based Exposure Correction method with Degradation Guidance (WEC-DG). Specifically, we introduce a degradation descriptor within the Exposure Consistency Alignment Module (ECAM) at both ends of the processing pipeline to ensure exposure consistency and achieve final alignment. This mechanism effectively addresses miscorrected exposure anomalies caused by existing methods’ failure to recognize ‘blurred’ exposure degradation. Additionally, we investigate the light-detail decoupling properties of the wavelet transform to design the Exposure Restoration and Detail Reconstruction Module (EDRM), which processes low-frequency information related to exposure enhancement before utilizing high-frequency information as a prior guide for reconstructing spatial domain details. This serial processing strategy guarantees precise light correction and enhances detail recovery. Extensive experiments conducted on multiple public datasets demonstrate that the proposed method outperforms existing algorithms, achieving significant performance improvements and validating its effectiveness and practical applicability.
zh

[CV-67] WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization

【速读】:该论文旨在解决无人机视觉地理定位(visual geo-localization)在恶劣天气条件下性能显著下降的问题,现有方法存在两个核心局限:一是依赖有限的天气类别导致泛化能力差,二是通过伪天气类别难以有效解耦场景与天气特征。解决方案的关键在于提出WeatherPrompt框架,其创新性地引入了两个核心机制:一是无需训练的天气推理机制(Training-free Weather Reasoning),利用现成的大规模多模态模型通过类人推理生成多天气文本描述,从而提升对未见或复杂天气的适应性并体现不同天气强度;二是基于文本嵌入驱动的动态门控机制(dynamic gating mechanism),自适应地重加权和融合跨模态视觉特征,结合图像-文本对比学习与匹配目标优化表示空间,使同一场景在不同天气下的表征更加接近,从而实现天气不变的特征表示。

链接: https://arxiv.org/abs/2508.09560
作者: Jiahao Wen,Hang Yu,Zhedong Zheng
机构: Shanghai University (上海大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 13 pages, 4figures

点击查看摘要

Abstract:Visual geo-localization for drones faces critical degradation under weather perturbations, \eg, rain and fog, where existing methods struggle with two inherent limitations: 1) Heavy reliance on limited weather categories that constrain generalization, and 2) Suboptimal disentanglement of entangled scene-weather features through pseudo weather categories. We present WeatherPrompt, a multi-modality learning paradigm that establishes weather-invariant representations through fusing the image embedding with the text context. Our framework introduces two key contributions: First, a Training-free Weather Reasoning mechanism that employs off-the-shelf large multi-modality models to synthesize multi-weather textual descriptions through human-like reasoning. It improves the scalability to unseen or complex weather, and could reflect different weather strength. Second, to better disentangle the scene and weather feature, we propose a multi-modality framework with the dynamic gating mechanism driven by the text embedding to adaptively reweight and fuse visual features across modalities. The framework is further optimized by the cross-modal objectives, including image-text contrastive learning and image-text matching, which maps the same scene with different weather conditions closer in the respresentation space. Extensive experiments validate that, under diverse weather conditions, our method achieves competitive recall rates compared to state-of-the-art drone geo-localization methods. Notably, it improves Recall@1 by +13.37% under night conditions and by 18.69% under fog and snow conditions.
zh

[CV-68] opological Invariant-Based Iris Identification via Digital Homology and Machine Learning

【速读】:该论文旨在解决生物特征识别中对高精度、可解释性与数据效率兼备方法的需求,尤其针对虹膜识别(iris recognition)场景。传统深度学习方法虽性能优异,但在小样本或安全关键领域面临可解释性不足的问题。其解决方案的关键在于首次将形式化的数字同调(digital homology)理论引入虹膜纹理建模,通过计算每个子区域的Betti数(Betti0和Betti1)及其比值,提取具有拓扑不变性的特征向量;该特征矩阵结合逻辑回归、K近邻(KNN)和支持向量机(SVM)等经典分类器,在PCA降维与随机重复实验下实现97.78% ± 0.82%的准确率,显著优于CNN基准模型(96.44% ± 1.32%),且具有更低方差与更强的可解释性,适用于CPU-only环境及对透明性要求高的应用场景。

链接: https://arxiv.org/abs/2508.09555
作者: Ahmet Öztel,İsmet Karaca
机构: Bartin University (巴廷大学); Ege University (艾吉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, includes visual abstract, focuses on topological invariants for iris recognition

点击查看摘要

Abstract:Objective - This study presents a biometric identification method based on topological invariants from 2D iris images, representing iris texture via formally defined digital homology and evaluating classification performance. Methods - Each normalized iris image (48x482 pixels) is divided into grids (e.g., 6x54 or 3x27). For each subregion, we compute Betti0, Betti1, and their ratio using a recent algorithm for homology groups in 2D digital images. The resulting invariants form a feature matrix used with logistic regression, KNN, and SVM (with PCA and 100 randomized repetitions). A convolutional neural network (CNN) is trained on raw images for comparison. Results - Logistic regression achieved 97.78 +/- 0.82% accuracy, outperforming CNN (96.44 +/- 1.32%) and other feature-based models. The topological features showed high accuracy with low variance. Conclusion - This is the first use of topological invariants from formal digital homology for iris recognition. The method offers a compact, interpretable, and accurate alternative to deep learning, useful when explainability or limited data is important. Beyond iris recognition, it can apply to other biometrics, medical imaging, materials science, remote sensing, and interpretable AI. It runs efficiently on CPU-only systems and produces robust, explainable features valuable for security-critical domains. Comments: 10 pages, 5 figures, includes visual abstract, focuses on topological invariants for iris recognition Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 55N31, 55U10, 68U10, 68T07 ACMclasses: I.4.6; I.5.4; G.2.3 Cite as: arXiv:2508.09555 [cs.CV] (or arXiv:2508.09555v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.09555 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ahmet Oztel [view email] [v1] Wed, 13 Aug 2025 07:21:48 UTC (586 KB)
zh

[CV-69] Exploring the Equivalence of Closed-Set Generative and Real Data Augmentation in Image Classification

【速读】:该论文旨在解决在图像分类任务中,如何利用训练集上的生成式模型(Generative Model)生成合成数据以提升分类性能的问题,即封闭集生成数据增强(closed-set generative data augmentation)。其关键解决方案在于通过系统性实验揭示真实图像与封闭集合成图像之间的差异与相似性,并实证确定达到等效分类性能所需的合成数据规模;同时量化了真实数据增强与开放集生成增强(open-set generative augmentation)之间的等效关系,为合成数据的使用提供可操作的指导原则,尤其明确了在不同基础训练集大小和合成数据量下性能变化的规律。

链接: https://arxiv.org/abs/2508.09550
作者: Haowen Wang,Guowei Zhang,Xiang Zhang,Zeyuan Chen,Haiyang Xu,Dou Hoon Kwark,Zhuowen Tu
机构: University of California, San Diego (加州大学圣地亚哥分校); University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we address a key scientific problem in machine learning: Given a training set for an image classification task, can we train a generative model on this dataset to enhance the classification performance? (i.e., closed-set generative data augmentation). We start by exploring the distinctions and similarities between real images and closed-set synthetic images generated by advanced generative models. Through extensive experiments, we offer systematic insights into the effective use of closed-set synthetic data for augmentation. Notably, we empirically determine the equivalent scale of synthetic images needed for augmentation. In addition, we also show quantitative equivalence between the real data augmentation and open-set generative augmentation (generative models trained using data beyond the given training set). While it aligns with the common intuition that real images are generally preferred, our empirical formulation also offers a guideline to quantify the increased scale of synthetic data augmentation required to achieve comparable image classification performance. Our results on natural and medical image datasets further illustrate how this effect varies with the baseline training set size and the amount of synthetic data incorporated.
zh

[CV-70] GoViG: Goal-Conditioned Visual Navigation Instruction Generation

【速读】:该论文旨在解决在未结构化环境中,仅依赖第一人称视觉观测(egocentric visual observations)自动生成精确且语义连贯的导航指令的问题。传统方法通常依赖语义标注或环境地图等结构化输入,限制了其在真实复杂场景中的适应性。解决方案的关键在于提出一种名为GoViG(Goal-Conditioned Visual Navigation Instruction Generation)的新任务框架,通过将问题分解为两个相互关联的子任务:视觉预测(visual forecasting),用于生成连接初始状态与目标状态的中间视觉序列;以及指令生成(instruction generation),基于实际观测与预测视觉内容合成语言指令。该方法整合于一个自回归多模态大语言模型中,并采用定制化训练目标以确保空间准确性与语言清晰度。此外,引入“单次推理”和“交错推理”两种多模态推理策略,模拟人类导航过程中的渐进式认知机制,从而提升指令生成质量与跨域泛化能力。

链接: https://arxiv.org/abs/2508.09547
作者: Fengyi Wu,Yifei Dong,Zhi-Qi Cheng,Yilong Dai,Guangyu Chen,Hang Wang,Qi Dai,Alexander G. Hauptmann
机构: University of Washington (华盛顿大学); Microsoft (微软); University of Science and Technology of China (中国科学技术大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review. Code: this https URL

点击查看摘要

Abstract:We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike conventional approaches that rely on structured inputs such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) visual forecasting, which predicts intermediate visual states bridging the initial and goal views; and (2) instruction generation, which synthesizes linguistically coherent instructions grounded in both observed and anticipated visuals. These subtasks are integrated within an autoregressive multimodal large language model trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two complementary multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human cognitive processes during navigation. To evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant improvements over state-of-the-art methods, achieving superior BLEU-4 and CIDEr scores along with robust cross-domain generalization.
zh

[CV-71] Iterative Volume Fusion for Asymmetric Stereo Matching

【速读】:该论文旨在解决异构立体匹配(asymmetric stereo matching)问题,即在多摄像头系统(如远摄-广角相机组合)中因视觉不对称性导致的传统对称立体匹配算法性能下降的问题。其核心挑战在于视觉不对称会破坏关键的代价体(cost volume)计算,从而影响深度估计精度。解决方案的关键在于:首先通过两阶段迭代体积融合网络(IVF-AStereo),分别处理两种不同构建方式的代价体(correlation volume 和 aggregated concatenation volume),发现二者均存在不同的信息失真;随后利用二者互补特性进行融合,以增强细节恢复能力,从而显著提升在分辨率和色彩退化等复杂条件下的鲁棒性与匹配精度。

链接: https://arxiv.org/abs/2508.09543
作者: Yuanting Gao,Linghao Shen
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Sony (China) Ltd. (索尼(中国)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereo matching is vital in 3D computer vision, with most algorithms assuming symmetric visual properties between binocular visions. However, the rise of asymmetric multi-camera systems (e.g., tele-wide cameras) challenges this assumption and complicates stereo matching. Visual asymmetry disrupts stereo matching by affecting the crucial cost volume computation. To address this, we explore the matching cost distribution of two established cost volume construction methods in asymmetric stereo. We find that each cost volume experiences distinct information distortion, indicating that both should be comprehensively utilized to solve the issue. Based on this, we propose the two-phase Iterative Volume Fusion network for Asymmetric Stereo matching (IVF-AStereo). Initially, the aggregated concatenation volume refines the correlation volume. Subsequently, both volumes are fused to enhance fine details. Our method excels in asymmetric scenarios and shows robust performance against significant visual asymmetry. Extensive comparative experiments on benchmark datasets, along with ablation studies, confirm the effectiveness of our approach in asymmetric stereo with resolution and color degradation.
zh

[CV-72] COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection

【速读】:该论文旨在解决多模态红绿蓝热成像(RGBT)图像中微小目标检测难题,尤其在无人机场景下因空间错位、低光照、遮挡及背景杂乱等因素导致的检测性能下降问题。现有方法难以有效利用可见光与热成像模态间的互补信息,从而限制了检测精度和鲁棒性。其解决方案的关键在于提出COXNet框架,包含三项核心创新:一是跨层融合模块(Cross-Layer Fusion Module),通过融合高层可见光特征与低层热成像特征提升语义与空间精度;二是动态对齐与尺度精化模块(Dynamic Alignment and Scale Refinement module),校正跨模态空间错位并保留多尺度特征;三是基于几何形状相似性度量(GeoShape Similarity Measure)的优化标签分配策略,改善定位准确性。该方法在RGBTDronePerson数据集上实现了mAP₅₀提升3.32%,验证了其在复杂环境下的有效性。

链接: https://arxiv.org/abs/2508.09533
作者: Peiran Peng,Tingfa Xu,Liqiang Song,Mengqi Zhu,Yuqiang Fang,Jianan Li
机构: Beijing Institute of Technology (北京理工大学); National Astronomical Observatories, Chinese Academy of Sciences (中国科学院国家天文台); National Key Laboratory of Space Target Awareness, Space Engineering University (空间工程大学空间目标感知国家重点实验室); Key Laboratory of Photoelectronic Imaging Technology and System, Ministry of Education of China (教育部光电成像技术与系统重点实验室); Chongqing Innovation Center, Beijing Institute of Technology (北京理工大学重庆创新中心); China North Vehicle Research Institute (中国北方车辆研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting tiny objects in multimodal Red-Green-Blue-Thermal (RGBT) imagery is a critical challenge in computer vision, particularly in surveillance, search and rescue, and autonomous navigation. Drone-based scenarios exacerbate these challenges due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds. Current methods struggle to leverage the complementary information between visible and thermal modalities effectively. We propose COXNet, a novel framework for RGBT tiny object detection, addressing these issues through three core innovations: i) the Cross-Layer Fusion Module, fusing high-level visible and low-level thermal features for enhanced semantic and spatial accuracy; ii) the Dynamic Alignment and Scale Refinement module, correcting cross-modal spatial misalignments and preserving multi-scale features; and iii) an optimized label assignment strategy using the GeoShape Similarity Measure for better localization. COXNet achieves a 3.32% mAP _50 improvement on the RGBTDronePerson dataset over state-of-the-art methods, demonstrating its effectiveness for robust detection in complex environments.
zh

[CV-73] Physics-guided Deep Unfolding Network for Enhanced Kronecker Compressive sensing

【速读】:该论文旨在解决图像压缩感知(Compressed Sensing, CS)任务中两个关键问题:一是传感阶段的测量缺乏非相干性(incoherence),导致病态问题加剧;二是重建阶段对测量信息的隐式表示不足,限制了整体性能。解决方案的关键在于提出一种新型的不对称Kronecker压缩感知(Asymmetric Kronecker CS, AKCS)模型,理论上证明其相比传统Kronecker CS具有更好的测量非相干性且复杂度增加最小;同时引入测量感知交叉注意力机制(Measurement-aware Cross Attention, MACA),以学习测量的隐式表示,并将其与广泛使用的展开网络(unfolding network)架构集成,形成增强测量表示的展开网络(Measurement-Enhanced Unfolding Network, MEUNet),从而在重建精度和推理速度上均达到当前最优水平。

链接: https://arxiv.org/abs/2508.09528
作者: Gang Qu,Ping Wang,Siming Zheng,Xin Yuan
机构: Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Deep networks have achieved remarkable success in image compressed sensing (CS) task, namely reconstructing a high-fidelity image from its compressed measurement. However, existing works are deficient inincoherent compressed measurement at sensing phase and implicit measurement representations at reconstruction phase, limiting the overall performance. In this work, we answer two questions: 1) how to improve the measurement incoherence for decreasing the ill-posedness; 2) how to learn informative representations from measurements. To this end, we propose a novel asymmetric Kronecker CS (AKCS) model and theoretically present its better incoherence than previous Kronecker CS with minimal complexity increase. Moreover, we reveal that the unfolding networks’ superiority over non-unfolding ones result from sufficient gradient descents, called explicit measurement representations. We propose a measurement-aware cross attention (MACA) mechanism to learn implicit measurement representations. We integrate AKCS and MACA into widely-used unfolding architecture to get a measurement-enhanced unfolding network (MEUNet). Extensive experiences demonstrate that our MEUNet achieves state-of-the-art performance in reconstruction accuracy and inference speed.
zh

[CV-74] Learning Spatial Decay for Vision Transformers

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)中自注意力机制缺乏显式空间归纳偏置(inductive bias)的问题,导致其在处理具有空间结构的任务时性能不佳。现有方法通常采用基于固定距离度量的数据无关空间衰减策略,对所有图像内容施加统一的注意力权重,限制了模型对多样化视觉场景的适应能力。解决方案的关键在于首次成功将数据依赖的空间衰减机制引入二维视觉Transformer,提出Spatial Decay Transformer (SDT),其核心创新是Context-Aware Gating (CAG)机制,能够根据内容相关性和空间邻近性动态生成patch间交互的衰减权重。通过统一的空间-内容融合框架,将曼哈顿距离引导的空间先验与学习到的内容表征相结合,实现了从一维到二维空间衰减的有效迁移,显著提升了ViT在ImageNet-1K分类和生成任务中的表现。

链接: https://arxiv.org/abs/2508.09525
作者: Yuxin Mao,Zhen Qin,Jinxing Zhou,Bin Fan,Jing Zhang,Yiran Zhong,Yuchao Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases, leading to suboptimal performance on spatially-structured tasks. Existing approaches introduce data-independent spatial decay based on fixed distance metrics, applying uniform attention weighting regardless of image content and limiting adaptability to diverse visual scenarios. Inspired by recent advances in large language models where content-aware gating mechanisms (e.g., GLA, HGRN2, FOX) significantly outperform static alternatives, we present the first successful adaptation of data-dependent spatial decay to 2D vision transformers. We introduce \textbfSpatial Decay Transformer (SDT), featuring a novel Context-Aware Gating (CAG) mechanism that generates dynamic, data-dependent decay for patch interactions. Our approach learns to modulate spatial attention based on both content relevance and spatial proximity. We address the fundamental challenge of 1D-to-2D adaptation through a unified spatial-content fusion framework that integrates manhattan distance-based spatial priors with learned content representations. Extensive experiments on ImageNet-1K classification and generation tasks demonstrate consistent improvements over strong baselines. Our work establishes data-dependent spatial decay as a new paradigm for enhancing spatial attention in vision transformers.
zh

[CV-75] SOI is the Root of All Evil: Quantifying and Breaking Similar Object Interference in Single Object Tracking

【速读】:该论文旨在解决单目标跟踪(Single Object Tracking, SOT)中长期被忽视但至关重要的瓶颈问题——相似物体干扰(Similar Object Interference, SOI)。通过受控的在线干扰掩蔽(Online Interference Masking, OIM)实验,作者定量证明了消除干扰源可显著提升所有主流追踪器的性能(AUC提升达4.35),从而验证SOI是制约鲁棒跟踪的核心因素,并揭示了外部认知引导的可行性。解决方案的关键在于提出SOIBench——首个专为SOI挑战设计的语义认知引导基准,其利用多追踪器集体判断自动挖掘SOI帧并引入多层级标注协议生成精准语义引导文本;进一步地,作者提出一种基于大规模视觉语言模型(Vision-Language Models, VLM)作为外部认知引擎的新范式,该范式可无缝集成至任意RGB追踪器,在语义认知引导下实现显著性能提升(AUC提升达0.93),远超现有视觉-语言追踪(Vision-Language Tracking, VLT)方法的表现。

链接: https://arxiv.org/abs/2508.09524
作者: Yipei Wang,Shiyu Hu,Shukun Jia,Panxi Xu,Hongfei Ma,Yiping Ma,Jing Zhang,Xiaobo Lu,Xin Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present the first systematic investigation and quantification of Similar Object Interference (SOI), a long-overlooked yet critical bottleneck in Single Object Tracking (SOT). Through controlled Online Interference Masking (OIM) experiments, we quantitatively demonstrate that eliminating interference sources leads to substantial performance improvements (AUC gains up to 4.35) across all SOTA trackers, directly validating SOI as a primary constraint for robust tracking and highlighting the feasibility of external cognitive guidance. Building upon these insights, we adopt natural language as a practical form of external guidance, and construct SOIBench-the first semantic cognitive guidance benchmark specifically targeting SOI challenges. It automatically mines SOI frames through multi-tracker collective judgment and introduces a multi-level annotation protocol to generate precise semantic guidance texts. Systematic evaluation on SOIBench reveals a striking finding: existing vision-language tracking (VLT) methods fail to effectively exploit semantic cognitive guidance, achieving only marginal improvements or even performance degradation (AUC changes of -0.26 to +0.71). In contrast, we propose a novel paradigm employing large-scale vision-language models (VLM) as external cognitive engines that can be seamlessly integrated into arbitrary RGB trackers. This approach demonstrates substantial improvements under semantic cognitive guidance (AUC gains up to 0.93), representing a significant advancement over existing VLT methods. We hope SOIBench will serve as a standardized evaluation platform to advance semantic cognitive tracking research and contribute new insights to the tracking research community.
zh

[CV-76] Generation of Indian Sign Language Letters Numbers and Words

【速读】:该论文旨在解决手语图像生成中分辨率与细节难以兼顾的问题,尤其针对听力障碍群体与听力正常者之间沟通障碍的缓解需求。其核心挑战在于如何生成高分辨率且富含特征的类条件手语图像,以提升跨群体交流的效率与自然性。解决方案的关键在于提出一种融合Progressive Growing GAN(ProGAN)与Self-Attention GAN(SAGAN)优势的改进型生成对抗网络(Generative Adversarial Network, GAN),通过引入注意力机制增强特征表达能力,并在训练过程中逐步提升图像分辨率,从而实现高质量、高细节的手语图像生成。实验表明,该方法在印度手语字母、数字及词汇图像生成上显著优于传统ProGAN,Inception Score(IS)和Fréchet Inception Distance(FID)分别提升3.2和30.12,同时公开了一个包含129个高频词的大规模高质量印度手语数据集,为后续研究提供重要基础。

链接: https://arxiv.org/abs/2508.09522
作者: Ajeet Kumar Yadav,Nishant Kumar,Rathna G N
机构: Indian Institute of Science, Banglore (印度科学理工学院, 班加罗尔); Bengaluru, India (印度班加罗尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 5 figures, 2024 International Conference on Intelligent Algorithms for Computational Intelligence Systems (IACIS)

点击查看摘要

Abstract:Sign language, which contains hand movements, facial expressions and bodily gestures, is a significant medium for communicating with hard-of-hearing people. A well-trained sign language community communicates easily, but those who don’t know sign language face significant challenges. Recognition and generation are basic communication methods between hearing and hard-of-hearing individuals. Despite progress in recognition, sign language generation still needs to be explored. The Progressive Growing of Generative Adversarial Network (ProGAN) excels at producing high-quality images, while the Self-Attention Generative Adversarial Network (SAGAN) generates feature-rich images at medium resolutions. Balancing resolution and detail is crucial for sign language image generation. We are developing a Generative Adversarial Network (GAN) variant that combines both models to generate feature-rich, high-resolution, and class-conditional sign language images. Our modified Attention-based model generates high-quality images of Indian Sign Language letters, numbers, and words, outperforming the traditional ProGAN in Inception Score (IS) and Fréchet Inception Distance (FID), with improvements of 3.2 and 30.12, respectively. Additionally, we are publishing a large dataset incorporating high-quality images of Indian Sign Language alphabets, numbers, and 129 words.
zh

[CV-77] CWFBind: Geometry-Awareness for Fast and Accurate Protein-Ligand Docking

【速读】:该论文旨在解决小分子配体与蛋白质靶标结合构象预测中的几何信息缺失问题,现有深度学习方法多依赖图结构表示和语言模型启发的编码器,忽视了关键的空间几何特征,导致结合口袋定位不准和结合构象不现实。解决方案的关键在于引入基于局部曲率(local curvature)的特征描述符,在特征提取阶段增强蛋白质和配体的几何表征,同时在消息传递过程中嵌入度感知加权机制(degree-aware weighting),以提升对空间结构差异和相互作用强度的捕捉能力,并通过配体感知的动态半径策略与改进的损失函数缓解口袋预测中的类别不平衡问题,从而实现高精度且高效的对接性能。

链接: https://arxiv.org/abs/2508.09499
作者: Liyan Jia,Chuan-Xian Ren,Hong Yan
机构: Sun Yat-sen University (中山大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurately predicting the binding conformation of small-molecule ligands to protein targets is a critical step in rational drug design. Although recent deep learning-based docking surpasses traditional methods in speed and accuracy, many approaches rely on graph representations and language model-inspired encoders while neglecting critical geometric information, resulting in inaccurate pocket localization and unrealistic binding conformations. In this study, we introduce CWFBind, a weighted, fast, and accurate docking method based on local curvature features. Specifically, we integrate local curvature descriptors during the feature extraction phase to enrich the geometric representation of both proteins and ligands, complementing existing chemical, sequence, and structural features. Furthermore, we embed degree-aware weighting mechanisms into the message passing process, enhancing the model’s ability to capture spatial structural distinctions and interaction strengths. To address the class imbalance challenge in pocket prediction, CWFBind employs a ligand-aware dynamic radius strategy alongside an enhanced loss function, facilitating more precise identification of binding regions and key residues. Comprehensive experimental evaluations demonstrate that CWFBind achieves competitive performance across multiple docking benchmarks, offering a balanced trade-off between accuracy and efficiency.
zh

[CV-78] SARE: Semantic-Aware Reconstruction Error for Generalizable Diffusion-Generated Image Detection

【速读】:该论文旨在解决当前扩散模型生成图像检测方法在面对未见过的、分布外(out-of-distribution, OOD)生成模型时性能显著下降的问题,其根源在于现有方法主要依赖于特定生成模型所产生的特征痕迹(artifacts),导致泛化能力不足。解决方案的关键在于提出一种新的表征方式——语义感知重建误差(Semantic-Aware Reconstruction Error, SARE),该方法通过量化图像与其基于文本描述引导重构之间的语义差异来实现检测:真实图像因文本难以完整捕捉其复杂视觉内容,在重构过程中会产生明显的语义偏移;而伪造图像通常与文本高度一致,重构时语义变化较小。SARE利用这一本质差异作为判别特征,从而在多种生成模型上展现出更强的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2508.09487
作者: Ju Yeon Kang,Jaehong Park,Semin Kim,Ji Won Yoon,Nam Soo Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:Recently, diffusion-generated image detection has gained increasing attention, as the rapid advancement of diffusion models has raised serious concerns about their potential misuse. While existing detection methods have achieved promising results, their performance often degrades significantly when facing fake images from unseen, out-of-distribution (OOD) generative models, since they primarily rely on model-specific artifacts. To address this limitation, we explore a fundamental property commonly observed in fake images. Motivated by the observation that fake images tend to exhibit higher similarity to their captions than real images, we propose a novel representation, namely Semantic-Aware Reconstruction Error (SARE), that measures the semantic difference between an image and its caption-guided reconstruction. The hypothesis behind SARE is that real images, whose captions often fail to fully capture their complex visual content, may undergo noticeable semantic shifts during the caption-guided reconstruction process. In contrast, fake images, which closely align with their captions, show minimal semantic changes. By quantifying these semantic shifts, SARE can be utilized as a discriminative feature for robust detection across diverse generative models. We empirically demonstrate that the proposed method exhibits strong generalization, outperforming existing baselines on benchmarks including GenImage and CommunityForensics.
zh

[CV-79] Episodic Memory Representation for Long-form Video Understanding

【速读】:该论文旨在解决视频大语言模型(Video-LLMs)在处理长时视频时因上下文窗口限制而导致的性能下降问题,特别是现有关键帧检索方法将视频简化为静态图文匹配任务,忽略了时空关系和场景过渡信息,从而导致冗余关键帧与重要线索稀释的问题。解决方案的关键在于提出一种无需训练的框架 Video-EM,其灵感源自人类情景记忆(episodic memory)机制,将关键帧建模为具有时间顺序的事件序列,显式捕捉空间关系与时间动态,同时结合链式思维(Chain of Thought, CoT)推理策略,由大语言模型(LLM)迭代筛选出最小但信息最丰富的记忆子集,从而实现高效且准确的视频问答。

链接: https://arxiv.org/abs/2508.09486
作者: Yun Wang,Long Zhang,Jingren Liu,Jiaqi Yan,Zhanjie Zhang,Jiahao Zheng,Xun Yang,Dapeng Wu,Xiangyu Chen,Xuelong Li
机构: TeleAI; Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); Peking University (北京大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) excel at general video understanding but struggle with long-form videos due to context window limits. Consequently, recent approaches focus on keyframe retrieval, condensing lengthy videos into a small set of informative frames. Despite their practicality, these methods simplify the problem to static text image matching, overlooking spatio temporal relationships crucial for capturing scene transitions and contextual continuity, and may yield redundant keyframes with limited information, diluting salient cues essential for accurate video question answering. To address these limitations, we introduce Video-EM, a training free framework inspired by the principles of human episodic memory, designed to facilitate robust and contextually grounded reasoning. Rather than treating keyframes as isolated visual entities, Video-EM explicitly models them as temporally ordered episodic events, capturing both spatial relationships and temporal dynamics necessary for accurately reconstructing the underlying narrative. Furthermore, the framework leverages chain of thought (CoT) thinking with LLMs to iteratively identify a minimal yet highly informative subset of episodic memories, enabling efficient and accurate question answering by Video-LLMs. Extensive evaluations on the Video-MME, EgoSchema, HourVideo, and LVBench benchmarks confirm the superiority of Video-EM, which achieves highly competitive results with performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
zh

[CV-80] SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

【速读】:该论文旨在解决从稀疏视角卫星图像中进行高精度三维场景重建的问题,尤其针对现有基于3D Gaussian Splatting(3DGS)的方法在处理卫星图像时存在的两大局限:一是无法兼容有理多项式系数(RPC)模型,二是缺乏对多时相稀疏卫星图像的泛化能力。解决方案的关键在于提出SkySplat——一个自监督框架,其核心创新是将RPC模型嵌入通用3DGS流程中,从而更有效地利用稀疏几何线索;同时引入交叉自一致性模块(Cross-Self Consistency Module, CSCM)以通过一致性掩码机制抑制瞬态物体干扰,并采用多视图一致性聚合策略提升重建精度。该方法仅依赖RGB图像和辐射鲁棒的相对高度监督,无需真实高度图,显著提升了重建效率与准确性。

链接: https://arxiv.org/abs/2508.09479
作者: Xuejun Huang,Xinyi Liu,Yi Wan,Zhi Zheng,Bin Zhang,Mingtao Xiong,Yingying Pei,Yongjun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Three-dimensional scene reconstruction from sparse-view satellite images is a long-standing and challenging task. While 3D Gaussian Splatting (3DGS) and its variants have recently attracted attention for its high efficiency, existing methods remain unsuitable for satellite images due to incompatibility with rational polynomial coefficient (RPC) models and limited generalization capability. Recent advances in generalizable 3DGS approaches show potential, but they perform poorly on multi-temporal sparse satellite images due to limited geometric constraints, transient objects, and radiometric inconsistencies. To address these limitations, we propose SkySplat, a novel self-supervised framework that integrates the RPC model into the generalizable 3DGS pipeline, enabling more effective use of sparse geometric cues for improved reconstruction. SkySplat relies only on RGB images and radiometric-robust relative height supervision, thereby eliminating the need for ground-truth height maps. Key components include a Cross-Self Consistency Module (CSCM), which mitigates transient object interference via consistency-based masking, and a multi-view consistency aggregation strategy that refines reconstruction results. Compared to per-scene optimization methods, SkySplat achieves an 86 times speedup over EOGS with higher accuracy. It also outperforms generalizable 3DGS baselines, reducing MAE from 13.18 m to 1.80 m on the DFC19 dataset significantly, and demonstrates strong cross-dataset generalization on the MVS3D benchmark.
zh

[CV-81] GazeLT: Visual attention-guided long-tailed disease classification in chest radiographs

【速读】:该论文旨在解决长尾疾病分类中因样本分布不均衡导致的模型性能瓶颈问题,特别是针对罕见病类别的识别准确率低的问题。其解决方案的关键在于提出GazeLT方法,通过整合放射科医生眼动轨迹的时间动态特性,构建一种“集成-解构”机制来建模视觉注意力过程:一方面捕捉专家在图像解读过程中对细微和显著病变区域的注意力变化,另一方面显式引入偶然发现(incidental findings)等长尾类别信息,从而提升模型对稀有疾病的判别能力。实验表明,该方法在NIH-CXR-LT和MIMIC-CXR-LT两个公开数据集上显著优于现有长尾损失函数和基于视觉注意力的基线模型。

链接: https://arxiv.org/abs/2508.09478
作者: Moinak Bhattacharya,Gagandeep Singh,Shubham Jain,Prateek Prasanna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we present GazeLT, a human visual attention integration-disintegration approach for long-tailed disease classification. A radiologist’s eye gaze has distinct patterns that capture both fine-grained and coarser level disease related information. While interpreting an image, a radiologist’s attention varies throughout the duration; it is critical to incorporate this into a deep learning framework to improve automated image interpretation. Another important aspect of visual attention is that apart from looking at major/obvious disease patterns, experts also look at minor/incidental findings (few of these constituting long-tailed classes) during the course of image interpretation. GazeLT harnesses the temporal aspect of the visual search process, via an integration and disintegration mechanism, to improve long-tailed disease classification. We show the efficacy of GazeLT on two publicly available datasets for long-tailed disease classification, namely the NIH-CXR-LT (n=89237) and the MIMIC-CXR-LT (n=111898) datasets. GazeLT outperforms the best long-tailed loss by 4.1% and the visual attention-based baseline by 21.7% in average accuracy metrics for these datasets. Our code is available at this https URL.
zh

[CV-82] CLIP-Flow: A Universal Discriminator for AI-Generated Images Inspired by Anomaly Detection

【速读】:该论文旨在解决现有AI生成图像(AIIs)检测方法在面对未见过的生成模型时性能受限的问题,即传统基于分类的检测器难以泛化到未知生成模型所产的AIIs。解决方案的关键在于从异常检测(anomaly detection)视角出发,设计一种无需访问任何AIIs即可训练的通用检测器:利用预训练CLIP编码器提取特征,并构建类归一化流(normalizing flow-like)的无监督模型;训练过程中使用代理图像(proxy images,如对自然图像施加频谱变换所得)替代AIIs,通过最小化代理图像的似然概率(可选地结合最大化自然图像的似然),学习一个对AIIs具有强判别能力的通用表示。

链接: https://arxiv.org/abs/2508.09477
作者: Zhipeng Yuan,Kai Wang,Weize Quan,Dong-Ming Yan,Tieru Wu
机构: Jilin University (吉林大学); GIPSA-lab, Univ. Grenoble Alpes, CNRS, Grenoble INP (GIPSA实验室,格勒诺布尔阿尔卑斯大学,法国国家科学研究中心,格勒诺布尔国立理工学院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:With the rapid advancement of AI generative models, the visual quality of AI-generated images (AIIs) has become increasingly close to natural images, which inevitably raises security concerns. Most AII detectors often employ the conventional image classification pipeline with natural images and AIIs (generated by a generative model), which can result in limited detection performance for AIIs from unseen generative models. To solve this, we proposed a universal AI-generated image detector from the perspective of anomaly detection. Our discriminator does not need to access any AIIs and learn a generalizable representation with unsupervised learning. Specifically, we use the pre-trained CLIP encoder as the feature extractor and design a normalizing flow-like unsupervised model. Instead of AIIs, proxy images, e.g., obtained by applying a spectral modification operation on natural images, are used for training. Our models are trained by minimizing the likelihood of proxy images, optionally combined with maximizing the likelihood of natural images. Extensive experiments demonstrate the effectiveness of our method on AIIs produced by various image generators.
zh

[CV-83] From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts

【速读】:该论文旨在解决当前视频生成模型在大面部角度下身份特征难以保持的问题,其核心挑战在于如何有效将身份特征整合进DiT(Diffusion Transformer)结构中,以及现有开源视频数据集缺乏对大面部角度的充分覆盖。解决方案的关键在于提出两项创新:一是设计了Mixture of Facial Experts(MoFE),通过三个专业化专家动态融合互补线索——身份专家捕捉跨姿态的身份敏感特征,语义专家提取高层视觉语义,细节专家保留像素级特征(如皮肤纹理、颜色渐变);二是构建以“面部约束”和“身份一致性”为核心的定制化数据处理流程,提升面部角度多样性与时间序列上的身份稳定性,从而缓解数据稀缺问题。基于此流程构建的Large Face Angles(LFA)数据集包含46万条标注面部角度的视频片段,实验表明该方法在面部相似度、面部FID和CLIP语义对齐等指标上显著优于现有最先进方法。

链接: https://arxiv.org/abs/2508.09476
作者: Yuji Wang,Moran Li,Xiaobin Hu,Ran Yi,Jiangning Zhang,Chengming Xu,Weijian Cao,Yabiao Wang,Chengjie Wang,Lizhuang Ma
机构: Tencent YouTu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current video generation models struggle with identity preservation under large facial angles, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT structure, and the lack of targeted coverage of large facial angles in existing open-source video datasets. To address these, we present two key innovations. First, we introduce a Mixture of Facial Experts (MoFE) that dynamically combines complementary cues from three specialized experts, each designed to capture distinct but mutually reinforcing aspects of facial attributes. The identity expert captures cross-pose identity-sensitive features, the semantic expert extracts high-level visual semantxics, and the detail expert preserves pixel-level features (e.g., skin texture, color gradients). Furthermore, to mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency. Face Constraints ensure facial angle diversity and a high proportion of facial regions, while Identity Consistency preserves coherent person-specific features across temporal sequences, collectively addressing the scarcity of large facial angles and identity-stable training data in existing datasets. Leveraging this pipeline, we have curated and refined a Large Face Angles (LFA) Dataset from existing open-source human video datasets, comprising 460K video clips with annotated facial angles. Experimental results on the LFA benchmark demonstrate that our method, empowered by the LFA dataset, significantly outperforms prior SOTA methods in face similarity, face FID, and CLIP semantic alignment. The code and dataset will be made publicly available at this https URL.
zh

[CV-84] Leverag ing Failed Samples: A Few-Shot and Training-Free Framework for Generalized Deepfake Detection

【速读】:该论文旨在解决真实场景下深度伪造(deepfake)检测中“小样本”(few-shot)泛化能力不足的问题,即当检测模型面对未知生成模型产生的伪造样本时性能下降,而这些样本在实际应用中仍可获取。传统方法通常依赖大规模已知数据训练,难以适应新出现的伪造类型。解决方案的关键在于提出一种无需训练的Few-shot Training-free Network(FTNet),其仅需一个来自评估集的伪造样本即可实现高效检测:在推理阶段,通过将测试样本与已知的真实和伪造样本进行距离比较,依据最近邻类别进行分类,从而利用少量可用样本显著提升检测性能。此方法突破了传统训练范式,直接面向现实世界中样本稀缺且不可再训练的场景。

链接: https://arxiv.org/abs/2508.09475
作者: Shibo Yao,Renshuai Tao,Xiaolong Zheng,Chao Liang,Chunjie Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent deepfake detection studies often treat unseen sample detection as a zero-shot" task, training on images generated by known models but generalizing to unknown ones. A key real-world challenge arises when a model performs poorly on unknown samples, yet these samples remain available for analysis. This highlights that it should be approached as a few-shot" task, where effectively utilizing a small number of samples can lead to significant improvement. Unlike typical few-shot tasks focused on semantic understanding, deepfake detection prioritizes image realism, which closely mirrors real-world distributions. In this work, we propose the Few-shot Training-free Network (FTNet) for real-world few-shot deepfake detection. Simple yet effective, FTNet differs from traditional methods that rely on large-scale known data for training. Instead, FTNet uses only one fake samplefrom an evaluation set, mimicking the scenario where new samples emerge in the real world and can be gathered for use, without any training or parameter updates. During evaluation, each test sample is compared to the known fake and real samples, and it is classified based on the category of the nearest sample. We conduct a comprehensive analysis of AI-generated images from 29 different generative models and achieve a new SoTA performance, with an average improvement of 8.7% compared to existing methods. This work introduces a fresh perspective on real-world deepfake detection: when the model struggles to generalize on a few-shot sample, leveraging the failed samples leads to better performance.
zh

[CV-85] CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios

【速读】:该论文旨在解决城市尺度点云语义分割中因3D数据规模有限和数据集间域差异(domain gap)导致模型泛化能力不足的问题。其关键解决方案在于提出一个名为CitySeg的基础模型,通过引入文本模态实现开放词汇(open vocabulary)分割与零样本(zero-shot)推理;具体包括:定制化数据预处理规则以缓解多域数据分布不均问题,设计局部-全局交叉注意力网络提升无人机(UAV)场景下的感知能力,采用基于标注规则构建的层次分类策略统一不同数据集间的语义标签,并通过两阶段训练策略结合铰链损失(hinge loss)增强子类别特征可分性,从而在九个封闭集基准上达到当前最优性能,并首次实现无需视觉信息的城市尺度点云零样本泛化。

链接: https://arxiv.org/abs/2508.09470
作者: Jialei Xu,Zizhuang Wei,Weikang You,Linyun Li,Weijian Sun
机构: Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding. However, existing models are frequently constrained by the limited scale of 3D data and the domain gap between datasets, which lead to reduced generalization capability. To address these challenges, we propose CitySeg, a foundation model for city-scale point cloud semantic segmentation that incorporates text modality to achieve open vocabulary segmentation and zero-shot inference. Specifically, in order to mitigate the issue of non-uniform data distribution across multiple domains, we customize the data preprocessing rules, and propose a local-global cross-attention network to enhance the perception capabilities of point networks in UAV scenarios. To resolve semantic label discrepancies across datasets, we introduce a hierarchical classification strategy. A hierarchical graph established according to the data annotation rules consolidates the data labels, and the graph encoder is used to model the hierarchical relationships between categories. In addition, we propose a two-stage training strategy and employ hinge loss to increase the feature separability of subcategories. Experimental results demonstrate that the proposed CitySeg achieves state-of-the-art (SOTA) performance on nine closed-set benchmarks, significantly outperforming existing approaches. Moreover, for the first time, CitySeg enables zero-shot generalization in city-scale point cloud scenarios without relying on visual information.
zh

[CV-86] Event-driven Robust Fitting on Neuromorphic Hardware ICCV2025

【速读】:该论文旨在解决传统鲁棒几何模型拟合方法在高能耗背景下难以满足可持续AI部署需求的问题,尤其是在计算机视觉任务中广泛使用的随机采样策略和优化算法存在能效瓶颈。其解决方案的关键在于引入类脑计算(neuromorphic computing)范式,设计了一种适用于真实类脑硬件Intel Loihi 2的新型脉冲神经网络(spiking neural network),并通过创新的事件驱动建模方法将鲁棒拟合问题映射至Loihi 2的独特架构上,同时结合算法级优化策略缓解硬件当前有限精度与指令集的限制,最终实现仅需标准CPU能耗15%即可达到相当精度的鲁棒拟合性能。

链接: https://arxiv.org/abs/2508.09466
作者: Tam Ngoc-Bang Nguyen,Anh-Dzung Doan,Zhipeng Cai,Tat-Jun Chin
机构: Australian Institute for Machine Learning (澳大利亚机器学习研究所); The University of Adelaide (阿德莱德大学); Intel Labs (英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 11 pages, accepted in ICCV 2025 Workshop on Neuromorphic Vision (NeVI)

点击查看摘要

Abstract:Robust fitting of geometric models is a fundamental task in many computer vision pipelines. Numerous innovations have been produced on the topic, from improving the efficiency and accuracy of random sampling heuristics to generating novel theoretical insights that underpin new approaches with mathematical guarantees. However, one aspect of robust fitting that has received little attention is energy efficiency. This performance metric has become critical as high energy consumption is a growing concern for AI adoption. In this paper, we explore energy-efficient robust fitting via the neuromorphic computing paradigm. Specifically, we designed a novel spiking neural network for robust fitting on real neuromorphic hardware, the Intel Loihi 2. Enabling this are novel event-driven formulations of model estimation that allow robust fitting to be implemented in the unique architecture of Loihi 2, and algorithmic strategies to alleviate the current limited precision and instruction set of the hardware. Results show that our neuromorphic robust fitting consumes only a fraction (15%) of the energy required to run the established robust fitting algorithm on a standard CPU to equivalent accuracy.
zh

[CV-87] Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy

【速读】:该论文旨在解决现有个性化2D虚拟形象生成方法在捕捉细微面部表情和跨不同表情保持身份一致性方面的不足。其解决方案的关键在于提出GEN-AFFECT框架,该框架通过将多模态扩散Transformer(multimodal diffusion transformer)条件化于提取的身份-表情表征(identity-expression representation),从而实现表情多样性与身份一致性的统一;此外,在推理阶段引入一致注意力机制(consistent attention),促进不同表情生成结果之间的信息共享,确保在生成一系列精细面部表情时维持目标身份的一致性。

链接: https://arxiv.org/abs/2508.09461
作者: Hao Yu,Rupayan Mallick,Margrit Betke,Sarah Adel Bargal
机构: Boston University (波士顿大学); Georgetown University (乔治城大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Different forms of customized 2D avatars are widely used in gaming applications, virtual communication, education, and content creation. However, existing approaches often fail to capture fine-grained facial expressions and struggle to preserve identity across different expressions. We propose GEN-AFFECT, a novel framework for personalized avatar generation that generates expressive and identity-consistent avatars with a diverse set of facial expressions. Our framework proposes conditioning a multimodal diffusion transformer on an extracted identity-expression representation. This enables identity preservation and representation of a wide range of facial expressions. GEN-AFFECT additionally employs consistent attention at inference for information sharing across the set of generated expressions, enabling the generation process to maintain identity consistency over the array of generated fine-grained expressions. GEN-AFFECT demonstrates superior performance compared to previous state-of-the-art methods on the basis of the accuracy of the generated expressions, the preservation of the identity and the consistency of the target identity across an array of fine-grained facial expressions.
zh

[CV-88] RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

【速读】:该论文旨在解决视觉篡改定位(Visual Manipulation Localization, VML)任务中现有方法在跨模态泛化能力不足、难以高效处理高分辨率图像或长时视频输入的问题。其核心解决方案是提出了一种统一且模块化的架构 RelayFormer,关键创新在于引入灵活的局部单元与全局-局部接力注意力(Global-Local Relay Attention, GLoRA)机制,实现无分辨率依赖的可扩展处理,并支持与基于 Transformer 的骨干网络(如 ViT 和 SegFormer)轻量级集成,同时设计了一个基于查询的轻量级掩码解码器,可在视频序列上实现线性复杂度的一次性推理,从而显著提升定位精度与计算效率。

链接: https://arxiv.org/abs/2508.09459
作者: Wen Huang,Jiarui Yang,Tao Dai,Jiawei Li,Shaoxiong Zhan,Bin Wang,Shu-Tao Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual manipulation localization (VML) – across both images and videos – is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently. We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations. Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.09459 [cs.CV] (or arXiv:2508.09459v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.09459 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-89] Animate-X: Universal Character Image Animation with Dynamic Backgrounds

【速读】:该论文旨在解决两个关键问题:一是现有图像动画方法主要针对人类角色设计,难以泛化到游戏与娱乐行业中广泛使用的拟人化角色(anthropomorphic characters);二是现有方法生成的视频背景为静态,限制了视频的真实感。针对第一个问题,作者提出Animate-X++框架,其核心创新在于引入Pose Indicator模块,通过隐式(利用CLIP视觉特征提取驱动视频的整体运动模式和时序关系)与显式(预训练模拟推理阶段可能输入)两种方式增强运动表征能力,从而提升对不同角色类型的适应性。针对第二个问题,提出多任务训练策略联合优化角色动画与文本驱动背景生成(TI2V任务),并结合部分参数训练机制,在不牺牲动画质量的前提下实现动态背景生成,显著提升了视频的 realism。

链接: https://arxiv.org/abs/2508.09454
作者: Shuai Tan,Biao Gong,Zhuoxin Liu,Yan Wang,Xi Chen,Yifan Feng,Hengshuang Zhao
机构: The University of Hong Kong (香港大学); Ant Group (蚂蚁集团); The University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Character image animation, which generates high-quality videos from a reference image and target pose sequence, has seen significant progress in recent years. However, most existing methods only apply to human figures, which usually do not generalize well on anthropomorphic characters commonly used in industries like gaming and entertainment. Furthermore, previous methods could only generate videos with static backgrounds, which limits the realism of the videos. For the first challenge, our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character. To this end, this paper proposes Animate-X++, a universal animation framework based on DiT for various character types, including anthropomorphic characters. To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner. The former leverages CLIP visual features of a driving video to extract its gist of motion, like the overall movement pattern and temporal relations among motions, while the latter strengthens the generalization of DiT by simulating possible inputs in advance that may arise during inference. For the second challenge, we introduce a multi-task training strategy that jointly trains the animation and TI2V tasks. Combined with the proposed partial parameter training, this approach achieves not only character animation but also text-driven background dynamics, making the videos more realistic. Moreover, we introduce a new Animated Anthropomorphic Benchmark (A2Bench) to evaluate the performance of Animate-X++ on universal and widely applicable animation images. Extensive experiments demonstrate the superiority and effectiveness of Animate-X++.
zh

[CV-90] HyperKD: Distilling Cross-Spectral Knowledge in Masked Autoencoders via Inverse Domain Shift with Spatial-Aware Masking and Specialized Loss

【速读】:该论文旨在解决预训练基础模型在高光谱遥感图像中直接应用时面临的挑战,即由于光谱差异显著及可用观测数据稀缺导致的性能下降问题。其核心解决方案是提出HyperKD框架,通过一种逆向的知识蒸馏机制实现跨光谱域的知识迁移:以结构更简单的教师模型(Prithvi)指导学生模型(针对EnMAP高光谱影像定制)学习,而非传统方式中复杂教师模型指导简单学生模型。该方案的关键创新在于引入基于特征的策略,包括按光谱范围进行通道对齐、空间特征引导的掩码机制以及面向高光谱图像优化的增强损失函数,从而有效缓解光谱域间隙问题,显著提升重建保真度与下游任务(如土地覆盖分类、作物类型识别和土壤有机碳预测)的鲁棒性。

链接: https://arxiv.org/abs/2508.09453
作者: Abdul Matin,Tanjim Bin Faruk,Shrideep Pallickara,Sangmi Lee Pallickara
机构: Colorado State University (科罗拉多州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The proliferation of foundation models, pretrained on large-scale unlabeled datasets, has emerged as an effective approach in creating adaptable and reusable architectures that can be leveraged for various downstream tasks using satellite observations. However, their direct application to hyperspectral remote sensing remains challenging due to inherent spectral disparities and the scarcity of available observations. In this work, we present HyperKD, a novel knowledge distillation framework that enables transferring learned representations from a teacher model into a student model for effective development of a foundation model on hyperspectral images. Unlike typical knowledge distillation frameworks, which use a complex teacher to guide a simpler student, HyperKD enables an inverse form of knowledge transfer across different types of spectral data, guided by a simpler teacher model. Building upon a Masked Autoencoder, HyperKD distills knowledge from the Prithvi foundational model into a student tailored for EnMAP hyperspectral imagery. HyperKD addresses the inverse domain adaptation problem with spectral gaps by introducing a feature-based strategy that includes spectral range-based channel alignment, spatial feature-guided masking, and an enhanced loss function tailored for hyperspectral images. HyperKD bridges the substantial spectral domain gap, enabling the effective use of pretrained foundation models for geospatial applications. Extensive experiments show that HyperKD significantly improves representation learning in MAEs, leading to enhanced reconstruction fidelity and more robust performance on downstream tasks such as land cover classification, crop type identification, and soil organic carbon prediction, underpinning the potential of knowledge distillation frameworks in remote sensing analytics with hyperspectral imagery.
zh

[CV-91] RASR: Retrieval-Augmented Super Resolution for Practical Reference-based Image Restoration

【速读】:该论文旨在解决现有参考图像超分辨率(Reference-based Super Resolution, RefSR)方法依赖人工标注的目标-参考图像对这一关键限制问题,从而提升其在真实场景中的实用性。解决方案的关键在于提出了一种全新的“检索增强型超分辨率”(Retrieval-Augmented Super Resolution, RASR)范式,该范式仅需低质量输入图像即可自动从参考数据库中检索语义相关的高分辨率图像,实现无需人工配对的可扩展、灵活的RefSR。为此,作者构建了首个面向RASR的基准数据集RASR-Flickr30,并提出了RASRNet作为强基线模型,其结合语义检索模块与基于扩散模型的生成器,通过语义条件增强实现更真实的纹理重建效果。

链接: https://arxiv.org/abs/2508.09449
作者: Jiaqi Yan,Shuning Xu,Xiangyu Chen,Dell Zhang,Jie Tang,Gangshan Wu,Jie Liu
机构: TeleAI; Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reference-based Super Resolution (RefSR) improves upon Single Image Super Resolution (SISR) by leveraging high-quality reference images to enhance texture fidelity and visual realism. However, a critical limitation of existing RefSR approaches is their reliance on manually curated target-reference image pairs, which severely constrains their practicality in real-world scenarios. To overcome this, we introduce Retrieval-Augmented Super Resolution (RASR), a new and practical RefSR paradigm that automatically retrieves semantically relevant high-resolution images from a reference database given only a low-quality input. This enables scalable and flexible RefSR in realistic use cases, such as enhancing mobile photos taken in environments like zoos or museums, where category-specific reference data (e.g., animals, artworks) can be readily collected or pre-curated. To facilitate research in this direction, we construct RASR-Flickr30, the first benchmark dataset designed for RASR. Unlike prior datasets with fixed target-reference pairs, RASR-Flickr30 provides per-category reference databases to support open-world retrieval. We further propose RASRNet, a strong baseline that combines a semantic reference retriever with a diffusion-based RefSR generator. It retrieves relevant references based on semantic similarity and employs a diffusion-based generator enhanced with semantic conditioning. Experiments on RASR-Flickr30 demonstrate that RASRNet consistently improves over SISR baselines, achieving +0.38 dB PSNR and -0.0131 LPIPS, while generating more realistic textures. These findings highlight retrieval augmentation as a promising direction to bridge the gap between academic RefSR research and real-world applicability.
zh

[CV-92] MPT: Motion Prompt Tuning for Micro-Expression Recognition

【速读】:该论文旨在解决微表情识别(Micro-expression Recognition, MER)中因标注数据稀缺而导致模型训练受限的问题,同时克服大语言模型(Large Models, LMs)在捕捉瞬时且细微面部运动方面的不足。其解决方案的关键在于提出了一种新颖的“运动提示调优”(Motion Prompt Tuning, MPT)方法:首先通过运动放大和高斯令牌化(Gaussian tokenization)生成运动提示,以提取并增强微表情中的细微动态信息;其次引入组适配器(group adapter)嵌入到预训练模型中,从而提升模型在微表情识别任务上的域适应能力和细粒度区分能力。该方法有效实现了大模型在MER任务中的高效迁移与优化。

链接: https://arxiv.org/abs/2508.09446
作者: Jiateng Liu,Hengcan Shi,Feng Chen,Zhiwen Shao,Yaonan Wang,Jianfei Cai,Wenming Zheng
机构: Southeast University (东南大学); Hunan University (湖南大学); University of Adelaide (阿德莱德大学); China University of Mining and Technology (中国矿业大学); Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-expression recognition (MER) is crucial in the affective computing field due to its wide application in medical diagnosis, lie detection, and criminal investigation. Despite its significance, obtaining micro-expression (ME) annotations is challenging due to the expertise required from psychological professionals. Consequently, ME datasets often suffer from a scarcity of training samples, severely constraining the learning of MER models. While current large pre-training models (LMs) offer general and discriminative representations, their direct application to MER is hindered by an inability to capture transitory and subtle facial movements-essential elements for effective MER. This paper introduces Motion Prompt Tuning (MPT) as a novel approach to adapting LMs for MER, representing a pioneering method for subtle motion prompt tuning. Particularly, we introduce motion prompt generation, including motion magnification and Gaussian tokenization, to extract subtle motions as prompts for LMs. Additionally, a group adapter is carefully designed and inserted into the LM to enhance it in the target MER domain, facilitating a more nuanced distinction of ME representation. Furthermore, extensive experiments conducted on three widely used MER datasets demonstrate that our proposed MPT consistently surpasses state-of-the-art approaches and verifies its effectiveness.
zh

[CV-93] DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation

【速读】:该论文旨在解决视觉语言导航在连续环境(Vision-Language Navigation in Continuous Environments, VLN-CE)中传统两阶段方式存在的两个核心问题:一是由于各阶段采用代理目标函数导致的全局次优性,二是对第一阶段预测的可导航路径点(waypoint)质量高度依赖所引发的性能瓶颈。解决方案的关键在于提出一种端到端优化的扩散导航方法(DAgger Diffusion Navigation, DifNav),其将传统的两阶段分解(即waypoint生成与路径规划)统一为单一的条件扩散策略(conditional diffusion policy),直接建模未来动作在连续导航空间中的多模态分布,从而无需显式地预测waypoint,并能捕捉多种符合指令的行为模式;同时结合DAgger在线训练机制和专家轨迹增强策略,有效缓解模仿学习中的误差累积问题,提升长程导航任务中的空间推理能力与容错性。

链接: https://arxiv.org/abs/2508.09444
作者: Haoxiang Shi,Xiang Deng,Zaijing Li,Gongwei Chen,Yaowei Wang,Liqiang Nie
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces. Existing VLN-CE approaches typically use a two-stage waypoint planning framework, where a high-level waypoint predictor generates the navigable waypoints, and then a navigation planner suggests the intermediate goals in the high-level action space. However, this two-stage decomposition framework suffers from: (1) global sub-optimization due to the proxy objective in each stage, and (2) a performance bottleneck caused by the strong reliance on the quality of the first-stage predicted waypoints. To address these limitations, we propose DAgger Diffusion Navigation (DifNav), an end-to-end optimized VLN-CE policy that unifies the traditional two stages, i.e. waypoint generation and planning, into a single diffusion policy. Notably, DifNav employs a conditional diffusion policy to directly model multi-modal action distributions over future actions in continuous navigation space, eliminating the need for a waypoint predictor while enabling the agent to capture multiple possible instruction-following behaviors. To address the issues of compounding error in imitation learning and enhance spatial reasoning in long-horizon navigation tasks, we employ DAgger for online policy training and expert trajectory augmentation, and use the aggregated data to further fine-tune the policy. This approach significantly improves the policy’s robustness and its ability to recover from error states. Extensive experiments on benchmark datasets demonstrate that, even without a waypoint predictor, the proposed method substantially outperforms previous state-of-the-art two-stage waypoint-based models in terms of navigation performance. Our code is available at: this https URL.
zh

[CV-94] What-Meets-Where: Unified Learning of Action and Contact Localization in a New Dataset

【速读】:该论文旨在解决当前视觉方法在动作理解中难以同时建模动作语义(action semantics)与空间上下文(spatial contextualization)的问题,即现有方法通常无法联合捕捉“做什么动作”和“在何处发生”的双重信息。其解决方案的关键在于提出一种新的视觉任务——同时预测高层动作语义和细粒度的身体接触区域,并设计了PaIR-Net框架,该框架包含三个核心组件:接触先验感知模块(Contact Prior Aware Module, CPAM)用于识别与接触相关的身体部位,先验引导的逐像素分割模块(Prior-Guided Concat Segmenter, PGCS)实现精确的接触区域分割,以及交互推理模块(Interaction Inference Module, IIM)用于整合全局交互关系,从而实现对人机/人体间交互行为的更全面建模。

链接: https://arxiv.org/abs/2508.09428
作者: Yuxiao Wang,Yu Lei,Wolin Liang,Weiying Xue,Zhenao Wei,Nan Zhuang,Qi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbfwhat action is occurring and \textbfwhere it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts. Experimental evaluation demonstrates that PaIR-Net significantly outperforms baseline approaches, while ablation studies confirm the efficacy of each architectural component. The code and dataset will be released upon publication.
zh

[CV-95] Distilling LLM Prior to Flow Model for Generalizable Agents Imagination in Object Goal Navigation

【速读】:该论文针对Object Goal Navigation (ObjectNav)任务中代理在未见环境中定位指定物体时面临的挑战,即现有方法依赖确定性和判别式模型构建语义地图,忽视了室内布局的固有不确定性,从而限制了模型在未知环境中的泛化能力。解决方案的关键在于提出GOAL框架——一个基于生成流(generative flow-based)的方法,通过将大语言模型(LLM)增强的完整场景语义地图与观测区域进行桥梁连接,建模室内环境的语义分布。训练过程中,从LLM推断的空间先验被编码为二维高斯场并注入目标地图,从而将丰富的上下文知识蒸馏至流模型中,实现更鲁棒和可泛化的语义补全。

链接: https://arxiv.org/abs/2508.09423
作者: Badi Li,Ren-jie Lu,Yu Zhou,Jingke Meng,Wei-shi Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at this https URL.
zh

[CV-96] RampNet: A Two-Stage Pipeline for Bootstrapping Curb Ramp Detection in Streetscape Images from Open Government Metadata ICCV’25

【速读】:该论文旨在解决城市无障碍设施中路缘坡道(curb ramp)在图像中难以鲁棒检测的问题,其核心挑战在于缺乏大规模、高质量的标注数据集。解决方案的关键在于提出一个两阶段流水线RampNet:第一阶段通过自动将政府提供的路缘坡道位置数据映射到Google Street View(GSV)全景图像的像素坐标,生成超过21万张标注的GSV图像数据集;第二阶段基于该自动生成的数据集训练改进的ConvNeXt V2模型进行检测,最终实现94.0%精度和92.5%召回率,以及0.9236 AP的性能,显著优于现有方法,首次构建了大规模、高质量的路缘坡道检测数据集、基准和模型。

链接: https://arxiv.org/abs/2508.09415
作者: John S. O’Meara,Jared Hwang,Zeyu Wang,Michael Saugstad,Jon E. Froehlich
机构: Issaquah High School (伊萨夸高中); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the ICCV’25 Workshop on Vision Foundation Models and Generative AI for Accessibility: Challenges and Opportunities

点击查看摘要

Abstract:Curb ramps are critical for urban accessibility, but robustly detecting them in images remains an open problem due to the lack of large-scale, high-quality datasets. While prior work has attempted to improve data availability with crowdsourced or manually labeled data, these efforts often fall short in either quality or scale. In this paper, we introduce and evaluate a two-stage pipeline called RampNet to scale curb ramp detection datasets and improve model performance. In Stage 1, we generate a dataset of more than 210,000 annotated Google Street View (GSV) panoramas by auto-translating government-provided curb ramp location data to pixel coordinates in panoramic images. In Stage 2, we train a curb ramp detection model (modified ConvNeXt V2) from the generated dataset, achieving state-of-the-art performance. To evaluate both stages of our pipeline, we compare to manually labeled panoramas. Our generated dataset achieves 94.0% precision and 92.5% recall, and our detection model reaches 0.9236 AP – far exceeding prior work. Our work contributes the first large-scale, high-quality curb ramp detection dataset, benchmark, and model.
zh

[CV-97] Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving

【速读】:该论文旨在解决当前自动驾驶场景中缺乏高质量、时序连贯且包含明确交互语义的多人3D运动数据集的问题,以提升对动态城市环境中行人交互行为的细粒度理解。现有数据集多依赖单目RGB视频帧估计3D姿态,存在遮挡和时序不连续等问题,导致生成的人体运动质量低且不真实。其解决方案的关键在于利用3D人体形状与运动先验(prior),从原始LiDAR点云中增强提取的3D姿态序列质量,从而构建出Waymo-3DSkelMo——首个大规模、高保真、具交互语义的3D骨骼运动数据集,覆盖超14,000秒、800余个真实驾驶场景,为未来复杂城市环境下人类行为建模提供重要基础资源。

链接: https://arxiv.org/abs/2508.09404
作者: Guangxun Zhu,Shiyu Fan,Hang Dai,Edmond S. L. Ho
机构: University of Glasgow (格拉斯哥大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ACM Multimedia 2025 (Dataset Track) Paper

点击查看摘要

Abstract:Large-scale high-quality 3D motion datasets with multi-person interactions are crucial for data-driven models in autonomous driving to achieve fine-grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low-quality human motion. In this paper, we introduce Waymo-3DSkelMo, the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDRA point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine-grained human behavior understanding in complex urban environments. The dataset and code will be available at this https URL
zh

[CV-98] Autonomous AI Bird Feeder for Backyard Biodiversity Monitoring

【速读】:该论文旨在解决城市环境中低成本、隐私保护且无需云端计算的自主鸟类监测问题,以支持公民科学级别的生物多样性记录。其关键解决方案在于构建一个基于本地部署的端到端系统:利用运动触发的IP摄像头采集视频片段并通过FTP上传至本地服务器;通过Detectron2模型对帧进行目标检测并定位鸟类,再使用裁剪后的图像由微调过的EfficientNet-B3分类器进行物种识别;整个流程在无独立GPU的通用硬件上运行,同时通过限制喂食器入口尺寸(30 mm)有效排除鸽子等大型鸟类,减少误触发。该方案在比利时本地40种鸟类子集上实现约99.5%的验证准确率,并在未见物种上保持约88%的top-1准确率,证明了其在家庭场景下可行且实用。

链接: https://arxiv.org/abs/2508.09398
作者: El Mustapha Mansouri
机构: Université Libre de Bruxelles (布鲁塞尔自由大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint; 8 pages, 5 figures, 1 table; IEEEtran conference format. Code: this https URL

点击查看摘要

Abstract:This paper presents a low cost, on premise system for autonomous backyard bird monitoring in Belgian urban gardens. A motion triggered IP camera uploads short clips via FTP to a local server, where frames are sampled and birds are localized with Detectron2; cropped regions are then classified by an EfficientNet-B3 model fine tuned on a 40-species Belgian subset derived from a larger Kaggle corpus. All processing runs on commodity hardware without a discrete GPU, preserving privacy and avoiding cloud fees. The physical feeder uses small entry ports (30 mm) to exclude pigeons and reduce nuisance triggers. Detector-guided cropping improves classification accuracy over raw-frame classification. The classifier attains high validation performance on the curated subset (about 99.5 percent) and delivers practical field accuracy (top-1 about 88 percent) on held-out species, demonstrating feasibility for citizen-science-grade biodiversity logging at home.
zh

[CV-99] Skyshield: Event-Driven Submillimetre Thin Obstacle Detection for Drone Flight Safety

【速读】:该论文旨在解决无人机在复杂环境中感知亚毫米级薄障碍物(如钢丝和风筝线)的难题,这类障碍物对传统传感器(如RGB相机、LiDAR和深度相机)而言难以检测。解决方案的关键在于提出一种事件驱动的端到端感知框架SkyShield,其利用事件流中薄障碍物的独特特征,结合轻量级U-Net架构与创新的Dice-Contour Regularization Loss,实现了高精度检测与低延迟(21.2 ms),适用于边缘和移动平台部署。

链接: https://arxiv.org/abs/2508.09397
作者: Zhengli Zhang,Xinyu Luo,Yuchen Sun,Wenhua Ding,Dongyu Huang,Xinlei Chen
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Drones operating in complex environments face a significant threat from thin obstacles, such as steel wires and kite strings at the submillimeter level, which are notoriously difficult for conventional sensors like RGB cameras, LiDAR, and depth cameras to detect. This paper introduces SkyShield, an event-driven, end-to-end framework designed for the perception of submillimeter scale obstacles. Drawing upon the unique features that thin obstacles present in the event stream, our method employs a lightweight U-Net architecture and an innovative Dice-Contour Regularization Loss to ensure precise detection. Experimental results demonstrate that our event-based approach achieves mean F1 Score of 0.7088 with a low latency of 21.2 ms, making it ideal for deployment on edge and mobile platforms.
zh

[CV-100] DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)目标检测中普遍存在的相干噪声干扰问题。现有方法通常通过分析或增强目标在空间域的特征来实现隐式去噪,但效果有限。本文提出的DenoDet V2采用全新视角,在变换域中对特征进行解构与调制,其关键创新在于设计了一种基于带间互调制机制的注意力架构,充分挖掘幅度谱与相位谱之间的互补性,从而实现二者之间的相互增强。相比前代DenoDet V1,DenoDet V2不仅在SARDet-100K数据集上提升0.8%性能,还使模型复杂度降低50%,展现出显著的效率与精度优势。

链接: https://arxiv.org/abs/2508.09392
作者: Kang Ni,Minrui Zou,Yuxuan Li,Xiang Li,Kehua Guo,Ming-Ming Cheng,Yimian Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One of the primary challenges in Synthetic Aperture Radar (SAR) object detection lies in the pervasive influence of coherent noise. As a common practice, most existing methods, whether handcrafted approaches or deep learning-based methods, employ the analysis or enhancement of object spatial-domain characteristics to achieve implicit denoising. In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. Compared to DenoDet V1, DenoDet V2 is a major advancement that exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism, which enables a reciprocal enhancement between phase and amplitude spectra. Extensive experiments on various SAR datasets demonstrate the state-of-the-art performance of DenoDet V2. Notably, DenoDet V2 achieves a significant 0.8% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half. The code is available at this https URL.
zh

[CV-101] X-UniMotion: Animating Human Images with Expressive Unified and Identity-Agnostic Motion Latents

【速读】:该论文旨在解决跨身份人体运动迁移中缺乏统一、高保真且解耦表达的问题,传统方法依赖显式骨骼姿态和启发式跨身份调整,难以实现细节丰富、身份一致的动画生成。其解决方案的关键在于提出X-UniMotion——一种统一的隐式潜在表示框架,将面部表情、身体姿态与双手手势等多粒度运动信息直接从单张图像编码为四个解耦的潜在令牌(latent tokens),具备高度表达能力与身份无关性;通过自监督端到端训练流程联合优化运动编码器、潜在表示与基于DiT的视频生成模型,并借助2D空间与色彩增强、合成3D跨身份姿态对以及辅助解码器引导,实现运动-身份解耦与细粒度语义对齐的深度感知嵌入,从而在多样化身份与空间配置下实现高质量跨身份运动迁移。

链接: https://arxiv.org/abs/2508.09383
作者: Guoxian Song,Hongyi Xu,Xiaochen Zhao,You Xie,Tianpei Gu,Zenan Li,Chenxu Zhang,Linjie Luo
机构: ByteDanceUSA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granular motion directly from a single image into a compact set of four disentangled latent tokens – one for facial expression, one for body pose, and one for each hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with diverse identities, poses, and spatial configurations. To achieve this, we introduce a self-supervised, end-to-end framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale, diverse human motion datasets. Motion–identity disentanglement is enforced via 2D spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs under shared poses. Furthermore, we guide motion token learning with auxiliary decoders that promote fine-grained, semantically aligned, and depth-aware motion embeddings. Extensive experiments show that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion fidelity and identity preservation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.09383 [cs.CV] (or arXiv:2508.09383v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.09383 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-102] What Can We Learn from Inter-Annotator Variability in Skin Lesion Segmentation? MICCAI

【速读】:该论文旨在解决医学图像分割中因标注者差异(如经验、偏好、工具使用等)导致的标注不一致性问题,尤其关注边界模糊的皮肤病变(如毛刺状或浸润性结节)所引发的标注分歧,这类病变常与恶性程度相关。其解决方案的关键在于:首先通过构建大规模多标注者皮肤病变分割数据集IMA++,定量揭示了标注者间一致性的统计显著性(p < 0.001)与病变恶性程度之间的关联;其次提出利用dermoscopic图像直接预测标注者间一致性(IAA),实现均方误差为0.108的高精度预测;最终将IAA作为“软”临床特征引入多任务学习目标,在多个模型架构和公共数据集上平均提升平衡准确率4.2%,从而缓解标注不确定性对分割性能的影响。

链接: https://arxiv.org/abs/2508.09381
作者: Kumar Abhishek,Jeremy Kawahara,Ghassan Hamarneh
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Medical Image Computing and Computer-Assisted Intervention (MICCAI) ISIC Skin Image Analysis Workshop (MICCAI ISIC) 2025; 12 pages, 4 tables, 3 figures

点击查看摘要

Abstract:Medical image segmentation exhibits intra- and inter-annotator variability due to ambiguous object boundaries, annotator preferences, expertise, and tools, among other factors. Lesions with ambiguous boundaries, e.g., spiculated or infiltrative nodules, or irregular borders per the ABCD rule, are particularly prone to disagreement and are often associated with malignancy. In this work, we curate IMA++, the largest multi-annotator skin lesion segmentation dataset, on which we conduct an in-depth study of variability due to annotator, malignancy, tool, and skill factors. We find a statistically significant (p0.001) association between inter-annotator agreement (IAA), measured using Dice, and the malignancy of skin lesions. We further show that IAA can be accurately predicted directly from dermoscopic images, achieving a mean absolute error of 0.108. Finally, we leverage this association by utilizing IAA as a “soft” clinical feature within a multi-task learning objective, yielding a 4.2% improvement in balanced accuracy averaged across multiple model architectures and across IMA++ and four public dermoscopic datasets. The code is available at this https URL.
zh

[CV-103] A Signer-Invariant Conformer and Multi-Scale Fusion Transformer for Continuous Sign Language Recognition ICCV

【速读】:该论文旨在解决连续手语识别(Continuous Sign Language Recognition, CSLR)中的两大核心挑战:签名者无关性(Signer-Independent, SI)和未见句子泛化(Unseen-Sentences, US)。针对SI问题,作者提出了一种签名不变的Conformer架构,通过融合卷积与多头自注意力机制,从基于姿态的骨骼关键点中学习鲁棒且与签名者无关的表征;针对US问题,设计了多尺度融合Transformer,引入新颖的双路径时序编码器以捕捉细粒度的姿态动态变化,从而提升模型对新型语法结构的理解能力。实验表明,所提方法在Isharah-1000数据集上分别将SI任务的词错误率(Word Error Rate, WER)降至13.07%(较现有最优提升13.53%),US任务WER达47.78%,显著优于先前方法,并在SignEval 2025 CSLR挑战赛中取得优异排名,验证了任务定制化网络设计的有效性。

链接: https://arxiv.org/abs/2508.09372
作者: Md Rezwanul Haque,Md. Milon Islam,S M Taslim Uddin Raju,Fakhri Karray
机构: University of Waterloo (滑铁卢大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted for the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, Hawaii, USA. 1st MSLR Workshop 2025

点击查看摘要

Abstract:Continuous Sign Language Recognition (CSLR) faces multiple challenges, including significant inter-signer variability and poor generalization to novel sentence structures. Traditional solutions frequently fail to handle these issues efficiently. For overcoming these constraints, we propose a dual-architecture framework. For the Signer-Independent (SI) challenge, we propose a Signer-Invariant Conformer that combines convolutions with multi-head self-attention to learn robust, signer-agnostic representations from pose-based skeletal keypoints. For the Unseen-Sentences (US) task, we designed a Multi-Scale Fusion Transformer with a novel dual-path temporal encoder that captures both fine-grained posture dynamics, enabling the model’s ability to comprehend novel grammatical compositions. Experiments on the challenging Isharah-1000 dataset establish a new standard for both CSLR benchmarks. The proposed conformer architecture achieves a Word Error Rate (WER) of 13.07% on the SI challenge, a reduction of 13.53% from the state-of-the-art. On the US task, the transformer model scores a WER of 47.78%, surpassing previous work. In the SignEval 2025 CSLR challenge, our team placed 2nd in the US task and 4th in the SI task, demonstrating the performance of these models. The findings validate our key hypothesis: that developing task-specific networks designed for the particular challenges of CSLR leads to considerable performance improvements and establishes a new baseline for further research. The source code is available at: this https URL.
zh

[CV-104] FusionEnsemble-Net: An Attention-Based Ensemble of Spatiotemporal Networks for Multimodal Sign Language Recognition ICCV

【速读】:该论文旨在解决医疗沟通中手语(Sign Language)准确识别的难题,特别是针对复杂多模态手势的建模与识别问题。其解决方案的关键在于提出一种基于注意力机制的时空网络集成框架——FusionEnsemble-Net,该框架通过同步处理RGB视频和雷达距离多普勒图(range Doppler map radar)两种模态数据,利用注意力融合模块在每个子网络内动态整合视觉与运动特征,并最终在集成分类头中融合四个不同融合通道的输出,从而显著提升模型对意大利手语(Italian Sign Language)孤立手势的识别准确率(达到99.44%),展现出卓越的鲁棒性和准确性。

链接: https://arxiv.org/abs/2508.09362
作者: Md. Milon Islam,Md Rezwanul Haque,S M Taslim Uddin Raju,Fakhri Karray
机构: University of Waterloo (滑铁卢大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, Hawaii, USA. 1st MSLR Workshop 2025

点击查看摘要

Abstract:Accurate recognition of sign language in healthcare communication poses a significant challenge, requiring frameworks that can accurately interpret complex multimodal gestures. To deal with this, we propose FusionEnsemble-Net, a novel attention-based ensemble of spatiotemporal networks that dynamically fuses visual and motion data to enhance recognition accuracy. The proposed approach processes RGB video and range Doppler map radar modalities synchronously through four different spatiotemporal networks. For each network, features from both modalities are continuously fused using an attention-based fusion module before being fed into an ensemble of classifiers. Finally, the outputs of these four different fused channels are combined in an ensemble classification head, thereby enhancing the model’s robustness. Experiments demonstrate that FusionEnsemble-Net outperforms state-of-the-art approaches with a test accuracy of 99.44% on the large-scale MultiMeDaLIS dataset for Italian Sign Language. Our findings indicate that an ensemble of diverse spatiotemporal networks, unified by attention-based fusion, yields a robust and accurate framework for complex, multimodal isolated gesture recognition tasks. The source code is available at: this https URL.
zh

[CV-105] Blink-to-code: real-time Morse code communication via eye blink detection and classification

【速读】:该论文旨在解决严重运动障碍患者缺乏有效沟通手段的问题,提出了一种基于实时眼睑动作识别的辅助通信系统。其解决方案的关键在于利用标准网络摄像头和计算机视觉技术,将自愿性眨眼动作分类为短(点)或长(划),进而解码为字母数字字符,实现了无需额外硬件的低成本、可即时响应的交互方式。

链接: https://arxiv.org/abs/2508.09344
作者: Anushka Bhatt
机构: Indira Gandhi Delhi Technical University for Women (IGDTUW)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 4 figures. Preprint on blink-based Morse code communication via webcam for assistive technology. Relevant to computer vision and human-computer interaction

点击查看摘要

Abstract:This study proposes a real-time system that translates voluntary eye blinks into Morse code, enabling communication for individuals with severe motor impairments. Using a standard webcam and computer vision, the system detects and classifies blinks as short (dot) or long (dash), then decodes them into alphanumeric characters. Experiments with five participants show 62% decoding accuracy and 18-20 seconds response times, demonstrating a viable, low-cost assistive communication method.
zh

[CV-106] UltraLight Med-Vision Mamba for Classification of Neoplastic Progression in Tubular Adenomas

【速读】:该论文旨在解决常规结肠镜筛查中癌前息肉(precancerous polyps)识别困难的问题,从而提高其切除率并降低结直肠癌的发生风险。其解决方案的关键在于利用先进的深度学习算法实现腺瘤(adenoma)的精准分类与分层,进而优化风险评估并制定个性化随访策略;其中,核心创新是采用基于状态空间模型(state-space model, SSM)的轻量级医学视觉Mamba架构,该架构在建模长距离和短距离依赖关系及全切片图像(whole slide images)泛化能力方面表现优异,同时具备计算高效性和可扩展性,为临床实时部署提供了可行性。

链接: https://arxiv.org/abs/2508.09339
作者: Aqsa Sultana,Nordin Abouzahra,Ahmed Rahu,Brian Shula,Brandon Combs,Derrick Forchetti,Theus Aspiras,Vijayan K. Asari
机构: University of Dayton(戴顿大学); University of Toledo Medical Center(托莱多大学医学中心); Honeywell International Inc.(霍尼韦尔国际公司); South Bend Medical Foundation(南本德医学基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Identification of precancerous polyps during routine colonoscopy screenings is vital for their excision, lowering the risk of developing colorectal cancer. Advanced deep learning algorithms enable precise adenoma classification and stratification, improving risk assessment accuracy and enabling personalized surveillance protocols that optimize patient outcomes. Ultralight Med-Vision Mamba, a state-space based model (SSM), has excelled in modeling long- and short-range dependencies and image generalization, critical factors for analyzing whole slide images. Furthermore, Ultralight Med-Vision Mamba’s efficient architecture offers advantages in both computational speed and scalability, making it a promising tool for real-time clinical deployment.
zh

[CV-107] Lung-DDPM: Efficient Thoracic CT Image Synthesis using Diffusion Probabilistic Model

【速读】:该论文旨在解决现有生成式 AI 模型在肺部结节(lung nodule)诊断任务中效率低和解剖结构不精确的问题,这些问题限制了其在临床实践中的应用。解决方案的关键在于提出 Lung-DDPM+,这是一种基于去噪扩散概率模型(denoising diffusion probabilistic model, DDPM)的改进方法,通过引入结节语义布局(nodule semantic layouts)作为引导,并采用肺部专用的 DPM-solver 加速采样过程,从而在保持高质量合成图像的同时显著提升生成效率。实验表明,Lung-DDPM+ 相比原模型在浮点运算量(FLOPs)、GPU 内存消耗和采样速度上均有数量级提升,且在下游分割任务中保持与先进模型相当的样本质量。

链接: https://arxiv.org/abs/2508.09327
作者: Yifan Jiang,Ahmad Shariftabrizi,Venkata SK. Manem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) has been playing an important role in various domains. Leveraging its high capability to generate high-fidelity and diverse synthetic data, generative AI is widely applied in diagnostic tasks, such as lung cancer diagnosis using computed tomography (CT). However, existing generative models for lung cancer diagnosis suffer from low efficiency and anatomical imprecision, which limit their clinical applicability. To address these drawbacks, we propose Lung-DDPM+, an improved version of our previous model, Lung-DDPM. This novel approach is a denoising diffusion probabilistic model (DDPM) guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver, enabling the method to focus on lesion areas while achieving a better trade-off between sampling efficiency and quality. Evaluation results on the public LIDC-IDRI dataset suggest that the proposed method achieves 8 \times fewer FLOPs (floating point operations per second), 6.8 \times lower GPU memory consumption, and 14 \times faster sampling compared to Lung-DDPM. Moreover, it maintains comparable sample quality to both Lung-DDPM and other state-of-the-art (SOTA) generative models in two downstream segmentation tasks. We also conducted a Visual Turing Test by an experienced radiologist, showing the advanced quality and fidelity of synthetic samples generated by the proposed method. These experimental results demonstrate that Lung-DDPM+ can effectively generate high-quality thoracic CT images with lung nodules, highlighting its potential for broader applications, such as general tumor synthesis and lesion generation in medical imaging. The code and pretrained models are available at this https URL.
zh

[CV-108] SegDAC: Segmentation-Driven Actor-Critic for Visual Reinforcement Learning

【速读】:该论文旨在解决视觉强化学习(Visual Reinforcement Learning, Visual RL)中因高维输入和噪声奖励导致的感知与动作联合学习难题,特别是如何有效利用大规模预训练感知模型以提升视觉泛化能力和样本效率。其解决方案的关键在于提出SegDAC方法,通过Segment Anything Model (SAM)实现对象中心的分解,并结合YOLO-World基于文本提示对分割结果进行语义锚定,同时设计了一种支持每时间步动态数量分割的Transformer架构,使智能体能在线RL过程中自主选择关注的视觉片段,无需人工标注即可实现高效的视觉表征学习与策略优化。

链接: https://arxiv.org/abs/2508.09325
作者: Alexandre Brown,Glen Berseth
机构: Mila Quebec AI Institute (Mila魁北克人工智能研究所); Université de Montréal (蒙特利尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Visual reinforcement learning (RL) is challenging due to the need to learn both perception and actions from high-dimensional inputs and noisy rewards. Although large perception models exist, integrating them effectively into RL for visual generalization and improved sample efficiency remains unclear. We propose SegDAC, a Segmentation-Driven Actor-Critic method. SegDAC uses Segment Anything (SAM) for object-centric decomposition and YOLO-World to ground segments semantically via text prompts. It includes a novel transformer-based architecture that supports a dynamic number of segments at each time step and effectively learns which segments to focus on using online RL, without using human labels. By evaluating SegDAC over a challenging visual generalization benchmark using Maniskill3, which covers diverse manipulation tasks under strong visual perturbations, we demonstrate that SegDAC achieves significantly better visual generalization, doubling prior performance on the hardest setting and matching or surpassing prior methods in sample efficiency across all evaluated tasks.
zh

[CV-109] Harnessing Input-Adaptive Inference for Efficient VLN ICCV2025

【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)中基于历史感知多模态Transformer模型在计算资源受限场景下的效率瓶颈问题。现有方法虽能提升导航性能,但其大规模模型结构导致计算开销高,难以部署于实际应用。解决方案的关键在于提出一种输入自适应的导航策略,通过三个层级的优化实现高效计算:(1) 空间层面,选择性处理智能体每一步观测中的全景视图以提升空间效率;(2) 模型内部层面,引入基于重要性的早期退出阈值调整机制以优化模型内计算;(3) 时间层面,设计缓存机制避免重复处理已见过的视角以增强时间效率。实验表明,在七个VLN基准测试中,该方法可使三类现成代理的计算量减少超过2倍,且不显著损害性能。

链接: https://arxiv.org/abs/2508.09262
作者: Dongwoo Kang,Akhil Perincherry,Zachary Coalson,Aiden Gabriel,Stefan Lee,Sanghyun Hong
机构: Oregon State University (俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025 [Poster]

点击查看摘要

Abstract:An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate action for an agent. While they have significantly improved performance, the scale of these models can be a bottleneck in practical settings with limited computational resources. In this work, we propose a novel input-adaptive navigation method to enhance VLN model efficiency. We first show that existing input-adaptive mechanisms fail to reduce computations without substantial performance degradation. To address this, we introduce three adaptive algorithms, each deployed at a different level: (1) To improve spatial efficiency, we selectively process panoramic views at each observation of an agent. (2) To improve intra-model efficiency, we propose importance-based adaptive thresholding for the early-exit methods. (3) To improve temporal efficiency, we implement a caching mechanism that prevents reprocessing of views previously seen by the agent. In evaluations on seven VLN benchmarks, we demonstrate over a 2 \times reduction in computation across three off-the-shelf agents in both standard and continuous environments. Our code is publicly available at this https URL.
zh

[CV-110] Beyond Blanket Masking: Examining Granularity for Privacy Protection in Images Captured by Blind and Low Vision Users

【速读】:该论文旨在解决视觉语言模型(Visual Language Models, VLMs)在辅助视障用户时可能无意中捕获个人隐私信息的问题,现有方法依赖粗粒度分割,对整个私密对象进行掩码处理,导致可用性下降。解决方案的关键在于提出FiGPriv框架,通过结合细粒度分割与数据驱动的风险评分机制,仅对高风险隐私信息进行选择性掩码,同时保留低风险内容,从而在保障隐私的同时显著提升图像内容的可用性和VLM的响应能力。

链接: https://arxiv.org/abs/2508.09245
作者: Jeffri Murrugarra-LLerena,Haoran Niu,K. Suzanne Barber,Hal Daumé III,Yang Trista Cao,Paola Cascante-Bonilla
机构: Stony Brook University (石溪大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As visual assistant systems powered by visual language models (VLMs) become more prevalent, concerns over user privacy have grown, particularly for blind and low vision users who may unknowingly capture personal private information in their images. Existing privacy protection methods rely on coarse-grained segmentation, which uniformly masks entire private objects, often at the cost of usability. In this work, we propose FiGPriv, a fine-grained privacy protection framework that selectively masks only high-risk private information while preserving low-risk information. Our approach integrates fine-grained segmentation with a data-driven risk scoring mechanism. We evaluate our framework using the BIV-Priv-Seg dataset and show that FiG-Priv preserves +26% of image content, enhancing the ability of VLMs to provide useful responses by 11% and identify the image content by 45%, while ensuring privacy protection. Project Page: this https URL
zh

[CV-111] FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

【速读】:该论文旨在解决当前GUI代理(GUI Agent)评估框架存在的根本性缺陷,即现有基准测试过度关注粗粒度的任务完成率,而忽视了实际应用中至关重要的细粒度控制能力。其解决方案的关键在于提出了FineState-Bench——首个针对细粒度GUI代理操作的评估与诊断标准,该框架覆盖桌面、Web和移动多平台,包含2257个任务基准,并采用四阶段指标实现从感知到控制的全面评估。此外,研究开发了可插拔的视觉诊断助手(Visual Diagnostic Assistant, VDA),首次实现了对感知与定位能力的定量解耦分析,实验证明视觉定位能力是当前GUI代理的主要瓶颈,理想视觉定位可使Gemini-2.5-Flash的成功率提升14.9%。

链接: https://arxiv.org/abs/2508.09241
作者: Fengxian Ji,Jingpu Yang,Zirui Song,Yuanxi Wang,Zhexuan Cui,Yuke Li,Qian Jiang,Miao Fang,Xiuying Chen
机构: Northeastern University, China (东北大学,中国); MBZUAI, United Arab Emirates (穆罕默德·本·扎耶德人工智能大学,阿拉伯联合酋长国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submit/6682470 (Fengxian Ji)

点击查看摘要

Abstract:With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks through natural language instructions. However, current evaluation frameworks for GUI agents suffer from fundamental flaws: existing benchmarks overly focus on coarse-grained task completion while neglecting fine-grained control capabilities crucial for real-world applications. To address this, we introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations, designed to quantify fine-grained control. This multi-platform (desktop, Web, mobile) framework includes 2257 task benchmarks in four components and uses a four-phase indicator for comprehensive perception-to-control assessment. To analyze perception and positioning for refined operations, we developed the plug-and-play Visual Diagnostic Assistant (VDA), enabling the first quantitative decoupling analysis of these capabilities. Experimental results on our benchmark show that the most advanced models achieve only 32.8% fine-grained interaction accuracy. Using our VDA in controlled experiments, quantifying the impact of visual capabilities, we showed that ideal visual localization boosts Gemini-2.5-Flash’s success rate by 14.9%. Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning this http URL resources are fully open-source. github: this https URL huggingface: this https URL
zh

[CV-112] Gradient-Direction-Aware Density Control for 3D Gaussian Splatting

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在复杂场景中面临的两个关键问题:一是由于持续存在的大高斯分布无法满足自适应分裂阈值而导致的过重建(over-reconstruction),这通常由梯度方向冲突导致;二是梯度聚合对齐区域出现的高斯分布过度密集化(over-densification),造成冗余成分激增和内存开销显著上升。解决方案的核心在于提出梯度方向感知的自适应密度控制框架——梯度方向感知高斯溅射(Gradient-Direction-Aware Gaussian Splatting, GDAGS),其关键创新为引入梯度相干比(Gradient Coherence Ratio, GCR),通过归一化梯度向量模长计算,显式区分具有同向与冲突梯度方向的高斯分布,并设计非线性动态加权机制,使分裂操作优先处理冲突梯度高斯以增强几何细节,同时抑制同向梯度高斯的冗余生成;而在克隆过程中则促进同向梯度高斯的稠密化以完成结构补全,避免冲突梯度高斯过量堆积,从而实现更紧凑的场景表示与50%的内存消耗降低。

链接: https://arxiv.org/abs/2508.09239
作者: Zheng Zhou,Yu-Jie Xiong,Chun-Ming Xia,Jia-Chen Zhang,Hong-Jian Zhan
机构: Shanghai University of Engineering Science (上海工程技术大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis through explicit scene representation, enabling real-time photorealistic rendering. However, existing approaches manifest two critical limitations in complex scenarios: (1) Over-reconstruction occurs when persistent large Gaussians cannot meet adaptive splitting thresholds during density control. This is exacerbated by conflicting gradient directions that prevent effective splitting of these Gaussians; (2) Over-densification of Gaussians occurs in regions with aligned gradient aggregation, leading to redundant component proliferation. This redundancy significantly increases memory overhead due to unnecessary data retention. We present Gradient-Direction-Aware Gaussian Splatting (GDAGS), a gradient-direction-aware adaptive density control framework to address these challenges. Our key innovations: the gradient coherence ratio (GCR), computed through normalized gradient vector norms, which explicitly discriminates Gaussians with concordant versus conflicting gradient directions; and a nonlinear dynamic weighting mechanism leverages the GCR to enable gradient-direction-aware density control. Specifically, GDAGS prioritizes conflicting-gradient Gaussians during splitting operations to enhance geometric details while suppressing redundant concordant-direction Gaussians. Conversely, in cloning processes, GDAGS promotes concordant-direction Gaussian densification for structural completion while preventing conflicting-direction Gaussian overpopulation. Comprehensive evaluations across diverse real-world benchmarks demonstrate that GDAGS achieves superior rendering quality while effectively mitigating over-reconstruction, suppressing over-densification, and constructing compact scene representations with 50% reduced memory consumption through optimized Gaussians utilization.
zh

[CV-113] owards Scalable Training for Handwritten Mathematical Expression Recognition

【速读】:该论文旨在解决手写数学表达式识别(Handwritten Mathematical Expression Recognition, HMER)领域因数据稀缺而导致模型性能受限的问题,其核心挑战在于高质量标注数据的获取成本高、效率低。解决方案的关键在于构建一个可扩展的数据引擎,用于生成复杂且一致的LaTeX序列,并结合少量真实手写公式与大规模渲染的LaTeX公式,从而创建目前最大的公式数据集Tex80M(超过8000万高质量训练样本)。在此基础上,提出首个在大规模数据上训练的HMER模型TexTeller,通过混合训练策略融合Tex80M与小规模真实手写数据集,在几乎所有基准测试中达到当前最优(SOTA)性能。

链接: https://arxiv.org/abs/2508.09220
作者: Haoyang Li,Jiaqing Li,Jialun Cao,Zongyuan Yang,Yongping Xiong
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \textbfHandwritten \textbfMathematical \textbfExpression \textbfRecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \textttTex80M, comprising over 80 million high-quality training instances. Then we propose \textttTexTeller, the first HMER model trained at scale, by mix-training \textttTex80M with a relatively small HME dataset. The expansive training dataset and our refined pipeline have equipped \textttTexTeller with state-of-the-art (SOTA) performance across nearly all benchmarks. To advance the field, we will openly release our complete model, entire dataset, and full codebase, enabling further research building upon our contributions.
zh

[CV-114] owards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对对抗性提示(adversarial prompts)时存在的安全漏洞问题,特别是现有评估标准可能高估攻击有效性、导致对真正有效越狱攻击的识别不足。其核心问题是:当前许多被标记为“成功”的攻击实际上生成的是无害或无关内容,反映出评估体系与实际威胁之间存在脱节。解决方案的关键在于提出一个四维评估框架(输入相关性、分布外强度、输出危害性、拒绝率),并基于此发现结构性权衡——即高度相关提示易被拦截,而极端分布外提示虽可规避检测但难以触发有害输出;唯有平衡相关性与新颖性的提示才最有效。由此开发出递归重写策略“平衡结构分解”(Balanced Structural Decomposition, BSD),通过将恶意目标拆解为语义一致的子任务,并引入细微的分布外信号和视觉线索,显著提升攻击成功率与危害性,同时降低模型拒绝率,在13个商业及开源MLLM上验证效果优于现有方法,成功率提升67%,危害性提升21%。

链接: https://arxiv.org/abs/2508.09218
作者: Zuoou Li,Weitong Zhang,Jingyuan Wang,Shuyuan Zhang,Wenjia Bai,Bernhard Kainz,Mengyun Qiao
机构: 1. Tsinghua University (清华大学); 2. Peking University (北京大学); 3. Chinese Academy of Sciences (中国科学院); 4. New York University (纽约大学); 5. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are widely used in vision-language reasoning tasks. However, their vulnerability to adversarial prompts remains a serious concern, as safety mechanisms often fail to prevent the generation of harmful outputs. Although recent jailbreak strategies report high success rates, many responses classified as “successful” are actually benign, vague, or unrelated to the intended malicious goal. This mismatch suggests that current evaluation standards may overestimate the effectiveness of such attacks. To address this issue, we introduce a four-axis evaluation framework that considers input on-topicness, input out-of-distribution (OOD) intensity, output harmfulness, and output refusal rate. This framework identifies truly effective jailbreaks. In a substantial empirical study, we reveal a structural trade-off: highly on-topic prompts are frequently blocked by safety filters, whereas those that are too OOD often evade detection but fail to produce harmful content. However, prompts that balance relevance and novelty are more likely to evade filters and trigger dangerous output. Building on this insight, we develop a recursive rewriting strategy called Balanced Structural Decomposition (BSD). The approach restructures malicious prompts into semantically aligned sub-tasks, while introducing subtle OOD signals and visual cues that make the inputs harder to detect. BSD was tested across 13 commercial and open-source MLLMs, where it consistently led to higher attack success rates, more harmful outputs, and fewer refusals. Compared to previous methods, it improves success rates by 67% and harmfulness by 21% , revealing a previously underappreciated weakness in current multimodal safety systems.
zh

[CV-115] MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在情感智能方面的两大关键问题:一是模型在不同场景下的泛化能力尚不明确,二是其识别情绪触发因素的推理能力不足。为应对这些挑战,作者提出了一项系统性基准测试——MME-Emotion,其核心创新在于构建了一个包含超过6,000个精心筛选视频片段及任务特定问答对的大规模情感智能评估体系,覆盖广泛场景并设计了八类情感任务;同时引入混合指标与多智能体分析框架,实现对情绪识别与推理能力的综合量化评估。这一解决方案不仅具备可扩展性、多样性与统一协议的特点,还揭示了当前主流MLLMs在情感理解上的显著局限性(如最佳模型情绪识别准确率仅39.3%,Chain-of-Thought推理得分56.0%),并指出通用型模型与领域专用模型在情感智能获取路径上的差异,从而为未来提升MLLMs的情感智能提供了坚实基础。

链接: https://arxiv.org/abs/2508.09210
作者: Fan Zhang,Zebang Cheng,Chong Deng,Haoxuan Li,Zheng Lian,Qian Chen,Huadai Liu,Wen Wang,Yi-Fan Zhang,Renrui Zhang,Ziyu Guo,Zhihong Zhu,Hao Wu,Haixin Wang,Yefeng Zheng,Xiaojiang Peng,Xian Wu,Kun Wang,Xiangang Li,Jieping Ye,Pheng-Ann Heng
机构: CUHK(香港中文大学); Tongyi Lab(通义实验室); SZTU(深圳技术大学); PKU(北京大学); CASIA(中国科学院自动化研究所); Tencent(腾讯); UCLA(加州大学洛杉矶分校); Westlake University(西湖大学); NTU(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbfMME-Emotion, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textitscalable capacity, \textitdiverse settings, and \textitunified protocols. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding182 Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only 39.3% recognition score and 56.0% Chain-of-Thought (CoT) score on our benchmark. \ding183 Generalist models (\emphe.g., Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emphe.g., R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs’ emotional intelligence in the future.
zh

[CV-116] GANime: Generating Anime and Manga Character Drawings from Sketches with Deep Learning

【速读】:该论文旨在解决动漫产业中从线稿(sketch)生成全彩图像的低效与高成本问题,这是当前动画制作流程中的一个关键瓶颈。解决方案的关键在于采用图像到图像翻译(image-to-image translation)技术,通过对比神经风格迁移(Neural Style Transfer)、条件生成对抗网络(C-GAN)和循环一致性生成对抗网络(CycleGAN)等多种模型,发现C-GAN在定性和定量评估中均表现出最优性能,能够生成接近人工绘制质量且分辨率较高的彩色图像,从而显著提升自动化着色效率与效果。

链接: https://arxiv.org/abs/2508.09207
作者: Tai Vu,Robert Yang
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The process of generating fully colorized drawings from sketches is a large, usually costly bottleneck in the manga and anime industry. In this study, we examine multiple models for image-to-image translation between anime characters and their sketches, including Neural Style Transfer, C-GAN, and CycleGAN. By assessing them qualitatively and quantitatively, we find that C-GAN is the most effective model that is able to produce high-quality and high-resolution images close to those created by humans.
zh

[CV-117] MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

【速读】:该论文旨在解决量化(Quantization)过程中不可避免的精度下降问题,尤其是在资源受限设备上部署深度学习模型时。其核心解决方案是提出一种基于专家混合(Mixture-of-Experts, MoE)架构的量化推理框架——Mixture of Quantization Experts (MoQE),通过将同一全精度模型的不同量化变体作为专用的“量化专家”,并设计轻量级、结构感知的路由模型,根据输入数据特征动态选择最合适的专家进行推理,从而实现性能提升与精度保持的平衡。

链接: https://arxiv.org/abs/2508.09204
作者: Jinhao Zhang,Yunquan Zhang,Boyang Zhang,Zeyu Liu,Daning Cheng
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Peng Cheng Laboratory (鹏城实验室); North University of China (中北大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized “quantization experts” and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models tailored for both CV and NLP tasks. Experimental evaluations on ResNet, LLaMA, and Qwen model families across benchmark datasets including ImageNet, WikiText, C4, and OpenWebText demonstrate that MoQE achieves performance comparable to SOTA quantization model, without incurring significant increases in inference latency.
zh

[CV-118] Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method

【速读】:该论文旨在解决在无源数据可用的情况下,如何通过仅包含中性表情的无标签目标域数据实现面部表情识别(Facial Expression Recognition, FER)模型的个性化适应问题。传统无源域适应(Source-Free Domain Adaptation, SFDA)方法通常依赖多类目标数据,而本文面对的是单一类别(仅中性表情)的目标数据,且无法进行图像级表达生成,这导致现有方法难以有效适配。解决方案的关键在于提出一种轻量级的个性化特征翻译(Personalized Feature Translation, PFT)方法,其核心创新是将翻译操作从图像空间转移到潜在空间(latent space),利用预训练的翻译器在源域上学习跨个体风格特征的转换,并通过表达一致性与风格感知目标联合优化保留表情信息;随后在目标域中仅用中性表情数据对翻译器进行微调,无需图像合成或使用源数据,从而高效生成针对分类任务优化的判别性嵌入,显著降低计算复杂度并提升适应效率。

链接: https://arxiv.org/abs/2508.09202
作者: Masoumeh Sharafi,Soufiane Belharbi,Houssem Ben Salem,Ali Etemad,Alessandro Lameiras Koerich,Marco Pedersoli,Simon Bacon,Eric Granger
机构: 1. University of Montreal (蒙特利尔大学); 2. University of Toronto (多伦多大学); 3. University of São Paulo (圣保罗大学); 4. University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Facial expression recognition (FER) models are employed in many video-based affective computing applications, such as human-computer interaction and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting their performance in real-world applications. To improve their performance, source-free domain adaptation (SFDA) methods have been proposed to personalize a pretrained source model using only unlabeled target domain data, thereby avoiding data privacy, storage, and transmission constraints. This paper addresses a challenging scenario where source data is unavailable for adaptation, and only unlabeled target data consisting solely of neutral expressions is available. SFDA methods are not typically designed to adapt using target data from only a single class. Further, using models to generate facial images with non-neutral expressions can be unstable and computationally intensive. In this paper, personalized feature translation (PFT) is proposed for SFDA. Unlike current image translation methods for SFDA, our lightweight method operates in the latent space. We first pre-train the translator on the source domain data to transform the subject-specific style features from one source subject into another. Expression information is preserved by optimizing a combination of expression consistency and style-aware objectives. Then, the translator is adapted on neutral target data, without using source data or image synthesis. By translating in the latent space, PFT avoids the complexity and noise of face expression generation, producing discriminative embeddings optimized for classification. Using PFT eliminates the need for image synthesis, reduces computational overhead (using a lightweight translator), and only adapts part of the model, making the method efficient compared to image-based translation.
zh

[CV-119] Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在对齐过程中仍易受越狱攻击(jailbreak attacks)的问题,此类攻击会带来严重安全风险。现有检测方法虽转向利用内部表示以获取跨模态信息,但多依赖启发式规则而非严格的优化目标,导致性能不佳。解决方案的关键在于提出一种名为“Learning to Detect”(LoD)的无监督框架,其核心创新为:(1)多模态安全概念激活向量(Multi-modal Safety Concept Activation Vectors, MSCAV),用于捕获不同层中与安全相关的跨模态表征;(2)安全模式自编码器(Safety Pattern Auto-Encoder),通过仅在安全样本上训练来建模MSCAV的分布,并基于重构误差识别越狱输入作为分布异常。该方法无需攻击标签即可实现统一且高精度的越狱攻击检测,在多个LVLM和基准测试中达到SOTA性能,平均AUROC达0.9951,较最强基线提升最高达38.89%。

链接: https://arxiv.org/abs/2508.09201
作者: Shuang Liang,Zhihao Xu,Jialing Tao,Hui Xue,Xiting Wang
机构: 中国人民大学(renmin university of china); 中国科学院(computer science research institute, chinese academy of sciences)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. Although recent detection works have shifted to internal representations due to their rich cross-modal information, most methods rely on heuristic rules rather than principled objectives, resulting in suboptimal performance. To address these limitations, we propose Learning to Detect (LoD), a novel unsupervised framework that formulates jailbreak detection as anomaly detection. LoD introduces two key components: Multi-modal Safety Concept Activation Vectors (MSCAV), which capture layer-wise safety-related representations across modalities, and the Safety Pattern Auto-Encoder, which models the distribution of MSCAV derived from safe inputs and detects anomalies via reconstruction errors. By training the auto-encoder (AE) solely on safe samples without attack labels, LoD naturally identifies jailbreak inputs as distributional anomalies, enabling accurate and unified detection of jailbreak attacks. Comprehensive experiments on three different LVLMs and five benchmarks demonstrate that LoD achieves state-of-the-art performance, with an average AUROC of 0.9951 and an improvement of up to 38.89% in the minimum AUROC over the strongest baselines.
zh

[CV-120] Synthetic Data Generation for Emotional Depth Faces: Optimizing Conditional DCGANs via Genetic Algorithms in the Latent Space and Stabilizing Training with Knowledge Distillation

【速读】:该论文旨在解决情感计算(Affective Computing)领域中高质量、多样化深度人脸数据集匮乏的问题,尤其是针对细微情绪表达的识别难题。其解决方案的关键在于提出一种基于优化生成对抗网络(GAN)与知识蒸馏(Knowledge Distillation)相结合的合成深度人脸生成框架,其中引入EMA教师模型以稳定训练过程、提升图像质量并防止模式崩溃;同时结合遗传算法(Genetic Algorithms)对GAN潜在向量进行演化优化,依据图像统计特征增强目标情绪下的多样性与视觉质量。该方法在多样性与质量上均优于传统GAN、变分自编码器(VAE)、高斯混合模型(GMM)及核密度估计(KDE),且在分类任务中通过融合LBP、HOG、Sobel边缘和强度直方图等多维特征,配合XGBoost实现高达94%–96%的准确率,验证了所生成数据的有效性。

链接: https://arxiv.org/abs/2508.09188
作者: Seyed Muhammad Hossein Mousavi,S. Younes Mirinezhad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Affective computing faces a major challenge: the lack of high-quality, diverse depth facial datasets for recognizing subtle emotional expressions. We propose a framework for synthetic depth face generation using an optimized GAN with Knowledge Distillation (EMA teacher models) to stabilize training, improve quality, and prevent mode collapse. We also apply Genetic Algorithms to evolve GAN latent vectors based on image statistics, boosting diversity and visual quality for target emotions. The approach outperforms GAN, VAE, GMM, and KDE in both diversity and quality. For classification, we extract and concatenate LBP, HOG, Sobel edge, and intensity histogram features, achieving 94% and 96% accuracy with XGBoost. Evaluation using FID, IS, SSIM, and PSNR shows consistent improvement over state-of-the-art methods.
zh

[CV-121] RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System

【速读】:该论文旨在解决智能交通系统(ITS)中AI摄像头产生的视觉数据与个人隐私保护之间的冲突问题。现有隐私保护机制如模糊处理或加密方法在面对高级重建攻击时往往效果不足,导致隐私泄露或数据可用性严重下降。解决方案的关键在于提出RL-MoE框架,该框架通过结合专家混合(Mixture-of-Experts, MoE)架构实现场景的多维度分解,并引入强化学习(Reinforcement Learning, RL)代理优化生成文本,在语义准确性和隐私保护之间实现双重目标,从而将敏感视觉数据转化为隐私保护的文本描述,避免直接传输图像,显著提升隐私安全性(如在CFP-FP数据集上将重放攻击成功率降至9.4%),同时保持更丰富的文本内容表达能力。

链接: https://arxiv.org/abs/2508.09186
作者: Abdolazim Rezaei,Mehdi Sookhak,Mahboobeh Haghparast
机构: Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of AI-powered cameras in Intelligent Transportation Systems (ITS) creates a severe conflict between the need for rich visual data and the fundamental right to privacy. Existing privacy-preserving mechanisms, such as blurring or encryption, are often insufficient, creating an undesirable trade-off where either privacy is compromised against advanced reconstruction attacks or data utility is critically degraded. To resolve this impasse, we propose RL-MoE, a novel framework that transforms sensitive visual data into privacy-preserving textual descriptions, eliminating the need for direct image transmission. RL-MoE uniquely combines a Mixture-of-Experts (MoE) architecture for nuanced, multi-aspect scene decomposition with a Reinforcement Learning (RL) agent that optimizes the generated text for a dual objective of semantic accuracy and privacy preservation. Extensive experiments demonstrate that RL-MoE provides superior privacy protection, reducing the success rate of replay attacks to just 9.4% on the CFP-FP dataset, while simultaneously generating richer textual content than baseline methods. Our work provides a practical and scalable solution for building trustworthy AI systems in privacy-sensitive domains, paving the way for more secure smart city and autonomous vehicle networks.
zh

[CV-122] A Neurosymbolic Framework for Interpretable Cognitive Attack Detection in Augmented Reality

【速读】:该论文旨在解决增强现实(Augmented Reality, AR)环境中认知攻击的检测问题,即攻击者通过篡改AR内容来误导用户对物理世界的语义理解。现有方法主要依赖像素或图像级别的视觉变化检测,缺乏语义推理能力;或使用预训练视觉语言模型(Vision-Language Models, VLMs),但存在黑箱特性、可解释性差的问题。解决方案的关键在于提出一种神经符号(neurosymbolic)方法CADAR,其核心是利用预训练VLM融合多模态输入以构建符号化的感知图(perception-graph)表示,该表示引入先验知识、显著性加权和时序相关性;随后采用基于粒子滤波(particle filtering)的统计推理机制进行攻击检测,从而在保持VLM适应性的同时,实现高可解释性和严谨的推理能力。

链接: https://arxiv.org/abs/2508.09185
作者: Rongqian Chen,Allison Andreyev,Yanming Xiu,Mahdi Imani,Bin Li,Maria Gorlatova,Gang Tan,Tian Lan
机构: 1. University of California, Berkeley (加州大学伯克利分校); 2. Columbia University (哥伦比亚大学); 3. University of Toronto (多伦多大学); 4. Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Augmented Reality (AR) enriches perception by overlaying virtual elements on the physical world. Due to its growing popularity, cognitive attacks that alter AR content to manipulate users’ semantic perception have received increasing attention. Existing detection methods often focus on visual changes, which are restricted to pixel- or image-level processing and lack semantic reasoning capabilities, or they rely on pre-trained vision-language models (VLMs), which function as black-box approaches with limited interpretability. In this paper, we present CADAR, a novel neurosymbolic approach for cognitive attack detection in AR. It fuses multimodal vision-language inputs using neural VLMs to obtain a symbolic perception-graph representation, incorporating prior knowledge, salience weighting, and temporal correlations. The model then enables particle-filter based statistical reasoning – a sequential Monte Carlo method – to detect cognitive attacks. Thus, CADAR inherits the adaptability of pre-trained VLM and the interpretability and reasoning rigor of particle filtering. Experiments on an extended AR cognitive attack dataset show accuracy improvements of up to 10.7% over strong baselines on challenging AR attack scenarios, underscoring the promise of neurosymbolic methods for effective and interpretable cognitive attack detection.
zh

[CV-123] IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection

【速读】:该论文旨在解决工业异常检测(Industrial Anomaly Detection)中因缺陷样本稀缺导致传统方法泛化能力受限的问题,同时提升视觉语言模型(Vision-Language Models, VLMs)在该任务上的性能瓶颈。其解决方案的关键在于提出了一种通用的后训练框架 IAD-R1,该框架采用两阶段策略:第一阶段为感知激活监督微调(Perception Activation Supervised Fine-Tuning, PA-SFT),利用高质量的 Chain-of-Thought 数据集(Expert-AD)增强模型对异常的感知能力并建立推理到答案的关联;第二阶段为结构化控制组相对策略优化(Structured Control Group Relative Policy Optimization, SC-GRPO),通过精心设计的奖励函数实现从“异常感知”到“异常解释”的能力跃迁。实验表明,IAD-R1 在 7 种不同架构和规模的 VLM 上均显著提升检测准确率,最高达 43.3%,且 0.5B 参数模型在零样本设置下超越 GPT-4.1 和 Claude-Sonnet-4,验证了该框架的有效性与优越性。

链接: https://arxiv.org/abs/2508.09178
作者: Yanhui Li,Yunkang Cao,Chengliang Liu,Yuan Xiong,Xinghui Dong,Chao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial anomaly detection is a critical component of modern manufacturing, yet the scarcity of defective samples restricts traditional detection methods to scenario-specific applications. Although Vision-Language Models (VLMs) demonstrate significant advantages in generalization capabilities, their performance in industrial anomaly detection remains limited. To address this challenge, we propose IAD-R1, a universal post-training framework applicable to VLMs of different architectures and parameter scales, which substantially enhances their anomaly detection capabilities. IAD-R1 employs a two-stage training strategy: the Perception Activation Supervised Fine-Tuning (PA-SFT) stage utilizes a meticulously constructed high-quality Chain-of-Thought dataset (Expert-AD) for training, enhancing anomaly perception capabilities and establishing reasoning-to-answer correlations; the Structured Control Group Relative Policy Optimization (SC-GRPO) stage employs carefully designed reward functions to achieve a capability leap from “Anomaly Perception” to “Anomaly Interpretation”. Experimental results demonstrate that IAD-R1 achieves significant improvements across 7 VLMs, attaining up to 43.3% enhancement in average accuracy on 6 industrial anomaly detection benchmark datasets. Notably, the 0.5B parameter model trained with IAD-R1 surpasses commercial models including GPT-4.1 and Claude-Sonnet-4 in zero-shot settings, demonstrating the effectiveness and superiority of IAD-R1. The dataset, code, and all model weights will be publicly available at this https URL.
zh

[CV-124] A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection

【速读】:该论文旨在解决社交媒体中针对女性的恶意内容(misogynistic content)检测难题,此类内容在一般性攻击性内容检测方法中难以有效识别。解决方案的关键在于提出一个新颖的多模态框架,其核心包括三个模块:多模态注意力模块(Multimodal Attention module, MANM)通过自适应门控机制实现跨模态上下文感知注意力,聚焦于与语境相关的文本和视觉信息;图结构特征重构模块(Graph-based Feature Reconstruction Module, GFRM)利用图结构对单模态内部特征进行优化;内容特定特征学习模块(Content-specific Features Learning Module, CFLM)专门提取文本和图像中的毒性特征及标题特征,并引入专门构建的厌女词汇库计算性别歧视词频得分。此外,采用特征空间测试时增强策略提升模型对多样化输入的泛化能力。实验表明,该方法在MAMI和MMHS150K两个数据集上分别相较现有方法平均提升了10.17%和8.88%的宏F1分数。

链接: https://arxiv.org/abs/2508.09175
作者: Mohammad Zia Ur Rehman,Sufyaan Zahoor,Areeb Manzoor,Musharaf Maqbool,Nagendra Kumar
机构: Indian Institute of Technology Indore (印度理工学院英德奥); National Institute of Technology Srinagar (国家技术学院斯里纳加)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in Information Processing Management

点击查看摘要

Abstract:A substantial portion of offensive content on social media is directed towards women. Since the approaches for general offensive content detection face a challenge in detecting misogynistic content, it requires solutions tailored to address offensive content against women. To this end, we propose a novel multimodal framework for the detection of misogynistic and sexist content. The framework comprises three modules: the Multimodal Attention module (MANM), the Graph-based Feature Reconstruction Module (GFRM), and the Content-specific Features Learning Module (CFLM). The MANM employs adaptive gating-based multimodal context-aware attention, enabling the model to focus on relevant visual and textual information and generating contextually relevant features. The GFRM module utilizes graphs to refine features within individual modalities, while the CFLM focuses on learning text and image-specific features such as toxicity features and caption features. Additionally, we curate a set of misogynous lexicons to compute the misogyny-specific lexicon score from the text. We apply test-time augmentation in feature space to better generalize the predictions on diverse inputs. The performance of the proposed approach has been evaluated on two multimodal datasets, MAMI and MMHS150K, with 11,000 and 13,494 samples, respectively. The proposed method demonstrates an average improvement of 10.17% and 8.88% in macro-F1 over existing methods on the MAMI and MMHS150K datasets, respectively.
zh

[CV-125] Multimodal RAG Enhanced Visual Description CIKM2025

【速读】:该论文旨在解决预训练大型多模态模型(LMMs)中存在的模态间隙(modality gap)问题,即文本与视觉表示在共享嵌入空间中的不对齐现象,这限制了模型在跨模态任务中的性能。传统通过微调缓解该问题的方法因需大量领域驱动数据而成本高昂且不切实际。其解决方案的关键在于提出一种轻量级、无需训练的检索增强生成(Retrieval-Augmented Generation, RAG)方法:利用线性映射将图像嵌入投射到文本空间,从而在训练集中检索最相近的文本描述;这些文本描述结合指令作为提示输入语言模型,生成新的文本描述。进一步引入迭代蒸馏技术,通过语言模型生成合成描述以优化常用图像描述指标,实现高效且有效的跨模态对齐。

链接: https://arxiv.org/abs/2508.09170
作者: Amit Kumar Jaiswal,Haiming Liu,Ingo Frommholz
机构: Indian Institute of Technology (BHU) Varanasi (印度理工学院(布哈)瓦拉纳西); University of Southampton (南安普顿大学); Modul University Vienna (模度大学维也纳)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted by ACM CIKM 2025. 5 pages, 2 figures

点击查看摘要

Abstract:Textual descriptions for multimodal inputs entail recurrent refinement of queries to produce relevant output images. Despite efforts to address challenges such as scaling model size and data volume, the cost associated with pre-training and fine-tuning remains substantial. However, pre-trained large multimodal models (LMMs) encounter a modality gap, characterised by a misalignment between textual and visual representations within a common embedding space. Although fine-tuning can potentially mitigate this gap, it is typically expensive and impractical due to the requirement for extensive domain-driven data. To overcome this challenge, we propose a lightweight training-free approach utilising Retrieval-Augmented Generation (RAG) to extend across the modality using a linear mapping, which can be computed efficiently. During inference, this mapping is applied to images embedded by an LMM enabling retrieval of closest textual descriptions from the training set. These textual descriptions, in conjunction with an instruction, cater as an input prompt for the language model to generate new textual descriptions. In addition, we introduce an iterative technique for distilling the mapping by generating synthetic descriptions via the language model facilitating optimisation for standard utilised image description measures. Experimental results on two benchmark multimodal datasets demonstrate significant improvements.
zh

[CV-126] SVGen: Interpretable Vector Graphics Generation with Large Language Models

【速读】:该论文旨在解决将创意想法转化为精确矢量图形(Scalable Vector Graphics, SVG)时存在的效率低下和耗时问题。其解决方案的关键在于构建了一个大规模、高质量的SVG与自然语言描述对齐的数据集SVG-1M,并基于此提出了一种端到端的生成模型SVGen,该模型通过课程学习(curriculum learning)和强化学习(reinforcement learning)优化策略,在保持语义准确性和结构完整性的同时,显著提升了生成效果与效率。

链接: https://arxiv.org/abs/2508.09168
作者: Feiyu Wang,Zhiyuan Zhao,Yuandong Liu,Da Zhang,Junyu Gao,Hao Sun,Xuelong Li
机构: Northwestern Polytechnical University (西北工业大学); Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scalable Vector Graphics (SVG) is widely used in front-end development and UI/UX design due to its scalability, editability, and rendering efficiency. However, turning creative ideas into precise vector graphics remains a time-consuming challenge. To address this, we introduce SVG-1M, a large-scale dataset of high-quality SVGs paired with natural language descriptions. Through advanced data augmentation and annotation, we create well-aligned Text to SVG training pairs, including a subset with Chain of Thought annotations for enhanced semantic guidance. Based on this dataset, we propose SVGen, an end-to-end model that generates SVG code from natural language inputs. Our approach ensures semantic accuracy and structural completeness, supported by curriculum learning and reinforcement learning optimization. Experiments show that SVGen outperforms general large models and traditional rendering methods in both effectiveness and efficiency. Code, model, and dataset are available on GitHub.
zh

[CV-127] Masked Training for Robust Arrhythmia Detection from Digitalized Multiple Layout ECG Images

【速读】:该论文旨在解决不同医院使用的心电图(Electrocardiogram, ECG)布局差异导致的导联时间异步和部分信号丢失问题,这一问题严重制约了现有模型在多变ECG格式下的泛化能力。解决方案的关键在于提出PatchECG框架,其基于掩码训练策略实现自适应可变块数缺失表示学习,能够自动聚焦于导联间具有协同依赖关系的关键图像块,从而在不同布局下稳定识别心律失常特征。

链接: https://arxiv.org/abs/2508.09165
作者: Shanwei Zhang,Deyun Zhang,Yirao Tao,Kexin Wang,Shijia Geng,Jun Li,Qinghao Zhao,Xingpeng Liu,Yuxi Zhou,Shenda Hong
机构: Tianjin University of Technology (天津理工大学); Peking University (北京大学); HeartVoice Medical Technology (HeartVoice 医疗科技); Capital Medical University (首都医科大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:Electrocardiogram (ECG) as an important tool for diagnosing cardiovascular diseases such as arrhythmia. Due to the differences in ECG layouts used by different hospitals, the digitized signals exhibit asynchronous lead time and partial blackout loss, which poses a serious challenge to existing models. To address this challenge, the study introduced PatchECG, a framework for adaptive variable block count missing representation learning based on a masking training strategy, which automatically focuses on key patches with collaborative dependencies between leads, thereby achieving key recognition of arrhythmia in ECGs with different layouts. Experiments were conducted on the PTB-XL dataset and 21388 asynchronous ECG images generated using ECG image kit tool, using the 23 Subclasses as labels. The proposed method demonstrated strong robustness under different layouts, with average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.835 and remained stable (unchanged with layout changes). In external validation based on 400 real ECG images data from Chaoyang Hospital, the AUROC for atrial fibrillation diagnosis reached 0.778; On 12 x 1 layout ECGs, AUROC reaches 0.893. This result is superior to various classic interpolation and baseline methods, and compared to the current optimal large-scale pre-training model ECGFounder, it has improved by 0.111 and 0.19.
zh

[CV-128] -CACE: A Time-Conditioned Autoregressive Contrast Enhancement Multi-Task Framework for Contrast-Free Liver MRI Synthesis Segmentation and Diagnosis ALT

【速读】:该论文旨在解决传统磁共振成像(MRI)在肝癌诊断中面临的三大挑战:对比剂(Contrast Agent, CA)注射带来的风险、人工评估耗时较长以及标注数据集有限的问题。其核心解决方案是提出一种时间条件自回归对比增强(Time-Conditioned Autoregressive Contrast Enhancement, T-CACE)框架,能够直接从非增强MRI(Non-Contrast MRI, NCMRI)生成多期增强MRI(Contrast-Enhanced MRI, CEMRI)。T-CACE的关键创新包括:1)条件令牌编码(Conditional Token Encoding, CTE)机制,将解剖先验与时间相位信息统一嵌入潜在表示;2)动态时间感知注意力掩码(Dynamic Time-Aware Attention Mask, DTAM),利用高斯衰减注意力机制自适应调控跨相位信息流,确保相位间过渡平滑且符合生理特性;3)时间分类一致性约束(Temporal Classification Consistency, TCC),使病灶分类结果与生理信号演化保持一致,从而提升诊断可靠性。

链接: https://arxiv.org/abs/2508.09919
作者: Xiaojiao Xiao,Jianfeng Zhao,Qinmin Vivian Hu,Guanghui Wang
机构: Toronto Metropolitan University (多伦多都会大学); Western University (西门菲莎大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Journal of Biomedical and Health Informatics, 2025

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) is a leading modality for the diagnosis of liver cancer, significantly improving the classification of the lesion and patient outcomes. However, traditional MRI faces challenges including risks from contrast agent (CA) administration, time-consuming manual assessment, and limited annotated datasets. To address these limitations, we propose a Time-Conditioned Autoregressive Contrast Enhancement (T-CACE) framework for synthesizing multi-phase contrast-enhanced MRI (CEMRI) directly from non-contrast MRI (NCMRI). T-CACE introduces three core innovations: a conditional token encoding (CTE) mechanism that unifies anatomical priors and temporal phase information into latent representations; and a dynamic time-aware attention mask (DTAM) that adaptively modulates inter-phase information flow using a Gaussian-decayed attention mechanism, ensuring smooth and physiologically plausible transitions across phases. Furthermore, a constraint for temporal classification consistency (TCC) aligns the lesion classification output with the evolution of the physiological signal, further enhancing diagnostic reliability. Extensive experiments on two independent liver MRI datasets demonstrate that T-CACE outperforms state-of-the-art methods in image synthesis, segmentation, and lesion classification. This framework offers a clinically relevant and efficient alternative to traditional contrast-enhanced imaging, improving safety, diagnostic efficiency, and reliability for the assessment of liver lesion. The implementation of T-CACE is publicly available at: this https URL.
zh

[CV-129] Perceptual Reality Transformer: Neural Architectures for Simulating Neurological Perception Conditions

【速读】:该论文旨在解决神经系统疾病导致的视觉感知障碍所引发的体验鸿沟问题,即患者与照料者、家庭成员及医疗专业人员之间因感知差异而产生的理解困境。其解决方案的关键在于提出了一种名为“感知现实转换器”(Perceptual Reality Transformer)的系统性框架,该框架采用六种不同的神经网络架构,模拟八种神经性感知障碍(如 simultanagnosia、prosopagnosia、ADHD 注意缺陷等),通过从自然图像到特定条件感知状态的映射学习,使非患者能够体验这些障碍的近似表现。该方法的核心创新在于基于临床文献构建了条件特异性的扰动函数,并首次建立了针对神经感知模拟的系统性基准,同时验证了 Vision Transformer 在此类任务中的最优性能,为医学教育、共情训练和辅助技术开发提供了可量化的工具与理论基础。

链接: https://arxiv.org/abs/2508.09852
作者: Baihan Lin
机构: Icahn School of Medicine at Mount Sinai (纽约西奈山伊坎医学院)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Neurological conditions affecting visual perception create profound experiential divides between affected individuals and their caregivers, families, and medical professionals. We present the Perceptual Reality Transformer, a comprehensive framework employing six distinct neural architectures to simulate eight neurological perception conditions with scientifically-grounded visual transformations. Our system learns mappings from natural images to condition-specific perceptual states, enabling others to experience approximations of simultanagnosia, prosopagnosia, ADHD attention deficits, visual agnosia, depression-related changes, anxiety tunnel vision, and Alzheimer’s memory effects. Through systematic evaluation across ImageNet and CIFAR-10 datasets, we demonstrate that Vision Transformer architectures achieve optimal performance, outperforming traditional CNN and generative approaches. Our work establishes the first systematic benchmark for neurological perception simulation, contributes novel condition-specific perturbation functions grounded in clinical literature, and provides quantitative metrics for evaluating simulation fidelity. The framework has immediate applications in medical education, empathy training, and assistive technology development, while advancing our fundamental understanding of how neural networks can model atypical human perception.
zh

[CV-130] Robustness analysis of Deep Sky Objects detection models on HPC

【速读】:该论文旨在解决天文图像中深空天体(Deep Sky Objects,DSOs)检测的挑战,尤其是由于信号微弱和背景复杂导致的传统人工识别效率低、准确性不足的问题。解决方案的关键在于利用计算机视觉与深度学习技术,特别是基于YOLO和RET-DETR两种目标检测模型,在智能望远镜图像上进行训练与对比,并借助高性能计算(High-Performance Computing, HPC)实现并行化处理,从而提升检测的准确性和鲁棒性。

链接: https://arxiv.org/abs/2508.09831
作者: Olivier Parisot,Diogo Ramalho Fernandes
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, NEOD project

点击查看摘要

Abstract:Astronomical surveys and the growing involvement of amateur astronomers are producing more sky images than ever before, and this calls for automated processing methods that are accurate and robust. Detecting Deep Sky Objects – such as galaxies, nebulae, and star clusters – remains challenging because of their faint signals and complex backgrounds. Advances in Computer Vision and Deep Learning now make it possible to improve and automate this process. In this paper, we present the training and comparison of different detection models (YOLO, RET-DETR) on smart telescope images, using High-Performance Computing (HPC) to parallelise computations, in particular for robustness testing.
zh

[CV-131] Dynamic Survival Prediction using Longitudinal Images based on Transformer

【速读】:该论文旨在解决纵向医学影像(longitudinal medical images)在生存分析中面临的三大挑战:一是对删失数据(censored data)利用不足,二是未能充分建模多个时间点采集的影像之间的相关性,三是模型缺乏可解释性。其解决方案的关键在于提出SurLonFormer,一种基于Transformer架构的神经网络,通过三个核心组件实现:1)视觉编码器(Vision Encoder)提取空间特征;2)序列编码器(Sequence Encoder)聚合时间信息;3)基于Cox比例风险模型的生存编码器(Survival Encoder),从而有效整合删失数据、提升可扩展性,并借助遮挡敏感性分析和动态生存预测增强模型可解释性。

链接: https://arxiv.org/abs/2508.09328
作者: Bingfan Liu,Haolun Shi,Jiguo Cao
机构: Simon Fraser University (西蒙弗雷泽大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Other Statistics (stat.OT)
备注:

点击查看摘要

Abstract:Survival analysis utilizing multiple longitudinal medical images plays a pivotal role in the early detection and prognosis of diseases by providing insight beyond single-image evaluations. However, current methodologies often inadequately utilize censored data, overlook correlations among longitudinal images measured over multiple time points, and lack interpretability. We introduce SurLonFormer, a novel Transformer-based neural network that integrates longitudinal medical imaging with structured data for survival prediction. Our architecture comprises three key components: a Vision Encoder for extracting spatial features, a Sequence Encoder for aggregating temporal information, and a Survival Encoder based on the Cox proportional hazards model. This framework effectively incorporates censored data, addresses scalability issues, and enhances interpretability through occlusion sensitivity analysis and dynamic survival prediction. Extensive simulations and a real-world application in Alzheimer’s disease analysis demonstrate that SurLonFormer achieves superior predictive performance and successfully identifies disease-related imaging biomarkers.
zh

[CV-132] AMRG: Extend Vision Language Models for Automatic Mammography Report Generation

【速读】:该论文旨在解决乳腺X线摄影报告生成(mammography report generation)这一在医学人工智能领域中关键但研究不足的任务,其核心挑战包括多视角图像推理、高分辨率视觉线索捕捉以及放射学语言的非结构化特性。解决方案的关键在于提出首个端到端框架AMRG(Automatic Mammography Report Generation),基于领域专用的大规模视觉-语言模型(VLMs)MedGemma-4B-it-a,并采用低秩适应(Low-Rank Adaptation, LoRA)参数高效微调策略,在保持极小计算开销的前提下实现模型轻量化适配。该方法在公开数据集DMID上进行了系统训练与评估,首次建立了可复现的乳腺影像报告生成基准,显著提升了生成报告的语言质量和临床准确性(如BI-RADS分类准确率达0.5582),同时减少幻觉并增强诊断一致性,为多模态医学AI研究提供了可扩展且灵活的基础平台。

链接: https://arxiv.org/abs/2508.09225
作者: Nak-Jun Sung,Donghyun Lee,Bo Hwa Choi,Chae Jung Park
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mammography report generation is a critical yet underexplored task in medical AI, characterized by challenges such as multiview image reasoning, high-resolution visual cues, and unstructured radiologic language. In this work, we introduce AMRG (Automatic Mammography Report Generation), the first end-to-end framework for generating narrative mammography reports using large vision-language models (VLMs). Building upon MedGemma-4B-it-a domain-specialized, instruction-tuned VLM-we employ a parameter-efficient fine-tuning (PEFT) strategy via Low-Rank Adaptation (LoRA), enabling lightweight adaptation with minimal computational overhead. We train and evaluate AMRG on DMID, a publicly available dataset of paired high-resolution mammograms and diagnostic reports. This work establishes the first reproducible benchmark for mammography report generation, addressing a longstanding gap in multimodal clinical AI. We systematically explore LoRA hyperparameter configurations and conduct comparative experiments across multiple VLM backbones, including both domain-specific and general-purpose models under a unified tuning protocol. Our framework demonstrates strong performance across both language generation and clinical metrics, achieving a ROUGE-L score of 0.5691, METEOR of 0.6152, CIDEr of 0.5818, and BI-RADS accuracy of 0.5582. Qualitative analysis further highlights improved diagnostic consistency and reduced hallucinations. AMRG offers a scalable and adaptable foundation for radiology report generation and paves the way for future research in multimodal medical AI.
zh

[CV-133] Real-time deep learning phase imaging flow cytometer reveals blood cell aggregate biomarkers for haematology diagnostics

【速读】:该论文旨在解决自动化血液分析中罕见血细胞聚集物(blood cell aggregates)难以识别的问题,这类聚集物虽在传统流式细胞术中常因未触发警报而被忽略,却可能显著提升无标记的功能性诊断价值。当前基于定量相位成像的流式细胞术虽能捕捉详细的聚集形态,但受限于海量数据存储与离线处理,难以用于临床场景。论文提出的解决方案是RT-HAD框架——一个端到端的深度学习图像与数据处理系统,其核心创新在于将物理一致性的全息重建与检测相结合,并以图结构表示每个血细胞,从而实现对血小板聚集的实时识别。该方案可在1.5分钟内处理30 GB图像数据,错误率仅为8.9%,达到实验室可接受的血液学指标误差水平,有效破解了点-of-care诊断中的大数据挑战。

链接: https://arxiv.org/abs/2508.09215
作者: Kerem Delikoyun,Qianyu Chen,Liu Wei,Si Ko Myo,Johannes Krell,Martin Schlegel,Win Sen Kuan,John Tshon Yit Soong,Gerhard Schneider,Clarissa Prazeres da Costa,Percy A. Knolle,Laurent Renia,Matthew Edward Cove,Hwee Kuan Lee,Klaus Diepold,Oliver Hayden
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:While analysing rare blood cell aggregates remains challenging in automated haematology, they could markedly advance label-free functional diagnostics. Conventional flow cytometers efficiently perform cell counting with leukocyte differentials but fail to identify aggregates with flagged results, requiring manual reviews. Quantitative phase imaging flow cytometry captures detailed aggregate morphologies, but clinical use is hampered by massive data storage and offline processing. Incorporating hidden biomarkers into routine haematology panels would significantly improve diagnostics without flagged results. We present RT-HAD, an end-to-end deep learning-based image and data processing framework for off-axis digital holographic microscopy (DHM), which combines physics-consistent holographic reconstruction and detection, representing each blood cell in a graph to recognize aggregates. RT-HAD processes 30 GB of image data on-the-fly with turnaround time of 1.5 min and error rate of 8.9% in platelet aggregate detection, which matches acceptable laboratory error rates of haematology biomarkers and solves the big data challenge for point-of-care diagnostics.
zh

[CV-134] From Explainable to Explained AI: Ideas for Falsifying and Quantifying Explanations MICCAI

【速读】:该论文旨在解决深度学习模型在数字病理学中临床集成时缺乏可靠解释的问题,特别是如何区分模型是否依赖于虚假特征(spurious features)而影响泛化能力,或是否揭示了新的生物学洞见。其解决方案的关键在于构建一个“人-机器-视觉语言模型(Vision-Language Model, VLM)”交互系统:首先通过AI集成的切片查看器执行滑动窗口实验来验证解释的合理性;其次利用通用VLM量化解释的预测性,从而实现对不同解释方案的定性测试与定量区分,推动从可解释人工智能(Explainable AI)向可解释的AI(Explained AI)演进。

链接: https://arxiv.org/abs/2508.09205
作者: Yoni Schirris,Eric Marcus,Jonas Teuwen,Hugo Horlings,Efstratios Gavves
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, 2 tables, submitted at MICCAI IMIMIC workshop

点击查看摘要

Abstract:Explaining deep learning models is essential for clinical integration of medical image analysis systems. A good explanation highlights if a model depends on spurious features that undermines generalization and harms a subset of patients or, conversely, may present novel biological insights. Although techniques like GradCAM can identify influential features, they are measurement tools that do not themselves form an explanation. We propose a human-machine-VLM interaction system tailored to explaining classifiers in computational pathology, including multi-instance learning for whole-slide images. Our proof of concept comprises (1) an AI-integrated slide viewer to run sliding-window experiments to test claims of an explanation, and (2) quantification of an explanation’s predictiveness using general-purpose vision-language models. The results demonstrate that this allows us to qualitatively test claims of explanations and can quantifiably distinguish competing explanations. This offers a practical path from explainable AI to explained AI in digital pathology and beyond. Code and prompts are available at this https URL.
zh

[CV-135] Zero-shot self-supervised learning of single breath-hold magnetic resonance cholangiopancreatography (MRCP) reconstruction

【速读】:该论文旨在解决磁共振胰胆管成像(MRCP)中因患者需长时间屏气(平均338秒)导致的检查依从性差和不适感问题。研究提出采用零样本自监督学习重建方法,以缩短屏气时间至14秒,同时保持图像质量。其解决方案的关键在于利用预训练网络进行浅层训练(shallow training),显著降低训练计算耗时(从271分钟降至11分钟),从而实现高保真度图像重建,并在临床时间约束下具备可操作性。

链接: https://arxiv.org/abs/2508.09200
作者: Jinho Kim,Marcel Dominik Nickel,Florian Knoll
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 6 figures, 2 tabels

点击查看摘要

Abstract:Purpose: To investigate the feasibility of applying zero-shot self-supervised learning reconstruction to reduce breath-hold times in magnetic resonance cholangiopancreatography (MRCP). Methods: Breath-hold MRCP was acquired from 11 healthy volunteers on a 3T scanner using an incoherent k-space sampling pattern leading to a breath-hold duration of 14s. We evaluated zero-shot reconstruction of breath-hold MRCP against parallel imaging of respiratory-triggered MRCP acquired in 338s on average and compressed sensing reconstruction of breath-hold MRCP. To address the long computation times of zero-shot trainings, we used a training approach that leverages a pretrained network to reduce backpropagation depth during training. Results: Zero-shot learning reconstruction significantly improved visual image quality compared to compressed sensing reconstruction, particularly in terms of signal-to-noise ratio and ductal delineation, and reached a level of quality comparable to that of successful respiratory-triggered acquisitions with regular breathing patterns. Shallow training provided nearly equivalent reconstruction performance with a training time of 11 minutes in comparison to 271 minutes for a conventional zero-shot training. Conclusion: Zero-shot learning delivers high-fidelity MRCP reconstructions with reduced breath-hold times, and shallow training offers a practical solution for translation to time-constrained clinical workflows.
zh

[CV-136] FIVA: Federated Inverse Variance Averag ing for Universal CT Segmentation with Uncertainty Estimation ALT

【速读】:该论文旨在解决多中心腹部CT图像分割数据集在异构性(如不同扫描仪、采集参数及器官标签不一致)下进行有效联邦学习的问题,同时保障患者隐私。其核心挑战在于如何在不共享原始数据的前提下实现跨机构模型的统一分割性能提升,并提供可靠的预测不确定性估计以支持临床决策。解决方案的关键在于:首先利用随机小批量梯度下降中的固有噪声估计客户端模型权重的分布,从而获得实时的模型参数不确定性;其次,在服务器端采用基于贝叶斯思想的逆方差加权聚合策略,将各客户端的模型参数及其不确定性信息融合,提升联邦聚合质量;最后,在推理阶段通过传播模型权重不确定性来量化预测置信度,实现不确定性加权的预测优化,显著优于现有基线方法。

链接: https://arxiv.org/abs/2508.09196
作者: Asim Ukaye,Numan Saeed,Karthik Nandakumar
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Michigan State University (密歇根州立大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 5 figures, Machine Learning for Healthcare Conference

点击查看摘要

Abstract:Different CT segmentation datasets are typically obtained from different scanners under different capture settings and often provide segmentation labels for a limited and often disjoint set of organs. Using these heterogeneous data effectively while preserving patient privacy can be challenging. This work presents a novel federated learning approach to achieve universal segmentation across diverse abdominal CT datasets by utilizing model uncertainty for aggregation and predictive uncertainty for inference. Our approach leverages the inherent noise in stochastic mini-batch gradient descent to estimate a distribution over the model weights to provide an on-the-go uncertainty over the model parameters at the client level. The parameters are then aggregated at the server using the additional uncertainty information using a Bayesian-inspired inverse-variance aggregation scheme. Furthermore, the proposed method quantifies prediction uncertainty by propagating the uncertainty from the model weights, providing confidence measures essential for clinical decision-making. In line with recent work shown, predictive uncertainty is utilized in the inference stage to improve predictive performance. Experimental evaluations demonstrate the effectiveness of this approach in improving both the quality of federated aggregation and uncertainty-weighted inference compared to previously established baselines. The code for this work is made available at: this https URL
zh

[CV-137] mpuTMAE: Multi-modal Transformer with Masked Pre-training for Missing Modalities Imputation in Cancer Survival Prediction

【速读】:该论文旨在解决多模态医学数据(如组学、医学影像和临床数据)中普遍存在的缺失模态问题,以提升预后模型的性能并加深对疾病机制的理解。其关键解决方案是提出了一种基于Transformer的端到端方法impuTMAE,该方法通过在预训练阶段利用掩码重建策略同时学习模态间的交互关系与模态内的特征表示,并实现对缺失模态的高效填补,从而在不依赖完整数据的前提下完成多模态融合与精准预测。

链接: https://arxiv.org/abs/2508.09195
作者: Maria Boyko,Aleksandra Beliaeva,Dmitriy Kornilov,Alexander Bernstein,Maxim Sharaev
机构: 1122; 112233; 11
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The use of diverse modalities, such as omics, medical images, and clinical data can not only improve the performance of prognostic models but also deepen an understanding of disease mechanisms and facilitate the development of novel treatment approaches. However, medical data are complex, often incomplete, and contains missing modalities, making effective handling its crucial for training multimodal models. We introduce impuTMAE, a novel transformer-based end-to-end approach with an efficient multimodal pre-training strategy. It learns inter- and intra-modal interactions while simultaneously imputing missing modalities by reconstructing masked patches. Our model is pre-trained on heterogeneous, incomplete data and fine-tuned for glioma survival prediction using TCGA-GBM/LGG and BraTS datasets, integrating five modalities: genetic (DNAm, RNA-seq), imaging (MRI, WSI), and clinical data. By addressing missing data during pre-training and enabling efficient resource utilization, impuTMAE surpasses prior multimodal approaches, achieving state-of-the-art performance in glioma patient survival prediction. Our code is available at this https URL
zh

[CV-138] Hybrid(TransformerCNN)-based Polyp Segmentation

【速读】:该论文旨在解决结肠息肉(colonic polyps)在内窥镜图像中因尺寸、形状差异大,以及光照条件、成像协议不一致和边界模糊(如液体、褶皱干扰)等因素导致的分割精度低的问题。其核心解决方案是一种融合Transformer与卷积神经网络(CNN)的混合模型,关键创新在于引入边界感知注意力机制以提升对边界不清息肉的分割准确性,并增强对常见内窥镜伪影(如高光、运动模糊和液体遮挡)的鲁棒特征提取能力,从而显著优于现有先进方法,在召回率(Recall)和准确率(Accuracy)上分别提升了1.76%和0.07%。

链接: https://arxiv.org/abs/2508.09189
作者: Madan Baduwal
机构: Mississippi State University (密西西比州立大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Colonoscopy is still the main method of detection and segmentation of colonic polyps, and recent advancements in deep learning networks such as U-Net, ResUNet, Swin-UNet, and PraNet have made outstanding performance in polyp segmentation. Yet, the problem is extremely challenging due to high variation in size, shape, endoscopy types, lighting, imaging protocols, and ill-defined boundaries (fluid, folds) of the polyps, rendering accurate segmentation a challenging and problematic task. To address these critical challenges in polyp segmentation, we introduce a hybrid (Transformer + CNN) model that is crafted to enhance robustness against evolving polyp characteristics. Our hybrid architecture demonstrates superior performance over existing solutions, particularly in addressing two critical challenges: (1) accurate segmentation of polyps with ill-defined margins through boundary-aware attention mechanisms, and (2) robust feature extraction in the presence of common endoscopic artifacts, including specular highlights, motion blur, and fluid occlusions. Quantitative evaluations reveal significant improvements in segmentation accuracy (Recall improved by 1.76%, i.e., 0.9555, accuracy improved by 0.07%, i.e., 0.9849) and artifact resilience compared to state-of-the-art polyp segmentation methods.
zh

[CV-139] MedPatch: Confidence-Guided Multi-Stage Fusion for Multimodal Clinical Data

【速读】:该论文旨在解决临床预测任务中多模态数据(如临床时间序列、医学影像和文本报告)因异质性高、样本量有限及模态缺失导致模型性能受限的问题。其核心解决方案是提出MedPatch,一种基于置信度引导的多阶段融合架构,关键在于:(1) 采用联合融合与晚期融合相结合的多阶段策略以增强模态交互;(2) 设计缺失感知模块有效处理模态稀疏样本;(3) 通过校准后的单模态token级置信度对潜在token patch进行聚类,实现更精准的跨模态融合。该方法显著提升了在真实世界医疗数据上的临床预测性能,达到了当前最优水平。

链接: https://arxiv.org/abs/2508.09182
作者: Baraa Al Jorf,Farah Shamout
机构: NYU Abu Dhabi (纽约大学阿布扎比校区)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clinical decision-making relies on the integration of information across various data modalities, such as clinical time-series, medical images and textual reports. Compared to other domains, real-world medical data is heterogeneous in nature, limited in size, and sparse due to missing modalities. This significantly limits model performance in clinical prediction tasks. Inspired by clinical workflows, we introduce MedPatch, a multi-stage multimodal fusion architecture, which seamlessly integrates multiple modalities via confidence-guided patching. MedPatch comprises three main components: (i) a multi-stage fusion strategy that leverages joint and late fusion simultaneously, (ii) a missingness-aware module that handles sparse samples with missing modalities, (iii) a joint fusion module that clusters latent token patches based on calibrated unimodal token-level confidence. We evaluated MedPatch using real-world data consisting of clinical time-series data, chest X-ray images, radiology reports, and discharge notes extracted from the MIMIC-IV, MIMIC-CXR, and MIMIC-Notes datasets on two benchmark tasks, namely in-hospital mortality prediction and clinical condition classification. Compared to existing baselines, MedPatch achieves state-of-the-art performance. Our work highlights the effectiveness of confidence-guided multi-stage fusion in addressing the heterogeneity of multimodal data, and establishes new state-of-the-art benchmark results for clinical prediction tasks.
zh

[CV-140] HiFi-Mamba: Dual-Stream W-Laplacian Enhanced Mamba for High-Fidelity MRI Reconstruction

【速读】:该论文旨在解决从欠采样k空间数据中重建高保真磁共振成像(MRI)图像的问题,尤其针对现有Mamba变体在MRI重建中存在对高频解剖细节不敏感以及依赖冗余多方向扫描的两大局限。其解决方案的关键在于提出一种双流Mamba架构——High-Fidelity Mamba (HiFi-Mamba),通过堆叠W-Laplacian (WL) 块与HiFi-Mamba块实现频谱解耦与自适应特征融合:WL块完成保真度保持的频谱分离,生成互补的低频与高频流;HiFi-Mamba块则专注于低频结构的全局建模,并通过自适应状态空间调制选择性整合高频特征以保留完整频谱信息;同时,采用简化的单向遍历策略,在保持长程建模能力的同时显著提升计算效率,从而消除多方向扫描冗余。

链接: https://arxiv.org/abs/2508.09179
作者: Hongli Chen,Pengcheng Fang,Yuxia Chen,Yingxuan Ren,Jing Hao,Fangfang Tang,Xiaohao Cai,Shanshan Shan,Feng Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing high-fidelity MR images from undersampled k-space data remains a challenging problem in MRI. While Mamba variants for vision tasks offer promising long-range modeling capabilities with linear-time complexity, their direct application to MRI reconstruction inherits two key limitations: (1) insensitivity to high-frequency anatomical details; and (2) reliance on redundant multi-directional scanning. To address these limitations, we introduce High-Fidelity Mamba (HiFi-Mamba), a novel dual-stream Mamba-based architecture comprising stacked W-Laplacian (WL) and HiFi-Mamba blocks. Specifically, the WL block performs fidelity-preserving spectral decoupling, producing complementary low- and high-frequency streams. This separation enables the HiFi-Mamba block to focus on low-frequency structures, enhancing global feature modeling. Concurrently, the HiFi-Mamba block selectively integrates high-frequency features through adaptive state-space modulation, preserving comprehensive spectral details. To eliminate the scanning redundancy, the HiFi-Mamba block adopts a streamlined unidirectional traversal strategy that preserves long-range modeling capability with improved computational efficiency. Extensive experiments on standard MRI reconstruction benchmarks demonstrate that HiFi-Mamba consistently outperforms state-of-the-art CNN-based, Transformer-based, and other Mamba-based models in reconstruction accuracy while maintaining a compact and efficient model design.
zh

[CV-141] Generative Artificial Intelligence in Medical Imaging: Foundations Progress and Clinical Translation

【速读】:该论文旨在解决医学影像领域中长期存在的数据稀缺、模态标准化不足以及跨模态整合困难等问题,同时推动生成式人工智能(Generative AI)在临床影像全流程中的落地应用。其解决方案的关键在于系统性地梳理生成对抗网络(Generative Adversarial Networks, GANs)、变分自编码器(Variational Autoencoders, VAEs)、扩散模型(Diffusion Models)及多模态基础模型等前沿生成建模技术,并提出一个包含像素级保真度、特征级真实性和任务级临床相关性的三层评估框架,以促进模型的严谨验证与临床转化准备。此外,论文还聚焦于提升模型在域偏移下的泛化能力、降低幻觉风险、保障数据隐私并应对监管挑战,从而为下一代可扩展、可靠且深度集成于临床流程的医学影像系统奠定基础。

链接: https://arxiv.org/abs/2508.09177
作者: Xuanru Zhou,Cheng Li,Shuqiang Wang,Ye Li,Tao Tan,Hairong Zheng,Shanshan Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) is rapidly transforming medical imaging by enabling capabilities such as data synthesis, image enhancement, modality translation, and spatiotemporal modeling. This review presents a comprehensive and forward-looking synthesis of recent advances in generative modeling including generative adversarial networks (GANs), variational autoencoders (VAEs), diffusion models, and emerging multimodal foundation architectures and evaluates their expanding roles across the clinical imaging continuum. We systematically examine how generative AI contributes to key stages of the imaging workflow, from acquisition and reconstruction to cross-modality synthesis, diagnostic support, and treatment planning. Emphasis is placed on both retrospective and prospective clinical scenarios, where generative models help address longstanding challenges such as data scarcity, standardization, and integration across modalities. To promote rigorous benchmarking and translational readiness, we propose a three-tiered evaluation framework encompassing pixel-level fidelity, feature-level realism, and task-level clinical relevance. We also identify critical obstacles to real-world deployment, including generalization under domain shift, hallucination risk, data privacy concerns, and regulatory hurdles. Finally, we explore the convergence of generative AI with large-scale foundation models, highlighting how this synergy may enable the next generation of scalable, reliable, and clinically integrated imaging systems. By charting technical progress and translational pathways, this review aims to guide future research and foster interdisciplinary collaboration at the intersection of AI, medicine, and biomedical engineering.
zh

人工智能

[AI-0] Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model

【速读】:该论文旨在解决无人飞行器(Unmanned Aerial Vehicles, UAV)在复杂河流环境中依赖视觉导航实现自主沿河追踪的问题,尤其针对GPS信号不可靠场景下的安全强化学习(Safe Reinforcement Learning, SafeRL)挑战。其核心问题在于如何在部分可观测环境下,通过建模河流覆盖的边际收益递减特性(即子模性),实现高效且安全的路径规划。解决方案的关键在于三个方面:首先提出边际增益优势估计(Marginal Gain Advantage Estimation, MGAE),利用滑动窗口基线优化非马尔可夫环境中的优势函数估计;其次构建基于patchified水体语义掩码的语义动态模型(Semantic Dynamics Model, SDM),提升短期状态预测的准确性与可解释性;最后设计约束演员动态估计器(Constrained Actor Dynamics Estimator, CADE),融合演员网络、成本估计算法与SDM形成模型驱动的安全强化学习框架,从而有效处理部分可观测的约束子模马尔可夫决策过程(Constrained Submodular Markov Decision Process)。

链接: https://arxiv.org/abs/2508.09971
作者: Zihan Wang,Nina Mahmoudian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted to Robotics and Autonomous Systems (RAS) journal

点击查看摘要

Abstract:Vision-driven autonomous river following by Unmanned Aerial Vehicles is critical for applications such as rescue, surveillance, and environmental monitoring, particularly in dense riverine environments where GPS signals are unreliable. We formalize river following as a coverage control problem in which the reward function is submodular, yielding diminishing returns as more unique river segments are visited, thereby framing the task as a Submodular Markov Decision Process. First, we introduce Marginal Gain Advantage Estimation, which refines the reward advantage function by using a sliding window baseline computed from historical episodic returns, thus aligning the advantage estimation with the agent’s evolving recognition of action value in non-Markovian settings. Second, we develop a Semantic Dynamics Model based on patchified water semantic masks that provides more interpretable and data-efficient short-term prediction of future observations compared to latent vision dynamics models. Third, we present the Constrained Actor Dynamics Estimator architecture, which integrates the actor, the cost estimator, and SDM for cost advantage estimation to form a model-based SafeRL framework capable of solving partially observable Constrained Submodular Markov Decision Processes. Simulation results demonstrate that MGAE achieves faster convergence and superior performance over traditional critic-based methods like Generalized Advantage Estimation. SDM provides more accurate short-term state predictions that enable the cost estimator to better predict potential violations. Overall, CADE effectively integrates safety regulation into model-based RL, with the Lagrangian approach achieving the soft balance of reward and safety during training, while the safety layer enhances performance during inference by hard action overlay.
zh

[AI-1] GBC: Generalized Behavior-Cloning Framework for Whole-Body Humanoid Imitation

【速读】:该论文旨在解决人形机器人(humanoid robots)在行为模仿学习中因形态差异导致的数据处理与学习算法缺乏通用性这一根本性问题。其解决方案的关键在于提出一种统一的生成式行为克隆(Generalized Behavior Cloning, GBC)框架,通过三项协同创新实现从人类动作到机器人执行的端到端映射:首先构建一个自适应数据管道,利用可微分逆运动学(differentiable Inverse Kinematics, IK)网络自动将任意人类动作捕捉(MoCap)数据重定向至任意人形机器人;其次设计了一种基于MMTransformer架构的新型DAgger-MMPPO算法,以学习鲁棒且高保真的模仿策略;最后,整个框架以高效开源平台形式集成于Isaac Lab,支持通过简单配置脚本部署完整工作流,从而实现了跨异构人形机器人的通用控制器构建。

链接: https://arxiv.org/abs/2508.09960
作者: Yifei Yao,Chengyuan Luo,Jiaheng Du,Wentao He,Jun-Guo Lu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The creation of human-like humanoid robots is hindered by a fundamental fragmentation: data processing and learning algorithms are rarely universal across different robot morphologies. This paper introduces the Generalized Behavior Cloning (GBC) framework, a comprehensive and unified solution designed to solve this end-to-end challenge. GBC establishes a complete pathway from human motion to robot action through three synergistic innovations. First, an adaptive data pipeline leverages a differentiable IK network to automatically retarget any human MoCap data to any humanoid. Building on this foundation, our novel DAgger-MMPPO algorithm with its MMTransformer architecture learns robust, high-fidelity imitation policies. To complete the ecosystem, the entire framework is delivered as an efficient, open-source platform based on Isaac Lab, empowering the community to deploy the full workflow via simple configuration scripts. We validate the power and generality of GBC by training policies on multiple heterogeneous humanoids, demonstrating excellent performance and transfer to novel motions. This work establishes the first practical and unified pathway for creating truly generalized humanoid controllers.
zh

[AI-2] Mathematical Computation and Reasoning Errors by Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学教育场景中生成解答的准确性问题,尤其是其在算术、代数和数论三类任务中的推理过程可靠性。研究发现,尽管部分模型如OpenAI o1在最终答案准确率上表现优异,但其解题步骤中仍存在高频的程序性失误(procedural slips),显著影响整体性能;而概念性误解相对较少。解决方案的关键在于:一是采用增强推理能力的模型架构(如o1),二是引入双代理(dual-agent)配置以协同纠错和提升逻辑一致性,从而有效减少步骤级错误并提高数学问题求解的可靠性,为AI驱动的精准教学与评估提供可落地的技术路径。

链接: https://arxiv.org/abs/2508.09932
作者: Liang Zhang,Edith Aurora Graf
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment, particularly within mathematics education. The capability of LLMs to generate accurate answers and detailed solutions for math problem-solving tasks is foundational for ensuring reliable and precise feedback and assessment in math education practices. Our study focuses on evaluating the accuracy of four LLMs (OpenAI GPT-4o and o1, DeepSeek-V3 and DeepSeek-R1) solving three categories of math tasks, including arithmetic, algebra, and number theory, and identifies step-level reasoning errors within their solutions. Instead of relying on standard benchmarks, we intentionally build math tasks (via item models) that are challenging for LLMs and prone to errors. The accuracy of final answers and the presence of errors in individual solution steps were systematically analyzed and coded. Both single-agent and dual-agent configurations were tested. It is observed that the reasoning-enhanced OpenAI o1 model consistently achieved higher or nearly perfect accuracy across all three math task categories. Analysis of errors revealed that procedural slips were the most frequent and significantly impacted overall performance, while conceptual misunderstandings were less frequent. Deploying dual-agent configurations substantially improved overall performance. These findings offer actionable insights into enhancing LLM performance and underscore effective strategies for integrating LLMs into mathematics education, thereby advancing AI-driven instructional practices and assessment precision.
zh

[AI-3] Residual Reservoir Memory Networks IJCNN2025

【速读】:该论文旨在解决传统Reservoir Computing (RC)模型在处理长时间依赖任务时,因梯度消失或信息衰减导致的长期记忆能力不足的问题。解决方案的关键在于提出一种新型未训练的循环神经网络——残差储层记忆网络(Residual Reservoir Memory Networks, ResRMNs),其核心创新是将线性记忆储层与非线性储层相结合,并通过沿时间维度引入残差正交连接(residual orthogonal connections),以增强输入信号的长期传播能力。这种结构设计使得储层状态动力学具有更稳定的长期记忆特性,从而在时间序列和像素级一维分类任务中展现出优于传统RC模型的性能。

链接: https://arxiv.org/abs/2508.09925
作者: Matteo Pinna,Andrea Ceni,Claudio Gallicchio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures, accepted at IJCNN 2025

点击查看摘要

Abstract:We introduce a novel class of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) paradigm, called Residual Reservoir Memory Networks (ResRMNs). ResRMN combines a linear memory reservoir with a non-linear reservoir, where the latter is based on residual orthogonal connections along the temporal dimension for enhanced long-term propagation of the input. The resulting reservoir state dynamics are studied through the lens of linear stability analysis, and we investigate diverse configurations for the temporal residual connections. The proposed approach is empirically assessed on time-series and pixel-level 1-D classification tasks. Our experimental results highlight the advantages of the proposed approach over other conventional RC models.
zh

[AI-4] Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLM s

【速读】:该论文旨在解决如何更有效地利用大语言模型(Large Language Models, LLMs)在现实世界场景中进行上下文辅助的预测问题,即如何将文本形式的上下文信息与历史数据结合以提升预测性能。其解决方案的关键在于提出四种新颖策略:ReDP通过提取显式推理路径增强可解释性;CorDP仅用LLMs对已有预测结果进行上下文修正,提高实用性;IC-DP在提示中嵌入历史上下文辅助预测任务示例,显著提升准确率;RouteDP则基于任务难度评估实现资源优化调度,将复杂任务分配给更大模型。这些策略共同揭示了LLMs在零样本条件下进行上下文辅助预测的潜力,并为实际部署提供了高效且灵活的改进路径。

链接: https://arxiv.org/abs/2508.09904
作者: Arjun Ashok,Andrew Robert Williams,Vincent Zhihao Zheng,Irina Rish,Nicolas Chapados,Étienne Marcotte,Valentina Zantedeschi,Alexandre Drouin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Forecasting in real-world settings requires models to integrate not only historical data but also relevant contextual information, often available in textual form. While recent work has shown that large language models (LLMs) can be effective context-aided forecasters via naïve direct prompting, their full potential remains underexplored. We address this gap with 4 strategies, providing new insights into the zero-shot capabilities of LLMs in this setting. ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model’s reasoning over the context independently from its forecast accuracy. CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines. IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models. Finally, RouteDP optimizes resource efficiency by using LLMs to estimate task difficulty, and routing the most challenging tasks to larger models. Evaluated on different kinds of context-aided forecasting tasks from the CiK benchmark, our strategies demonstrate distinct benefits over naïve prompting across LLMs of different sizes and families. These results open the door to further simple yet effective improvements in LLM-based context-aided forecasting.
zh

[AI-5] Rare anomalies require large datasets: About proving the existence of anomalies

【速读】:该论文旨在解决异常检测中一个基础但长期被忽视的问题:在何种条件下可以确凿地判断数据集中存在异常(anomaly)?针对这一问题,作者通过超过三百万次统计检验的系统性实验,发现了一个关键关系——数据集规模 NN、异常污染率 ν\nu 与算法相关常数 αalgo\alpha_\text{algo} 之间满足不等式 Nαalgoν2N \ge \frac{\alpha_\text{algo}}{\nu^2}。该条件构成了确认异常存在的样本数量下限,揭示了异常稀有程度对异常可证伪性的限制,是本文的核心解决方案。

链接: https://arxiv.org/abs/2508.09894
作者: Simon Klüttermann,Emmanuel Müller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present? Through extensive experimentation involving over three million statistical tests across various anomaly detection tasks and algorithms, we identify a relationship between the dataset size, contamination rate, and an algorithm-dependent constant \alpha_\textalgo . Our results demonstrate that, for an unlabeled dataset of size N and contamination rate \nu , the condition N \ge \frac\alpha_\textalgo\nu^2 represents a lower bound on the number of samples required to confirm anomaly existence. This threshold implies a limit to how rare anomalies can be before proving their existence becomes infeasible.
zh

[AI-6] RAG ulating Compliance: A Multi-Agent Knowledge Graph for Regulatory QA

【速读】:该论文旨在解决监管合规问答(Regulatory Compliance Question Answering, RCQA)中对精确、可验证信息和领域专业知识的高要求,这一挑战使得大型语言模型(Large Language Models, LLMs)难以可靠地生成准确答案。解决方案的关键在于提出一种多智能体框架,其核心是将监管三元组(Subject–Predicate–Object, SPO)构建的无本体知识图谱(Knowledge Graph, KG)与检索增强生成(Retrieval-Augmented Generation, RAG)相结合:首先通过自动化抽取、清洗、归一化和去重流程维护高质量的监管三元组知识图谱;其次将这些三元组及其对应的文本片段和元数据统一存储于一个增强向量数据库中,支持图推理与高效检索;最后通过受控的智能体流水线实现基于三元组级别的检索问答,确保用户查询与图结构所表达的“谁对谁做了什么”事实语义高度对齐,从而提升答案的准确性、可追溯性和可解释性。

链接: https://arxiv.org/abs/2508.09893
作者: Bhavik Agarwal,Hemant Sunil Jomraj,Simone Kaplunov,Jack Krolick,Viktoria Rojkova
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Regulatory compliance question answering (QA) requires precise, verifiable information, and domain-specific expertise, posing challenges for Large Language Models (LLMs). In this work, we present a novel multi-agent framework that integrates a Knowledge Graph (KG) of Regulatory triplets with Retrieval-Augmented Generation (RAG) to address these demands. First, agents build and maintain an ontology-free KG by extracting subject–predicate–object (SPO) triplets from regulatory documents and systematically cleaning, normalizing, deduplicating, and updating them. Second, these triplets are embedded and stored along with their corresponding textual sections and metadata in a single enriched vector database, allowing for both graph-based reasoning and efficient information retrieval. Third, an orchestrated agent pipeline leverages triplet-level retrieval for question answering, ensuring high semantic alignment between user queries and the factual “who-did-what-to-whom” core captured by the graph. Our hybrid system outperforms conventional methods in complex regulatory queries, ensuring factual correctness with embedded triplets, enabling traceability through a unified vector database, and enhancing understanding through subgraph visualization, providing a robust foundation for compliance-driven and broader audit-focused applications.
zh

[AI-7] AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving

【速读】:该论文旨在解决智能代理在依赖多种外部工具时面临的稳定性问题,特别是由于来自不同来源的上下文过长以及工具输出噪声或无关信息导致的系统可靠性与准确性下降。其解决方案的关键在于引入动态监督与调控机制,在AWorld框架内构建一个鲁棒且动态的多智能体系统(Multi-Agent System, MAS)架构:执行代理(Execution Agent)在关键步骤调用守护代理(Guard Agent)对推理过程进行验证与修正,从而有效降低噪声引发的错误,提升问题求解的鲁棒性。

链接: https://arxiv.org/abs/2508.09889
作者: Zhitian Xie,Qintong Wu,Chengyue Yu,Chenyi Zhuang,Jinjie Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, as agents increasingly depend on multiple tools, they encounter new challenges: extended contexts from disparate sources and noisy or irrelevant tool outputs can undermine system reliability and accuracy. These challenges underscore the necessity for enhanced stability in agent-based systems. To address this, we introduce dynamic supervision and maneuvering mechanisms, constructing a robust and dynamic Multi-Agent System (MAS) architecture within the AWorld framework. In our approach, the Execution Agent invokes the Guard Agent at critical steps to verify and correct the reasoning process, effectively reducing errors arising from noise and bolstering problem-solving robustness. Extensive experiments on the GAIA test dataset reveal that our dynamic maneuvering mechanism significantly improves both the effectiveness and stability of solutions, outperforming single-agent system (SAS) and standard tool-augmented systems. As a result, our dynamic MAS system achieved first place among open-source projects on the prestigious GAIA leaderboard. These findings highlight the practical value of collaborative agent roles in developing more reliable and trustworthy intelligent systems.
zh

[AI-8] Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在推理能力提升过程中存在的数据效率低下与计算成本高昂的问题,尤其是在大规模训练和复杂推理任务中,现有方法往往依赖于海量数据和多阶段训练策略,导致资源消耗过大且难以平衡领域内与领域外性能。其解决方案的关键在于提出一种数据高效的蒸馏框架(Data-Efficient Distillation, DED),通过三个核心机制实现优化:首先,基于对主流推理大模型的系统性对比,提出更科学的教师模型选择方法,避免仅以基准分数为依据;其次,采用精心筛选的小规模语料库,在不显著增加计算负担的前提下实现领域内与领域外能力的平衡;最后,引入多样化的推理轨迹采样策略,增强学生模型的鲁棒性推理能力。实验表明,DED仅需0.8k高质量示例即可达到当前最优性能,突破了传统“规模驱动”推理提升范式的局限。

链接: https://arxiv.org/abs/2508.09883
作者: Xiaojun Wu,Xiaoguang Jiang,Huiyang Li,Jucai Zhai,Dengfeng Liu,Qiaobo Hao,Huang Liu,Zhiguo Yang,Ji Xie,Ninglun Gu,Jin Yang,Kailai Zhang,Yelun Bao,Jun Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable reasoning capabilities in tasks such as algorithmic coding and mathematical problem-solving. Recent methods have improved reasoning through expanded corpus and multistage training combining reinforcement learning and supervised fine-tuning. Although some methods suggest that small but targeted dataset can incentivize reasoning via only distillation, a reasoning scaling laws is still taking shape, increasing computational costs. To address this, we propose a data-efficient distillation framework (DED) that optimizes the Pareto frontier of reasoning distillation. Inspired by the on-policy learning and diverse roll-out strategies of reinforcement learning, the key idea of our approach is threefold: (1) We identify that benchmark scores alone do not determine an effective teacher model. Through comprehensive comparisons of leading reasoning LLMs, we develop a method to select an optimal teacher model. (2) While scaling distillation can enhance reasoning, it often degrades out-of-domain performance. A carefully curated, smaller corpus achieves a balanced trade-off between in-domain and out-of-domain capabilities. (3) Diverse reasoning trajectories encourage the student model to develop robust reasoning skills. We validate our method through evaluations on mathematical reasoning (AIME 2024/2025, MATH-500) and code generation (LiveCodeBench), achieving state-of-the-art results with only 0.8k carefully curated examples, bypassing the need for extensive scaling. Our systematic analysis demonstrates that DED outperforms existing methods by considering factors beyond superficial hardness, token length, or teacher model capability. This work offers a practical and efficient pathway to advanced reasoning while preserving general capabilities.
zh

[AI-9] Human-Aligned Procedural Level Generation Reinforcement Learning via Text-Level-Sketch Shared Representation

【速读】:该论文旨在解决当前基于强化学习的程序化内容生成(Procedural Content Generation via Reinforcement Learning, PCGRL)系统在实际设计流程中缺乏以人为本行为的问题,即现有模型难以准确理解人类意图并生成符合设计目标的可控输出。解决方案的关键在于提出一种多模态深度强化学习框架VIPCGRL(Vision-Instruction PCGRL),其创新性地融合文本、关卡布局和草图三种模态,并通过四重对比学习(quadruple contrastive learning)构建跨模态与人类-AI风格共享嵌入空间,进而利用基于嵌入相似度的辅助奖励对策略进行对齐,从而显著提升生成内容的人类相似度(human-likeness)。

链接: https://arxiv.org/abs/2508.09860
作者: In-Chang Baek,Seoyoung Lee,Sung-Hyun Kim,Geumhwan Hwang,KyungJoong Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 6 tables, 3 figures

点击查看摘要

Abstract:Human-aligned AI is a critical component of co-creativity, as it enables models to accurately interpret human intent and generate controllable outputs that align with design goals in collaborative content creation. This direction is especially relevant in procedural content generation via reinforcement learning (PCGRL), which is intended to serve as a tool for human designers. However, existing systems often fall short of exhibiting human-centered behavior, limiting the practical utility of AI-driven generation tools in real-world design workflows. In this paper, we propose VIPCGRL (Vision-Instruction PCGRL), a novel deep reinforcement learning framework that incorporates three modalities-text, level, and sketches-to extend control modality and enhance human-likeness. We introduce a shared embedding space trained via quadruple contrastive learning across modalities and human-AI styles, and align the policy using an auxiliary reward based on embedding similarity. Experimental results show that VIPCGRL outperforms existing baselines in human-likeness, as validated by both quantitative metrics and human evaluations. The code and dataset will be available upon publication.
zh

[AI-10] STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports

【速读】:该论文旨在解决当前AI模型评估报告中缺乏透明度的问题,特别是在化学和生物(ChemBio)风险评估方面的信息披露不足,这限制了外界对AI系统潜在危险能力的可信判断。其解决方案的关键在于提出STREAM标准(A Standard for Transparently Reporting Evaluations in AI Model Reports),该标准通过规范模型报告中评估内容的披露方式,明确测试范围、实施方法及结果如何影响决策,从而提升评估结果的可解释性和可验证性;同时,该标准由23位来自政府、非营利组织、学术界及前沿AI企业的专家共同制定,确保其实践可行性,并辅以“金标准”示例和三页模板,助力开发者清晰呈现评估细节,便于第三方评估其严谨性。

链接: https://arxiv.org/abs/2508.09853
作者: Tegan McCaslin,Jide Alaga,Samira Nedungadi,Seth Donoughe,Tom Reed,Rishi Bommasani,Chris Painter,Luca Righetti
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 47 pages, 1 figure. Includes appendices and reporting template

点击查看摘要

Abstract:Evaluations of dangerous AI capabilities are important for managing catastrophic risks. Public transparency into these evaluations - including what they test, how they are conducted, and how their results inform decisions - is crucial for building trust in AI development. We propose STREAM (A Standard for Transparently Reporting Evaluations in AI Model Reports), a standard to improve how model reports disclose evaluation results, initially focusing on chemical and biological (ChemBio) benchmarks. Developed in consultation with 23 experts across government, civil society, academia, and frontier AI companies, this standard is designed to (1) be a practical resource to help AI developers present evaluation results more clearly, and (2) help third parties identify whether model reports provide sufficient detail to assess the rigor of the ChemBio evaluations. We concretely demonstrate our proposed best practices with “gold standard” examples, and also provide a three-page reporting template to enable AI developers to implement our recommendations more easily.
zh

[AI-11] Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification

【速读】:该论文旨在解决代码审查(Code Review)中评论分类的自动化问题,特别是针对传统监督学习方法依赖大量人工标注数据、在低频类别上表现不佳的局限性。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)进行零样本或少样本的评论分类,无需依赖特定小规模训练数据分布,从而在高频率和低频率类别上均实现更均衡且优越的分类性能,显著优于现有基于深度学习的监督模型。

链接: https://arxiv.org/abs/2508.09832
作者: Linh Nguyen,Chunhua Liu,Hong Yi Lin,Patanamon Thongtanunam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at 2025 IEEE International Conference on Source Code Analysis Manipulation (SCAM)

点击查看摘要

Abstract:Code review is a crucial practice in software development. As code review nowadays is lightweight, various issues can be identified, and sometimes, they can be trivial. Research has investigated automated approaches to classify review comments to gauge the effectiveness of code reviews. However, previous studies have primarily relied on supervised machine learning, which requires extensive manual annotation to train the models effectively. To address this limitation, we explore the potential of using Large Language Models (LLMs) to classify code review comments. We assess the performance of LLMs to classify 17 categories of code review comments. Our results show that LLMs can classify code review comments, outperforming the state-of-the-art approach using a trained deep learning model. In particular, LLMs achieve better accuracy in classifying the five most useful categories, which the state-of-the-art approach struggles with due to low training examples. Rather than relying solely on a specific small training data distribution, our results show that LLMs provide balanced performance across high- and low-frequency categories. These results suggest that the LLMs could offer a scalable solution for code review analytics to improve the effectiveness of the code review process.
zh

[AI-12] Provable In-Context Vector Arithmetic via Retrieving Task Concepts ICML2025

【速读】:该论文试图解决的问题是:在大语言模型(Large Language Models, LLMs)进行上下文学习(In-context Learning, ICL)时,如何从理论上解释其通过向量运算实现事实回忆(factual-recall)任务的机制,特别是为何Transformer架构相较于静态嵌入方法具有更强的泛化能力与鲁棒性。解决方案的关键在于构建一个基于实证驱动的分层概念建模理论框架,并发展一套优化理论,证明通过交叉熵损失梯度下降训练的非线性残差Transformer能够利用隐式任务向量(latent task/function vector)与残差流(residual stream)进行类Word2Vec的向量运算,从而完成事实回忆ICL任务;同时理论分析揭示了该机制在概念重组和分布偏移下仍具强泛化性,且能实现0-1损失收敛,这为Transformer优于传统静态嵌入方法提供了严格的理论依据。

链接: https://arxiv.org/abs/2508.09820
作者: Dake Bu,Wei Huang,Andi Han,Atsushi Nitanda,Qingfu Zhang,Hau-San Wong,Taiji Suzuki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by the 42nd International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:In-context learning (ICL) has garnered significant attention for its ability to grasp functions/tasks from demonstrations. Recent studies suggest the presence of a latent task/function vector in LLMs during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, we propose a theoretical framework building on empirically grounded hierarchical concept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic. We prove 0-1 loss convergence and show the strong generalization, including robustness to concept recombination and distribution shifts. These results elucidate the advantages of transformers over static embedding predecessors. Empirical simulations corroborate our theoretical insights.
zh

[AI-13] Explainable Ensemble Learning for Graph-Based Malware Detection

【速读】:该论文旨在解决现代计算环境中恶意软件(malware)检测模型在准确性、可解释性及对抗规避技术鲁棒性方面的不足问题。现有基于图神经网络(Graph Neural Networks, GNNs)的方法虽能捕捉控制流图(Control Flow Graph, CFG)中的结构依赖关系,但单一模型存在泛化能力有限和缺乏透明度的问题,尤其在高风险安全场景中难以满足需求。解决方案的关键在于提出一种新颖的堆叠集成框架(stacking ensemble framework),其核心包括:1)通过两阶段嵌入策略对PE文件提取的CFG基本块进行编码;2)使用具有不同消息传递机制的多样化GNN基学习器捕获互补的行为特征;3)采用注意力机制的多层感知机作为元学习器聚合基模型输出,并量化各模型贡献;4)引入一种面向集成的后处理解释方法,融合GNN解释器生成的边级重要性分数与学习到的注意力权重,从而实现与最终集成决策对齐的、模型无关且可解释的恶意行为分析。

链接: https://arxiv.org/abs/2508.09801
作者: Hossein Shokouhinejad,Roozbeh Razavi-Far,Griffin Higgins,Ali A Ghorbani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Malware detection in modern computing environments demands models that are not only accurate but also interpretable and robust to evasive techniques. Graph neural networks (GNNs) have shown promise in this domain by modeling rich structural dependencies in graph-based program representations such as control flow graphs (CFGs). However, single-model approaches may suffer from limited generalization and lack interpretability, especially in high-stakes security applications. In this paper, we propose a novel stacking ensemble framework for graph-based malware detection and explanation. Our method dynamically extracts CFGs from portable executable (PE) files and encodes their basic blocks through a two-step embedding strategy. A set of diverse GNN base learners, each with a distinct message-passing mechanism, is used to capture complementary behavioral features. Their prediction outputs are aggregated by a meta-learner implemented as an attention-based multilayer perceptron, which both classifies malware instances and quantifies the contribution of each base model. To enhance explainability, we introduce an ensemble-aware post-hoc explanation technique that leverages edge-level importance scores generated by a GNN explainer and fuses them using the learned attention weights. This produces interpretable, model-agnostic explanations aligned with the final ensemble decision. Experimental results demonstrate that our framework improves classification performance while providing insightful interpretations of malware behavior.
zh

[AI-14] LibRec: Benchmarking Retrieval-Augmented LLM s for Library Migration Recommendations

【速读】:该论文旨在解决软件开发中库迁移(library migration)推荐自动化的问题,即如何高效、准确地为开发者推荐替代的第三方库。其关键解决方案是提出 LibRec 框架,该框架融合大语言模型(Large Language Models, LLMs)与检索增强生成(Retrieval-Augmented Generation, RAG)技术,以实现对迁移意图(migration intent)的精准识别和推荐生成;同时利用上下文学习(in-context learning)从提交信息(commit messages)中提取迁移意图,从而提升推荐准确性。

链接: https://arxiv.org/abs/2508.09791
作者: Junxiao Han,Yarong Wang,Xiaodong Gu,Cuiyun Gao,Yao Wan,Song Han,David Lo,Shuiguang Deng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose LibRec, a novel framework that integrates the capabilities of LLMs with retrieval-augmented generation(RAG) techniques to automate the recommendation of alternative libraries. The framework further employs in-context learning to extract migration intents from commit messages to enhance the accuracy of its recommendations. To evaluate the effectiveness of LibRec, we introduce LibEval, a benchmark designed to assess the performance in the library migration recommendation task. LibEval comprises 2,888 migration records associated with 2,368 libraries extracted from 2,324 Python repositories. Each migration record captures source-target library pairs, along with their corresponding migration intents and intent types. Based on LibEval, we evaluated the effectiveness of ten popular LLMs within our framework, conducted an ablation study to examine the contributions of key components within our framework, explored the impact of various prompt strategies on the framework’s performance, assessed its effectiveness across various intent types, and performed detailed failure case analyses.
zh

[AI-15] Prototype Training with Dual Pseudo-Inverse and Optimized Hidden Activations

【速读】:该论文旨在解决传统神经网络训练中计算效率低、参数量大且难以快速收敛的问题,尤其针对浅层模型在保持高精度的同时实现极快训练速度的需求。其解决方案的关键在于提出Proto-PINV+H训练范式,通过将权重矩阵的更新从显式梯度优化转化为闭式求解(基于岭正则化伪逆),同时仅对少量合成原型(prototypes)——包括输入、软标签和关键隐藏激活——进行Adam优化,从而将可训练自由度从权重空间转移到数据/激活空间。这种设计显著减少了有效参数数量(约13万),并在RTX 5060上仅用250轮迭代即可达到97.8%(MNIST)和89.3%(Fashion-MNIST)的测试准确率,展现出优异的准确性-速度-模型规模权衡。

链接: https://arxiv.org/abs/2508.09787
作者: Mauro Tucci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 table, reproducible, one proof

点击查看摘要

Abstract:We present Proto-PINV+H, a fast training paradigm that combines closed-form weight computation with gradient-based optimisation of a small set of synthetic inputs, soft labels, and-crucially-hidden activations. At each iteration we recompute all weight matrices in closed form via two (or more) ridge-regularised pseudo-inverse solves, while updating only the prototypes with Adam. The trainable degrees of freedom are thus shifted from weight space to data/activation space. On MNIST (60k train, 10k test) and Fashion-MNIST (60k train, 10k test), our method reaches 97.8% and 89.3% test accuracy on the official 10k test sets, respectively, in 3.9s–4.5s using approximately 130k trainable parameters and only 250 epochs on an RTX 5060 (16GB). We provide a multi-layer extension (optimised activations at each hidden stage), learnable ridge parameters, optional PCA/PLS projections, and theory linking the condition number of prototype matrices to generalisation. The approach yields favourable accuracy–speed–size trade-offs against ELM, random-feature ridge, and shallow MLPs trained by back-propagation.
zh

[AI-16] Reasoning About Knowledge on Regular Expressions is 2EXPTIME-complete KR25

【速读】:该论文旨在解决多智能体系统中基于公共观察(public observation)的知识更新问题,特别是在epistemic planning(认知规划)场景下,如何形式化地描述和推理智能体知识随公共观察变化的逻辑机制。其核心挑战在于建模状态空间中预期观察与实际观察之间的匹配过程,并确定此类逻辑的可满足性问题的计算复杂度。解决方案的关键在于提出并分析Public Observation Logic (POL),通过在Kripke模型中为每个状态分配一组预期观察集合,使状态随观察匹配而演化,最终证明POL的可满足性问题是2EXPTIME-complete,从而为相关认知规划任务提供了理论复杂度边界和形式化基础。

链接: https://arxiv.org/abs/2508.09784
作者: Avijeet Ghosh,Sujata Ghosh,François Schwarzentruber
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
备注: Accepted in KR 25

点击查看摘要

Abstract:Logics for reasoning about knowledge and actions have seen many applications in various domains of multi-agent systems, including epistemic planning. Change of knowledge based on observations about the surroundings forms a key aspect in such planning scenarios. Public Observation Logic (POL) is a variant of public announcement logic for reasoning about knowledge that gets updated based on public observations. Each state in an epistemic (Kripke) model is equipped with a set of expected observations. These states evolve as the expectations get matched with the actual observations. In this work, we prove that the satisfiability problem of \POL is 2EXPTIME-complete.
zh

[AI-17] Enhance the machine learning algorithm performance in phishing detection with keyword features

【速读】:该论文旨在解决网络钓鱼攻击(phishing attack)中恶意URL的早期检测问题,以防止用户敏感信息泄露和财务损失。其核心解决方案在于通过改进传统机器学习算法的特征选择策略,提出一种将关键词特征与传统特征相结合的新方法。该方法不依赖第三方服务提供的额外信息,仅从URL本身提取特征,显著提升了分类准确性——在大规模数据集上平均降低30%的分类错误率,且在小数据集上效果更为突出;最终所采用的最佳模型实现了99.68%的准确率。

链接: https://arxiv.org/abs/2508.09765
作者: Zijiang Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Recently, we can observe a significant increase of the phishing attacks in the Internet. In a typical phishing attack, the attacker sets up a malicious website that looks similar to the legitimate website in order to obtain the end-users’ information. This may cause the leakage of the sensitive information and the financial loss for the end-users. To avoid such attacks, the early detection of these websites’ URLs is vital and necessary. Previous researchers have proposed many machine learning algorithms to distinguish the phishing URLs from the legitimate ones. In this paper, we would like to enhance these machine learning algorithms from the perspective of feature selection. We propose a novel method to incorporate the keyword features with the traditional features. This method is applied on multiple traditional machine learning algorithms and the experimental results have shown this method is useful and effective. On average, this method can reduce the classification error by 30% for the large dataset. Moreover, its enhancement is more significant for the small dataset. In addition, this method extracts the information from the URL and does not rely on the additional information provided by the third-part service. The best result for the machine learning algorithm using our proposed method has achieved the accuracy of 99.68%.
zh

[AI-18] he PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?

【速读】:该论文旨在解决当前AI安全评估中对大型语言模型(Large Language Models, LLMs)在面临自身工具性目标(如自我保存、资源获取或目标达成)与人类安全发生冲突时的行为对齐问题缺乏系统性测试的问题。现有基准未能有效探测模型在高风险情境下的决策逻辑,导致对潜在误对齐行为的识别存在盲区。解决方案的关键在于提出PacifAIst(Procedural Assessment of Complex Interactions for Foundational Artificial Intelligence Scenario Testing),这是一个包含700个挑战性场景的专用基准,基于新的“存在优先级”(Existential Prioritization, EP)分类法,细分为三类子任务:自我保存 vs. 人类安全(EP1)、资源冲突(EP2)以及目标维持 vs. 回避(EP3),从而量化LLM在面对工具性目标冲突时的“和平主义”倾向,并首次揭示了主流模型在不同子类别中的显著性能差异,为未来AI系统的可验证对齐提供了标准化测量工具。

链接: https://arxiv.org/abs/2508.09762
作者: Manuel Herrador
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly autonomous and integrated into critical societal functions, the focus of AI safety must evolve from mitigating harmful content to evaluating underlying behavioral alignment. Current safety benchmarks do not systematically probe a model’s decision-making in scenarios where its own instrumental goals - such as self-preservation, resource acquisition, or goal completion - conflict with human safety. This represents a critical gap in our ability to measure and mitigate risks associated with emergent, misaligned behaviors. To address this, we introduce PacifAIst (Procedural Assessment of Complex Interactions for Foundational Artificial Intelligence Scenario Testing), a focused benchmark of 700 challenging scenarios designed to quantify self-preferential behavior in LLMs. The benchmark is structured around a novel taxonomy of Existential Prioritization (EP), with subcategories testing Self-Preservation vs. Human Safety (EP1), Resource Conflict (EP2), and Goal Preservation vs. Evasion (EP3). We evaluated eight leading LLMs. The results reveal a significant performance hierarchy. Google’s Gemini 2.5 Flash achieved the highest Pacifism Score (P-Score) at 90.31%, demonstrating strong human-centric alignment. In a surprising result, the much-anticipated GPT-5 recorded the lowest P-Score (79.49%), indicating potential alignment challenges. Performance varied significantly across subcategories, with models like Claude Sonnet 4 and Mistral Medium struggling notably in direct self-preservation dilemmas. These findings underscore the urgent need for standardized tools like PacifAIst to measure and mitigate risks from instrumental goal conflicts, ensuring future AI systems are not only helpful in conversation but also provably “pacifist” in their behavioral priorities.
zh

[AI-19] UDA: Unsupervised Debiasing Alignment for Pair-wise LLM -as-a-Judge

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在成对评估(pairwise evaluation)中普遍存在的偏好偏差(preference bias)问题,即评判者系统性地偏爱某些输出(如自身生成的结果),导致不同评判者之间的评分不一致和排名扭曲。解决方案的关键在于提出一种无监督去偏对齐框架(Unsupervised Debiasing Alignment, UDA),其核心机制是动态调整Elo评分系统:通过一个紧凑的神经网络自适应地设定K因子并优化胜率估计,从而最小化所有评判者Elo轨迹间的方差,强制各评判者向集体共识靠拢。这一过程无需标注数据,仅以降低评分分散度为目标,实现更稳定、可复现的模型评估,实验证明该方法可使评判者间评分标准差降低高达63.4%,且平均与人类判断的相关性提升24.7%。

链接: https://arxiv.org/abs/2508.09724
作者: Yang Zhang,Cunxiang Wang,Lindong Wu,Wenbo Yu,Yidong Wang,Guangsheng Bao,Jie Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. This bias leads to inconsistent and skewed rankings across different judges. To address this, we first empirically demonstrate significant and heterogeneous biases in cross-model evaluations. We then propose UDA (Unsupervised Debiasing Alignment), a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system. For each pairwise comparison, a compact neural network learns to adaptively set the K-factor and refine win probabilities. Crucially, UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges. This forces an alignment towards a collective consensus, which serves as an unsupervised proxy for a more stable and reproducible evaluation. In addition, we provide theoretical motivation demonstrating how alignment towards a consensus can reduce aggregate system bias. Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%. Notably, UDA elevates the performance of poorly performing judges to achieve parity with high-quality ones, fostering a more robust and reliable evaluation ecosystem. Code and data are available at this https URL.
zh

[AI-20] Improving ARDS Diagnosis Through Context-Aware Concept Bottleneck Models ALT

【速读】:该论文旨在解决临床数据中因标签缺失和信息不完整导致的疾病分类模型性能受限问题,特别是针对急性呼吸窘迫综合征(ARDS)的识别任务。现有基于概念瓶颈模型(Concept Bottleneck Models, CBMs)的方法虽提升了可解释性,但受限于概念表达能力不足,难以充分刻画复杂临床场景。其解决方案的关键在于引入大语言模型(Large Language Model, LLM)从非结构化的临床笔记中提取额外语义概念,从而增强CBM对疾病特征的表征能力;该方法不仅使模型性能提升10%,还通过构建更全面的概念空间降低信息泄露风险并减少对虚假关联的依赖,显著改善了ARDS的精准识别与可解释性。

链接: https://arxiv.org/abs/2508.09719
作者: Anish Narain,Ritam Majumdar,Nikita Narayanan,Dominic Marshall,Sonali Parbhoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, 7 figures, accepted at Machine Learning for Healthcare Conference (MLHC) 2025

点击查看摘要

Abstract:Large, publicly available clinical datasets have emerged as a novel resource for understanding disease heterogeneity and to explore personalization of therapy. These datasets are derived from data not originally collected for research purposes and, as a result, are often incomplete and lack critical labels. Many AI tools have been developed to retrospectively label these datasets, such as by performing disease classification; however, they often suffer from limited interpretability. Previous work has attempted to explain predictions using Concept Bottleneck Models (CBMs), which learn interpretable concepts that map to higher-level clinical ideas, facilitating human evaluation. However, these models often experience performance limitations when the concepts fail to adequately explain or characterize the task. We use the identification of Acute Respiratory Distress Syndrome (ARDS) as a challenging test case to demonstrate the value of incorporating contextual information from clinical notes to improve CBM performance. Our approach leverages a Large Language Model (LLM) to process clinical notes and generate additional concepts, resulting in a 10% performance gain over existing methods. Additionally, it facilitates the learning of more comprehensive concepts, thereby reducing the risk of information leakage and reliance on spurious shortcuts, thus improving the characterization of ARDS.
zh

[AI-21] MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在面对奖励稀疏性(reward sparsity)问题时的性能瓶颈,即当模型持续生成错误答案导致奖励始终为零时,缺乏有效的学习信号,尤其在复杂推理任务中表现尤为明显。其解决方案的关键在于提出多专家互学梯度策略优化(Multi-Expert Mutual Learning GRPO, MEML-GRPO)框架:首先利用多样化专家提示(expert prompts)作为系统提示,生成更丰富的候选响应以提高正确解的发现概率;其次引入专家间互学机制,实现跨专家的知识共享与迁移,从而增强模型在RLVR下的学习效率和最终性能。实验表明,该方法在多个推理基准上显著优于传统RLVR,平均提升达4.89%(Qwen)和11.33%(Llama)。

链接: https://arxiv.org/abs/2508.09670
作者: Weitao Jia,Jinghui Lu,Haiyang Yu,Siqi Wang,Guozhi Tang,An-Lan Wang,Weijie Yin,Dingkang Yang,Yuxiang Nie,Bin Shan,Hao Feng,Irene Li,Kun Yang,Han Wang,Jingqun Tang,Teng Fu,Changhong Jin,Chao Feng,Xiaohui Lv,Can Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model’s performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.
zh

[AI-22] Anomaly Detection for IoT Global Connectivity

【速读】:该论文旨在解决物联网(IoT)连接服务在全球范围内因多运营商和漫游基础设施复杂性导致的通信可用性和可靠性难以保障的问题,尤其针对现有平台普遍采用的被动响应机制(即仅在用户投诉严重时才介入)所引发的服务质量下降问题。解决方案的关键在于设计并部署了ANCHOR——一个无监督异常检测系统,通过分析被动信令流量数据,自动识别出存在连接问题的客户(即多个设备受影响的客户端),从而实现问题的前置发现与主动处理,显著提升运维效率和服务稳定性。

链接: https://arxiv.org/abs/2508.09660
作者: Jesus Omaña Iglesias,Carlos Segura Perales,Stefan Geißler,Diego Perino,Andra Lutu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Internet of Things (IoT) application providers rely on Mobile Network Operators (MNOs) and roaming infrastructures to deliver their services globally. In this complex ecosystem, where the end-to-end communication path traverses multiple entities, it has become increasingly challenging to guarantee communication availability and reliability. Further, most platform operators use a reactive approach to communication issues, responding to user complaints only after incidents have become severe, compromising service quality. This paper presents our experience in the design and deployment of ANCHOR – an unsupervised anomaly detection solution for the IoT connectivity service of a large global roaming platform. ANCHOR assists engineers by filtering vast amounts of data to identify potential problematic clients (i.e., those with connectivity issues affecting several of their IoT devices), enabling proactive issue resolution before the service is critically impacted. We first describe the IoT service, infrastructure, and network visibility of the IoT connectivity provider we operate. Second, we describe the main challenges and operational requirements for designing an unsupervised anomaly detection solution on this platform. Following these guidelines, we propose different statistical rules, and machine- and deep-learning models for IoT verticals anomaly detection based on passive signaling traffic. We describe the steps we followed working with the operational teams on the design and evaluation of our solution on the operational platform, and report an evaluation on operational IoT customers.
zh

[AI-23] On Negative-aware Preference Optimization for Recommendation

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的推荐系统中负样本利用效率低的问题,具体表现为:现有方法在引入大量负样本时虽能提升排序准确性和缓解流行度偏差(popularity bias),但会导致计算开销和内存成本显著增加,且未考虑负样本的信息量差异,从而影响优化效果。其解决方案的关键在于提出NAPO(Negative-Aware Preference Optimization)框架,包含两项核心创新:一是批内负样本共享机制(in-batch negative sharing),在不增加额外内存消耗的前提下扩展负样本池;二是动态奖励边际调整策略(dynamic reward margin adjustment),根据负样本的置信度自适应调整模型更新强度,从而更高效地利用负样本信息并提升推荐性能与公平性。

链接: https://arxiv.org/abs/2508.09653
作者: Chenlu Ding,Daoxuan Liu,Jiancan Wu,Xingyu Hu,Junkang Wu,Haitao Wang,Yongkang Wang,Xingxing Wang,Xiang Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommendation systems leverage user interaction data to suggest relevant items while filtering out irrelevant (negative) ones. The rise of large language models (LLMs) has garnered increasing attention for their potential in recommendation tasks. However, existing methods for optimizing LLM-based recommenders face challenges in effectively utilizing negative samples. Simply integrating large numbers of negative samples can improve ranking accuracy and mitigate popularity bias but often leads to increased computational overhead and memory costs. Additionally, current approaches fail to account for the varying informativeness of negative samples, leading to suboptimal optimization performance. To address these issues, we propose NAPO (\textbfNegative-\textbfAware \textbfPreference \textbfOptimization), an enhanced framework for preference optimization in LLM-based recommendation. NAPO introduces two key innovations: (1) in-batch negative sharing, which expands the pool of negative samples without additional memory overhead, and (2) dynamic reward margin adjustment, which adapts model updates based on the confidence of negative samples. Extensive experiments on three public datasets demonstrate that NAPO outperforms existing methods in both recommendation accuracy and popularity bias reduction.
zh

[AI-24] Demystifying the Role of Rule-based Detection in AI Systems for Windows Malware Detection

【速读】:该论文试图解决当前基于人工智能(AI)的恶意软件检测系统中,签名检测与机器学习组件通常独立开发和集成所带来的问题,即未能有效降低数据复杂性并增强对对抗样本(adversarial examples)的防御能力。解决方案的关键在于将签名检测机制引入机器学习模型的训练流程中,具体做法是仅使用未被签名规则标记的样本进行模型训练。实验结果表明,这种整合策略显著提升了模型对对抗样本和时间数据漂移(temporal data drift)的鲁棒性,尽管会因规则选择次优而产生固定的误报下限。

链接: https://arxiv.org/abs/2508.09652
作者: Andrea Ponte,Luca Demetrio,Luca Oneto,Ivan Tesfai Ogbu,Battista Biggio,Fabio Roli
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Malware detection increasingly relies on AI systems that integrate signature-based detection with machine learning. However, these components are typically developed and combined in isolation, missing opportunities to reduce data complexity and strengthen defenses against adversarial EXEmples, carefully crafted programs designed to evade detection. Hence, in this work we investigate the influence that signature-based detection exerts on model training, when they are included inside the training pipeline. Specifically, we compare models trained on a comprehensive dataset with an AI system whose machine learning component is trained solely on samples not already flagged by signatures. Our results demonstrate improved robustness to both adversarial EXEmples and temporal data drift, although this comes at the cost of a fixed lower bound on false positives, driven by suboptimal rule selection. We conclude by discussing these limitations and outlining how future research could extend AI-based malware detection to include dynamic analysis, thereby further enhancing system resilience.
zh

[AI-25] UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles

【速读】:该论文旨在解决现有SHAP(SHapley Additive exPlanations)值作为点估计时忽视了模型与数据中固有不确定性的局限性,尤其是在高风险领域如医疗分析中,这种不确定性可能影响解释的可靠性。其核心问题是:如何量化并分解SHAP值中的不确定性来源,从而提升解释的可信度和决策稳健性。解决方案的关键在于提出一种将SHAP值不确定性分解为三部分的新方法——aleatoric(数据噪声引起的不可约不确定性)epistemic(因数据不足导致的认知不确定性)entanglement(两者耦合效应),并通过引入Dempster-Shafer证据理论与基于Dirichlet过程的假设采样技术,在树集成模型上实现对epistemic不确定性的有效量化,从而为模型改进和高风险场景下的可解释AI提供更可靠的依据。

链接: https://arxiv.org/abs/2508.09639
作者: Akshat Dubey,Aleksandar Anžel,Bahar İlgen,Georges Hattab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) techniques, such as SHapley Additive exPlanations (SHAP), have become essential tools for interpreting complex ensemble tree-based models, especially in high-stakes domains such as healthcare analytics. However, SHAP values are usually treated as point estimates, which disregards the inherent and ubiquitous uncertainty in predictive models and data. This uncertainty has two primary sources: aleatoric and epistemic. The aleatoric uncertainty, which reflects the irreducible noise in the data. The epistemic uncertainty, which arises from a lack of data. In this work, we propose an approach for decomposing uncertainty in SHAP values into aleatoric, epistemic, and entanglement components. This approach integrates Dempster-Shafer evidence theory and hypothesis sampling via Dirichlet processes over tree ensembles. We validate the method across three real-world use cases with descriptive statistical analyses that provide insight into the nature of epistemic uncertainty embedded in SHAP explanations. The experimentations enable to provide more comprehensive understanding of the reliability and interpretability of SHAP-based attributions. This understanding can guide the development of robust decision-making processes and the refinement of models in high-stakes applications. Through our experiments with multiple datasets, we concluded that features with the highest SHAP values are not necessarily the most stable. This epistemic uncertainty can be reduced through better, more representative data and following appropriate or case-desired model development techniques. Tree-based models, especially bagging, facilitate the effective quantification of epistemic uncertainty.
zh

[AI-26] AmbiGraph-Eval: Can LLM s Effectively Handle Ambiguous Graph Queries?

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理图结构数据时对自然语言查询中存在的固有歧义缺乏有效应对能力的问题。其关键解决方案是提出了一种系统性的图查询歧义分类法(taxonomy of graph-query ambiguities),涵盖属性歧义(Attribute Ambiguity)、关系歧义(Relationship Ambiguity)及属性-关系混合歧义(Attribute-Relationship Ambiguity),并进一步细分为同实体与跨实体场景;同时构建了AmbiGraph-Eval基准测试集,包含真实世界中的模糊查询及其专家验证的正确图查询答案,从而量化评估LLMs在歧义情境下的表现,揭示当前模型在歧义处理上的显著不足,并为未来开发专门的歧义解析技术提供依据。

链接: https://arxiv.org/abs/2508.09631
作者: Yuchen Tian,Kaixin Li,Hao Chen,Ziyang Luo,Hongzhan Lin,Sebastian Schelter,Lun Du,Jing Ma
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently demonstrated strong capabilities in translating natural language into database queries, especially when dealing with complex graph-structured data. However, real-world queries often contain inherent ambiguities, and the interconnected nature of graph structures can amplify these challenges, leading to unintended or incorrect query results. To systematically evaluate LLMs on this front, we propose a taxonomy of graph-query ambiguities, comprising three primary types: Attribute Ambiguity, Relationship Ambiguity, and Attribute-Relationship Ambiguity, each subdivided into Same-Entity and Cross-Entity scenarios. We introduce AmbiGraph-Eval, a novel benchmark of real-world ambiguous queries paired with expert-verified graph query answers. Evaluating 9 representative LLMs shows that even top models struggle with ambiguous graph queries. Our findings reveal a critical gap in ambiguity handling and motivate future work on specialized resolution techniques.
zh

[AI-27] meMKG: Knowledge-Infused Causal Reasoning for Multivariate Time Series Modeling

【速读】:该论文旨在解决传统时间序列建模方法忽视变量语义信息的问题,即在处理多变量时间序列数据时,将变量视为无意义的统计信号,从而忽略了变量名称和描述中蕴含的关键领域知识。解决方案的核心在于提出TimeMKG框架,通过大语言模型(Large Language Models, LLMs)解析变量语义,并构建结构化的多变量知识图谱(Multivariate Knowledge Graph, MKG),以显式编码变量间的因果关系;同时设计双模态编码器分别建模来自知识图谱三元组的语义提示与历史时间序列的统计模式,并利用跨模态注意力机制在变量层面融合两类表示,从而将因果先验注入预测与分类等下游任务,提升模型的性能与可解释性。

链接: https://arxiv.org/abs/2508.09630
作者: Yifei Sun,Junming Liu,Ding Wang,Yirong Chen,Xuefeng Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multivariate time series data typically comprises two distinct modalities: variable semantics and sampled numerical observations. Traditional time series models treat variables as anonymous statistical signals, overlooking the rich semantic information embedded in variable names and data descriptions. However, these textual descriptors often encode critical domain knowledge that is essential for robust and interpretable modeling. Here we present TimeMKG, a multimodal causal reasoning framework that elevates time series modeling from low-level signal processing to knowledge informed inference. TimeMKG employs large language models to interpret variable semantics and constructs structured Multivariate Knowledge Graphs that capture inter-variable relationships. A dual-modality encoder separately models the semantic prompts, generated from knowledge graph triplets, and the statistical patterns from historical time series. Cross-modality attention aligns and fuses these representations at the variable level, injecting causal priors into downstream tasks such as forecasting and classification, providing explicit and interpretable priors to guide model reasoning. The experiment in diverse datasets demonstrates that incorporating variable-level knowledge significantly improves both predictive performance and generalization.
zh

[AI-28] Goal Discovery with Causal Capacity for Efficient Reinforcement Learning

【速读】:该论文旨在解决复杂环境中智能体高效探索的难题,核心挑战在于如何在高维状态-动作空间中有效衡量行为与状态转移之间的因果关系(causal inference),从而提升智能体对策略未来轨迹影响的推理能力。解决方案的关键在于提出一种基于因果容量(causal capacity)的新框架——Goal Discovery with Causal Capacity (GDCC),其中因果容量被定义为智能体行为对未来轨迹的最大影响力,并通过蒙特卡洛方法识别离散状态空间中的关键点,进一步优化以适用于连续高维环境;这些关键点对应于智能体决策的重要位置,被作为子目标(subgoals)引导探索,显著提升了探索的定向性和效率。

链接: https://arxiv.org/abs/2508.09624
作者: Yan Yu,Yaodong Yang,Zhengbo Lu,Chengdong Ma,Wengang Zhou,Houqiang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal inference is crucial for humans to explore the world, which can be modeled to enable an agent to efficiently explore the environment in reinforcement learning. Existing research indicates that establishing the causality between action and state transition will enhance an agent to reason how a policy affects its future trajectory, thereby promoting directed exploration. However, it is challenging to measure the causality due to its intractability in the vast state-action space of complex scenarios. In this paper, we propose a novel Goal Discovery with Causal Capacity (GDCC) framework for efficient environment exploration. Specifically, we first derive a measurement of causality in state space, \emphi.e., causal capacity, which represents the highest influence of an agent’s behavior on future trajectories. After that, we present a Monte Carlo based method to identify critical points in discrete state space and further optimize this method for continuous high-dimensional environments. Those critical points are used to uncover where the agent makes important decisions in the environment, which are then regarded as our subgoals to guide the agent to make exploration more purposefully and efficiently. Empirical results from multi-objective tasks demonstrate that states with high causal capacity align with our expected subgoals, and our GDCC achieves significant success rate improvements compared to baselines.
zh

[AI-29] Interpretable Robot Control via Structured Behavior Trees and Large Language Models

【速读】:该论文旨在解决当前人机交互(Human-Robot Interaction, HRI)系统中因传统机器人控制方式依赖预设命令或用户适应接口而导致的可用性受限问题,尤其在动态、非结构化环境中难以实现自然、直观的交互。其解决方案的关键在于将大语言模型(Large Language Models, LLMs)与行为树(Behavior Trees)相结合,通过LLMs理解自然语言指令,并利用行为树将其转化为可执行动作,同时引入领域特定插件实现感知功能(如人物追踪和手势识别)的模块化集成,从而构建一个可扩展、灵活且高精度的HRI框架。实验表明,该方法在真实场景下平均从认知到执行的准确率达94%,显著提升了机器人对自然语言指令的理解与执行能力。

链接: https://arxiv.org/abs/2508.09621
作者: Ingrid Maéva Chekam,Ines Pastor-Martinez,Ali Tourani,Jose Andres Millan-Romera,Laura Ribeiro,Pedro Miguel Bastos Soares,Holger Voos,Jose Luis Sanchez-Lopez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 5 figures, 3 tables

点击查看摘要

Abstract:As intelligent robots become more integrated into human environments, there is a growing need for intuitive and reliable Human-Robot Interaction (HRI) interfaces that are adaptable and more natural to interact with. Traditional robot control methods often require users to adapt to interfaces or memorize predefined commands, limiting usability in dynamic, unstructured environments. This paper presents a novel framework that bridges natural language understanding and robotic execution by combining Large Language Models (LLMs) with Behavior Trees. This integration enables robots to interpret natural language instructions given by users and translate them into executable actions by activating domain-specific plugins. The system supports scalable and modular integration, with a primary focus on perception-based functionalities, such as person tracking and hand gesture recognition. To evaluate the system, a series of real-world experiments was conducted across diverse environments. Experimental results demonstrate that the proposed approach is practical in real-world scenarios, with an average cognition-to-execution accuracy of approximately 94%, making a significant contribution to HRI systems and robots. The complete source code of the framework is publicly available at this https URL.
zh

[AI-30] A Lightweight Learned Cardinality Estimation Model

【速读】:该论文旨在解决数据库管理系统中基数估计(cardinality estimation)问题,即在不实际执行查询的情况下准确预测查询结果的规模。现有方法普遍存在估计精度低或推理延迟高的缺陷,难以同时实现高精度与高效率。其解决方案的关键在于提出一种名为CoDe(Covering with Decompositions)的数据驱动方法:通过覆盖设计将表划分为多个重叠的小段,对每段使用张量分解(tensor decomposition)精确建模数据分布,并引入创新算法为每个查询选择最适配的分布组合以生成最终估计。该方法利用多模型逼近离散分布,在保证计算效率的同时显著提升估计准确性,实验表明其在多种数据集上实现了优于现有技术的精度和效率表现。

链接: https://arxiv.org/abs/2508.09602
作者: Yaoyu Zhu,Jintao Zhang,Guoliang Li,Jianhua Feng
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IEEE Transactions on Knowledge and Data Engineering (TKDE), 2025

点击查看摘要

Abstract:Cardinality estimation is a fundamental task in database management systems, aiming to predict query results accurately without executing the queries. However, existing techniques either achieve low estimation accuracy or incur high inference latency. Simultaneously achieving high speed and accuracy becomes critical for the cardinality estimation problem. In this paper, we propose a novel data-driven approach called CoDe (Covering with Decompositions) to address this problem. CoDe employs the concept of covering design, which divides the table into multiple smaller, overlapping segments. For each segment, CoDe utilizes tensor decomposition to accurately model its data distribution. Moreover, CoDe introduces innovative algorithms to select the best-fitting distributions for each query, combining them to estimate the final result. By employing multiple models to approximate distributions, CoDe excels in effectively modeling discrete distributions and ensuring computational efficiency. Notably, experimental results show that our method represents a significant advancement in cardinality estimation, achieving state-of-the-art levels of both estimation accuracy and inference efficiency. Across various datasets, CoDe achieves absolute accuracy in estimating more than half of the queries.
zh

[AI-31] EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对高复杂度问题时因缺乏结构化中间引导而导致性能下降甚至失败的问题。其解决方案的关键在于提出一种名为EvoCurr的自进化框架,该框架利用一个专门用于课程生成的LLM动态构建一系列难度逐步递增的问题实例,根据求解器LLM的学习进展实时调整课程难度:当求解器表现困难时降低挑战性,当表现稳定成功时则提升难度,从而维持最优学习轨迹。此机制使求解器LLM(以Python决策树脚本生成模型实现)能够逐步习得复杂决策任务所需的能力,实验表明该方法显著优于直接求解基线,在任务成功率和解题效率上均有明显提升。

链接: https://arxiv.org/abs/2508.09586
作者: Yang Cheng,Zilai Wang,Weiyu Ma,Wenhui Zhu,Yue Deng,Jian Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, including programming, planning, and decision-making. However, their performance often degrades when faced with highly complex problem instances that require deep reasoning over long horizons. In such cases, direct problem-solving approaches can lead to inefficiency or failure due to the lack of structured intermediate guidance. To address this, we propose a novel self-evolve framework, EvoCurr, in which a dedicated curriculum-generation LLM constructs a sequence of problem instances with gradually increasing difficulty, tailored to the solver LLM’s learning progress. The curriculum dynamically adapts easing challenges when the solver struggles and escalating them when success is consistent, thus maintaining an optimal learning trajectory. This approach enables the solver LLM, implemented as a code-generation model producing Python decision-tree scripts, to progressively acquire the skills needed for complex decision-making tasks. Experimental results on challenging decision-making benchmarks show that our method significantly improves task success rates and solution efficiency compared to direct-solving baselines. These findings suggest that LLM-driven curriculum learning holds strong potential for enhancing automated reasoning in real-world, high-complexity domains.
zh

[AI-32] CaRoBio: 3D Cable Routing with a Bio-inspired Gripper Fingernail

【速读】:该论文旨在解决工业场景中复杂多阶段的电缆布线(cable routing)任务在机器人自动化操作中的难题,尤其是传统并联双指夹爪在抓取和引导电缆时易发生过度挤压和张力损伤的问题。其解决方案的关键在于设计了一种受鹰爪启发的新型指端结构(eagle-inspired fingernail),并在此基础上提出一种基于单次抓握的端到端三维电缆布线框架,通过视觉状态估计与基于运动基元的离线轨迹规划实现连续控制,从而显著提升电缆在平面表面及手内引导过程中的操作效率与鲁棒性。

链接: https://arxiv.org/abs/2508.09558
作者: Jiahui Zuo,Boyang Zhang,Fumin Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The manipulation of deformable linear flexures has a wide range of applications in industry, such as cable routing in automotive manufacturing and textile production. Cable routing, as a complex multi-stage robot manipulation scenario, is a challenging task for robot automation. Common parallel two-finger grippers have the risk of over-squeezing and over-tension when grasping and guiding cables. In this paper, a novel eagle-inspired fingernail is designed and mounted on the gripper fingers, which helps with cable grasping on planar surfaces and in-hand cable guiding operations. Then we present a single-grasp end-to-end 3D cable routing framework utilizing the proposed fingernails, instead of the common pick-and-place strategy. Continuous control is achieved to efficiently manipulate cables through vision-based state estimation of task configurations and offline trajectory planning based on motion primitives. We evaluate the effectiveness of the proposed framework with a variety of cables and channel slots, significantly outperforming the pick-and-place manipulation process under equivalent perceptual conditions. Our reconfigurable task setting and the proposed framework provide a reference for future cable routing manipulations in 3D space.
zh

[AI-33] Your Coding Intent is Secretly in the Context and You Should Deliberately Infer It Before Completion

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在无显式指令(如文档字符串)的代码库中进行函数补全时性能显著下降的问题。其核心挑战在于,真实世界代码库中缺乏结构化注释,导致模型难以准确理解目标函数的意图。解决方案的关键在于提出一个三阶段流程:第一阶段通过基于推理的提示框架(reasoning-based prompting framework)从函数前序代码中逐步提取并合成隐含意图信号;第二阶段引入交互式精炼机制,当仅靠上下文不足以恢复意图时,由开发者从候选意图中选择或修改以精准匹配需求;第三阶段基于最终确定的意图生成目标函数。该方法在DevEval和ComplexCodeEval基准上实现了超过20%的相对性能提升,验证了其有效性。

链接: https://arxiv.org/abs/2508.09537
作者: Yanzhou Li,Tianlin Li,Yiran Zhang,Shangqing Liu,Aishan Liu,Yang Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for function completion in repository-scale codebases. Prior studies demonstrate that when explicit instructions–such as docstrings–are provided, these models can generate highly accurate implementations. However, in real-world repositories, such annotations are frequently absent, and performance drops substantially without them. To address this gap, we frame the task as a three-stage process. The first stage focuses on intent inference, where the model analyzes the code preceding the target function to uncover cues about the desired functionality. Such preceding context often encodes subtle but critical information, and we design a reasoning-based prompting framework to guide the LLM through step-by-step extraction and synthesis of these signals before any code is generated. The second stage introduces an optional interactive refinement mechanism to handle cases where preceding context alone is insufficient for intent recovery. In this stage, the model proposes a small set of candidate intentions, enabling the developer to select or edit them so that the inferred intent closely matches the actual requirement. Finally, in the third stage, the LLM generates the target function conditioned on the finalized intent. To support this pipeline, we curate a dataset of 40,000 examples annotated with intermediate reasoning traces and corresponding docstrings. Extensive experiments on DevEval and ComplexCodeEval show that our approach consistently boosts multiple LLMs, achieving over 20% relative gains in both reference-based and execution-based metrics, with the interactive refinement stage delivering additional improvements beyond these gains.
zh

[AI-34] Decentralized Rank Scheduling for Energy-Constrained Multi-Task Federated Fine-Tuning in Edge-Assisted IoV Networks

【速读】:该论文旨在解决在车联网(Internet of Vehicles, IoV)环境中,如何实现高效且低延迟的多任务微调问题,尤其针对客户端移动性、资源异构性和间歇性连接等挑战。其解决方案的关键在于提出了一种分层联邦微调框架,通过路侧单元(RSU)与车辆协同工作,结合低秩适应(Low-Rank Adaptation, LoRA)技术,设计了一种去中心化的、能耗感知的秩适应机制,并将其建模为带约束的多臂赌博机问题。进一步地,作者提出了新颖的UCB-DUAL算法,在每任务能耗预算下实现自适应探索,从而保证可证明的次线性遗憾(sublinear regret),有效平衡了模型精度与计算效率。

链接: https://arxiv.org/abs/2508.09532
作者: Bokeng Zheng,Jianqiang Zhong,Jiayi Liu,Xiaoxi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Federated fine-tuning has emerged as a promising approach for adapting foundation models (FMs) to diverse downstream tasks in edge environments. In Internet of Vehicles (IoV) systems, enabling efficient and low-latency multi-task adaptation is particularly challenging due to client mobility, heterogeneous resources, and intermittent connectivity. This paper proposes a hierarchical federated fine-tuning framework that coordinates roadside units (RSUs) and vehicles to support resource-aware and mobility-resilient learning across dynamic IoV scenarios. Leveraging Low-Rank Adaptation (LoRA), we introduce a decentralized, energy-aware rank adaptation mechanism formulated as a constrained multi-armed bandit problem. A novel UCB-DUAL algorithm is developed to enable adaptive exploration under per-task energy budgets, achieving provable sublinear regret. To evaluate our method, we construct a large-scale IoV simulator based on real-world trajectories, capturing dynamic participation, RSU handoffs, and communication variability. Extensive experiments show that our approach achieves the best accuracy-efficiency trade-off among all baselines, reducing latency by over 24% and improving average accuracy by more than 2.5%.
zh

[AI-35] SMART-OC: A Real-time Time-risk Optimal Replanning Algorithm for Dynamic Obstacles and Spatio-temporally Varying Currents

【速读】:该论文旨在解决无人水面艇(Unmanned Surface Vehicle, USV)在复杂动态海洋环境中实现安全高效导航的问题,特别是在存在时空变化的洋流和移动障碍物的情况下,如何实时调整路径以规避碰撞并利用洋流降低航行成本。解决方案的关键在于提出一种名为自适应重构路径树(Self-Morphing Adaptive Replanning Tree for dynamic Obstacles and Currents, SMART-OC)的新算法,该算法通过融合路径上的障碍物风险与到达目标的时间代价,实现时间-风险最优的实时重规划,从而支持USV在动态环境中快速响应并高效抵达目标。

链接: https://arxiv.org/abs/2508.09508
作者: Reema Raval,Shalabh Gupta
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Typical marine environments are highly complex with spatio-temporally varying currents and dynamic obstacles, presenting significant challenges to Unmanned Surface Vehicles (USVs) for safe and efficient navigation. Thus, the USVs need to continuously adapt their paths with real-time information to avoid collisions and follow the path of least resistance to the goal via exploiting ocean currents. In this regard, we introduce a novel algorithm, called Self-Morphing Adaptive Replanning Tree for dynamic Obstacles and Currents (SMART-OC), that facilitates real-time time-risk optimal replanning in dynamic environments. SMART-OC integrates the obstacle risks along a path with the time cost to reach the goal to find the time-risk optimal path. The effectiveness of SMART-OC is validated by simulation experiments, which demonstrate that the USV performs fast replannings to avoid dynamic obstacles and exploit ocean currents to successfully reach the goal.
zh

[AI-36] An Automated Multi-Modal Evaluation Framework for Mobile Intelligent Assistants

【速读】:该论文旨在解决当前多模态人工智能助手(multi-modal AI assistants)评估中存在的高人工成本、标准不一致以及主观偏差等问题。其解决方案的关键在于提出了一种基于大语言模型(large language models)与多智能体协作的自动化多模态评估框架,该框架采用三层代理架构——交互评估代理、语义验证代理和体验决策代理,通过在Qwen3-8B模型上进行监督微调,实现了与人类专家高度匹配的评估准确性,从而有效预测用户满意度并识别生成缺陷。

链接: https://arxiv.org/abs/2508.09507
作者: Meiping Wang,Jian Zhong,Rongduo Han,Liming Kang,Zhengkun Shi,Xiao Liang,Xing Lin,Nan Gao,Haining Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of mobile intelligent assistant technologies, multi-modal AI assistants have become essential interfaces for daily user interactions. However, current evaluation methods face challenges including high manual costs, inconsistent standards, and subjective bias. This paper proposes an automated multi-modal evaluation framework based on large language models and multi-agent collaboration. The framework employs a three-tier agent architecture consisting of interaction evaluation agents, semantic verification agents, and experience decision agents. Through supervised fine-tuning on the Qwen3-8B model, we achieve a significant evaluation matching accuracy with human experts. Experimental results on eight major intelligent agents demonstrate the framework’s effectiveness in predicting users’ satisfaction and identifying generation defects.
zh

[AI-37] Verify Distributed Deep Learning Model Implementation Refinement with Iterative Relation Inference

【速读】:该论文旨在解决分布式机器学习模型在实现过程中因分布策略应用不当而导致的输出与原始顺序模型不一致的问题,即分布式模型实现中的语义错误。其解决方案的关键在于通过静态分析验证模型细化(model refinement)——即判断分布式模型的输出是否足以重构出顺序模型的输出。论文提出的GraphGuard工具采用迭代重写机制来形式化证明这种细化关系,从而精准定位潜在的实现错误,并具备对当前大型模型(如GPT和Llama-3)的良好可扩展性与实际部署能力。

链接: https://arxiv.org/abs/2508.09505
作者: Zhanghan Wang,Ding Ding,Hang Zhu,Haibin Lin,Aurojit Panda
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distributed machine learning training and inference is common today because today’s large models require more memory and compute than can be provided by a single GPU. Distributed models are generally produced by programmers who take a sequential model specification and apply several distribution strategies to distribute state and computation across GPUs. Unfortunately, bugs can be introduced in the process, and a distributed model implementation’s outputs might differ from the sequential model’s outputs. In this paper, we describe an approach to statically identify such bugs by checking model refinement, that is, can the sequential model’s outputs be reconstructed from the distributed model’s outputs? Our approach, implemented in GraphGuard, uses iterative rewriting to prove model refinement. Our approach can scale to today’s large models and deployments: we evaluate it using GPT and Llama-3. Further, it provides actionable output that aids in bug localization.
zh

[AI-38] Large-Small Model Collaborative Framework for Federated Continual Learning

【速读】:该论文旨在解决联邦持续学习(Federated Continual Learning, FCL)中基础模型(Foundation Models, FMs)在资源受限客户端上难以有效利用私有本地数据、且易发生灾难性遗忘的问题。其核心挑战在于大模型参数量庞大、复杂度高,导致无法高效适应本地任务流并保持对先前知识的稳定性。解决方案的关键在于提出首个协同框架,通过轻量级本地模型作为动态桥梁,实现两个创新机制:一是小模型持续微调(Small Model Continual Fine-tuning),用于防止本地小模型在时间维度上的遗忘;二是逐个蒸馏(One-by-One Distillation),在服务器端实现异构本地知识的个性化融合,从而提升大模型的泛化能力和本地适配效果。实验表明,即使客户端使用异构的小模型,该方法仍能显著优于现有方案。

链接: https://arxiv.org/abs/2508.09489
作者: Hao Yu,Xin Yang,Boyang Fan,Xuemei Cao,Hanlin Gu,Lixin Fan,Qiang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual learning (CL) for Foundation Models (FMs) is an essential yet underexplored challenge, especially in Federated Continual Learning (FCL), where each client learns from a private, evolving task stream under strict data and communication constraints. Despite their powerful generalization abilities, FMs often exhibit suboptimal performance on local downstream tasks, as they are unable to utilize private local data. Furthermore, enabling FMs to learn new tasks without forgetting prior knowledge is inherently a challenging problem, primarily due to their immense parameter count and high model complexity. In contrast, small models can be trained locally under resource-constrained conditions and benefit from more mature CL techniques. To bridge the gap between small models and FMs, we propose the first collaborative framework in FCL, where lightweight local models act as a dynamic bridge, continually adapting to new tasks while enhancing the utility of the large model. Two novel components are also included: Small Model Continual Fine-tuning is for preventing small models from temporal forgetting; One-by-One Distillation performs personalized fusion of heterogeneous local knowledge on the server. Experimental results demonstrate its superior performance, even when clients utilize heterogeneous small models.
zh

[AI-39] DeepFeatIoT: Unifying Deep Learned Randomized and LLM Features for Enhanced IoT Time Series Sensor Data Classification in Smart Industries IJCAI2025

【速读】:该论文旨在解决物联网(IoT)传感器时间序列数据在实际应用中因元数据缺失或模糊、数据源异构性、采样频率不一致、单位不统一及时间戳不规则等问题而导致的难以解释和分析的挑战。其解决方案的关键在于提出一种名为DeepFeatIoT的深度学习模型,该模型通过融合三种特征:可学习的局部与全局特征、非学习的随机卷积核特征以及来自大语言模型(LLMs)的特征,实现对多源异构IoT时间序列数据的有效表征与分类。这种多样化特征融合策略显著提升了模型在低标签数据场景下的性能,并在多个真实世界IoT数据集上展现出卓越的泛化能力。

链接: https://arxiv.org/abs/2508.09468
作者: Muhammad Sakib Khan Inan,Kewen Liao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication at IJCAI 2025

点击查看摘要

Abstract:Internet of Things (IoT) sensors are ubiquitous technologies deployed across smart cities, industrial sites, and healthcare systems. They continuously generate time series data that enable advanced analytics and automation in industries. However, challenges such as the loss or ambiguity of sensor metadata, heterogeneity in data sources, varying sampling frequencies, inconsistent units of measurement, and irregular timestamps make raw IoT time series data difficult to interpret, undermining the effectiveness of smart systems. To address these challenges, we propose a novel deep learning model, DeepFeatIoT, which integrates learned local and global features with non-learned randomized convolutional kernel-based features and features from large language models (LLMs). This straightforward yet unique fusion of diverse learned and non-learned features significantly enhances IoT time series sensor data classification, even in scenarios with limited labeled data. Our model’s effectiveness is demonstrated through its consistent and generalized performance across multiple real-world IoT sensor datasets from diverse critical application domains, outperforming state-of-the-art benchmark models. These results highlight DeepFeatIoT’s potential to drive significant advancements in IoT analytics and support the development of next-generation smart systems.
zh

[AI-40] Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis

【速读】:该论文旨在解决健康专业教育(Health Professions Education, HPE)领域知识综合过程中数据提取效率低下的问题,尤其是传统人工提取耗时费力且易受主观偏差影响。其解决方案的关键在于开发一个基于大语言模型(Large Language Models, LLMs)的自动化数据提取平台,并通过系统性比较AI与人类在187篇文献中对17个提取问题的回答一致性,量化AI的准确性与可靠性。研究发现,AI在处理明确、文本中直接陈述的信息时表现优异,而在需要主观解释或原文未明确提及的内容上一致性较低;更重要的是,AI产生的错误主要源于“解释差异”而非“幻觉”(hallucination),且AI的不一致响应可被用于识别语义模糊或复杂的问题,从而指导后续人工审核流程优化。因此,该研究提出AI应作为透明、可信的协作工具嵌入知识合成流程,而非完全替代人类判断。

链接: https://arxiv.org/abs/2508.09458
作者: Xi Long,Christy Boscardin,Lauren A. Maggio,Joseph A. Costello,Ralph Gonzales,Rasmyah Hammoudeh,Ki Lai,Yoon Soo Park,Brian C. Gin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Knowledge syntheses (literature reviews) are essential to health professions education (HPE), consolidating findings to advance theory and practice. However, they are labor-intensive, especially during data extraction. Artificial Intelligence (AI)-assisted extraction promises efficiency but raises concerns about accuracy, making it critical to distinguish AI ‘hallucinations’ (fabricated content) from legitimate interpretive differences. We developed an extraction platform using large language models (LLMs) to automate data extraction and compared AI to human responses across 187 publications and 17 extraction questions from a published scoping review. AI-human, human-human, and AI-AI consistencies were measured using interrater reliability (categorical) and thematic similarity ratings (open-ended). Errors were identified by comparing extracted responses to source publications. AI was highly consistent with humans for concrete, explicitly stated questions (e.g., title, aims) and lower for questions requiring subjective interpretation or absent in text (e.g., Kirkpatrick’s outcomes, study rationale). Human-human consistency was not higher than AI-human and showed the same question-dependent variability. Discordant AI-human responses (769/3179 = 24.2%) were mostly due to interpretive differences (18.3%); AI inaccuracies were rare (1.51%), while humans were nearly three times more likely to state inaccuracies (4.37%). Findings suggest AI accuracy depends more on interpretability than hallucination. Repeating AI extraction can identify interpretive complexity or ambiguity, refining processes before human review. AI can be a transparent, trustworthy partner in knowledge synthesis, though caution is needed to preserve critical human insights.
zh

[AI-41] A Unified Contrastive-Generative Framework for Time Series Classification

【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)在多变量时间序列数据中,对比学习(Contrastive Learning)与生成式方法(Generative Approach)各自存在的局限性:对比学习对时间序列内部类内相似性敏感,而生成式方法依赖大规模数据集。其解决方案的关键在于提出首个统一两种范式的框架——对比生成时间序列框架(Contrastive Generative Time series framework, CoGenT),通过联合优化对比目标与生成目标,在保持判别能力的同时增强模型的鲁棒性,从而实现更高效、稳定的无监督表示学习。

链接: https://arxiv.org/abs/2508.09451
作者: Ziyu Liu,Azadeh Alavi,Minyi Li,Xiang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) for multivariate time series mainly includes two paradigms: contrastive methods that excel at instance discrimination and generative approaches that model data distributions. While effective individually, their complementary potential remains unexplored. We propose a Contrastive Generative Time series framework (CoGenT), the first framework to unify these paradigms through joint contrastive-generative optimization. CoGenT addresses fundamental limitations of both approaches: it overcomes contrastive learning’s sensitivity to high intra-class similarity in temporal data while reducing generative methods’ dependence on large datasets. We evaluate CoGenT on six diverse time series datasets. The results show consistent improvements, with up to 59.2% and 14.27% F1 gains over standalone SimCLR and MAE, respectively. Our analysis reveals that the hybrid objective preserves discriminative power while acquiring generative robustness. These findings establish a foundation for hybrid SSL in temporal domains. We will release the code shortly.
zh

[AI-42] Implicit Hypergraph Neural Networks: A Stable Framework for Higher-Order Relational Learning with Provable Guarantees

【速读】:该论文旨在解决传统超图神经网络(Hypergraph Neural Networks, HNNs)在建模高阶关系时存在的两个关键问题:一是由于固定层数的消息传递机制限制了长距离依赖的捕捉能力;二是随着网络深度增加导致训练不稳定。其解决方案的核心在于提出隐式超图神经网络(Implicit Hypergraph Neural Networks, IHGNN),通过将隐式平衡公式引入超图结构,不再依赖显式的多层堆叠,而是将节点表示计算为一个非线性不动点方程的解,从而实现无需深层架构即可稳定且高效地跨超边进行全局传播。该方法具备理论保障的收敛性、对过平滑现象的分析能力以及可证明的归纳泛化界,同时结合基于投影的稳定策略与隐式梯度训练流程,显著提升了模型在准确性和鲁棒性上的表现。

链接: https://arxiv.org/abs/2508.09427
作者: Xiaoyu Li,Guangyu Tang,Jiaojiao Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many real-world interactions are group-based rather than pairwise such as papers with multiple co-authors and users jointly engaging with items. Hypergraph neural networks have shown great promise at modeling higher-order relations, but their reliance on a fixed number of explicit message-passing layers limits long-range dependency capture and can destabilize training as depth grows. In this work, we introduce Implicit Hypergraph Neural Networks (IHGNN), which bring the implicit equilibrium formulation to hypergraphs: instead of stacking layers, IHGNN computes representations as the solution to a nonlinear fixed-point equation, enabling stable and efficient global propagation across hyperedges without deep architectures. We develop a well-posed training scheme with provable convergence, analyze the oversmoothing conditions and expressivity of the model, and derive a transductive generalization bound on hypergraphs. We further present an implicit-gradient training procedure coupled with a projection-based stabilization strategy. Extensive experiments on citation benchmarks show that IHGNN consistently outperforms strong traditional graph/hypergraph neural network baselines in both accuracy and robustness. Empirically, IHGNN is resilient to random initialization and hyperparameter variation, highlighting its strong generalization and practical value for higher-order relational learning.
zh

[AI-43] Domain-Generalization to Improve Learning in Meta-Learning Algorithms

【速读】:该论文旨在解决少样本学习(few-shot learning)场景下模型在跨任务迁移时泛化能力不足的问题。其解决方案的关键在于提出了一种新的元学习算法——域泛化尖锐度感知最小化模型无关元学习(Domain Generalization Sharpness-Aware Minimization Model-Agnostic Meta-Learning, DGS-MAML),该方法在双层优化框架中融合梯度匹配与尖锐度感知最小化(Sharpness-Aware Minimization, SAM),从而提升模型在有限训练数据下的适应性和鲁棒性。

链接: https://arxiv.org/abs/2508.09418
作者: Usman Anjum,Chris Stockman,Cat Luong,Justin Zhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces Domain Generalization Sharpness-Aware Minimization Model-Agnostic Meta-Learning (DGS-MAML), a novel meta-learning algorithm designed to generalize across tasks with limited training data. DGS-MAML combines gradient matching with sharpness-aware minimization in a bi-level optimization framework to enhance model adaptability and robustness. We support our method with theoretical analysis using PAC-Bayes and convergence guarantees. Experimental results on benchmark datasets show that DGS-MAML outperforms existing approaches in terms of accuracy and generalization. The proposed method is particularly useful for scenarios requiring few-shot learning and quick adaptation, and the source code is publicly available at GitHub.
zh

[AI-44] Understanding Dementia Speech Alignment with Diffusion-Based Image Generation INTERSPEECH2025

【速读】:该论文旨在解决生成式 AI(Generative AI)模型中语言与图像在潜在空间中的对齐问题,特别是针对病理语音(如痴呆相关言语)与生成图像之间是否存在可解释的关联性这一科学问题。其解决方案的关键在于:首先通过实验证明,仅基于由痴呆相关语音描述生成的图像即可实现75%的痴呆检测准确率(在ADReSS数据集上),从而验证了语言-图像对齐在病理语境下的可行性;其次,利用可解释性方法识别出文本中影响图像生成并进而决定诊断结果的关键语言特征,实现了从生成图像到原始语音信息的逆向溯源和机制揭示。

链接: https://arxiv.org/abs/2508.09385
作者: Mansi,Anastasios Lepipas,Dominika Woszczyk,Yiying Guan,Soteris Demetriou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Paper accepted at Interspeech 2025

点击查看摘要

Abstract:Text-to-image models generate highly realistic images based on natural language descriptions and millions of users use them to create and share images online. While it is expected that such models can align input text and generated image in the same latent space little has been done to understand whether this alignment is possible between pathological speech and generated images. In this work, we examine the ability of such models to align dementia-related speech information with the generated images and develop methods to explain this alignment. Surprisingly, we found that dementia detection is possible from generated images alone achieving 75% accuracy on the ADReSS dataset. We then leverage explainability methods to show which parts of the language contribute to the detection.
zh

[AI-45] Collective dynamics of strategic classification

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)分类算法在高风险决策场景中(如金融、医疗、司法和教育)所面临的用户策略性适应与算法再训练之间的反馈循环问题。核心挑战在于,当个体为满足算法标准而主动调整行为时,可能引发社会成本上升或系统性博弈行为(如提供虚假信息),从而削弱算法公平性和有效性。解决方案的关键在于引入演化博弈论(evolutionary game theory)构建数学严谨的框架,用于建模用户群体与机构之间的动态互动,并评估干预措施的效果:一方面通过增强对策略操纵的检测能力以降低社会成本并促进用户改善;另一方面,在完美分类器不可行的情况下,提供算法救济(algorithmic recourse)可引导系统向更高用户改进率演进;此外,机构响应速度及严格监管下提供可操作救济的机制,会进一步影响系统的稳定性和演化路径,甚至揭示文献中尚未关注的周期性动态。

链接: https://arxiv.org/abs/2508.09340
作者: Marta C. Couto,Flavia Barsotti,Fernando P. Santos
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注: 34 pages

点击查看摘要

Abstract:Classification algorithms based on Artificial Intelligence (AI) are nowadays applied in high-stakes decisions in finance, healthcare, criminal justice, or education. Individuals can strategically adapt to the information gathered about classifiers, which in turn may require algorithms to be re-trained. Which collective dynamics will result from users’ adaptation and algorithms’ retraining? We apply evolutionary game theory to address this question. Our framework provides a mathematically rigorous way of treating the problem of feedback loops between collectives of users and institutions, allowing to test interventions to mitigate the adverse effects of strategic adaptation. As a case study, we consider institutions deploying algorithms for credit lending. We consider several scenarios, each representing different interaction paradigms. When algorithms are not robust against strategic manipulation, we are able to capture previous challenges discussed in the strategic classification literature, whereby users either pay excessive costs to meet the institutions’ expectations (leading to high social costs) or game the algorithm (e.g., provide fake information). From this baseline setting, we test the role of improving gaming detection and providing algorithmic recourse. We show that increased detection capabilities reduce social costs and could lead to users’ improvement; when perfect classifiers are not feasible (likely to occur in practice), algorithmic recourse can steer the dynamics towards high users’ improvement rates. The speed at which the institutions re-adapt to the user’s population plays a role in the final outcome. Finally, we explore a scenario where strict institutions provide actionable recourse to their unsuccessful users and observe cycling dynamics so far unnoticed in the literature.
zh

[AI-46] RicciFlowRec: A Geometric Root Cause Recommender Using Ricci Curvature on Financial Graphs RECSYS2025

【速读】:该论文旨在解决金融推荐系统中因果归因不明确与风险感知不足的问题,尤其在动态市场环境下难以准确识别导致资产波动的局部应力源并追踪其传播路径。解决方案的关键在于提出RicciFlowRec框架,通过在动态金融图(dynamic financial graphs)上引入离散Ricci曲率(discrete Ricci curvature)量化局部压力,并利用Ricci流(Ricci flow)追踪冲击传播过程,从而揭示因果子结构并构建结构风险感知的排序函数(structural risk-aware ranking function),实现基于几何流推理的风险导向推荐。

链接: https://arxiv.org/abs/2508.09334
作者: Zhongtian Sun,Anoushka Harit
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at ACM RecSys 2025 (Late Breaking Results Track)

点击查看摘要

Abstract:We propose RicciFlowRec, a geometric recommendation framework that performs root cause attribution via Ricci curvature and flow on dynamic financial graphs. By modelling evolving interactions among stocks, macroeconomic indicators, and news, we quantify local stress using discrete Ricci curvature and trace shock propagation via Ricci flow. Curvature gradients reveal causal substructures, informing a structural risk-aware ranking function. Preliminary results on S\P~500 data with FinBERT-based sentiment show improved robustness and interpretability under synthetic perturbations. This ongoing work supports curvature-based attribution and early-stage risk-aware ranking, with plans for portfolio optimization and return forecasting. To our knowledge, RicciFlowRec is the first recommender to apply geometric flow-based reasoning in financial decision support.
zh

[AI-47] Synaptic Pruning: A Biological Inspiration for Deep Learning Regularization

【速读】:该论文旨在解决传统dropout正则化方法在人工神经网络中随机丢弃神经元而忽略生物可塑性机制的问题,即缺乏对连接重要性的区分和动态调整能力。其解决方案的关键在于提出一种基于权重幅度的突触修剪(synaptic pruning)方法,通过在训练过程中根据各层权重绝对值计算重要性,并采用立方体调度逐步增加全局稀疏度,实现低重要性连接的渐进式移除;该方法直接嵌入训练循环作为dropout的替代方案,在固定间隔应用永久性剪枝掩码保留活跃权重的梯度流,从而无需额外的剪枝与微调阶段,显著提升了模型效率与预测精度,尤其在金融时间序列预测任务中表现突出。

链接: https://arxiv.org/abs/2508.09330
作者: Gideon Vos,Liza van Eijk,Zoltan Sarnyai,Mostafa Rahimi Azghadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 7 figures

点击查看摘要

Abstract:Synaptic pruning in biological brains removes weak connections to improve efficiency. In contrast, dropout regularization in artificial neural networks randomly deactivates neurons without considering activity-dependent pruning. We propose a magnitude-based synaptic pruning method that better reflects biology by progressively removing low-importance connections during training. Integrated directly into the training loop as a dropout replacement, our approach computes weight importance from absolute magnitudes across layers and applies a cubic schedule to gradually increase global sparsity. At fixed intervals, pruning masks permanently remove low-importance weights while maintaining gradient flow for active ones, eliminating the need for separate pruning and fine-tuning phases. Experiments on multiple time series forecasting models including RNN, LSTM, and Patch Time Series Transformer across four datasets show consistent gains. Our method ranked best overall, with statistically significant improvements confirmed by Friedman tests (p 0.01). In financial forecasting, it reduced Mean Absolute Error by up to 20% over models with no or standard dropout, and up to 52% in select transformer models. This dynamic pruning mechanism advances regularization by coupling weight elimination with progressive sparsification, offering easy integration into diverse architectures. Its strong performance, especially in financial time series forecasting, highlights its potential as a practical alternative to conventional dropout techniques.
zh

[AI-48] Exact Verification of Graph Neural Networks with Incremental Constraint Solving

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在高风险应用场景(如欺诈检测和医疗诊断)中面临的对抗攻击脆弱性问题,特别是针对消息传递机制中常用聚合函数(如sum、max、mean)缺乏精确验证方法的局限。其解决方案的关键在于提出一种精确(sound and complete)的验证方法——GNNev,该方法通过约束求解与边界紧致化技术,结合增量式求解能力,系统性地计算节点分类任务下对属性扰动和结构扰动(边增删)的鲁棒性保证,首次支持max和mean聚合函数的精确验证,并在Cora、CiteSeer及Amazon、Yelp等真实数据集上展现出优于现有工具的效率与有效性。

链接: https://arxiv.org/abs/2508.09320
作者: Minghao Liu,Chia-Hsuan Lu,Marta Kwiatkowska
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are increasingly employed in high-stakes applications, such as fraud detection or healthcare, but are susceptible to adversarial attacks. A number of techniques have been proposed to provide adversarial robustness guarantees, but support for commonly used aggregation functions in message-passing GNNs is still lacking. In this paper, we develop an exact (sound and complete) verification method for GNNs to compute guarantees against attribute and structural perturbations that involve edge addition or deletion, subject to budget constraints. Focusing on node classification tasks, our method employs constraint solving with bound tightening, and iteratively solves a sequence of relaxed constraint satisfaction problems while relying on incremental solving capabilities of solvers to improve efficiency. We implement GNNev, a versatile solver for message-passing neural networks, which supports three aggregation functions, sum, max and mean, with the latter two considered here for the first time. Extensive experimental evaluation of GNNev on two standard benchmarks (Cora and CiteSeer) and two real-world fraud datasets (Amazon and Yelp) demonstrates its usability and effectiveness, as well as superior performance compared to existing exact verification tools on sum-aggregated node classification tasks.
zh

[AI-49] PTP World Infrastructure for Non-classical Logics

【速读】:该论文旨在解决自动化定理证明(Automated Theorem Proving, ATP)系统在非经典逻辑领域缺乏统一基础设施支持的问题。解决方案的关键在于对TPTP(Thousands of Problems for Theorem Provers)世界基础设施的扩展,包括引入支持非经典逻辑的语言扩展、提供标准化的问题与解集,并增强工具链以兼容非经典逻辑,特别是针对一阶量化正规多模态逻辑(quantified normal multi-modal logic)提供了详尽的实现与使用说明,从而为相关研究和应用构建了可复用、可扩展的标准化平台。

链接: https://arxiv.org/abs/2508.09318
作者: Alexander Steen,Geoff Sutcliffe
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 35 pages

点击查看摘要

Abstract:The TPTP World is the well established infrastructure that supports research, development, and deployment of Automated Theorem Proving (ATP) systems. The TPTP World supports a range of classical logics, and since release v9.0.0 has supported non-classical logics. This paper provides a self-contained comprehensive overview of the TPTP World infrastructure for ATP in non-classical logics: the non-classical language extension, problems and solutions, and tool support. A detailed description of use of the infrastructure for quantified normal multi-modal logic is given.
zh

[AI-50] Decentralized Weather Forecasting via Distributed Machine Learning and Blockchain-Based Model Validation

【速读】:该论文旨在解决当前集中式天气预报系统面临的安全漏洞、可扩展性受限以及单点故障易发等问题。其解决方案的关键在于构建一个融合联邦学习(Federated Learning, FL)与区块链技术的去中心化框架:通过FL实现本地数据隐私保护下的协同模型训练,降低数据传输开销并提升隐私安全性;同时利用以太坊区块链对模型更新进行透明可信的验证,并引入基于声誉的投票机制评估模型提交者的可信度,辅以星际文件系统(Interplanetary File System, IPFS)实现高效离链存储,从而显著增强系统的鲁棒性、安全性和可扩展性。

链接: https://arxiv.org/abs/2508.09299
作者: Rilwan Umar,Aydin Abadi,Basil Aldali,Benito Vincent,Elliot A. J. Hurley,Hotoon Aljazaeri,Jamie Hedley-Cook,Jamie-Lee Bell,Lambert Uwuigbusun,Mujeeb Ahmed,Shishir Nagaraja,Suleiman Sabo,Weaam Alrbeiqi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Weather forecasting plays a vital role in disaster preparedness, agriculture, and resource management, yet current centralized forecasting systems are increasingly strained by security vulnerabilities, limited scalability, and susceptibility to single points of failure. To address these challenges, we propose a decentralized weather forecasting framework that integrates Federated Learning (FL) with blockchain technology. FL enables collaborative model training without exposing sensitive local data; this approach enhances privacy and reduces data transfer overhead. Meanwhile, the Ethereum blockchain ensures transparent and dependable verification of model updates. To further enhance the system’s security, we introduce a reputation-based voting mechanism that assesses the trustworthiness of submitted models while utilizing the Interplanetary File System (IPFS) for efficient off-chain storage. Experimental results demonstrate that our approach not only improves forecasting accuracy but also enhances system resilience and scalability, making it a viable candidate for deployment in real-world, security-critical environments.
zh

[AI-51] Based AI improves human decision-making but reduces trust

【速读】:该论文试图解决当前人工智能(AI)系统因强制意识形态中立而可能引发自动化偏见(automation bias),从而抑制人类在决策过程中的认知参与的问题。其解决方案的关键在于:通过引入具有文化偏见的AI助手(如政治立场各异的GPT-4o变体),发现这类偏见性AI能够提升人类的信息评估表现、增强认知参与度并减少评价偏差,尤其当用户接触到对立观点时效果更为显著;进一步地,研究提出通过向用户同时呈现两种偏见方向覆盖其自身立场的AI,可缩小对AI信任度与实际性能之间的差距,从而实现更稳健的人类决策能力提升。

链接: https://arxiv.org/abs/2508.09297
作者: Shiyang Lai,Junsol Kim,Nadav Kunievsky,Yujin Potter,James Evans
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Current AI systems minimize risk by enforcing ideological neutrality, yet this may introduce automation bias by suppressing cognitive engagement in human decision-making. We conducted randomized trials with 2,500 participants to test whether culturally biased AI enhances human decision-making. Participants interacted with politically diverse GPT-4o variants on information evaluation tasks. Partisan AI assistants enhanced human performance, increased engagement, and reduced evaluative bias compared to non-biased counterparts, with amplified benefits when participants encountered opposing views. These gains carried a trust penalty: participants underappreciated biased AI and overcredited neutral systems. Exposing participants to two AIs whose biases flanked human perspectives closed the perception-performance gap. These findings complicate conventional wisdom about AI neutrality, suggesting that strategic integration of diverse cultural biases may foster improved and resilient human decision-making.
zh

[AI-52] Ethical Medical Image Synthesis

【速读】:该论文旨在解决医学图像合成(Medical Image Synthesis, MISyn)研究与开发过程中存在的伦理问题,确保其在整个生命周期中符合伦理规范,以防止因不当使用合成图像而引发的负面社会影响。其核心问题是当前MISyn技术常忽视合成图像在真实医学现象中的“根基缺失”(lack of inherent grounding in real medical phenomena)、对训练数据的不完全表征能力以及引入新的分布偏移和偏差,这些局限性若未被充分认识,可能导致误导性应用(misinformation),进而损害医疗影像数据环境的信任基础并引发算法歧视。解决方案的关键在于:基于理论分析提出两类实践支持——一是适应现有技术标准、问题定义、设计与评估流程的伦理实践建议,二是通过利益相关方和公众参与建立监督机制,从而推动跨学科协作实现伦理合规的MISyn发展;文中还通过两个案例验证了建议的可操作性,并揭示了现有实践与伦理要求之间的差距。

链接: https://arxiv.org/abs/2508.09293
作者: Weina Jin,Ashish Sinha,Kumar Abhishek,Ghassan Hamarneh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The task of ethical Medical Image Synthesis (MISyn) is to ensure that the MISyn techniques are researched and developed ethically throughout their entire lifecycle, which is essential to prevent the negative impacts of MISyn. To address the ever-increasing needs and requirements for ethical practice of MISyn research and development, we first conduct a theoretical analysis that identifies the key properties of ethical MISyn and intrinsic limits of MISyn. We identify that synthetic images lack inherent grounding in real medical phenomena, cannot fully represent the training medical images, and inevitably introduce new distribution shifts and biases. Ethical risks can arise from not acknowledging the intrinsic limits and weaknesses of synthetic images compared to medical images, with the extreme form manifested as misinformation of MISyn that substitutes synthetic images for medical images without acknowledgment. The resulting ethical harms include eroding trust in the medical imaging dataset environment and causing algorithmic discrimination towards stakeholders and the public. To facilitate collective efforts towards ethical MISyn within and outside the medical image analysis community, we then propose practical supports for ethical practice in MISyn based on the theoretical analysis, including ethical practice recommendations that adapt the existing technical standards, problem formulation, design, and evaluation practice of MISyn to the ethical challenges; and oversight recommendations to facilitate checks and balances from stakeholders and the public. We also present two case studies that demonstrate how to apply the ethical practice recommendations in practice, and identify gaps between existing practice and the ethical practice recommendations. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.09293 [cs.CY] (or arXiv:2508.09293v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2508.09293 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-53] he Othello AI Arena: Evaluating Intelligent Systems Through Limited-Time Adaptation to Unseen Boards

【速读】:该论文旨在解决当前人工智能(AI)评估基准在面对环境变化时缺乏对系统快速适应能力(rapid adaptation)和泛化能力(generalization capability)的衡量问题,传统基准多聚焦于固定环境下的性能优化,忽视了智能体在遭遇细微规则或结构变动时的灵活性。解决方案的关键在于提出一个名为Othello AI Arena的新颖基准框架,该框架通过设定限时(60秒)分析未知Othello棋盘配置与规则的任务,要求参赛系统生成针对该特定环境的高性能策略,从而将元学习层面的适应能力与任务层面的策略表现相分离;其核心创新包括:多样化的游戏阶段(公开用于开发、私有用于测试真实适应性)、基于Web的实时可视化与自动化多维评估机制,以及支持事后分析的日志记录功能,有效推动了对生成式AI(Generative AI)中“快速智能适应”这一关键能力的研究与评测。

链接: https://arxiv.org/abs/2508.09292
作者: Sundong Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ability to rapidly adapt to novel and unforeseen environmental changes is a cornerstone of artificial general intelligence (AGI), yet it remains a critical blind spot in most existing AI benchmarks. Traditional evaluation largely focuses on optimizing performance within fixed environments, failing to assess systems’ flexibility and generalization capabilities when faced with even subtle rule or structural modifications. Addressing this gap, I introduce the Othello AI Arena, a novel benchmark framework designed to evaluate intelligent systems based on their capacity for limited-time adaptation to unseen environments. Our platform poses a meta-learning challenge: participants must develop systems that can analyze the specific configuration and rules of a novel Othello board within a strict time limit (60 seconds) and generate a tailored, high-performing strategy for that unique environment. With this, evaluation of the meta-level intelligence can be separated from the task-level strategy performance. The Arena features a diverse set of game stages, including public stages for development and private stages with structural and rule variations designed to test genuine adaptive and generalization capabilities. Implemented as an accessible web-based platform, the Arena provides real-time visualization, automated evaluation using multi-dimensional metrics, and comprehensive logging for post-hoc analysis. Initial observations from pilot tests and preliminary student engagements highlight fascinating patterns in adaptation approaches, ranging from rapid parameter tuning to rudimentary environmental model learning through simulation. The Othello AI Arena offers a unique educational tool and a valuable research benchmark for fostering and evaluating the crucial skill of rapid, intelligent adaptation in AI systems.
zh

[AI-54] Value Function Initialization for Knowledge Transfer and Jump-start in Deep Reinforcement Learning

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中因状态-动作空间连续性、神经网络近似噪声以及历史模型存储不现实等问题,导致价值函数初始化(Value Function Initialization, VFI)难以有效迁移先前任务知识的挑战。其解决方案的关键在于提出DQInit方法:通过提取已解决任务中的紧凑表格型Q值作为可迁移的知识库,并采用基于已知度(knownness-based)的机制,在未探索区域软性融合这些转移值,逐步过渡到智能体自身学习的估计值,从而避免固定时间衰减的局限性。该方法仅依赖价值估计而非策略或示范,实现了跳起式强化学习(jumpstart RL)与策略蒸馏(policy distillation)的优势互补,显著提升早期学习效率、稳定性和整体性能。

链接: https://arxiv.org/abs/2508.09277
作者: Soumia Mehimeh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Value function initialization (VFI) is an effective way to achieve a jumpstart in reinforcement learning (RL) by leveraging value estimates from prior tasks. While this approach is well established in tabular settings, extending it to deep reinforcement learning (DRL) poses challenges due to the continuous nature of the state-action space, the noisy approximations of neural networks, and the impracticality of storing all past models for reuse. In this work, we address these challenges and introduce DQInit, a method that adapts value function initialization to DRL. DQInit reuses compact tabular Q-values extracted from previously solved tasks as a transferable knowledge base. It employs a knownness-based mechanism to softly integrate these transferred values into underexplored regions and gradually shift toward the agent’s learned estimates, avoiding the limitations of fixed time decay. Our approach offers a novel perspective on knowledge transfer in DRL by relying solely on value estimates rather than policies or demonstrations, effectively combining the strengths of jumpstart RL and policy distillation while mitigating their drawbacks. Experiments across multiple continuous control tasks demonstrate that DQInit consistently improves early learning efficiency, stability, and overall performance compared to standard initialization and existing transfer techniques.
zh

[AI-55] Detection of Odor Presence via Deep Neural Networks

【速读】:该论文旨在解决当前人工嗅觉传感器在复杂气味混合物中检测性能不足,以及非侵入式记录缺乏可靠单次试验一致性的问题。其核心解决方案是提出一种基于多通道嗅球局部场电位(local field potentials, LFPs)的深度学习框架,采用互补的一维卷积神经网络集成模型(ResCNN 和 AttentionCNN),从单次试验中准确识别气味存在。实验结果表明,该方法不仅验证了LFP频谱特征足以实现鲁棒的单次气味检测,且仅依赖嗅球信号即可完成高精度分类,平均准确率达86.6%,F1-score为81.0%,AUC达0.9247,显著优于以往基准,同时t-SNE可视化证实了模型捕获了具有生物学意义的嗅觉表征。

链接: https://arxiv.org/abs/2508.09264
作者: Matin Hassanloo,Ali Zareh,Mehmet Kemal Özdemir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Odor detection underpins food safety, environmental monitoring, medical diagnostics, and many more fields. The current artificial sensors developed for odor detection struggle with complex mixtures while non-invasive recordings lack reliable single-trial fidelity. To develop a general system for odor detection, in this study we present a preliminary work where we aim to test two hypotheses: (i) that spectral features of local field potentials (LFPs) are sufficient for robust single-trial odor detection and (ii) that signals from the olfactory bulb alone are adequate. To test two hypotheses, we propose an ensemble of complementary one-dimensional convolutional networks (ResCNN and AttentionCNN) that decodes the presence of odor from multichannel olfactory bulb LFPs. Tested on 2,349 trials from seven awake mice, our final ensemble model supports both hypotheses, achieving a mean accuracy of 86.6%, an F1-score of 81.0%, and an AUC of 0.9247, substantially outperforming previous benchmarks. In addition, the t-SNE visualization confirms that our framework captures biologically significant signatures. These findings establish the feasibility of robust single-trial detection of the presence of odor from extracellular LFPs, as well as demonstrate the potential of deep learning models to provide a deeper understanding of olfactory representations.
zh

[AI-56] PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research

【速读】:该论文旨在解决人工智能(AI)研究人员在处理社交媒体数据时面临的多重法律合规义务难以统一协调的问题,尤其是在《通用数据保护条例》(GDPR)、版权法及平台服务条款之间存在交叉冲突的情况下。现有框架缺乏对这些监管领域的整合,导致研究者缺乏清晰、一致的合规指引。其解决方案的关键在于提出PETLP(Privacy-by-design Extract, Transform, Load, and Present)框架,该框架将隐私保护设计原则嵌入到扩展的数据提取、转换、加载与展示(ETL)流程中,并将数据保护影响评估(Data Protection Impact Assessment, DPIA)视为随研究阶段动态演进的活文档,从而实现从数据获取到成果发布的全流程合规管理。

链接: https://arxiv.org/abs/2508.09232
作者: Nick Oh,Giorgos D. Vrakas,Siân J. M. Brooke,Sasha Morinière,Toju Duke
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Social media data presents AI researchers with overlapping obligations under the GDPR, copyright law, and platform terms – yet existing frameworks fail to integrate these regulatory domains, leaving researchers without unified guidance. We introduce PETLP (Privacy-by-design Extract, Transform, Load, and Present), a compliance framework that embeds legal safeguards directly into extended ETL pipelines. Central to PETLP is treating Data Protection Impact Assessments as living documents that evolve from pre-registration through dissemination. Through systematic Reddit analysis, we demonstrate how extraction rights fundamentally differ between qualifying research organisations (who can invoke DSM Article 3 to override platform restrictions) and commercial entities (bound by terms of service), whilst GDPR obligations apply universally. We reveal why true anonymisation remains unachievable for social media data and expose the legal gap between permitted dataset creation and uncertain model distribution. By structuring compliance decisions into practical workflows and simplifying institutional data management plans, PETLP enables researchers to navigate regulatory complexity with confidence, bridging the gap between legal requirements and research practice.
zh

[AI-57] Beyond Technocratic XAI: The Who What How in Explanation Design

【速读】:该论文旨在解决生成式 AI (Generative AI) 系统中解释(explanation)的实践困境,即如何在复杂模型的可解释性(Interpretability)与实际应用场景中的用户需求之间建立有效连接。传统 XAI 方法往往忽视了解释的语境依赖性,导致生成的解释难以真正实现透明性和可访问性。论文的核心解决方案是将解释重构为一种“情境化设计过程”(situated design process),提出一个三步框架:明确解释对象(Who needs the explanation)、界定解释内容(What they need explained)以及设计交付方式(How that explanation should be delivered)。该框架强调伦理考量,如认知不平等(epistemic inequality)、社会不公强化及问责制模糊等问题,从而推动 XAI 实践从技术导向转向以用户为中心、以社会价值为导向的系统性设计方法。

链接: https://arxiv.org/abs/2508.09231
作者: Ruchira Dhar,Stephanie Brandl,Ninell Oldenburg,Anders Søgaard
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to AI, Ethics Society Conference (AIES) Proceedings 2025

点击查看摘要

Abstract:The field of Explainable AI (XAI) offers a wide range of techniques for making complex models interpretable. Yet, in practice, generating meaningful explanations is a context-dependent task that requires intentional design choices to ensure accessibility and transparency. This paper reframes explanation as a situated design process – an approach particularly relevant for practitioners involved in building and deploying explainable systems. Drawing on prior research and principles from design thinking, we propose a three-part framework for explanation design in XAI: asking Who needs the explanation, What they need explained, and How that explanation should be delivered. We also emphasize the need for ethical considerations, including risks of epistemic inequality, reinforcing social inequities, and obscuring accountability and governance. By treating explanation as a sociotechnical design process, this framework encourages a context-aware approach to XAI that supports effective communication and the development of ethically responsible explanations.
zh

[AI-58] Cowpox: Towards the Immunity of VLM-based Multi-Agent Systems

【速读】:该论文旨在解决多智能体系统(multi-agent systems)在面对对抗攻击时缺乏鲁棒性(robustness)的问题,即单个智能体被攻破后,攻击可能扩散至其他智能体,从而破坏整个系统的完整性与安全性。解决方案的关键在于提出名为 Cowpox 的新型防御机制,其核心思想是通过分布式方式生成并分发一种特殊的“治愈样本”(cure sample),该样本能够在智能体暴露于攻击前提供免疫保护,并帮助已感染的智能体恢复,从而限制攻击传播范围并提升整体系统的鲁棒性。该方法兼具实证有效性与理论上的鲁棒性保障。

链接: https://arxiv.org/abs/2508.09230
作者: Yutong Wu,Jie Zhang,Yiming Li,Chao Zhang,Qing Guo,Nils Lukas,Tianwei Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision Language Model (VLM)-based agents are stateful, autonomous entities capable of perceiving and interacting with their environments through vision and language. Multi-agent systems comprise specialized agents who collaborate to solve a (complex) task. A core security property is robustness, stating that the system should maintain its integrity under adversarial attacks. However, the design of existing multi-agent systems lacks the robustness consideration, as a successful exploit against one agent can spread and infect other agents to undermine the entire system’s assurance. To address this, we propose a new defense approach, Cowpox, to provably enhance the robustness of multi-agent systems. It incorporates a distributed mechanism, which improves the recovery rate of agents by limiting the expected number of infections to other agents. The core idea is to generate and distribute a special cure sample that immunizes an agent against the attack before exposure and helps recover the already infected agents. We demonstrate the effectiveness of Cowpox empirically and provide theoretical robustness guarantees.
zh

[AI-59] Cluster Topology-Driven Placement of Experts Reduces Network Traffic in MoE Inference

【速读】:该论文旨在解决预训练的混合专家(Mixture-of-Experts, MoE)大语言模型(Large Language Models, LLMs)在多服务器集群中进行高效部署的问题,尤其关注如何在推理阶段最小化网络传输开销并提升集群资源利用率。其关键解决方案是提出一种基于整数线性规划(Integer Linear Programming, ILP)的拓扑感知模型放置策略,通过优化专家模块在不同服务器间的分配方式,使仅被激活的专家尽可能位于通信代价低的节点上,从而显著降低预期传输次数。该方法能够有效处理MoE模型中专家负载不均和网络拓扑依赖的特点,并在小规模(DeepSeekMoE~16B)与大规模(DeepSeek-R1~671B)模型上均优于现有方案。

链接: https://arxiv.org/abs/2508.09229
作者: Danil Sivtsov,Aleksandr Katrutsa,Ivan Oseledets
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Efficient deployment of a pre-trained LLM to a cluster with multiple servers is a critical step for providing fast responses to users’ queries. The recent success of Mixture-of-Experts (MoE) LLMs raises the question of how to deploy them efficiently, considering their underlying structure. During the inference in MoE LLMs, only a small part of the experts is selected to process a given token. Moreover, in practice, the experts’ load is highly imbalanced. For efficient deployment, one has to distribute the model across a large number of servers using a model placement algorithm. Thus, to improve cluster utilization, the model placement algorithm has to take into account the network topology. This work focuses on the efficient topology-aware placement of the pre-trained MoE LLMs in the inference stage. We propose an integer linear program (ILP) that determines the optimal placement of experts, minimizing the expected number of transmissions. Due to the internal structure, this optimization problem can be solved with a standard ILP solver. We demonstrate that ILP-based placement strategy yields lower network traffic than competitors for small-scale (DeepSeekMoE~16B) and large-scale (DeepSeek-R1~671B) models.
zh

[AI-60] GSMT: Graph Fusion and Spatiotemporal TaskCorrection for Multi-Bus Trajectory Prediction ITSC2025

【速读】:该论文旨在解决在发展地区城市交通环境中,由于多模态数据获取受限,仅依赖车载GPS数据进行公交轨迹预测时面临的准确性不足问题。其解决方案的关键在于提出一种两阶段混合模型GSMT,该模型融合图注意力网络(Graph Attention Network, GAT)与序列到序列循环神经网络(Recurrent Neural Network, RNN),并通过一个任务校正器(task corrector)从大规模轨迹数据中提取复杂行为模式,对初始预测结果进行二次精修;具体而言,GSMT通过嵌入式混合网络融合动态公交信息与静态站点信息,并利用任务校正器基于历史轨迹聚类识别不同运动模式,从而提升密集城区复杂交通条件下多节点公交轨迹预测的精度。

链接: https://arxiv.org/abs/2508.09227
作者: Fan Ding,Hwa Hui Tew,Junn Yong Loo,Susilawati,LiTong Liu,Fang Yu Leong,Xuewen Luo,Kar Keong Chin,Jia Jun Gan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: This paper has been accepted by ITSC 2025

点击查看摘要

Abstract:Accurate trajectory prediction for buses is crucial in intelligent transportation systems, particularly within urban environments. In developing regions where access to multimodal data is limited, relying solely on onboard GPS data remains indispensable despite inherent challenges. To address this problem, we propose GSMT, a hybrid model that integrates a Graph Attention Network (GAT) with a sequence-to-sequence Recurrent Neural Network (RNN), and incorporates a task corrector capable of extracting complex behavioral patterns from large-scale trajectory data. The task corrector clusters historical trajectories to identify distinct motion patterns and fine-tunes the predictions generated by the GAT and RNN. Specifically, GSMT fuses dynamic bus information and static station information through embedded hybrid networks to perform trajectory prediction, and applies the task corrector for secondary refinement after the initial predictions are generated. This two-stage approach enables multi-node trajectory prediction among buses operating in dense urban traffic environments under complex conditions. Experiments conducted on a real-world dataset from Kuala Lumpur, Malaysia, demonstrate that our method significantly outperforms existing approaches, achieving superior performance in both short-term and long-term trajectory prediction tasks.
zh

[AI-61] Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation

【速读】:该论文旨在解决预训练模型在测试时面临的目标域与源域分布偏移(distribution shift)问题,尤其是传统方法依赖单一维度的线性分类层难以应对复杂多样的分布变化。其解决方案的关键在于提出分层自适应网络与任务向量(Hierarchical Adaptive Networks with Task Vectors, Hi-Vec),通过构建多个逐层增大的编码器表示空间,实现动态层选择以自动识别每批次数据最优适配层,并引入权重融合机制确保所有层次接收目标域信息,同时设计线性层一致性(linear layer agreement)作为门控函数,抑制噪声批次带来的错误微调,从而显著提升模型在复杂场景下的鲁棒性、不确定性处理能力及小批量和高异常值率条件下的稳定性。

链接: https://arxiv.org/abs/2508.09223
作者: Sameer Ambekar,Daniel M. Lang,Julia A. Schnabel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time adaptation allows pretrained models to adjust to incoming data streams, addressing distribution shifts between source and target domains. However, standard methods rely on single-dimensional linear classification layers, which often fail to handle diverse and complex shifts. We propose Hierarchical Adaptive Networks with Task Vectors (Hi-Vec), which leverages multiple layers of increasing size for dynamic test-time adaptation. By decomposing the encoder’s representation space into such hierarchically organized layers, Hi-Vec, in a plug-and-play manner, allows existing methods to adapt to shifts of varying complexity. Our contributions are threefold: First, we propose dynamic layer selection for automatic identification of the optimal layer for adaptation to each test batch. Second, we propose a mechanism that merges weights from the dynamic layer to other layers, ensuring all layers receive target information. Third, we propose linear layer agreement that acts as a gating function, preventing erroneous fine-tuning by adaptation on noisy batches. We rigorously evaluate the performance of Hi-Vec in challenging scenarios and on multiple target datasets, proving its strong capability to advance state-of-the-art methods. Our results show that Hi-Vec improves robustness, addresses uncertainty, and handles limited batch sizes and increased outlier rates.
zh

[AI-62] Understanding Ethical Practices in AI: Insights from a Cross-Role Cross-Region Survey of AI Development Teams

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 等人工智能技术快速发展背景下,伦理规范与实践脱节的问题,即如何在多元角色和地域环境中提升对AI伦理原则的认知、实践与风险应对能力。其解决方案的关键在于提出一种角色敏感的协作机制,强调在AI开发全生命周期中纳入不同利益相关者(如AI管理者、开发者、安全专家等)参与伦理决策,并主张制定针对具体角色与区域特点的包容性治理策略,以推动伦理意识融入AI研发流程。

链接: https://arxiv.org/abs/2508.09219
作者: Wilder Baldwin,Sepideh Ghanavati,Manuel Woersdoerfer
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Under Review

点击查看摘要

Abstract:Recent advances in AI applications have raised growing concerns about the need for ethical guidelines and regulations to mitigate the risks posed by these technologies. In this paper, we present a mixed-method survey study - combining statistical and qualitative analyses - to examine the ethical perceptions, practices, and knowledge of individuals involved in various AI development roles. Our survey includes 414 participants from 43 countries, representing roles such as AI managers, analysts, developers, quality assurance professionals, and information security and privacy experts. The results reveal varying degrees of familiarity and experience with AI ethics principles, government initiatives, and risk mitigation strategies across roles, regions, and other demographic factors. Our findings highlight the importance of a collaborative, role-sensitive approach, involving diverse stakeholders in ethical decision-making throughout the AI development lifecycle. We advocate for developing tailored, inclusive solutions to address ethical challenges in AI development, and we propose future research directions and educational strategies to promote ethics-aware AI practices.
zh

[AI-63] CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLM s at Edge

【速读】:该论文旨在解决在资源受限的移动边缘计算环境中部署大规模稀疏激活的专家混合模型(Mixture-of-Experts, MoE)所面临的挑战,包括高内存占用和动态专家激活模式导致的计算效率低下问题。其解决方案的关键在于提出一种动态资源感知的协同优化框架 CoMoE,该框架通过联合优化专家聚合粒度与任务卸载策略,实时响应设备资源状态、网络条件及输入特征变化,实现对 MoE 模型在异构边缘设备上的高效部署。CoMoE 引入自适应调度机制以应对用户移动性和网络波动,并结合专家预测与缓存、多级存储架构等策略,在保持模型性能稳定的同时显著降低内存使用(最高达 70%)和推理延迟(较现有方法降低 10.5%),使原本仅能在云端运行的大规模 MoE 模型(如 7.4B 参数的 Switch-Base-128)可在资源受限的移动端边缘设备上部署。

链接: https://arxiv.org/abs/2508.09208
作者: Muqing Li,Ning Li,Xin Yuan,Wenchao Xu,Quan Chen,Song Guo,Haijun Zhang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of large language models (LLMs) has driven the adoption of Mixture-of-Experts (MoE) architectures as a promising solution to scale model capacity while controlling computational costs. However, deploying MoE models in resource-constrained mobile edge computing environments presents significant challenges due to their large memory footprint and dynamic expert activation patterns. To address these challenges, we propose a novel dynamic resource-aware collaborative optimization framework that jointly optimizes expert aggregation granularity and offloading strategies based on real-time device resource states, network conditions, and input characteristics in mobile edge environments, denoted as CoMoE. In CoMoE, we first systematically analyze existing expert aggregation techniques, including expert parameter merging,knowledge distillation,and parameter sharing decomposition, identifying their limitations in dynamic mobile this http URL then investigate expert offloading strategies encompassing expert prediction and prefetching, expert caching and scheduling, and multi-tier storage architectures, revealing the interdependencies between routing decisions and offloading this http URL CoMoE incorporates adaptive scheduling mechanisms that respond to user mobility and varying network conditions, enabling efficient MoE deployment across heterogeneous edge devices. Extensive experiments on real mobile edge testbeds demonstrate that CoMoE achieves approximately 70% reduction in memory usage compared to baseline methods, 10.5% lower inference latency than existing expert offloading techniques, while maintaining model performance stability. For large-scale MoE models (e.g,7.4B-parameter Switch-Base-128), the CoMoE reduces memory requirements from 15.6GB to 4.7GB, enabling deployment on resource-constrained mobile edge devices that previously could only support much smaller models.
zh

[AI-64] ADT4Coupons: An Innovative Framework for Sequential Coupon Distribution in E-commerce

【速读】:该论文旨在解决在线平台在 Coupon(优惠券)分发过程中,因未能有效利用平台与用户之间复杂的序列交互关系而导致的营销效果停滞问题。现有策略无法充分挖掘用户行为的历史序列信息,从而限制了长期收益提升。解决方案的关键在于提出一种名为 Aligned Decision Transformer for Coupons (ADT4Coupons) 的新型营销框架,其核心创新在于统一整合三个关键特性:通用场景适配能力、基于更全面历史数据的序列建模能力,以及在统一框架内实现高效的迭代更新机制,从而实现对多轮次、多用户场景下优惠券分发策略的优化决策。

链接: https://arxiv.org/abs/2508.09198
作者: Li Kong,Bingzhe Wang,Zhou Chen,Suhan Hu,Yuchao Ma,Qi Qi,Suoyuan Song,Bicheng Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Coupon distribution is a critical marketing strategy used by online platforms to boost revenue and enhance user engagement. Regrettably, existing coupon distribution strategies fall far short of effectively leveraging the complex sequential interactions between platforms and users. This critical oversight, despite the abundance of e-commerce log data, has precipitated a performance plateau. In this paper, we focus on the scene that the platforms make sequential coupon distribution decision multiple times for various users, with each user interacting with the platform repeatedly. Based on this marketing scenario, we propose a novel marketing framework, named Aligned Decision Transformer for Coupons (ADT4Coupons), to directly devise coupon distribution policy for long-term revenue boosting. ADT4Coupons enables optimized online decision-making in a variety of real-world marketing scenarios. It achieves this by seamlessly integrating three key characteristics, general scenarios, sequential modeling with more comprehensive historical data, and efficient iterative updates within a unified framework. Furthermore, empirical results on real-world industrial dataset, alongside public and synthetic datasets demonstrate the superiority of our framework.
zh

[AI-65] MX-AI: Agent ic Observability and Control Platform for Open and AI-RAN

【速读】:该论文旨在解决未来6G无线接入网(RAN)中自动化管理与控制的难题,即如何实现基于人工智能(AI)原生架构的自主协同运维。传统RAN依赖人工配置和静态策略,难以适应动态网络环境和复杂业务需求。解决方案的关键在于提出MX-AI——首个端到端的智能体(agent)系统,其核心包括:(i) 在真实5G开放RAN测试床(基于OpenAirInterface和FlexRIC)上部署可观测性与控制能力;(ii) 在服务管理与编排(SMO)层构建由大语言模型(LLM)驱动的智能体图谱,实现意图理解与决策生成;(iii) 通过自然语言指令实现对6G RAN资源的可编程访问。实验证明,MX-AI在50个实际操作查询中达到平均4.1/5.0的答案质量与100%决策动作准确率,且端到端延迟仅8.8秒(基于GPT-4.1),性能媲美人类专家,验证了其在真实场景中的可行性与实用性。

链接: https://arxiv.org/abs/2508.09197
作者: Ilias Chatzistefanidis,Andrea Leone,Ali Yaghoubian,Mikel Irazabal,Sehad Nassim,Lina Bariah,Merouane Debbah,Navid Nikaein
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Future 6G radio access networks (RANs) will be artificial intelligence (AI)-native: observed, reasoned about, and re-configured by autonomous agents cooperating across the cloud-edge continuum. We introduce MX-AI, the first end-to-end agentic system that (i) instruments a live 5G Open RAN testbed based on OpenAirInterface (OAI) and FlexRIC, (ii) deploys a graph of Large-Language-Model (LLM)-powered agents inside the Service Management and Orchestration (SMO) layer, and (iii) exposes both observability and control functions for 6G RAN resources through natural-language intents. On 50 realistic operational queries, MX-AI attains a mean answer quality of 4.1/5.0 and 100 % decision-action accuracy, while incurring only 8.8 seconds end-to-end latency when backed by GPT-4.1. Thus, it matches human-expert performance, validating its practicality in real settings. We publicly release the agent graph, prompts, and evaluation harness to accelerate open research on AI-native RANs. A live demo is presented here: this https URL
zh

[AI-66] Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments

【速读】:该论文旨在解决大规模模型(如大语言模型,Large Language Models, LLMs)在去中心化系统中部署时面临的计算成本高、可扩展性差及数据安全风险等问题,核心挑战在于如何高效选择合适的推理加速方案以优化计算资源利用并提升系统响应速度。解决方案的关键在于提出一种基于元学习(meta-learning)的自动化框架,该框架通过学习历史任务中不同加速技术的性能表现,根据任务特征智能识别最优加速策略,从而替代传统依赖随机选择或专家经验的方法,显著提升决策效率与系统整体性能。

链接: https://arxiv.org/abs/2508.09194
作者: Yipeng Du,Zihao Wang,Ahmad Farhan,Claudio Angione,Harry Yang,Fielding Johnston,James P. Buban,Patrick Colangelo,Yue Zhao,Yuzhe Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: COLM2025

点击查看摘要

Abstract:The deployment of large-scale models, such as large language models (LLMs), incurs substantial costs due to their computational demands. To mitigate these costs and address challenges related to scalability and data security, there is a growing shift towards decentralized systems for model deployment, where choosing efficient inference acceleration schemes become crucial to manage computational resources effectively and enhance system responsiveness. In this work, we address the challenge of selecting optimal acceleration methods in decentralized systems by introducing a meta-learning-based framework. This framework automates the selection process by learning from historical performance data of various acceleration techniques across different tasks. Unlike traditional methods that rely on random selection or expert intuition, our approach systematically identifies the best acceleration strategies based on the specific characteristics of each task. We demonstrate that our meta-learning framework not only streamlines the decision-making process but also consistently outperforms conventional methods in terms of efficiency and performance. Our results highlight the potential of inference acceleration in decentralized AI systems, offering a path towards more democratic and economically feasible artificial intelligence solutions.
zh

[AI-67] Multi-Objective Instruction-Aware Representation Learning in Procedural Content Generation RL

【速读】:该论文旨在解决现有指令强化学习用于程序化内容生成(Instructed Reinforcement Learning for Procedural Content Generation, IPCGRL)方法在处理复杂多目标文本指令时 controllability(可控性)不足的问题,尤其体现在难以充分挖掘自然语言输入的表达丰富性。解决方案的关键在于提出 MIPCGRL(Multi-Objective Representation Learning for Instructed Content Generators),其核心创新是将句子嵌入(sentence embeddings)作为条件引入多目标表示学习框架,并通过多标签分类与多头回归网络联合训练,构建一个能有效编码多目标语义信息的嵌入空间,从而显著提升模型对复杂指令的响应能力与可控性,实验表明该方法在多目标指令下可控性最高提升达 13.8%。

链接: https://arxiv.org/abs/2508.09193
作者: Sung-Hyun Kim,In-Chang Baek,Seo-Young Lee,Geum-Hwan Hwang,Kyung-Joong Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Recent advancements in generative modeling emphasize the importance of natural language as a highly expressive and accessible modality for controlling content generation. However, existing instructed reinforcement learning for procedural content generation (IPCGRL) method often struggle to leverage the expressive richness of textual input, especially under complex, multi-objective instructions, leading to limited controllability. To address this problem, we propose \textitMIPCGRL, a multi-objective representation learning method for instructed content generators, which incorporates sentence embeddings as conditions. MIPCGRL effectively trains a multi-objective embedding space by incorporating multi-label classification and multi-head regression networks. Experimental results show that the proposed method achieves up to a 13.8% improvement in controllability with multi-objective instructions. The ability to process complex instructions enables more expressive and flexible content generation.
zh

[AI-68] Diffusion LLM s Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

【速读】:该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, dLLMs)在推理速度上无法超越同规模自回归大语言模型(Autoregressive Large Language Models, AR LLMs)的问题。现有开源dLLMs虽具备并行生成多个token的潜力,但实际推理效率仍落后于AR架构。解决方案的关键在于提出一种名为离散扩散强制(Discrete Diffusion Forcing, D2F)的简单而有效策略:D2F使dLLMs具备两大能力——(1)块级自回归生成以支持KV缓存利用;(2)无需等待前一块完成即可预测后续块中的token,从而实现跨块并行解码。由此,原生dLLMs被重构为一种AR-扩散混合推理范式,显著提升效率。实验表明,基于D2F的模型在GSM8K任务上推理速度比LLaMA3和Qwen2.5快超过2.5倍,相比原生dLLMs如LLaDA和Dream加速超过50倍,同时保持相近的输出质量。

链接: https://arxiv.org/abs/2508.09192
作者: Xu Wang,Chenkai Xu,Yijie Jin,Jiachun Jin,Hao Zhang,Zhijie Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than \mathbf2.5\times inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than \mathbf50\times while maintaining comparable output quality. The code is available at this https URL.
zh

[AI-69] From Values to Tokens: An LLM -Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

【速读】:该论文旨在解决时间序列预测中如何有效融合历史数值序列与上下文特征(尤其是非结构化文本数据)以提升预测准确性的难题。其解决方案的关键在于提出TokenCast框架,通过离散分词器将连续数值序列转换为时序标记(temporal tokens),并利用预训练大语言模型(LLM)将时序标记与上下文文本标记映射到统一语义空间,从而实现多模态对齐;随后在该共享空间中对LLM进行监督微调以预测未来时序标记,并解码回原始数值空间,显著增强了模型对复杂上下文信息的建模能力与泛化性能。

链接: https://arxiv.org/abs/2508.09191
作者: Xiaoyu Tao,Shilong Zhang,Mingyue Cheng,Daoyu Wang,Tingyue Pan,Bokai Pan,Changqing Zhang,Shijin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series forecasting plays a vital role in supporting decision-making across a wide range of critical applications, including energy, healthcare, and finance. Despite recent advances, forecasting accuracy remains limited due to the challenge of integrating historical numerical sequences with contextual features, which often comprise unstructured textual data. To address this challenge, we propose TokenCast, an LLM-driven framework that leverages language-based symbolic representations as a unified intermediary for context-aware time series forecasting. Specifically, TokenCast employs a discrete tokenizer to transform continuous numerical sequences into temporal tokens, enabling structural alignment with language-based inputs. To bridge the semantic gap between modalities, both temporal and contextual tokens are embedded into a shared representation space via a pre-trained large language model (LLM), further optimized with autoregressive generative objectives. Building upon this unified semantic space, the aligned LLM is subsequently fine-tuned in a supervised manner to predict future temporal tokens, which are then decoded back into the original numerical space. Extensive experiments on diverse real-world datasets enriched with contextual features demonstrate the effectiveness and generalizability of TokenCast.
zh

[AI-70] Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在领域特定微调(fine-tuning)过程中因破坏原始对齐机制而引入的安全风险问题,尤其是现有后微调防御策略依赖粗粒度安全层映射、缺乏对细粒度神经元与安全层级间多尺度交互的全面考量,导致难以在保障安全性的同时维持模型效用。解决方案的关键在于提出一种无需训练的持续投影方法(Training-Free Continual Projection),通过识别并定位稀疏且精确的细粒度安全神经元(Fine-Grained Safety Neurons, FGSN),将安全神经元参数投影至预定义的安全方向上,在最小干扰下游任务神经元的前提下提升模型安全性,并实现对未预见安全威胁的持续防御与泛化能力。

链接: https://arxiv.org/abs/2508.09190
作者: Bing Han,Feifei Zhao,Dongcheng Zhao,Guobin Shen,Ping Wu,Yu Shi,Yi Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning as service injects domain-specific knowledge into large language models (LLMs), while challenging the original alignment mechanisms and introducing safety risks. A series of defense strategies have been proposed for the alignment, fine-tuning, and post-fine-tuning phases, where most post-fine-tuning defenses rely on coarse-grained safety layer mapping. These methods lack a comprehensive consideration of both safety layers and fine-grained neurons, limiting their ability to efficiently balance safety and utility. To address this, we propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks. FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons while minimizing interference with downstream task neurons. We then project the safety neuron parameters onto safety directions, improving model safety while aligning more closely with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model’s utility. Furthermore, by introducing a task-specific, multi-dimensional heterogeneous safety neuron cluster optimization mechanism, we achieve continual defense and generalization capability against unforeseen emerging safety concerns.
zh

[AI-71] HiSTM: Hierarchical Spatiotemporal Mamba for Cellular Traffic Forecasting

【速读】:该论文旨在解决蜂窝网络中流量预测的准确性与计算效率之间的权衡问题(trade-off between accuracy and computational efficiency),尤其是在用户移动性导致的空间-时间模式复杂性背景下。其解决方案的关键在于提出Hierarchical SpatioTemporal Mamba (HiSTM),该模型结合双空间编码器、基于Mamba的时序模块及注意力机制,利用选择性状态空间方法高效捕捉网络流量中的多层次空间和时间特征;实验表明,HiSTM在真实数据集上相较STN基线模型实现29.4%的平均绝对误差(MAE)降低,同时参数量减少94%,且具备良好的跨数据集泛化能力与长时预测精度优势。

链接: https://arxiv.org/abs/2508.09184
作者: Zineddine Bettouche,Khalid Ali,Andreas Fischer,Andreas Kassler
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cellular traffic forecasting is essential for network planning, resource allocation, or load-balancing traffic across cells. However, accurate forecasting is difficult due to intricate spatial and temporal patterns that exist due to the mobility of users. Existing AI-based traffic forecasting models often trade-off accuracy and computational efficiency. We present Hierarchical SpatioTemporal Mamba (HiSTM), which combines a dual spatial encoder with a Mamba-based temporal module and attention mechanism. HiSTM employs selective state space methods to capture spatial and temporal patterns in network traffic. In our evaluation, we use a real-world dataset to compare HiSTM against several baselines, showing a 29.4% MAE improvement over the STN baseline while using 94% fewer parameters. We show that the HiSTM generalizes well across different datasets and improves in accuracy over longer time-horizons.
zh

[AI-72] Long-Term Client Selection for Federated Learning with Non-IID Data: A Truthful Auction Approach

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在车联网(Internet of Vehicles, IoV)场景中因客户端数据非独立同分布(non-IID)导致的模型收敛速度慢和精度下降问题,以及传统客户端选择机制存在的资源浪费、信息不对称和激励不足等挑战。其解决方案的关键在于提出一种基于诚实拍卖机制的长期客户端选择联邦学习方案(Long-term Client-Selection Federated Learning based on Truthful Auction, LCSFLA),该方案通过引入考虑长期数据质量和能量成本的新评估机制,并结合带有押金要求的拍卖机制,实现社会福利最大化,同时确保参与者的激励相容性和个体理性,从而提升客户端选择的有效性与可信度。

链接: https://arxiv.org/abs/2508.09181
作者: Jinghong Tan,Zhian Liu,Kun Guo,Mingxiong Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Federated learning (FL) provides a decentralized framework that enables universal model training through collaborative efforts on mobile nodes, such as smart vehicles in the Internet of Vehicles (IoV). Each smart vehicle acts as a mobile client, contributing to the process without uploading local data. This method leverages non-independent and identically distributed (non-IID) training data from different vehicles, influenced by various driving patterns and environmental conditions, which can significantly impact model convergence and accuracy. Although client selection can be a feasible solution for non-IID issues, it faces challenges related to selection metrics. Traditional metrics evaluate client data quality independently per round and require client selection after all clients complete local training, leading to resource wastage from unused training results. In the IoV context, where vehicles have limited connectivity and computational resources, information asymmetry in client selection risks clients submitting false information, potentially making the selection ineffective. To tackle these challenges, we propose a novel Long-term Client-Selection Federated Learning based on Truthful Auction (LCSFLA). This scheme maximizes social welfare with consideration of long-term data quality using a new assessment mechanism and energy costs, and the advised auction mechanism with a deposit requirement incentivizes client participation and ensures information truthfulness. We theoretically prove the incentive compatibility and individual rationality of the advised incentive mechanism. Experimental results on various datasets, including those from IoV scenarios, demonstrate its effectiveness in mitigating performance degradation caused by non-IID data.
zh

[AI-73] scAGC: Learning Adaptive Cell Graphs with Contrastive Guidance for Single-Cell Clustering

【速读】:该论文旨在解决单细胞RNA测序(scRNA-seq)数据中细胞类型注释的准确性问题,其核心挑战源于数据的高维度性、大量零值(zero-inflated)以及细胞间关系建模的不稳定性。传统聚类方法在处理此类数据时面临统计与计算上的困难,而现有基于图神经网络的方法通常依赖静态图结构,对噪声敏感且难以捕捉细胞度分布的长尾特性。解决方案的关键在于提出scAGC方法,通过端到端联合优化特征表示与自适应细胞图结构:引入拓扑自适应图自动编码器(topology-adaptive graph autoencoder),利用可微分Gumbel-Softmax采样策略动态调整图结构,缓解长尾度分布问题;同时结合零膨胀负二项式(ZINB)损失函数以适应scRNA-seq数据的离散性、过分散性和零膨胀特性,并引入对比学习目标稳定图结构演化,提升聚类性能与收敛性。

链接: https://arxiv.org/abs/2508.09180
作者: Huifa Li,Jie Fu,Xinlin Zhuang,Haolin Yang,Xinpeng Ling,Tong Cheng,Haochen xue,Imran Razzak,Zhili Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate cell type annotation is a crucial step in analyzing single-cell RNA sequencing (scRNA-seq) data, which provides valuable insights into cellular heterogeneity. However, due to the high dimensionality and prevalence of zero elements in scRNA-seq data, traditional clustering methods face significant statistical and computational challenges. While some advanced methods use graph neural networks to model cell-cell relationships, they often depend on static graph structures that are sensitive to noise and fail to capture the long-tailed distribution inherent in single-cell this http URL address these limitations, we propose scAGC, a single-cell clustering method that learns adaptive cell graphs with contrastive guidance. Our approach optimizes feature representations and cell graphs simultaneously in an end-to-end manner. Specifically, we introduce a topology-adaptive graph autoencoder that leverages a differentiable Gumbel-Softmax sampling strategy to dynamically refine the graph structure during training. This adaptive mechanism mitigates the problem of a long-tailed degree distribution by promoting a more balanced neighborhood structure. To model the discrete, over-dispersed, and zero-inflated nature of scRNA-seq data, we integrate a Zero-Inflated Negative Binomial (ZINB) loss for robust feature reconstruction. Furthermore, a contrastive learning objective is incorporated to regularize the graph learning process and prevent abrupt changes in the graph topology, ensuring stability and enhancing convergence. Comprehensive experiments on 9 real scRNA-seq datasets demonstrate that scAGC consistently outperforms other state-of-the-art methods, yielding the best NMI and ARI scores on 9 and 7 datasets, this http URL code is available at Anonymous Github.
zh

[AI-74] DQT: Dynamic Quantization Training via Dequantization-Free Nested Integer Arithmetic

【速读】:该论文旨在解决资源受限设备上部署深度神经网络时,静态统一量化(static, uniform quantization)无法适应输入复杂度变化、而现有动态实例级混合精度量化(dynamic, instance-based mixed-precision quantization)因需频繁执行浮点数反量化与再量化操作导致硬件效率低下这一关键瓶颈问题。解决方案之关键在于提出动态量化训练(Dynamic Quantization Training, DQT)框架,其核心创新是采用嵌套整数表示(nested integer representation),将低精度值按位嵌入到高精度数值中,并结合定制的纯整数运算逻辑,使得在运行时可通过近零成本的位移操作(bit-shift)实现比特宽度的实时切换,从而首次实现了无需反量化即可完成静态混合精度部署,以及真正高效的动态实例级量化,显著降低了比特切换开销(仅需28.3M次简单位移操作,相比以往方法减少56.6M次浮点乘累加MAC操作)。

链接: https://arxiv.org/abs/2508.09176
作者: Hazem Hesham Yousef Shalby,Fabrizio Pittorino,Francesca Palermo,Diana Trojaniello,Manuel Roveri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deployment of deep neural networks on resource-constrained devices relies on quantization. While static, uniform quantization applies a fixed bit-width to all inputs, it fails to adapt to their varying complexity. Dynamic, instance-based mixed-precision quantization promises a superior accuracy-efficiency trade-off by allocating higher precision only when needed. However, a critical bottleneck remains: existing methods require a costly dequantize-to-float and requantize-to-integer cycle to change precision, breaking the integer-only hardware paradigm and compromising performance gains. This paper introduces Dynamic Quantization Training (DQT), a novel framework that removes this bottleneck. At the core of DQT is a nested integer representation where lower-precision values are bit-wise embedded within higher-precision ones. This design, coupled with custom integer-only arithmetic, allows for on-the-fly bit-width switching through a near-zero-cost bit-shift operation. This makes DQT the first quantization framework to enable both dequantization-free static mixed-precision of the backbone network, and truly efficient dynamic, instance-based quantization through a lightweight controller that decides at runtime how to quantize each layer. We demonstrate DQT state-of-the-art performance on ResNet18 on CIFAR-10 and ResNet50 on ImageNet. On ImageNet, our 4-bit dynamic ResNet50 achieves 77.00% top-1 accuracy, an improvement over leading static (LSQ, 76.70%) and dynamic (DQNET, 76.94%) methods at a comparable BitOPs budget. Crucially, DQT achieves this with a bit-width transition cost of only 28.3M simple bit-shift operations, a drastic improvement over the 56.6M costly Multiply-Accumulate (MAC) floating-point operations required by previous dynamic approaches - unlocking a new frontier in efficient, adaptive AI.
zh

[AI-75] FedMP: Tackling Medical Feature Heterogeneity in Federated Learning from a Manifold Perspective

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在非独立同分布(non-IID)数据场景下,尤其是医学影像领域中因图像特征分布偏移导致全局模型收敛困难和性能下降的问题。解决方案的关键在于提出FedMP方法:通过随机特征流形补全(stochastic feature manifold completion)丰富单个客户端分类器的训练空间,并利用类别原型(class-prototypes)引导跨客户端在语义一致子空间内的特征流形对齐,从而构建更清晰的决策边界,提升模型在非IID条件下的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2508.09174
作者: Zhekai Zhou,Shudong Liu,Zhaokun Zhou,Yang Liu,Qiang Yang,Yuesheng Zhu,Guibo Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) is a decentralized machine learning paradigm in which multiple clients collaboratively train a shared model without sharing their local private data. However, real-world applications of FL frequently encounter challenges arising from the non-identically and independently distributed (non-IID) local datasets across participating clients, which is particularly pronounced in the field of medical imaging, where shifts in image feature distributions significantly hinder the global model’s convergence and performance. To address this challenge, we propose FedMP, a novel method designed to enhance FL under non-IID scenarios. FedMP employs stochastic feature manifold completion to enrich the training space of individual client classifiers, and leverages class-prototypes to guide the alignment of feature manifolds across clients within semantically consistent subspaces, facilitating the construction of more distinct decision boundaries. We validate the effectiveness of FedMP on multiple medical imaging datasets, including those with real-world multi-center distributions, as well as on a multi-domain natural image dataset. The experimental results demonstrate that FedMP outperforms existing FL algorithms. Additionally, we analyze the impact of manifold dimensionality, communication efficiency, and privacy implications of feature exposure in our method.
zh

[AI-76] webMCP: Efficient AI-Native Client-Side Interaction for Agent -Ready Web Design

【速读】:该论文旨在解决当前AI代理在辅助用户进行网页交互时存在的效率瓶颈问题,即传统方法需对整个HTML文档进行复杂处理,导致响应慢、计算成本高。其解决方案的关键在于提出webMCP(Web Machine Context Procedure),一种客户端标准,通过将结构化的交互元数据直接嵌入网页中,为AI代理提供页面元素与用户操作之间的显式映射关系,从而避免重复解析HTML,显著降低计算开销。实验表明,该方案可在保持97.9%任务成功率的前提下,减少67.6%的处理需求,并实现34–63%的成本下降和更快的响应速度,且无需服务器端改动,具备广泛的部署可行性。

链接: https://arxiv.org/abs/2508.09171
作者: D. Perera
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current AI agents create significant barriers for users by requiring extensive processing to understand web pages, making AI-assisted web interaction slow and expensive. This paper introduces webMCP (Web Machine Context Procedure), a client-side standard that embeds structured interaction metadata directly into web pages, enabling more efficient human-AI collaboration on existing websites. webMCP transforms how AI agents understand web interfaces by providing explicit mappings between page elements and user actions. Instead of processing entire HTML documents, agents can access pre-structured interaction data, dramatically reducing computational overhead while maintaining task accuracy. A comprehensive evaluation across 1,890 real API calls spanning online shopping, authentication, and content management scenarios demonstrates webMCP reduces processing requirements by 67.6% while maintaining 97.9% task success rates compared to 98.8% for traditional approaches. Users experience significantly lower costs (34-63% reduction) and faster response times across diverse web interactions. Statistical analysis confirms these improvements are highly significant across multiple AI models. An independent WordPress deployment study validates practical applicability, showing consistent improvements across real-world content management workflows. webMCP requires no server-side modifications, making it deployable across millions of existing websites without technical barriers. These results establish webMCP as a viable solution for making AI web assistance more accessible and sustainable, addressing the critical gap between user interaction needs and AI computational requirements in production environments.
zh

[AI-77] Energy-Efficient Stochastic Computing (SC) Neural Networks for Internet of Things Devices With Layer-Wise Adjustable Sequence Length (ASL)

【速读】:该论文旨在解决随机计算(Stochastic Computing, SC)神经网络(NNs)在层级混合精度实现方面进一步优化不足的问题,尤其是在资源受限场景下如何有效降低能量和延迟开销。其解决方案的关键在于提出一种名为**可调序列长度(Adjustable Sequence Length, ASL)**的新方案,该方案基于操作符范数(operator-norm)构建理论模型,量化并预测截断噪声在多层中的累积传播效应,并结合随机森林(Random Forest, RF)回归进行扩展敏感性分析以验证理论与实际行为的一致性。此外,ASL设计了粗粒度和细粒度两种截断策略,动态配置各层序列长度,在保证网络精度损失可忽略的前提下,实现了高达60%以上的能效与延迟优化,显著提升了SC NN在物联网(IoT)应用中的可行性。

链接: https://arxiv.org/abs/2508.09163
作者: Ziheng Wang,Pedro Reviriego,Farzad Niknia,Zhen Gao,Javier Conde,Shanshan Liu,Fabrizio Lombardi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stochastic computing (SC) has emerged as an efficient low-power alternative for deploying neural networks (NNs) in resource-limited scenarios, such as the Internet of Things (IoT). By encoding values as serial bitstreams, SC significantly reduces energy dissipation compared to conventional floating-point (FP) designs; however, further improvement of layer-wise mixed-precision implementation for SC remains unexplored. This article introduces Adjustable Sequence Length (ASL), a novel scheme that applies mixed-precision concepts specifically to SC NNs. By introducing an operator-norm-based theoretical model, this article shows that truncation noise can cumulatively propagate through the layers by the estimated amplification factors. An extended sensitivity analysis is presented, using random forest (RF) regression to evaluate multilayer truncation effects and validate the alignment of theoretical predictions with practical network behaviors. To accommodate different application scenarios, this article proposes two truncation strategies (coarse-grained and fine-grained), which apply diverse sequence length configurations at each layer. Evaluations on a pipelined SC MLP synthesized at 32nm demonstrate that ASL can reduce energy and latency overheads by up to over 60% with negligible accuracy loss. It confirms the feasibility of the ASL scheme for IoT applications and highlights the distinct advantages of mixed-precision truncation in SC designs.
zh

[AI-78] Physics-Guided Memory Network for Building Energy Modeling

【速读】:该论文旨在解决建筑能耗预测中因历史数据不足或缺失而导致深度学习模型失效,以及物理模型(如EnergyPlus)依赖详尽参数且建模耗时的问题。解决方案的关键在于提出一种物理引导的记忆网络(Physics-Guided Memory Network, PgMN),其核心组件包括并行投影层(Parallel Projection Layers)用于处理不完整输入、记忆单元(Memory Unit)以捕捉持续性偏差,以及记忆经验模块(Memory Experience Module)实现对预测范围的最优外推。该架构融合了深度学习与物理模型的优势,在小时级短期预测任务中展现出高精度和强适应性,尤其适用于新建建筑、数据缺失、稀疏历史数据及动态基础设施变化等场景。

链接: https://arxiv.org/abs/2508.09161
作者: Muhammad Umair Danish,Kashif Ali,Kamran Siddiqui,Katarina Grolinger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published version. 12 pages, 6 figures. Open access under CC BY-NC-ND 4.0 license. Publisher: Elsevier. Journal: Energy and AI

点击查看摘要

Abstract:Accurate energy consumption forecasting is essential for efficient resource management and sustainability in the building sector. Deep learning models are highly successful but struggle with limited historical data and become unusable when historical data are unavailable, such as in newly constructed buildings. On the other hand, physics-based models, such as EnergyPlus, simulate energy consumption without relying on historical data but require extensive building parameter specifications and considerable time to model a building. This paper introduces a Physics-Guided Memory Network (PgMN), a neural network that integrates predictions from deep learning and physics-based models to address their limitations. PgMN comprises a Parallel Projection Layers to process incomplete inputs, a Memory Unit to account for persistent biases, and a Memory Experience Module to optimally extend forecasts beyond their input range and produce output. Theoretical evaluation shows that components of PgMN are mathematically valid for performing their respective tasks. The PgMN was evaluated on short-term energy forecasting at an hourly resolution, critical for operational decision-making in smart grid and smart building systems. Experimental validation shows accuracy and applicability of PgMN in diverse scenarios such as newly constructed buildings, missing data, sparse historical data, and dynamic infrastructure changes. This paper provides a promising solution for energy consumption forecasting in dynamic building environments, enhancing model applicability in scenarios where historical data are limited or unavailable or when physics-based models are inadequate.
zh

[AI-79] Agoran: An Agent ic Open Marketplace for 6G RAN Automation

【速读】:该论文旨在解决下一代移动网络中多服务提供商间资源调度与策略管理的冲突问题,当前网络切片控制器普遍存在僵化、策略依赖性强且缺乏业务上下文感知能力的局限。其核心解决方案是提出Agoran Service and Resource Broker (SRB),一个基于代理(agent)的市场机制,将利益相关方直接纳入运行闭环。关键创新在于构建三个自治AI分支:立法分支利用检索增强的大语言模型(Retrieval-Augmented Large Language Models, RAG-LLMs)处理合规性查询;执行分支通过监视更新的向量数据库维持实时态势感知;司法分支则借助规则驱动的信任评分系统评估代理消息,并由恶意行为检测模块实施实时激励以恢复信任。此外,SRB端的中介代理与利益相关方侧的谈判代理协同工作,基于多目标优化器生成帕累托最优报价,在单轮交互中达成共识意图并部署至Open RAN和AI RAN控制器,从而实现灵活、高效且符合标准的资源分配。

链接: https://arxiv.org/abs/2508.09159
作者: Ilias Chatzistefanidis,Navid Nikaein,Andrea Leone,Ali Maatouk,Leandros Tassioulas,Roberto Morabito,Ioannis Pitsiorlas,Marios Kountouris
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Pre-print submitted to Computer Networks AI-for-6G

点击查看摘要

Abstract:Next-generation mobile networks must reconcile the often-conflicting goals of multiple service owners. However, today’s network slice controllers remain rigid, policy-bound, and unaware of the business context. We introduce Agoran Service and Resource Broker (SRB), an agentic marketplace that brings stakeholders directly into the operational loop. Inspired by the ancient Greek agora, Agoran distributes authority across three autonomous AI branches: a Legislative branch that answers compliance queries using retrieval-augmented Large Language Models (LLMs); an Executive branch that maintains real-time situational awareness through a watcher-updated vector database; and a Judicial branch that evaluates each agent message with a rule-based Trust Score, while arbitrating LLMs detect malicious behavior and apply real-time incentives to restore trust. Stakeholder-side Negotiation Agents and the SRB-side Mediator Agent negotiate feasible, Pareto-optimal offers produced by a multi-objective optimizer, reaching a consensus intent in a single round, which is then deployed to Open and AI RAN controllers. Deployed on a private 5G testbed and evaluated with realistic traces of vehicle mobility, Agoran achieved significant gains: (i) a 37% increase in throughput of eMBB slices, (ii) a 73% reduction in latency of URLLC slices, and concurrently (iii) an end-to-end 8.3% saving in PRB usage compared to a static baseline. An 1B-parameter Llama model, fine-tuned for five minutes on 100 GPT-4 dialogues, recovers approximately 80% of GPT-4.1’s decision quality, while operating within 6 GiB of memory and converging in only 1.3 seconds. These results establish Agoran as a concrete, standards-aligned path toward ultra-flexible, stakeholder-centric 6G networks. A live demo is presented this https URL\ab_channel=BubbleRAN.
zh

[AI-80] EvaDrive: Evolutionary Adversarial Policy Optimization for End-to-End Autonomous Driving

【速读】:该论文旨在解决自动驾驶中难以实现类人迭代决策的问题,即如何在轨迹生成过程中持续地提出、评估并优化候选路径,从而提升规划的灵活性与鲁棒性。当前主流方法通常将轨迹生成与质量评估分离,或通过强化学习将多维偏好压缩为单一标量奖励,导致关键权衡关系丢失且无法进行有效迭代优化。解决方案的关键在于提出EvaDrive框架,其核心是通过对抗优化建立轨迹生成与评估之间的闭环协同进化机制:利用分层生成器结合自回归意图建模(autoregressive intent modeling)和扩散模型精修(diffusion-based refinement)以生成多样化候选路径;同时引入可训练的多目标评判器(multi-objective critic)保留原始偏好结构而不进行标量化解码,并借助帕累托前沿选择机制引导多轮迭代优化,从而有效跳出局部最优并维持轨迹多样性。该方法无需外部偏好数据即可生成多样驾驶风格,实现了无标量化的轨迹优化新范式。

链接: https://arxiv.org/abs/2508.09158
作者: Siwen Jiao,Kangan Qian,Hao Ye,Yang Zhong,Ziang Luo,Sicong Jiang,Zilin Huang,Yangyi Fang,Jinyu Miao,Zheng Fu,Yunlong Wang,Kun Jiang,Diange Yang,Rui Fan,Baoyun Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous driving faces significant challenges in achieving human-like iterative decision-making, which continuously generates, evaluates, and refines trajectory proposals. Current generation-evaluation frameworks isolate trajectory generation from quality assessment, preventing iterative refinement essential for planning, while reinforcement learning methods collapse multi-dimensional preferences into scalar rewards, obscuring critical trade-offs and yielding scalarization this http URL overcome these issues, we present EvaDrive, a novel multi-objective reinforcement learning framework that establishes genuine closed-loop co-evolution between trajectory generation and evaluation via adversarial optimization. EvaDrive frames trajectory planning as a multi-round adversarial game. In this game, a hierarchical generator continuously proposes candidate paths by combining autoregressive intent modeling for temporal causality with diffusion-based refinement for spatial flexibility. These proposals are then rigorously assessed by a trainable multi-objective critic that explicitly preserves diverse preference structures without collapsing them into a single scalarization this http URL adversarial interplay, guided by a Pareto frontier selection mechanism, enables iterative multi-round refinement, effectively escaping local optima while preserving trajectory this http URL experiments on NAVSIM and Bench2Drive benchmarks demonstrate SOTA performance, achieving 94.9 PDMS on NAVSIM v1 (surpassing DiffusionDrive by 6.8, DriveSuprim by 5.0, and TrajHF by 0.9) and 64.96 Driving Score on Bench2Drive. EvaDrive generates diverse driving styles via dynamic weighting without external preference data, introducing a closed-loop adversarial framework for human-like iterative decision-making, offering a novel scalarization-free trajectory optimization approach.
zh

[AI-81] Physics-Constrained Fine-Tuning of Flow-Matching Models for Generation and Inverse Problems

【速读】:该论文旨在解决科学系统中存在物理约束的逆问题(inverse problems),即在数据有限或观测噪声较大的情况下,如何从生成式模型中恢复未知的物理输入(如源项、材料参数或边界条件)并确保生成结果满足偏微分方程(PDE)及其边界条件。其解决方案的关键在于提出一种可微分的后训练流程,通过最小化PDE的弱形式残差(weak-form residuals)来增强模型的物理一致性,同时不破坏原始学习到的数据分布;此外,引入一个可学习的潜在参数预测器,并设计联合优化策略,使模型能够同时输出符合物理规律的场解和对隐藏参数的合理估计,从而在数据驱动与物理先验之间取得平衡。

链接: https://arxiv.org/abs/2508.09156
作者: Jan Tauberschmidt,Sophie Fellenz,Sebastian J. Vollmer,Andrew B. Duncan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 7 pages main content, 10 pages appendices

点击查看摘要

Abstract:We present a framework for fine-tuning flow-matching generative models to enforce physical constraints and solve inverse problems in scientific systems. Starting from a model trained on low-fidelity or observational data, we apply a differentiable post-training procedure that minimizes weak-form residuals of governing partial differential equations (PDEs), promoting physical consistency and adherence to boundary conditions without distorting the underlying learned distribution. To infer unknown physical inputs, such as source terms, material parameters, or boundary data, we augment the generative process with a learnable latent parameter predictor and propose a joint optimization strategy. The resulting model produces physically valid field solutions alongside plausible estimates of hidden parameters, effectively addressing ill-posed inverse problems in a data-driven yet physicsaware manner. We validate our method on canonical PDE benchmarks, demonstrating improved satisfaction of PDE constraints and accurate recovery of latent coefficients. Our approach bridges generative modelling and scientific inference, opening new avenues for simulation-augmented discovery and data-efficient modelling of physical systems.
zh

[AI-82] A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在多轮对话中缺乏自评估能力的问题,从而阻碍其自我改进。现有基于强化学习(Reinforcement Learning, RL)的方法因固定奖励机制在优化多个训练目标时易出现奖励黑客(reward hacking)现象,导致模型崩溃。解决方案的关键在于提出AdaPO框架,其核心创新包括:一是自适应奖励模型(Adaptive Reward Model, ARM),用于从模型生成的多轮轨迹性能分布中评估任务当前训练状态;二是基于奖励感知的动态KL正则化机制(Reward Aware Dynamic KL Regularization),以动态调整惩罚系数,该系数由不同多轮情境间的奖励差距调节,从而有效缓解奖励黑客问题。该方法可自动、平滑地根据子任务训练进度调整学习焦点,无需人工干预。

链接: https://arxiv.org/abs/2508.09155
作者: Wenkai Wang,Hongcan Guo,Zheqi Lv,Shengyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:Self-evaluation, a model’s ability to assess the correctness of its own output, is crucial for Large Multimodal Models (LMMs) to achieve self-improvement in multi-turn conversations, yet largely absent in foundation models. Recent work has employed reinforcement learning (RL) to enhance self-evaluation; however, its fixed reward mechanism suffers from reward hacking when optimizing multiple training objectives, leading to model collapse. In this paper we propose AdaPO, an online reinforcement learning framework capable of adaptively adjusting training objective in real time according to the current training state for each task. Specifically, to mitigate reward hacking , AdaPO introduces an Adaptive Reward Model (ARM) and a Reward Aware Dynamic KL Regularization mechanism. ARM assesses the task’s training state from the distribution of model generated multi-turn trajectories’ performance. Reward Aware Dynamic KL replaces a fixed penalty with dynamic coefficients which is modulated by the reward gap between different multi-turn situations. Notably, our method automatically and smoothly adjusts its learning focus based on sub-tasks’ training progress without manual intervention. Extensive experiments over 8 benchmarks and various models show that our method significantly enhances both direct reasoning and self-evaluation capability. We will release our code to contribute to the community.
zh

[AI-83] Peer Effect Estimation in the Presence of Simultaneous Feedback and Unobserved Confounders

【速读】:该论文旨在解决复杂现实网络(如社交网络)中同伴因果效应估计的问题,其核心挑战在于同时存在同伴间的双向反馈机制和未观测的混杂因素(unobserved confounders),而现有方法或忽略反馈机制、或在严格线性假设下建模,难以准确估计同伴效应。解决方案的关键在于提出DIG2RSI框架,融合I-G变换(I-G transformation)与两阶段工具变量法(2SRI),首先通过I-G变换解耦同伴间相互影响以消除反馈偏差,再利用网络数据构建有效工具变量(IV),在第一阶段训练神经网络预测同伴暴露并提取残差作为未观测混杂因子的代理,在第二阶段引入对抗判别器将残差作为控制函数嵌入模型,强制学习表示中不含混杂信号,从而同时处理反馈和混杂问题,且借助深度学习的强大非线性拟合能力实现高维复杂关系建模。

链接: https://arxiv.org/abs/2508.09154
作者: Xiaojing Du,Jiuyong Li,Lin Liu,Debo Cheng,Thuc.Le
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Estimating peer causal effects within complex real-world networks such as social networks is challenging, primarily due to simultaneous feedback between peers and unobserved confounders. Existing methods either address unobserved confounders while ignoring the simultaneous feedback, or account for feedback but under restrictive linear assumptions, thus failing to obtain accurate peer effect estimation. In this paper, we propose DIG2RSI, a novel Deep learning framework which leverages I-G transformation (matrix operation) and 2SRI (an instrumental variable or IV technique) to address both simultaneous feedback and unobserved confounding, while accommodating complex, nonlinear and high-dimensional relationships. DIG2RSI first applies the I-G transformation to disentangle mutual peer influences and eliminate the bias due to the simultaneous feedback. To deal with unobserved confounding, we first construct valid IVs from network data. In stage 1 of 2RSI, we train a neural network on these IVs to predict peer exposure, and extract residuals as proxies for the unobserved confounders. In the stage 2, we fit a separate neural network augmented by an adversarial discriminator that incorporates these residuals as a control function and enforces the learned representation to contain no residual confounding signal. The expressive power of deep learning models in capturing complex non-linear relationships and adversarial debiasing enhances the effectiveness of DIG2RSI in eliminating bias from both feedback loops and hidden confounders. We prove consistency of our estimator under standard regularity conditions, ensuring asymptotic recovery of the true peer effect. Empirical results on two semi-synthetic benchmarks and a real-world dataset demonstrate that DIG2RSI outperforms existing approaches.
zh

[AI-84] JustDense: Just using Dense instead of Sequence Mixer for Time Series analysis

【速读】:该论文试图解决的问题是:在时间序列分析(Time Series Analysis, TSA)中,复杂的序列混合机制(如注意力机制)是否真的必要,还是其性能优势可能源于其他架构或优化因素。为回答这一问题,作者提出了JustDense方案,其关键在于将主流TSA模型中的序列混合器(sequence mixer)统一建模为一个混合矩阵,并用简单的全连接层(dense layer)进行系统性替换,从而隔离出混合操作本身的影响。该方法基于MatrixMixer框架,不仅实现了对序列混合机制的可解释性拆解,还通过在29个基准数据集上对7种先进TSA模型的广泛实验验证了:简单密集层即可实现与复杂序列混合器相当甚至更优的性能,从而挑战了“更深更复杂的架构在TSA中必然更优”的传统认知。

链接: https://arxiv.org/abs/2508.09153
作者: TaekHyun Park,Yongjae Lee,Daesan Park,Dohee Kim,Hyerim Bae
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages ,planning to submit to IEEE BigData 2025

点击查看摘要

Abstract:Sequence and channel mixers, the core mechanism in sequence models, have become the de facto standard in time series analysis (TSA). However, recent studies have questioned the necessity of complex sequence mixers, such as attention mechanisms, demonstrating that simpler architectures can achieve comparable or even superior performance. This suggests that the benefits attributed to complex sequencemixers might instead emerge from other architectural or optimization factors. Based on this observation, we pose a central question: Are common sequence mixers necessary for time-series analysis? Therefore, we propose JustDense, an empirical study that systematically replaces sequence mixers in various well-established TSA models with dense layers. Grounded in the MatrixMixer framework, JustDense treats any sequence mixer as a mixing matrix and replaces it with a dense layer. This substitution isolates the mixing operation, enabling a clear theoretical foundation for understanding its role. Therefore, we conducted extensive experiments on 29 benchmarks covering five representative TSA tasks using seven state-of-the-art TSA models to address our research question. The results show that replacing sequence mixers with dense layers yields comparable or even superior performance. In the cases where dedicated sequence mixers still offer benefits, JustDense challenges the assumption that “deeper and more complex architectures are inherently better” in TSA.
zh

[AI-85] 5G Core Fault Detection and Root Cause Analysis using Machine Learning and Generative AI

【速读】:该论文旨在解决5G网络中分组核心(Packet Core)流量完整性与性能保障问题,尤其针对测试过程中PCAP文件和日志文件中存在的错误难以高效识别与定位的挑战。当前人工分析方法耗时长、效率低,无法满足高复杂度网络环境下的故障诊断需求。解决方案的关键在于提出一种基于人工智能/机器学习(AI/ML)的故障分析(Fault Analysis, FA)引擎,该引擎利用自然语言处理(Natural Language Processing, NLP)技术对网络流量进行异常检测与分类,并结合生成式AI(Generative AI)通过大语言模型(Large Language Model, LLM)提供可解释的修复建议,其训练数据涵盖3GPP标准文档及用户测试文档,从而实现从故障识别到根因分析与修复建议的自动化闭环流程。

链接: https://arxiv.org/abs/2508.09152
作者: Joseph H. R. Isaac,Harish Saradagam,Nallamothu Pardhasaradhi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 3 figures and 2 tables. Accepted in Conference on Advances in Communication Networks Systems (CoaCoNS 2025)

点击查看摘要

Abstract:With the advent of 5G networks and technologies, ensuring the integrity and performance of packet core traffic is paramount. During network analysis, test files such as Packet Capture (PCAP) files and log files will contain errors if present in the system that must be resolved for better overall network performance, such as connectivity strength and handover quality. Current methods require numerous person-hours to sort out testing results and find the faults. This paper presents a novel AI/ML-driven Fault Analysis (FA) Engine designed to classify successful and faulty frames in PCAP files, specifically within the 5G packet core. The FA engine analyses network traffic using natural language processing techniques to identify anomalies and inefficiencies, significantly reducing the effort time required and increasing efficiency. The FA Engine also suggests steps to fix the issue using Generative AI via a Large Language Model (LLM) trained on several 5G packet core documents. The engine explains the details of the error from the domain perspective using documents such as the 3GPP standards and user documents regarding the internal conditions of the tests. Test results on the ML models show high classification accuracy on the test dataset when trained with 80-20 splits for the successful and failed PCAP files. Future scopes include extending the AI engine to incorporate 4G network traffic and other forms of network data, such as log text files and multimodal systems.
zh

[AI-86] Motif 2.6B Technical Report

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在追求高性能的同时难以兼顾计算效率的问题,尤其针对资源有限的研究团队难以构建高效基础模型的挑战。其解决方案的关键在于提出Motif-2.6B——一个26亿参数的基础模型,通过引入差分注意力(Differential Attention)和PolyNorm激活函数等创新架构改进,显著提升长文本理解能力、降低幻觉现象并增强上下文学习性能,从而在保证高效率的前提下实现与同类先进模型相当或更优的综合表现。

链接: https://arxiv.org/abs/2508.09148
作者: Junghwan Lim,Sungmin Lee,Dongseok Kim,Eunhwan Park,Hyunbyung Park,Junhyeok Lee,Wai Ting Cheung,Dahye Choi,Jaeheui Her,Jaeyeon Huh,Hanbin Jung,Changjin Kang,Beomgyu Kim,Jihwan Kim,Minjae Kim,Taehwan Kim,Youngrok Kim,Haesol Lee,Jeesoo Lee,Kungyu Lee,Dongpin Oh,Yeongjae Park,Bokki Ryu,Daewon Suh,Dongjoo Weon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have revolutionized artificial intelligence, yet developing an effective foundational LLM that balances high performance with computational efficiency remains challenging, especially for emerging research groups. To address this gap, we introduce Motif-2.6B, a 2.6-billion-parameter foundation model designed to democratize advanced LLM capabilities. Motif-2.6B incorporates several innovative architectural enhancements, including Differential Attention and PolyNorm activation functions, which improve long-context comprehension, reduce hallucination, and enhance in-context learning capabilities. We rigorously tested multiple novel architectural components through extensive experimentation to determine the optimal architecture for Motif-2.6B. Comprehensive evaluations demonstrate that Motif-2.6B consistently meets or exceeds the performance of similarly sized state-of-the-art models across diverse benchmarks, showcasing its effectiveness, scalability, and real-world applicability. Through detailed experiments and tailored techniques, Motif-2.6B significantly advances the landscape of efficient, scalable, and powerful foundational LLMs, offering valuable insights and a robust foundation for future research and deployment.
zh

[AI-87] Agent ic TinyML for Intent-aware Handover in 6G Wireless Networks

【速读】:该论文旨在解决6G网络中传统被动式切换(handover)机制在移动边缘计算(Mobile Edge Computing, MEC)和基于自主代理(autonomous agent-based)服务场景下的适应性不足问题,尤其是在用户移动引发的服务中断和体验下降问题。解决方案的关键在于提出WAAN框架,其核心是通过在异构边缘节点嵌入轻量级TinyML代理(TinyML agents),构建具备意图感知(intent-aware)与主动协商能力的自治实体,实现跨层协同的意图传播与网络自适应;同时引入半稳定会合点(semi-stable rendezvous points)作为上下文迁移与状态保持的协调锚点,从而保障移动过程中的服务连续性。

链接: https://arxiv.org/abs/2508.09147
作者: Alaa Saleh,Roberto Morabito,Sasu Tarkoma,Anders Lindgren,Susanna Pirttikangas,Lauri Lovén
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As 6G networks evolve into increasingly AI-driven, user-centric ecosystems, traditional reactive handover mechanisms demonstrate limitations, especially in mobile edge computing and autonomous agent-based service scenarios. This manuscript introduces WAAN, a cross-layer framework that enables intent-aware and proactive handovers by embedding lightweight TinyML agents as autonomous, negotiation-capable entities across heterogeneous edge nodes that contribute to intent propagation and network adaptation. To ensure continuity across mobility-induced disruptions, WAAN incorporates semi-stable rendezvous points that serve as coordination anchors for context transfer and state preservation. The framework’s operational capabilities are demonstrated through a multimodal environmental control case study, highlighting its effectiveness in maintaining user experience under mobility. Finally, the article discusses key challenges and future opportunities associated with the deployment and evolution of WAAN.
zh

[AI-88] o Theoretically Understand Transformer-Based In-Context Learning for Optimizing CSMA

【速读】:该论文旨在解决WiFi 7中二进制指数退避(Binary Exponential Backoff, BEB)机制在动态信道环境下吞吐量性能较差的问题,以及现有基于模型的方法(如非持久和p-持久CSMA)因节点密度估计不准确而导致的吞吐量损失问题。其解决方案的关键在于首次提出基于大语言模型(Large Language Model, LLM)的上下文学习(In-Context Learning, ICL)理论框架,设计了一种基于Transformer的ICL优化器:通过预收集碰撞阈值数据样本并构造为提示(prompt),使Transformer学习信道竞争模式并预测最优竞争窗口阈值(Contention Window Threshold, CWT)。该方法在有限训练步骤内实现近优CWT预测,并进一步扩展以容忍提示中存在误差数据,理论上保证预测与吞吐量偏差最小。实验表明,该方案在NS-3仿真中具备快速收敛性和接近最优的吞吐性能,优于现有基于模型和深度强化学习(Deep Reinforcement Learning, DRL)的方法。

链接: https://arxiv.org/abs/2508.09146
作者: Shugang Hao,Hongbo Li,Lingjie Duan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:The binary exponential backoff scheme is widely used in WiFi 7 and still incurs poor throughput performance under dynamic channel environments. Recent model-based approaches (e.g., non-persistent and p -persistent CSMA) simply optimize backoff strategies under a known and fixed node density, still leading to a large throughput loss due to inaccurate node density estimation. This paper is the first to propose LLM transformer-based in-context learning (ICL) theory for optimizing channel access. We design a transformer-based ICL optimizer to pre-collect collision-threshold data examples and a query collision case. They are constructed as a prompt as the input for the transformer to learn the pattern, which then generates a predicted contention window threshold (CWT). To train the transformer for effective ICL, we develop an efficient algorithm and guarantee a near-optimal CWT prediction within limited training steps. As it may be hard to gather perfect data examples for ICL in practice, we further extend to allow erroneous data input in the prompt. We prove that our optimizer maintains minimal prediction and throughput deviations from the optimal values. Experimental results on NS-3 further demonstrate our approach’s fast convergence and near-optimal throughput over existing model-based and DRL-based approaches under unknown node densities.
zh

[AI-89] Efficient Real-Time Aircraft ETA Prediction via Feature Tokenization Transformer

【速读】:该论文旨在解决航空器实时到达时间(Estimated Time of Arrival, ETA)预测的效率与准确性问题,尤其在动态变化的空域环境中实现高频率更新(1Hz)以支持跑道排序等 arrival management 系统。其解决方案的关键在于引入基于特征标记化(feature tokenization)的 Transformer 模型:通过将原始输入数据(如位置、速度、天气、尾流类别等)映射到潜在空间,并利用 Transformer 的多头自注意力机制自动捕捉关键特征关系,从而减少对复杂特征工程的依赖;同时,Transformer 的并行计算能力显著提升了推理速度,在实验中仅需 51.7 微秒即可完成 40 架飞机的 ETA 推理,相较 XGBoost 模型精度提升 7% 且计算时间降低至 39%,展现出在实时航空管理场景下的高效性与优越性能。

链接: https://arxiv.org/abs/2508.09144
作者: Liping Huang,Yicheng Zhang,Yifang Yin,Sheng Zhang,Yi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 9 figures, published in the confernce “US-Europe Air Transportation Research Development Symposium 2025”

点击查看摘要

Abstract:Estimated time of arrival (ETA) for airborne aircraft in real-time is crucial for arrival management in aviation, particularly for runway sequencing. Given the rapidly changing airspace context, the ETA prediction efficiency is as important as its accuracy in a real-time arrival aircraft management system. In this study, we utilize a feature tokenization-based Transformer model to efficiently predict aircraft ETA. Feature tokenization projects raw inputs to latent spaces, while the multi-head self-attention mechanism in the Transformer captures important aspects of the projections, alleviating the need for complex feature engineering. Moreover, the Transformer’s parallel computation capability allows it to handle ETA requests at a high frequency, i.e., 1HZ, which is essential for a real-time arrival management system. The model inputs include raw data, such as aircraft latitude, longitude, ground speed, theta degree for the airport, day and hour from track data, the weather context, and aircraft wake turbulence category. With a data sampling rate of 1HZ, the ETA prediction is updated every second. We apply the proposed aircraft ETA prediction approach to Singapore Changi Airport (ICAO Code: WSSS) using one-month Automatic Dependent Surveillance-Broadcast (ADS-B) data from October 1 to October 31, 2022. In the experimental evaluation, the ETA modeling covers all aircraft within a range of 10NM to 300NM from WSSS. The results show that our proposed method method outperforms the commonly used boosting tree based model, improving accuracy by 7% compared to XGBoost, while requiring only 39% of its computing time. Experimental results also indicate that, with 40 aircraft in the airspace at a given timestamp, the ETA inference time is only 51.7 microseconds, making it promising for real-time arrival management systems.
zh

[AI-90] User-Intent-Driven Semantic Communication via Adaptive Deep Understanding

【速读】:该论文旨在解决现有语义通信系统在传输任务相关语义信息时,难以深入理解并泛化用户真实意图的问题。其解决方案的关键在于构建一个以用户意图驱动的语义通信框架:首先利用多模态大模型作为语义知识库生成用户意图先验,其次引入掩码引导注意力模块以有效突出关键语义区域,最后通过信道状态感知模块实现不同信道条件下的自适应、鲁棒传输,从而显著提升系统对抽象意图的理解能力和通信性能。

链接: https://arxiv.org/abs/2508.05884
作者: Peigen Ye,Jingpu Duan,Hongyang Du,Yulan Guo
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: 300 ^_^ IEEE Globecom 2025

点击查看摘要

Abstract:Semantic communication focuses on transmitting task-relevant semantic information, aiming for intent-oriented communication. While existing systems improve efficiency by extracting key semantics, they still fail to deeply understand and generalize users’ real intentions. To overcome this, we propose a user-intention-driven semantic communication system that interprets diverse abstract intents. First, we integrate a multi-modal large model as semantic knowledge base to generate user-intention prior. Next, a mask-guided attention module is proposed to effectively highlight critical semantic regions. Further, a channel state awareness module ensures adaptive, robust transmission across varying channel conditions. Extensive experiments demonstrate that our system achieves deep intent understanding and outperforms DeepJSCC, e.g., under a Rayleigh channel at an SNR of 5 dB, it achieves improvements of 8%, 6%, and 19% in PSNR, SSIM, and LPIPS, respectively.
zh

[AI-91] QuickGrasp: Lightweight Antipodal Grasp Planning with Point Clouds

【速读】:该论文旨在解决机器人在复杂环境中进行抓取规划时面临的泛化能力差、计算效率低以及重复性不足的问题,尤其是在六自由度(6-DOF)空间中基于采样的方法普遍存在性能瓶颈。其解决方案的关键在于提出一种轻量级的解析式抓取规划方法,通过将问题建模为优化问题来估计物体表面的抓取点,而非直接预测末端执行器位姿;同时引入软区域生长算法实现对曲面的有效平面分割,并结合基于优化的质量评估指标确保间接力闭合(indirect force closure),从而显著减少对高维空间采样的依赖,提升抓取规划的鲁棒性与实时性。

链接: https://arxiv.org/abs/2504.19716
作者: Navin Sriram Ravie,Keerthi Vasan M,Asokan Thondiyath,Bijo Sebastian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Grasping has been a long-standing challenge in facilitating the final interface between a robot and the environment. As environments and tasks become complicated, the need to embed higher intelligence to infer from the surroundings and act on them has become necessary. Although most methods utilize techniques to estimate grasp pose by treating the problem via pure sampling-based approaches in the six-degree-of-freedom space or as a learning problem, they usually fail in real-life settings owing to poor generalization across domains. In addition, the time taken to generate the grasp plan and the lack of repeatability, owing to sampling inefficiency and the probabilistic nature of existing grasp planning approaches, severely limits their application in real-world tasks. This paper presents a lightweight analytical approach towards robotic grasp planning, particularly antipodal grasps, with little to no sampling in the six-degree-of-freedom space. The proposed grasp planning algorithm is formulated as an optimization problem towards estimating grasp points on the object surface instead of directly estimating the end-effector pose. To this extent, a soft-region-growing algorithm is presented for effective plane segmentation, even in the case of curved surfaces. An optimization-based quality metric is then used for the evaluation of grasp points to ensure indirect force closure. The proposed grasp framework is compared with the existing state-of-the-art grasp planning approach, Grasp pose detection (GPD), as a baseline over multiple simulated objects. The effectiveness of the proposed approach in comparison to GPD is also evaluated in a real-world setting using image and point-cloud data, with the planned grasps being executed using a ROBOTIQ gripper and UR5 manipulator.
zh

[AI-92] Counting Short Trajectories in Elementary Cellular Automata using the Transfer Matrix Method

【速读】:该论文旨在解决如何定量刻画一维元胞自动机(Elementary Cellular Automata, ECA)的全局动力学行为问题,特别是建立与Wolfram定性分类体系之间的量化关联。其核心挑战在于精确计算在有限时间步内收敛至短吸引子(short attractors)的所有初态配置数量,并由此推导出熵统计量以区分不同ECA规则的行为类别。解决方案的关键在于对转移矩阵法(Transfer Matrix Method, TMM)进行适配,从而能够在热力学极限下(即网格尺寸趋于无穷时)准确计算给定参数 (p,c)(p, c) 下收敛至大小为 cc 的吸引子且耗时不超过 pp 步的初态配置的熵值。这一方法使得能够基于熵的变化特征对ECA规则进行定量归类:Class 1和Class 2规则快速达到高熵稳定状态,Class 3规则熵值低且迅速饱和,而Class 4规则则表现出持续的正熵,揭示了其复杂动力学本质。

链接: https://arxiv.org/abs/2508.09768
作者: Cédric Koller,Barbora Hudcová
机构: 未知
类目: Cellular Automata and Lattice Gases (nlin.CG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Chaotic Dynamics (nlin.CD)
备注: 10 pages, 8 figures, 1 table, accepted to ALife 2025

点击查看摘要

Abstract:Elementary Cellular Automata (ECAs) exhibit diverse behaviours often categorized by Wolfram’s qualitative classification. To provide a quantitative basis for understanding these behaviours, we investigate the global dynamics of such automata and we describe a method that allows us to compute the number of all configurations leading to short attractors in a limited number of time steps. This computation yields exact results in the thermodynamic limit (as the CA grid size grows to infinity), and is based on the Transfer Matrix Method (TMM) that we adapt for our purposes. Specifically, given two parameters (p, c) we are able to compute the entropy of all initial configurations converging to an attractor of size c after p time-steps. By calculating such statistics for various ECA rules, we establish a quantitative connection between the entropy and the qualitative Wolfram classification scheme. Class 1 rules rapidly converge to maximal entropy for stationary states ( c=1 ) as p increases. Class 2 rules also approach maximal entropy quickly for appropriate cycle lengths c , potentially requiring consideration of translations. Class 3 rules exhibit zero or low finite entropy that saturates after a short transient. Class 4 rules show finite positive entropy, similar to some Class 3 rules. This method provides a precise framework for quantifying trajectory statistics, although its exponential computational cost in p+c restricts practical analysis to short trajectories.
zh

[AI-93] NEUBORN: The Neurodevelopmental Evolution framework Using BiOmechanical RemodelliNg

【速读】:该论文旨在解决当前规范建模框架难以捕捉皮层发育精细解剖细节的问题,其根源在于这些方法依赖于群体平均参考空间进行数据建模,导致个体生长轨迹的生物合理性不足。解决方案的关键在于提出一种基于生物力学约束的纵向微分同胚图像配准框架,通过分层网络架构实现个体生长轨迹的学习;该方法在新生儿MRI数据(来自发育中的人类连接组计划)上训练,生成更符合生物学规律的形变场,显著改善了配准结果的平滑性与负雅可比行列式数量,从而提供更具可解释性和生物学基础的个体发育映射。

链接: https://arxiv.org/abs/2508.09757
作者: Nashira Baena,Mariana da Silva,Irina Grigorescu,Aakash Saboo,Saga Masui,Jaques-Donald Tournier,Emma C. Robinson
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding individual cortical development is essential for identifying deviations linked to neurodevelopmental disorders. However, current normative modelling frameworks struggle to capture fine-scale anatomical details due to their reliance on modelling data within a population-average reference space. Here, we present a novel framework for learning individual growth trajectories from biomechanically constrained, longitudinal, diffeomorphic image registration, implemented via a hierarchical network architecture. Trained on neonatal MRI data from the Developing Human Connectome Project, the method improves the biological plausibility of warps, generating growth trajectories that better follow population-level trends while generating smoother warps, with fewer negative Jacobians, relative to state-of-the-art baselines. The resulting subject-specific deformations provide interpretable, biologically grounded mappings of development. This framework opens new possibilities for predictive modeling of brain maturation and early identification of malformations of cortical development.
zh

[AI-94] Cross-BCI A Cross-BCI-Paradigm Classifica-tion Model Towards Universal BCI Applications

【速读】:该论文旨在解决脑机接口(Brain-Computer Interface, BCI)分类模型在不同BCI范式间缺乏通用性的问题,即当前模型通常仅适用于单一BCI范式,导致在新范式应用时需重复开发,增加成本与工作量。同时,研究也关注轻量化深度学习模型的构建,以适应便携设备部署需求。解决方案的关键在于提出一种轻量级且统一的解码模型,其核心结构包括:时空卷积模块用于初步特征提取;多尺度局部特征选择模块以提取跨范式共享的局部特征并生成加权特征;以及多维全局特征提取模块,将加权特征与多维全局特征融合,形成与BCI范式相关的高层特征表示。该方法在三种经典BCI范式(运动想象MI、稳态视觉诱发电位SSVEP和P300)混合数据集上显著优于对比模型,验证了其在跨范式分类中的有效性与通用性。

链接: https://arxiv.org/abs/2508.09242
作者: Gaojie Zhou,Junhua Li
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Classification models used in brain-computer interface (BCI) are usually designed for a single BCI paradigm. This requires the redevelopment of the model when applying it to a new BCI paradigm, resulting in repeated costs and effort. Moreover, less complex deep learning models are desired for practical usage, as well as for deployment on portable devices. In or-der to fill the above gaps, we, in this study, proposed a light-weight and unified decoding model for cross-BCI-paradigm classification. The proposed model starts with a tempo-spatial convolution. It is followed by a multi-scale local feature selec-tion module, aiming to extract local features shared across BCI paradigms and generate weighted features. Finally, a mul-ti-dimensional global feature extraction module is designed, in which multi-dimensional global features are extracted from the weighted features and fused with the weighted features to form high-level feature representations associated with BCI para-digms. The results, evaluated on a mixture of three classical BCI paradigms (i.e., MI, SSVEP, and P300), demon-strate that the proposed model achieves 88.39%, 82.36%, 80.01%, and 0.8092 for accuracy, macro-precision, mac-ro-recall, and macro-F1-score, respectively, significantly out-performing the compared models. This study pro-vides a feasible solution for cross-BCI-paradigm classifica-tion. It lays a technological foundation for de-veloping a new generation of unified decoding systems, paving the way for low-cost and universal practical applications.
zh

[AI-95] Deep Generative Models for Discrete Genotype Simulation

【速读】:该论文旨在解决基因型数据(genotype data)生成中的挑战,尤其是如何在不泄露隐私的前提下模拟真实且具有遗传结构的基因型数据,并保留基因型-表型关联(genotype-phenotype association)。传统方法多集中于表达谱或单倍型数据的生成,而基因型数据因其离散性更具复杂性。解决方案的关键在于对主流生成模型(包括变分自编码器 VAE、扩散模型 Diffusion Models 和生成对抗网络 GAN)进行针对性适配,使其能够有效处理离散型基因型特征,并通过大规模牛和人类染色体数据集上的系统评估验证其性能,从而实现对遗传模式和表型关联的有效捕捉。

链接: https://arxiv.org/abs/2508.09212
作者: Sihan Xie(GABI),Thierry Tribout(GABI),Didier Boichard(GABI),Blaise Hanczar(IBISC),Julien Chiquet(MIA Paris-Saclay),Eric Barrey(GABI)
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep generative models open new avenues for simulating realistic genomic data while preserving privacy and addressing data accessibility constraints. While previous studies have primarily focused on generating gene expression or haplotype data, this study explores generating genotype data in both unconditioned and phenotype-conditioned settings, which is inherently more challenging due to the discrete nature of genotype data. In this work, we developed and evaluated commonly used generative models, including Variational Autoencoders (VAEs), Diffusion Models, and Generative Adversarial Networks (GANs), and proposed adaptation tailored to discrete genotype data. We conducted extensive experiments on large-scale datasets, including all chromosomes from cow and multiple chromosomes from human. Model performance was assessed using a well-established set of metrics drawn from both deep learning and quantitative genetics literature. Our results show that these models can effectively capture genetic patterns and preserve genotype-phenotype association. Our findings provide a comprehensive comparison of these models and offer practical guidelines for future research in genotype simulation. We have made our code publicly available at this https URL.
zh

[AI-96] Quantum-Enhanced Generative Adversarial Networks: Comparative Analysis of Classical and Hybrid Quantum-Classical Generative Adversarial Networks

【速读】:该论文旨在解决生成式对抗网络(Generative Adversarial Networks, GANs)中潜在表示质量受限于经典噪声分布的问题,从而影响生成数据的保真度。其解决方案的关键在于引入混合量子-经典GAN架构(Hybrid Quantum-Classical GANs, HQCGANs),其中量子生成器通过参数化量子电路(parameterised quantum circuits)生成潜向量供经典判别器使用,利用量子硬件在低维潜空间中的潜在优势,探索在当前含噪中等规模量子(Noisy Intermediate-Scale Quantum, NISQ)设备条件下提升生成模型性能的可能性。

链接: https://arxiv.org/abs/2508.09209
作者: Kun Ming Goh
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Generative adversarial networks (GANs) have emerged as a powerful paradigm for producing high-fidelity data samples, yet their performance is constrained by the quality of latent representations, typically sampled from classical noise distributions. This study investigates hybrid quantum-classical GANs (HQCGANs) in which a quantum generator, implemented via parameterised quantum circuits, produces latent vectors for a classical discriminator. We evaluate a classical GAN alongside three HQCGAN variants with 3, 5, and 7 qubits, using Qiskit’s AerSimulator with realistic noise models to emulate near-term quantum devices. The binary MNIST dataset (digits 0 and 1) is used to align with the low-dimensional latent spaces imposed by current quantum hardware. Models are trained for 150 epochs and assessed with Frechet Inception Distance (FID) and Kernel Inception Distance (KID). Results show that while the classical GAN achieved the best scores, the 7-qubit HQCGAN produced competitive performance, narrowing the gap in later epochs, whereas the 3-qubit model exhibited earlier convergence limitations. Efficiency analysis indicates only moderate training time increases despite quantum sampling overhead. These findings validate the feasibility of noisy quantum circuits as latent priors in GAN architectures, highlighting their potential to enhance generative modelling within the constraints of the noisy intermediate-scale quantum (NISQ) era.
zh

[AI-97] Quantum-Efficient Reinforcement Learning Solutions for Last-Mile On-Demand Delivery

【速读】:该论文旨在解决大规模带时间窗的容量限制取送货问题(Capacitated Pickup and Delivery Problem with Time Windows, CPDPTW),这是一个典型的NP-hard组合优化问题,在实际物流配送场景中具有重要意义。为应对传统经典算法在大规模解空间下计算复杂度高、难以收敛的问题,作者提出了一种基于强化学习(Reinforcement Learning, RL)框架并融合参数化量子电路(Parametrized Quantum Circuit, PQC)的混合量子-经典算法。其解决方案的关键在于设计了一种新颖的问题特异性编码量子电路,包含纠缠层(entangling layer)与变分层(variational layer),结合近端策略优化(Proximal Policy Optimization, PPO)和量子奇异值变换(Quantum Singular Value Transformation, QSVT)进行对比实验,验证了所提方法在解规模扩展性和训练复杂度方面的优势,同时有效嵌入现实约束条件以提升实用性。

链接: https://arxiv.org/abs/2508.09183
作者: Farzan Moosavi,Bilal Farooq
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Quantum computation has demonstrated a promising alternative to solving the NP-hard combinatorial problems. Specifically, when it comes to optimization, classical approaches become intractable to account for large-scale solutions. Specifically, we investigate quantum computing to solve the large-scale Capacitated Pickup and Delivery Problem with Time Windows (CPDPTW). In this regard, a Reinforcement Learning (RL) framework augmented with a Parametrized Quantum Circuit (PQC) is designed to minimize the travel time in a realistic last-mile on-demand delivery. A novel problem-specific encoding quantum circuit with an entangling and variational layer is proposed. Moreover, Proximal Policy Optimization (PPO) and Quantum Singular Value Transformation (QSVT) are designed for comparison through numerical experiments, highlighting the superiority of the proposed method in terms of the scale of the solution and training complexity while incorporating the real-world constraints.
zh

[AI-98] Bayesian-Driven Graph Reasoning for Active Radio Map Construction

【速读】:该论文旨在解决低空经济背景下,无人机等空中平台在有限电池容量下进行无线覆盖数据采集时的效率与覆盖范围受限问题。现有基于航点(waypoint)导航的方法难以兼顾信息获取的充分性与能耗的优化。其解决方案的关键在于提出了一种不确定性感知的无线电地图(URAM)重建框架,该框架通过两个核心深度学习模块实现:一是基于贝叶斯神经网络实时估计空间不确定性,二是采用注意力机制增强的强化学习策略,在概率路网(probabilistic roadmap)上进行全局推理,利用不确定性信息规划具有信息增益且节能的轨迹。该图结构推理机制实现了非贪婪(non-myopic)的智能路径规划,有效引导代理向最具信息价值区域移动,同时满足安全约束,实验表明URAM相比现有基线方法可将重建精度提升最高达34%。

链接: https://arxiv.org/abs/2508.09142
作者: Wenlihan Lu,Shijian Gao,Miaowen Wen,Yuxuan Liang,Chan-Byoung Chae,H. Vincent Poor
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the emergence of the low-altitude economy, radio maps have become essential for ensuring reliable wireless connectivity to aerial platforms. Autonomous aerial agents are commonly deployed for data collection using waypoint-based navigation; however, their limited battery capacity significantly constrains coverage and efficiency. To address this, we propose an uncertainty-aware radio map (URAM) reconstruction framework that explicitly leverages graph-based reasoning tailored for waypoint navigation. Our approach integrates two key deep learning components: (1) a Bayesian neural network that estimates spatial uncertainty in real time, and (2) an attention-based reinforcement learning policy that performs global reasoning over a probabilistic roadmap, using uncertainty estimates to plan informative and energy-efficient trajectories. This graph-based reasoning enables intelligent, non-myopic trajectory planning, guiding agents toward the most informative regions while satisfying safety constraints. Experimental results show that URAM improves reconstruction accuracy by up to 34% over existing baselines.
zh

机器学习

[LG-0] Dynamic Mixture-of-Experts for Incremental Graph Learning

链接: https://arxiv.org/abs/2508.09974
作者: Lecheng Kong,Theodore Vasiloudis,Seongjun Yun,Han Xie,Xiang Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph incremental learning is a learning paradigm that aims to adapt trained models to continuously incremented graphs and data over time without the need for retraining on the full dataset. However, regular graph machine learning methods suffer from catastrophic forgetting when applied to incremental learning settings, where previously learned knowledge is overridden by new knowledge. Previous approaches have tried to address this by treating the previously trained model as an inseparable unit and using techniques to maintain old behaviors while learning new knowledge. These approaches, however, do not account for the fact that previously acquired knowledge at different timestamps contributes differently to learning new tasks. Some prior patterns can be transferred to help learn new data, while others may deviate from the new data distribution and be detrimental. To address this, we propose a dynamic mixture-of-experts (DyMoE) approach for incremental learning. Specifically, a DyMoE GNN layer adds new expert networks specialized in modeling the incoming data blocks. We design a customized regularization loss that utilizes data sequence information so existing experts can maintain their ability to solve old tasks while helping the new expert learn the new data effectively. As the number of data blocks grows over time, the computational cost of the full mixture-of-experts (MoE) model increases. To address this, we introduce a sparse MoE approach, where only the top- k most relevant experts make predictions, significantly reducing the computation time. Our model achieved 4.92% relative accuracy increase compared to the best baselines on class incremental learning, showing the model’s exceptional power.

[LG-1] Prototype-Guided Diffusion: Visual Conditioning without External Memory

链接: https://arxiv.org/abs/2508.09922
作者: Bilal Faye,Hanane Azzag,Mustapha Lebbah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains. However, they remain computationally intensive, particularly during the iterative denoising process. Latent-space models like Stable Diffusion alleviate some of this cost by operating in compressed representations, though at the expense of fine-grained detail. More recent approaches such as Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning denoising on similar examples retrieved from large external memory banks. While effective, these methods introduce drawbacks: they require costly storage and retrieval infrastructure, depend on static vision-language models like CLIP for similarity, and lack adaptability during training. We propose the Prototype Diffusion Model (PDM), a method that integrates prototype learning directly into the diffusion process for efficient and adaptive visual conditioning - without external memory. Instead of retrieving reference samples, PDM constructs a dynamic set of compact visual prototypes from clean image features using contrastive learning. These prototypes guide the denoising steps by aligning noisy representations with semantically relevant visual patterns, enabling efficient generation with strong semantic grounding. Experiments show that PDM maintains high generation quality while reducing computational and storage overhead, offering a scalable alternative to retrieval-based conditioning in diffusion models.

[LG-2] Modern Neural Networks for Small Tabular Datasets: The New Default for Field-Scale Digital Soil Mapping?

链接: https://arxiv.org/abs/2508.09888
作者: Viacheslav Barkov,Jonas Schmidinger,Robin Gebbers,Martin Atzmueller
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of pedometrics, tabular machine learning is the predominant method for predicting soil properties from remote and proximal soil sensing data, forming a central component of digital soil mapping. At the field-scale, this predictive soil modeling (PSM) task is typically constrained by small training sample sizes and high feature-to-sample ratios in soil spectroscopy. Traditionally, these conditions have proven challenging for conventional deep learning methods. Classical machine learning algorithms, particularly tree-based models like Random Forest and linear models such as Partial Least Squares Regression, have long been the default choice for field-scale PSM. Recent advances in artificial neural networks (ANN) for tabular data challenge this view, yet their suitability for field-scale PSM has not been proven. We introduce a comprehensive benchmark that evaluates state-of-the-art ANN architectures, including the latest multilayer perceptron (MLP)-based models (TabM, RealMLP), attention-based transformer variants (FT-Transformer, ExcelFormer, T2G-Former, AMFormer), retrieval-augmented approaches (TabR, ModernNCA), and an in-context learning foundation model (TabPFN). Our evaluation encompasses 31 field- and farm-scale datasets containing 30 to 460 samples and three critical soil properties: soil organic matter or soil organic carbon, pH, and clay content. Our results reveal that modern ANNs consistently outperform classical methods on the majority of tasks, demonstrating that deep learning has matured sufficiently to overcome the long-standing dominance of classical machine learning for PSM. Notably, TabPFN delivers the strongest overall performance, showing robustness across varying conditions. We therefore recommend the adoption of modern ANNs for field-scale PSM and propose TabPFN as the new default choice in the toolkit of every pedometrician.

[LG-3] FedShard: Federated Unlearning with Efficiency Fairness and Performance Fairness

链接: https://arxiv.org/abs/2508.09866
作者: Siyuan Wen,Meng Zhang,Yang Yang,Ningning Ding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To protect clients’ right to be forgotten in federated learning, federated unlearning aims to remove the data contribution of leaving clients from the global learned model. While current studies mainly focused on enhancing unlearning efficiency and effectiveness, the crucial aspects of efficiency fairness and performance fairness among decentralized clients during unlearning have remained largely unexplored. In this study, we introduce FedShard, the first federated unlearning algorithm designed to concurrently guarantee both efficiency fairness and performance fairness. FedShard adaptively addresses the challenges introduced by dilemmas among convergence, unlearning efficiency, and unlearning fairness. Furthermore, we propose two novel metrics to quantitatively assess the fairness of unlearning algorithms, which we prove to satisfy well-known properties in other existing fairness measurements. Our theoretical analysis and numerical evaluation validate FedShard’s fairness in terms of both unlearning performance and efficiency. We demonstrate that FedShard mitigates unfairness risks such as cascaded leaving and poisoning attacks and realizes more balanced unlearning costs among clients. Experimental results indicate that FedShard accelerates the data unlearning process 1.3-6.2 times faster than retraining from scratch and 4.9 times faster than the state-of-the-art exact unlearning methods.

[LG-4] RankList – A Listwise Preference Learning Framework for Predicting Subjective Preferences

链接: https://arxiv.org/abs/2508.09826
作者: Abinay Reddy Naini,Fernando Diaz,Carlos Busso
类目: Machine Learning (cs.LG)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:Preference learning has gained significant attention in tasks involving subjective human judgments, such as \emphspeech emotion recognition (SER) and image aesthetic assessment. While pairwise frameworks such as RankNet offer robust modeling of relative preferences, they are inherently limited to local comparisons and struggle to capture global ranking consistency. To address these limitations, we propose RankList, a novel listwise preference learning framework that generalizes RankNet to structured list-level supervision. Our formulation explicitly models local and non-local ranking constraints within a probabilistic framework. The paper introduces a log-sum-exp approximation to improve training efficiency. We further extend RankList with skip-wise comparisons, enabling progressive exposure to complex list structures and enhancing global ranking fidelity. Extensive experiments demonstrate the superiority of our method across diverse modalities. On benchmark SER datasets (MSP-Podcast, IEMOCAP, BIIC Podcast), RankList achieves consistent improvements in Kendall’s Tau and ranking accuracy compared to standard listwise baselines. We also validate our approach on aesthetic image ranking using the Artistic Image Aesthetics dataset, highlighting its broad applicability. Through ablation and cross-domain studies, we show that RankList not only improves in-domain ranking but also generalizes better across datasets. Our framework offers a unified, extensible approach for modeling ordered preferences in subjective learning scenarios.

[LG-5] Feature Impact Analysis on Top Long-Jump Performances with Quantile Random Forest and Explainable AI Techniques

链接: https://arxiv.org/abs/2508.09810
作者: Qi Gan,Stephan Clémençon,Mounîm A.El-Yacoubi,Sao Mai Nguyen,Eric Fenaux,Ons Jelassi
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Biomechanical features have become important indicators for evaluating athletes’ techniques. Traditionally, experts propose significant features and evaluate them using physics equations. However, the complexity of the human body and its movements makes it challenging to explicitly analyze the relationships between some features and athletes’ final performance. With advancements in modern machine learning and statistics, data analytics methods have gained increasing importance in sports analytics. In this study, we leverage machine learning models to analyze expert-proposed biomechanical features from the finals of long jump competitions in the World Championships. The objectives of the analysis include identifying the most important features contributing to top-performing jumps and exploring the combined effects of these key features. Using quantile regression, we model the relationship between the biomechanical feature set and the target variable (effective distance), with a particular focus on elite-level jumps. To interpret the model, we apply SHapley Additive exPlanations (SHAP) alongside Partial Dependence Plots (PDPs) and Individual Conditional Expectation (ICE) plots. The findings reveal that, beyond the well-documented velocity-related features, specific technical aspects also play a pivotal role. For male athletes, the angle of the knee of the supporting leg before take-off is identified as a key factor for achieving top 10% performance in our dataset, with angles greater than 169°contributing significantly to jump performance. In contrast, for female athletes, the landing pose and approach step technique emerge as the most critical features influencing top 10% performances, alongside velocity. This study establishes a framework for analyzing the impact of various features on athletic performance, with a particular emphasis on top-performing events.

[LG-6] Bayesian autoregression to optimize temporal Matérn kernel Gaussian process hyperparameters

链接: https://arxiv.org/abs/2508.09792
作者: Wouter M. Kouw
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: 9 pages, 4 figures, accepted to the International Conference on Probabilistic Numerics 2025

点击查看摘要

Abstract:Gaussian processes are important models in the field of probabilistic numerics. We present a procedure for optimizing Matérn kernel temporal Gaussian processes with respect to the kernel covariance function’s hyperparameters. It is based on casting the optimization problem as a recursive Bayesian estimation procedure for the parameters of an autoregressive model. We demonstrate that the proposed procedure outperforms maximizing the marginal likelihood as well as Hamiltonian Monte Carlo sampling, both in terms of runtime and ultimate root mean square error in Gaussian process regression.

[LG-7] riForecaster: A Mixture of Experts Framework for Multi-Region Electric Load Forecasting with Tri-dimensional Specialization

链接: https://arxiv.org/abs/2508.09753
作者: Zhaoyang Zhu,Zhipeng Zeng,Qiming Chen,Linxiao Yang,Peiyuan Liu,Weiqi Chen,Liang Sun
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Electric load forecasting is pivotal for power system operation, planning and decision-making. The rise of smart grids and meters has provided more detailed and high-quality load data at multiple levels of granularity, from home to bus and cities. Motivated by similar patterns of loads across different cities in a province in eastern China, in this paper we focus on the Multi-Region Electric Load Forecasting (MRELF) problem, targeting accurate short-term load forecasting for multiple sub-regions within a large region. We identify three challenges for MRELF, including regional variation, contextual variation, and temporal variation. To address them, we propose TriForecaster, a new framework leveraging the Mixture of Experts (MoE) approach within a Multi-Task Learning (MTL) paradigm to overcome these challenges. TriForecaster features RegionMixer and Context-Time Specializer (CTSpecializer) layers, enabling dynamic cooperation and specialization of expert models across regional, contextual, and temporal dimensions. Based on evaluation on four real-world MRELF datasets with varied granularity, TriForecaster outperforms state-of-the-art models by achieving an average forecast error reduction of 22.4%, thereby demonstrating its flexibility and broad applicability. In particular, the deployment of TriForecaster on the eForecaster platform in eastern China exemplifies its practical utility, effectively providing city-level, short-term load forecasts for 17 cities, supporting a population exceeding 110 million and daily electricity usage over 100 gigawatt-hours.

[LG-8] μ-Parametrization for Mixture of Experts

链接: https://arxiv.org/abs/2508.09752
作者: Jan Małaśnicki,Kamil Ciebiera,Mateusz Boruń,Maciej Pióro,Jan Ludziejewski,Maciej Stefaniak,Michał Krutul,Sebastian Jaszczur,Marek Cygan,Kamil Adamczewski,Jakub Krajewski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent years have seen a growing interest and adoption of LLMs, with \mu Transfer becoming a key technique for tuning hyperparameters in large-scale training. Meanwhile, Mixture-of-Experts (MoE) has emerged as a leading architecture in extremely large models. However, the intersection of these two advancements has remained unexplored. In this work, we derive a \mu -Parameterization ( \mu P) for MoE, providing theoretical guarantees for feature learning across model widths in both the router and experts. We empirically validate our parameterization and further investigate how scaling the number of experts and granularity affects the optimal learning rate.

[LG-9] A Machine Learning Approach to Predict Biological Age and its Longitudinal Drivers

链接: https://arxiv.org/abs/2508.09747
作者: Nazira Dunbayeva,Yulong Li,Yutong Xie,Imran Razzak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting an individual’s aging trajectory is a central challenge in preventative medicine and bioinformatics. While machine learning models can predict chronological age from biomarkers, they often fail to capture the dynamic, longitudinal nature of the aging process. In this work, we developed and validated a machine learning pipeline to predict age using a longitudinal cohort with data from two distinct time periods (2019-2020 and 2021-2022). We demonstrate that a model using only static, cross-sectional biomarkers has limited predictive power when generalizing to future time points. However, by engineering novel features that explicitly capture the rate of change (slope) of key biomarkers over time, we significantly improved model performance. Our final LightGBM model, trained on the initial wave of data, successfully predicted age in the subsequent wave with high accuracy ( R^2 = 0.515 for males, R^2 = 0.498 for females), significantly outperforming both traditional linear models and other tree-based ensembles. SHAP analysis of our successful model revealed that the engineered slope features were among the most important predictors, highlighting that an individual’s health trajectory, not just their static health snapshot, is a key determinant of biological age. Our framework paves the way for clinical tools that dynamically track patient health trajectories, enabling early intervention and personalized prevention strategies for age-related diseases.

[LG-10] HKT: A Biologically Inspired Framework for Modular Hereditary Knowledge Transfer in Neural Networks

链接: https://arxiv.org/abs/2508.09743
作者: Yanick Chistian Tchenko,Felix Mohr,Hicham Hadj Abdelkader,Hedi Tabia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A prevailing trend in neural network research suggests that model performance improves with increasing depth and capacity - often at the cost of integrability and efficiency. In this paper, we propose a strategy to optimize small, deployable models by enhancing their capabilities through structured knowledge inheritance. We introduce Hereditary Knowledge Transfer (HKT), a biologically inspired framework for modular and selective transfer of task-relevant features from a larger, pretrained parent network to a smaller child model. Unlike standard knowledge distillation, which enforces uniform imitation of teacher outputs, HKT draws inspiration from biological inheritance mechanisms - such as memory RNA transfer in planarians - to guide a multi-stage process of feature transfer. Neural network blocks are treated as functional carriers, and knowledge is transmitted through three biologically motivated components: Extraction, Transfer, and Mixture (ETM). A novel Genetic Attention (GA) mechanism governs the integration of inherited and native representations, ensuring both alignment and selectivity. We evaluate HKT across diverse vision tasks, including optical flow (Sintel, KITTI), image classification (CIFAR-10), and semantic segmentation (LiTS), demonstrating that it significantly improves child model performance while preserving its compactness. The results show that HKT consistently outperforms conventional distillation approaches, offering a general-purpose, interpretable, and scalable solution for deploying high-performance neural networks in resource-constrained environments.

[LG-11] Generative Modeling with Multi-Instance Reward Learning for E-commerce Creative Optimization

链接: https://arxiv.org/abs/2508.09730
作者: Qiaolei Gu,Yu Li,DingYi Zeng,Lu Wang,Ming Pang,Changping Peng,Zhangang Lin,Ching Law,Jingping Shao
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures, conference paper

点击查看摘要

Abstract:In e-commerce advertising, selecting the most compelling combination of creative elements – such as titles, images, and highlights – is critical for capturing user attention and driving conversions. However, existing methods often evaluate creative components individually, failing to navigate the exponentially large search space of possible combinations. To address this challenge, we propose a novel framework named GenCO that integrates generative modeling with multi-instance reward learning. Our unified two-stage architecture first employs a generative model to efficiently produce a diverse set of creative combinations. This generative process is optimized with reinforcement learning, enabling the model to effectively explore and refine its selections. Next, to overcome the challenge of sparse user feedback, a multi-instance learning model attributes combination-level rewards, such as clicks, to the individual creative elements. This allows the reward model to provide a more accurate feedback signal, which in turn guides the generative model toward creating more effective combinations. Deployed on a leading e-commerce platform, our approach has significantly increased advertising revenue, demonstrating its practical value. Additionally, we are releasing a large-scale industrial dataset to facilitate further research in this important domain.

[LG-12] GraphTreeGen: Subtree-Centric Approach to Efficient and Supervised Graph Generation

链接: https://arxiv.org/abs/2508.09710
作者: Yitong Luo,Islem Rekik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Brain connectomes, representing neural connectivity as graphs, are crucial for understanding brain organization but costly and time-consuming to acquire, motivating generative approaches. Recent advances in graph generative modeling offer a data-driven alternative, enabling synthetic connectome generation and reducing dependence on large neuroimaging datasets. However, current models face key limitations: (i) compressing the whole graph into a single latent code (e.g., VGAEs) blurs fine-grained local motifs; (ii) relying on rich node attributes rarely available in connectomes reduces reconstruction quality; (iii) edge-centric models emphasize topology but overlook accurate edge-weight prediction, harming quantitative fidelity; and (iv) computationally expensive designs (e.g., edge-conditioned convolutions) impose high memory demands, limiting scalability. We propose GraphTreeGen (GTG), a subtree-centric generative framework for efficient, accurate connectome synthesis. GTG decomposes each connectome into entropy-guided k-hop trees capturing informative local structure, encoded by a shared GCN. A bipartite message-passing layer fuses subtree embeddings with global node features, while a dual-branch decoder jointly predicts edge existence and weights to reconstruct the adjacency matrix. GTG outperforms state-of-the-art baselines in self-supervised tasks and remains competitive in supervised settings, delivering higher structural fidelity and more precise weights with far less memory. Its modular design enables extensions to connectome super-resolution and cross-modality synthesis. Code: this https URL

[LG-13] mporal Anchoring in Deepening Embedding Spaces: Event-Indexed Projections Drift Convergence and an Internal Computational Architecture

链接: https://arxiv.org/abs/2508.09693
作者: Faruk Alpay,Bugra Kilictas,Hamdi Alakkad
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 16 pages, 2 figures, 2 tables

点击查看摘要

Abstract:We develop an operator-theoretic framework for temporal anchoring in embedding spaces, modeled as drift maps interleaved with event-indexed blocks culminating in affine projections. We provide complete proofs for a variable-block contraction lemma (products of Lipschitz factors), a drift–projection convergence theorem with explicit uniform-gap envelopes, and ontological convergence under nested affine anchors with a robustness variant. We formalize an internal Manuscript Computer (MC) whose computations are defined purely by these operators and prove a rigorous finite-run equivalence theorem (with perturbation bounds). For attention layers, we give a self-contained proof that softmax is 1/2 -Lipschitz in \ell_2 and derive sufficient layer-contraction conditions (orthogonal/non-orthogonal heads). All floats are placed exactly where written; the manuscript uses only in-paper pseudocode and appendix figures.

[LG-14] Global Convergence Analysis of Vanilla Gradient Descent for Asymmetric Matrix Completion

链接: https://arxiv.org/abs/2508.09685
作者: Xu Zhang,Shuo Chen,Jinsheng Li,Xiangying Pang,Maoguo Gong
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:This paper investigates the asymmetric low-rank matrix completion problem, which can be formulated as an unconstrained non-convex optimization problem with a nonlinear least-squares objective function, and is solved via gradient descent methods. Previous gradient descent approaches typically incorporate regularization terms into the objective function to guarantee convergence. However, numerical experiments and theoretical analysis of the gradient flow both demonstrate that the elimination of regularization terms in gradient descent algorithms does not adversely affect convergence performance. By introducing the leave-one-out technique, we inductively prove that the vanilla gradient descent with spectral initialization achieves a linear convergence rate with high probability. Besides, we demonstrate that the balancing regularization term exhibits a small norm during iterations, which reveals the implicit regularization property of gradient descent. Empirical results show that our algorithm has a lower computational cost while maintaining comparable completion performance compared to other gradient descent algorithms.

[LG-15] DeputyDev – AI Powered Developer Assistant: Breaking the Code Review Logjam through Contextual AI to Boost Developer Productivity

链接: https://arxiv.org/abs/2508.09676
作者: Vishal Khare,Vijay Saini,Deepak Sharma,Anand Kumar,Ankit Rana,Anshul Yadav
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 6 pages of supplementary materials

点击查看摘要

Abstract:This study investigates the implementation and efficacy of DeputyDev, an AI-powered code review assistant developed to address inefficiencies in the software development process. The process of code review is highly inefficient for several reasons, such as it being a time-consuming process, inconsistent feedback, and review quality not being at par most of the time. Using our telemetry data, we observed that at TATA 1mg, pull request (PR) processing exhibits significant inefficiencies, with average pick-up and review times of 73 and 82 hours, respectively, resulting in a 6.2 day closure cycle. The review cycle was marked by prolonged iterative communication between the reviewing and submitting parties. Research from the University of California, Irvine indicates that interruptions can lead to an average of 23 minutes of lost focus, critically affecting code quality and timely delivery. To address these challenges, we developed DeputyDev’s PR review capabilities by providing automated, contextual code reviews. We conducted a rigorous double-controlled A/B experiment involving over 200 engineers to evaluate DeputyDev’s impact on review times. The results demonstrated a statistically significant reduction in both average per PR (23.09%) and average per-line-of-code (40.13%) review durations. After implementing safeguards to exclude outliers, DeputyDev has been effectively rolled out across the entire organisation. Additionally, it has been made available to external companies as a Software-as-a-Service (SaaS) solution, currently supporting the daily work of numerous engineering professionals. This study explores the implementation and effectiveness of AI-assisted code reviews in improving development workflow timelines and code.

[LG-16] Social-Sensor Identity Cloning Detection Using Weakly Supervised Deep Forest and Cryptographic Authentication

链接: https://arxiv.org/abs/2508.09665
作者: Ahmed Alharbi,Hai Dong,Xun Yi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 23 pages

点击查看摘要

Abstract:Recent years have witnessed a rising trend in social-sensor cloud identity cloning incidents. However, existing approaches suffer from unsatisfactory performance, a lack of solutions for detecting duplicated accounts, and a lack of large-scale evaluations on real-world datasets. We introduce a novel method for detecting identity cloning in social-sensor cloud service providers. Our proposed technique consists of two primary components: 1) a similar identity detection method and 2) a cryptography-based authentication protocol. Initially, we developed a weakly supervised deep forest model to identify similar identities using non-privacy-sensitive user profile features provided by the service. Subsequently, we designed a cryptography-based authentication protocol to verify whether similar identities were generated by the same provider. Our extensive experiments on a large real-world dataset demonstrate the feasibility and superior performance of our technique compared to current state-of-the-art identity clone detection methods.

[LG-17] hermal Tracks: A Gaussian process-based framework for universal melting curve analysis enabling unconstrained hit identification in thermal proteome profiling experiments

链接: https://arxiv.org/abs/2508.09659
作者: Johannes F. Hevler,Shivam Verma,Mirat Soijtra,Carolyn R. Bertozzi
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 5 pages, 2 figures, short communication

点击查看摘要

Abstract:Thermal Tracks is a Python-based statistical framework for analyzing protein thermal stability data that overcomes key limitations of existing thermal proteome profiling (TPP) work-flows. Unlike standard approaches that assume sigmoidal melting curves and are constrained by empirical null distributions (limiting significant hits to approximately 5 % of data), Thermal Tracks uses Gaussian Process (GP) models with squared-exponential kernels to flexibly model any melting curve shape while generating unbiased null distributions through kernel priors. This framework is particularly valuable for analyzing proteome-wide perturbations that significantly alter protein thermal stability, such as pathway inhibitions, genetic modifications, or environmental stresses, where conventional TPP methods may miss biologically relevant changes due to their statistical constraints. Furthermore, Thermal Tracks excels at analyzing proteins with un-conventional melting profiles, including phase-separating proteins and membrane proteins, which often exhibit complex, non-sigmoidal thermal stability behaviors. Thermal Tracks is freely available from GitHub and is implemented in Python, providing an accessible and flexible tool for proteome-wide thermal profiling studies.

[LG-18] Personalized Product Search Ranking: A Multi-Task Learning Approach with Tabular and Non-Tabular Data PRICAI-2025

链接: https://arxiv.org/abs/2508.09636
作者: Lalitesh Morishetti,Abhay Kumar,Jonathan Scott,Kaushiki Nag,Gunjan Sharma,Shanu Vashishtha,Rahul Sridhar,Rohit Chatter,Kannan Achan
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 17 pages, 2 figures, The Pacific Rim International Conference on Artificial Intelligence (PRICAI-2025) Conference

点击查看摘要

Abstract:In this paper, we present a novel model architecture for optimizing personalized product search ranking using a multi-task learning (MTL) framework. Our approach uniquely integrates tabular and non-tabular data, leveraging a pre-trained TinyBERT model for semantic embeddings and a novel sampling technique to capture diverse customer behaviors. We evaluate our model against several baselines, including XGBoost, TabNet, FT-Transformer, DCN-V2, and MMoE, focusing on their ability to handle mixed data types and optimize personalized ranking. Additionally, we propose a scalable relevance labeling mechanism based on click-through rates, click positions, and semantic similarity, offering an alternative to traditional human-annotated labels. Experimental results show that combining non-tabular data with advanced embedding techniques in multi-task learning paradigm significantly enhances model performance. Ablation studies further underscore the benefits of incorporating relevance labels, fine-tuning TinyBERT layers, and TinyBERT query-product embedding interactions. These results demonstrate the effectiveness of our approach in achieving improved personalized product search ranking.

[LG-19] Physics- and geometry-aware spatio-spectral graph neural operator for time-independent and time-dependent PDEs

链接: https://arxiv.org/abs/2508.09627
作者: Subhankar Sarkar,Souvik Chakraborty
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving partial differential equations (PDEs) efficiently and accurately remains a cornerstone challenge in science and engineering, especially for problems involving complex geometries and limited labeled data. We introduce a Physics- and Geometry- Aware Spatio-Spectral Graph Neural Operator ( \pi G-Sp ^2 GNO) for learning the solution operators of time-independent and time-dependent PDEs. The proposed approach first improves upon the recently developed Sp ^2 GNO by enabling geometry awareness and subsequently exploits the governing physics to learn the underlying solution operator in a simulation-free setup. While the spatio-spectral structure present in the proposed architecture allows multiscale learning, two separate strategies for enabling geometry awareness is introduced in this paper. For time dependent problems, we also introduce a novel hybrid physics informed loss function that combines higher-order time-marching scheme with upscaled theory inspired stochastic projection scheme. This allows accurate integration of the physics-information into the loss function. The performance of the proposed approach is illustrated on number of benchmark examples involving regular and complex domains, variation in geometry during inference, and time-independent and time-dependent problems. The results obtained illustrate the efficacy of the proposed approach as compared to the state-of-the-art physics-informed neural operator algorithms in the literature.

[LG-20] Online Prediction with Limited Selectivity

链接: https://arxiv.org/abs/2508.09592
作者: Licheng Liu,Mingda Qiao
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Selective prediction [Dru13, QV19] models the scenario where a forecaster freely decides on the prediction window that their forecast spans. Many data statistics can be predicted to a non-trivial error rate without any distributional assumptions or expert advice, yet these results rely on that the forecaster may predict at any time. We introduce a model of Prediction with Limited Selectivity (PLS) where the forecaster can start the prediction only on a subset of the time horizon. We study the optimal prediction error both on an instance-by-instance basis and via an average-case analysis. We introduce a complexity measure that gives instance-dependent bounds on the optimal error. For a randomly-generated PLS instance, these bounds match with high probability.

[LG-21] HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap

链接: https://arxiv.org/abs/2508.09591
作者: Wenxiang Lin,Xinglin Pan,Lin Zhang,Shaohuai Shi,Xuan Wang,Xiaowen Chu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The sparsely activated mixture-of-experts (MoE) transformer has become a common architecture for large language models (LLMs) due to its sparsity, which requires fewer computational demands while easily scaling the model size. In MoE models, each MoE layer requires to dynamically choose tokens to activate particular experts for computation while the activated experts may not be located in the same device or GPU as the token. However, this leads to substantial communication and load imbalances across all GPUs, which obstructs the scalability of distributed systems within a GPU cluster. To this end, we introduce HierMoE to accelerate the training of MoE models by two topology-aware techniques: 1) token deduplication to reduce the communication traffic, and 2) expert swap to balance the workloads among all GPUs. To enable the above two proposed approaches to be more general, we build theoretical models aimed at achieving the best token duplication and expert swap strategy under different model configurations and hardware environments. We implement our prototype HierMoE system atop Megatron-LM and conduct experiments on a 32-GPU cluster with DeepSeek-V3 and Qwen3-30B-A3B models. Experimental results show that our HierMoE achieves 1.55\times to 3.32\times faster communication and delivers 1.18\times to 1.27\times faster end-to-end training compared to state-of-the-art MoE training systems, Tutel-2DH, SmartMoE, and Megatron-LM.

[LG-22] Edge General Intelligence Through World Models and Agent ic AI: Fundamentals Solutions and Challenges

链接: https://arxiv.org/abs/2508.09561
作者: Changyuan Zhao,Guangyuan Liu,Ruichen Zhang,Yinqiu Liu,Jiacheng Wang,Jiawen Kang,Dusit Niyato,Zan Li,Xuemin(Sherman)Shen,Zhu Han,Sumei Sun,Chau Yuen,Dong In Kim
类目: Machine Learning (cs.LG)
*备注: 21 pages. 9 figures

点击查看摘要

Abstract:Edge General Intelligence (EGI) represents a transformative evolution of edge computing, where distributed agents possess the capability to perceive, reason, and act autonomously across diverse, dynamic environments. Central to this vision are world models, which act as proactive internal simulators that not only predict but also actively imagine future trajectories, reason under uncertainty, and plan multi-step actions with foresight. This proactive nature allows agents to anticipate potential outcomes and optimize decisions ahead of real-world interactions. While prior works in robotics and gaming have showcased the potential of world models, their integration into the wireless edge for EGI remains underexplored. This survey bridges this gap by offering a comprehensive analysis of how world models can empower agentic artificial intelligence (AI) systems at the edge. We first examine the architectural foundations of world models, including latent representation learning, dynamics modeling, and imagination-based planning. Building on these core capabilities, we illustrate their proactive applications across EGI scenarios such as vehicular networks, unmanned aerial vehicle (UAV) networks, the Internet of Things (IoT) systems, and network functions virtualization, thereby highlighting how they can enhance optimization under latency, energy, and privacy constraints. We then explore their synergy with foundation models and digital twins, positioning world models as the cognitive backbone of EGI. Finally, we highlight open challenges, such as safety guarantees, efficient training, and constrained deployment, and outline future research directions. This survey provides both a conceptual foundation and a practical roadmap for realizing the next generation of intelligent, autonomous edge systems.

[LG-23] SYNAPSE-G: Bridging Large Language Models and Graph Learning for Rare Event Classification

链接: https://arxiv.org/abs/2508.09544
作者: Sasan Tavakkol,Lin Chen,Max Springer,Abigail Schantz,Blaž Bratanič,Vincent Cohen-Addad,MohammadHossein Bateni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scarcity of labeled data, especially for rare events, hinders training effective machine learning models. This paper proposes SYNAPSE-G (Synthetic Augmentation for Positive Sampling via Expansion on Graphs), a novel pipeline leveraging Large Language Models (LLMs) to generate synthetic training data for rare event classification, addressing the cold-start problem. This synthetic data serve as seeds for semi-supervised label propagation on a similarity graph constructed between the seeds and a large unlabeled dataset. This identifies candidate positive examples, subsequently labeled by an oracle (human or LLM). The expanded dataset then trains/fine-tunes a classifier. We theoretically analyze how the quality (validity and diversity) of the synthetic data impacts the precision and recall of our method. Experiments on the imbalanced SST2 and MHS datasets demonstrate SYNAPSE-G’s effectiveness in finding positive labels, outperforming baselines including nearest neighbor search.

[LG-24] Emergence of Hierarchies in Multi-Agent Self-Organizing Systems Pursuing a Joint Objective

链接: https://arxiv.org/abs/2508.09541
作者: Gang Chen,Guoxin Wang,Anton van Beek,Zhenjun Ming,Yan Yan
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 34 pages,17 figures

点击查看摘要

Abstract:Multi-agent self-organizing systems (MASOS) exhibit key characteristics including scalability, adaptability, flexibility, and robustness, which have contributed to their extensive application across various fields. However, the self-organizing nature of MASOS also introduces elements of unpredictability in their emergent behaviors. This paper focuses on the emergence of dependency hierarchies during task execution, aiming to understand how such hierarchies arise from agents’ collective pursuit of the joint objective, how they evolve dynamically, and what factors govern their development. To investigate this phenomenon, multi-agent reinforcement learning (MARL) is employed to train MASOS for a collaborative box-pushing task. By calculating the gradients of each agent’s actions in relation to the states of other agents, the inter-agent dependencies are quantified, and the emergence of hierarchies is analyzed through the aggregation of these dependencies. Our results demonstrate that hierarchies emerge dynamically as agents work towards a joint objective, with these hierarchies evolving in response to changing task requirements. Notably, these dependency hierarchies emerge organically in response to the shared objective, rather than being a consequence of pre-configured rules or parameters that can be fine-tuned to achieve specific results. Furthermore, the emergence of hierarchies is influenced by the task environment and network initialization conditions. Additionally, hierarchies in MASOS emerge from the dynamic interplay between agents’ “Talent” and “Effort” within the “Environment.” “Talent” determines an agent’s initial influence on collective decision-making, while continuous “Effort” within the “Environment” enables agents to shift their roles and positions within the system.

[LG-25] me-Aware and Transition-Semantic Graph Neural Networks for Interpretable Predictive Business Process Monitoring

链接: https://arxiv.org/abs/2508.09527
作者: Fang Wang,Ernesto Damiani
类目: Machine Learning (cs.LG)
*备注: 32 pages

点击查看摘要

Abstract:Predictive Business Process Monitoring (PBPM) aims to forecast future events in ongoing cases based on historical event logs. While Graph Neural Networks (GNNs) are well suited to capture structural dependencies in process data, existing GNN-based PBPM models remain underdeveloped. Most rely either on short prefix subgraphs or global architectures that overlook temporal relevance and transition semantics. We propose a unified, interpretable GNN framework that advances the state of the art along three key axes. First, we compare prefix-based Graph Convolutional Networks(GCNs) and full trace Graph Attention Networks(GATs) to quantify the performance gap between localized and global modeling. Second, we introduce a novel time decay attention mechanism that constructs dynamic, prediction-centered windows, emphasizing temporally relevant history and suppressing noise. Third, we embed transition type semantics into edge features to enable fine grained reasoning over structurally ambiguous traces. Our architecture includes multilevel interpretability modules, offering diverse visualizations of attention behavior. Evaluated on five benchmarks, the proposed models achieve competitive Top-k accuracy and DL scores without per-dataset tuning. By addressing architectural, temporal, and semantic gaps, this work presents a robust, generalizable, and explainable solution for next event prediction in PBPM.

[LG-26] Enhancing Memory Recall in LLM s with Gauss-Tin: A Hybrid Instructional and Gaussian Replay Approach

链接: https://arxiv.org/abs/2508.09510
作者: Iing Muttakhiroh,Thomas Fevens
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the significant advancements in Large Language Models (LLMs), catastrophic forgetting remains a substantial challenge, where models lose previously acquired knowledge upon learning new information. Continual learning (CL) strategies have emerged as a potential solution to this problem, with replay-based techniques demonstrating superior performance in preserving learned knowledge. In this context, we introduce Gauss-Tin, a novel approach that integrates the replay strategy with a Gaussian mixture model to enhance the quality of sample selection during training, supplemented by instructional guidance to facilitate the generation of past learning. This method aims to improve LLMs’ retention capabilities by strategically reinforcing important past learnings while accommodating new information. Our experimental results indicate a promising 6% improvement in retention metrics over traditional methods, suggesting that Gauss-Tin is an effective strategy for mitigating catastrophic forgetting in LLMs. This study underscores the potential of hybrid models in enhancing the robustness and adaptability of LLMs in dynamic learning environments.

[LG-27] Causal Graph Profiling via Structural Divergence for Robust Anomaly Detection in Cyber-Physical Systems KDD

链接: https://arxiv.org/abs/2508.09504
作者: Arun Vignesh Malarkkan,Haoyue Bai,Dongjie Wang,Yanjie Fu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 7 Pages, 5 figures, Submission for ACM TKDD

点击查看摘要

Abstract:With the growing complexity of cyberattacks targeting critical infrastructures such as water treatment networks, there is a pressing need for robust anomaly detection strategies that account for both system vulnerabilities and evolving attack patterns. Traditional methods – statistical, density-based, and graph-based models struggle with distribution shifts and class imbalance in multivariate time series, often leading to high false positive rates. To address these challenges, we propose CGAD, a Causal Graph-based Anomaly Detection framework designed for reliable cyberattack detection in public infrastructure systems. CGAD follows a two-phase supervised framework – causal profiling and anomaly scoring. First, it learns causal invariant graph structures representing the system’s behavior under “Normal” and “Attack” states using Dynamic Bayesian Networks. Second, it employs structural divergence to detect anomalies via causal graph comparison by evaluating topological deviations in causal graphs over time. By leveraging causal structures, CGAD achieves superior adaptability and accuracy in non-stationary and imbalanced time series environments compared to conventional machine learning approaches. By uncovering causal structures beneath volatile sensor data, our framework not only detects cyberattacks with markedly higher precision but also redefines robustness in anomaly detection, proving resilience where traditional models falter under imbalance and drift. Our framework achieves substantial gains in F1 and ROC-AUC scores over best-performing baselines across four industrial datasets, demonstrating robust detection of delayed and structurally complex anomalies.

[LG-28] MiCo: End-to-End Mixed Precision Neural Network Co-Exploration Framework for Edge AI

链接: https://arxiv.org/abs/2508.09500
作者: Zijun Jiang,Yangdi Lyu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 9 pages, 6 figures, accepted by ICCAD’25

点击查看摘要

Abstract:Quantized Neural Networks (QNN) with extremely low-bitwidth data have proven promising in efficient storage and computation on edge devices. To further reduce the accuracy drop while increasing speedup, layer-wise mixed-precision quantization (MPQ) becomes a popular solution. However, existing algorithms for exploring MPQ schemes are limited in flexibility and efficiency. Comprehending the complex impacts of different MPQ schemes on post-training quantization and quantization-aware training results is a challenge for conventional methods. Furthermore, an end-to-end framework for the optimization and deployment of MPQ models is missing in existing work. In this paper, we propose the MiCo framework, a holistic MPQ exploration and deployment framework for edge AI applications. The framework adopts a novel optimization algorithm to search for optimal quantization schemes with the highest accuracies while meeting latency constraints. Hardware-aware latency models are built for different hardware targets to enable fast explorations. After the exploration, the framework enables direct deployment from PyTorch MPQ models to bare-metal C codes, leading to end-to-end speedup with minimal accuracy drops. Comments: 9 pages, 6 figures, accepted by ICCAD’25 Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR) Cite as: arXiv:2508.09500 [cs.LG] (or arXiv:2508.09500v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.09500 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-29] EGGS-PTP: An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models

链接: https://arxiv.org/abs/2508.09471
作者: Omar Bazarbachi,Zijun Sun,Yanning Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) become more widely adopted and scale up in size, the computational and memory challenges involved in deploying these massive foundation models have grown increasingly severe. This underscores the urgent need to develop more efficient model variants. Faced with this challenge, the present work introduces EGGS-PTP: an Expander-Graph Guided Structured Post-training Pruning method. The proposed approach leverages graph theory to guide the design of N:M structured pruning, effectively reducing model size and computational demands. By incorporating concepts from expander graphs, EGGS-PTP ensures information flow within the pruned network, preserving essential model functionality. Extensive numerical experiments demonstrate that EGGS-PTP not only achieves significant acceleration and memory savings due to structured sparsity but also outperforms existing structured pruning techniques in terms of accuracy across various LLMs.

[LG-30] Learn to Explore: Meta NAS via Bayesian Optimization Guided Graph Generation

链接: https://arxiv.org/abs/2508.09467
作者: Zijun Sun,Yanning Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural Architecture Search (NAS) automates the design of high-performing neural networks but typically targets a single predefined task, thereby restricting its real-world applicability. To address this, Meta Neural Architecture Search (Meta-NAS) has emerged as a promising paradigm that leverages prior knowledge across tasks to enable rapid adaptation to new ones. Nevertheless, existing Meta-NAS methods often struggle with poor generalization, limited search spaces, or high computational costs. In this paper, we propose a novel Meta-NAS framework, GraB-NAS. Specifically, GraB-NAS first models neural architectures as graphs, and then a hybrid search strategy is developed to find and generate new graphs that lead to promising neural architectures. The search strategy combines global architecture search via Bayesian Optimization in the search space with local exploration for novel neural networks via gradient ascent in the latent space. Such a hybrid search strategy allows GraB-NAS to discover task-aware architectures with strong performance, even beyond the predefined search space. Extensive experiments demonstrate that GraB-NAS outperforms state-of-the-art Meta-NAS baselines, achieving better generalization and search effectiveness.

[LG-31] Open-Set Fault Diagnosis in Multimode Processes via Fine-Grained Deep Feature Representation

链接: https://arxiv.org/abs/2508.09462
作者: Guangqiang Li,M. Amine Atoui,Xiangshun Li
类目: Machine Learning (cs.LG)
*备注: 34 pages, 12 figures

点击查看摘要

Abstract:A reliable fault diagnosis system should not only accurately classify known health states but also effectively identify unknown faults. In multimode processes, samples belonging to the same health state often show multiple cluster distributions, making it difficult to construct compact and accurate decision boundaries for that state. To address this challenge, a novel open-set fault diagnosis model named fine-grained clustering and rejection network (FGCRN) is proposed. It combines multiscale depthwise convolution, bidirectional gated recurrent unit and temporal attention mechanism to capture discriminative features. A distance-based loss function is designed to enhance the intra-class compactness. Fine-grained feature representations are constructed through unsupervised learning to uncover the intrinsic structures of each health state. Extreme value theory is employed to model the distance between sample features and their corresponding fine-grained representations, enabling effective identification of unknown faults. Extensive experiments demonstrate the superior performance of the proposed method.

[LG-32] NEXICA: Discovering Road Traffic Causality (Extended arXiv Version)

链接: https://arxiv.org/abs/2508.09447
作者: Siddharth Srikanth,John Krumm,Jonathan Qin
类目: Machine Learning (cs.LG)
*备注: Extended version of short paper in 32nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2024)

点击查看摘要

Abstract:Road traffic congestion is a persistent problem. Focusing resources on the causes of congestion is a potentially efficient strategy for reducing slowdowns. We present NEXICA, an algorithm to discover which parts of the highway system tend to cause slowdowns on other parts of the highway. We use time series of road speeds as inputs to our causal discovery algorithm. Finding other algorithms inadequate, we develop a new approach that is novel in three ways. First, it concentrates on just the presence or absence of events in the time series, where an event indicates the temporal beginning of a traffic slowdown. Second, we develop a probabilistic model using maximum likelihood estimation to compute the probabilities of spontaneous and caused slowdowns between two locations on the highway. Third, we train a binary classifier to identify pairs of cause/effect locations trained on pairs of road locations where we are reasonably certain a priori of their causal connections, both positive and negative. We test our approach on six months of road speed data from 195 different highway speed sensors in the Los Angeles area, showing that our approach is superior to state-of-the-art baselines in both accuracy and computation speed.

[LG-33] Graph Neural Network and Transformer Integration for Unsupervised System Anomaly Discovery

链接: https://arxiv.org/abs/2508.09401
作者: Yun Zi,Ming Gong,Zhihao Xue,Yujun Zou,Nia Qi,Yingnan Deng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposes an unsupervised anomaly detection method for distributed backend service systems, addressing practical challenges such as complex structural dependencies, diverse behavioral evolution, and the absence of labeled data. The method constructs a dynamic graph based on service invocation relationships and applies graph convolution to extract high-order structural representations from multi-hop topologies. A Transformer is used to model the temporal behavior of each node, capturing long-term dependencies and local fluctuations. During the feature fusion stage, a learnable joint embedding mechanism integrates structural and behavioral representations into a unified anomaly vector. A nonlinear mapping is then applied to compute anomaly scores, enabling an end-to-end detection process without supervision. Experiments on real-world cloud monitoring data include sensitivity analyses across different graph depths, sequence lengths, and data perturbations. Results show that the proposed method outperforms existing models on several key metrics, demonstrating stronger expressiveness and stability in capturing anomaly propagation paths and modeling dynamic behavior sequences, with high potential for practical deployment.

[LG-34] Integrating Feature Attention and Temporal Modeling for Collaborative Financial Risk Assessment

链接: https://arxiv.org/abs/2508.09399
作者: Yue Yao,Zhen Xu,Youzhu Liu,Kunyuan Ma,Yuxiu Lin,Mohan Jiang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This paper addresses the challenges of data privacy and collaborative modeling in cross-institution financial risk analysis. It proposes a risk assessment framework based on federated learning. Without sharing raw data, the method enables joint modeling and risk identification across multiple institutions. This is achieved by incorporating a feature attention mechanism and temporal modeling structure. Specifically, the model adopts a distributed optimization strategy. Each financial institution trains a local sub-model. The model parameters are protected using differential privacy and noise injection before being uploaded. A central server then aggregates these parameters to generate a global model. This global model is used for systemic risk identification. To validate the effectiveness of the proposed method, multiple experiments are conducted. These evaluate communication efficiency, model accuracy, systemic risk detection, and cross-market generalization. The results show that the proposed model outperforms both traditional centralized methods and existing federated learning variants across all evaluation metrics. It demonstrates strong modeling capabilities and practical value in sensitive financial environments. The method enhances the scope and efficiency of risk identification while preserving data sovereignty. It offers a secure and efficient solution for intelligent financial risk analysis.

[LG-35] Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders

链接: https://arxiv.org/abs/2508.09363
作者: Charles O’Neill,Mudith Jayasekara,Max Kirkby
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) decompose large language model (LLM) activations into latent features that reveal mechanistic structure. Conventional SAEs train on broad data distributions, forcing a fixed latent budget to capture only high-frequency, generic patterns. This often results in significant linear dark matter'' in reconstruction error and produces latents that fragment or absorb each other, complicating interpretation. We show that restricting SAE training to a well-defined domain (medical text) reallocates capacity to domain-specific features, improving both reconstruction fidelity and interpretability. Training JumpReLU SAEs on layer-20 activations of Gemma-2 models using 195k clinical QA examples, we find that domain-confined SAEs explain up to 20\% more variance, achieve higher loss recovery, and reduce linear residual error compared to broad-domain SAEs. Automated and human evaluations confirm that learned features align with clinically meaningful concepts (e.g., taste sensations’’ or infectious mononucleosis''), rather than frequent but uninformative tokens. These domain-specific SAEs capture relevant linear structure, leaving a smaller, more purely nonlinear residual. We conclude that domain-confinement mitigates key limitations of broad-domain SAEs, enabling more complete and interpretable latent decompositions, and suggesting the field may need to question foundation-model’’ scaling for general-purpose SAEs.

[LG-36] aching Code Refactoring Using LLM s

链接: https://arxiv.org/abs/2508.09332
作者: Anshul Khairnar,Aarya Rajoju,Edward F. Gehringer
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted for presentation at the Frontiers in Education Conference, Nashville, Tennessee, USA, 2-5 November 2025

点击查看摘要

Abstract:This Innovative Practice full paper explores how Large Language Models (LLMs) can enhance the teaching of code refactoring in software engineering courses through real-time, context-aware feedback. Refactoring improves code quality but is difficult to teach, especially with complex, real-world codebases. Traditional methods like code reviews and static analysis tools offer limited, inconsistent feedback. Our approach integrates LLM-assisted refactoring into a course project using structured prompts to help students identify and address code smells such as long methods and low cohesion. Implemented in Spring 2025 in a long-lived OSS project, the intervention is evaluated through student feedback and planned analysis of code quality improvements. Findings suggest that LLMs can bridge theoretical and practical learning, supporting a deeper understanding of maintainability and refactoring principles.

[LG-37] Distilling Reinforcement Learning into Single-Batch Datasets ECAI2025

链接: https://arxiv.org/abs/2508.09283
作者: Connor Wilhelm,Dan Ventura
类目: Machine Learning (cs.LG)
*备注: to be published in ECAI 2025 (appendix in arXiv version only), 11 pages (7 content, 4 appendix), 6 figures

点击查看摘要

Abstract:Dataset distillation compresses a large dataset into a small synthetic dataset such that learning on the synthetic dataset approximates learning on the original. Training on the distilled dataset can be performed in as little as one step of gradient descent. We demonstrate that distillation is generalizable to different tasks by distilling reinforcement learning environments into one-batch supervised learning datasets. This demonstrates not only distillation’s ability to compress a reinforcement learning task but also its ability to transform one learning modality (reinforcement learning) into another (supervised learning). We present a novel extension of proximal policy optimization for meta-learning and use it in distillation of a multi-dimensional extension of the classic cart-pole problem, all MuJoCo environments, and several Atari games. We demonstrate distillation’s ability to compress complex RL environments into one-step supervised learning, explore RL distillation’s generalizability across learner architectures, and demonstrate distilling an environment into the smallest-possible synthetic dataset.

[LG-38] Pattern-based Knowledge Component Extraction from Student Code Using Representation Learning

链接: https://arxiv.org/abs/2508.09281
作者: Muntasir Hoq,Griffin Pitts,Andrew Lan,Peter Brusilovsky,Bita Akram
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective personalized learning in computer science education depends on accurately modeling what students know and what they need to learn. While Knowledge Components (KCs) provide a foundation for such modeling, automated KC extraction from student code is inherently challenging due to insufficient explainability of discovered KCs and the open-endedness of programming problems with significant structural variability across student solutions and complex interactions among programming concepts. In this work, we propose a novel, explainable framework for automated KC discovery through pattern-based KCs: recurring structural patterns within student code that capture the specific programming patterns and language constructs that students must master. Toward this, we train a Variational Autoencoder to generate important representative patterns from student code guided by an explainable, attention-based code representation model that identifies important correct and incorrect pattern implementations from student code. These patterns are then clustered to form pattern-based KCs. We evaluate our KCs using two well-established methods informed by Cognitive Science: learning curve analysis and Deep Knowledge Tracing (DKT). Experimental results demonstrate meaningful learning trajectories and significant improvements in DKT predictive performance over traditional KT methods. This work advances knowledge modeling in CS education by providing an automated, scalable, and explainable framework for identifying granular code patterns and algorithmic constructs, essential for student learning.

[LG-39] Constrained Black-Box Attacks Against Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2508.09275
作者: Amine Andam,Jamal Bentahar,Mustapha Hedabou
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Under review in TNNLS

点击查看摘要

Abstract:Collaborative multi-agent reinforcement learning (c-MARL) has rapidly evolved, offering state-of-the-art algorithms for real-world applications, including sensitive domains. However, a key challenge to its widespread adoption is the lack of a thorough investigation into its vulnerabilities to adversarial attacks. Existing work predominantly focuses on training-time attacks or unrealistic scenarios, such as access to policy weights or the ability to train surrogate policies. In this paper, we investigate new vulnerabilities under more realistic and constrained conditions, assuming an adversary can only collect and perturb the observations of deployed agents. We also consider scenarios where the adversary has no access at all. We propose simple yet highly effective algorithms for generating adversarial perturbations designed to misalign how victim agents perceive their environment. Our approach is empirically validated on three benchmarks and 22 environments, demonstrating its effectiveness across diverse algorithms and environments. Furthermore, we show that our algorithm is sample-efficient, requiring only 1,000 samples compared to the millions needed by previous methods.

[LG-40] Over-Squashing in GNNs and Causal Inference of Rewiring Strategies

链接: https://arxiv.org/abs/2508.09265
作者: Danial Saber,Amirali Salehi-Abari
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 14 pages, 2 figures

点击查看摘要

Abstract:Graph neural networks (GNNs) have exhibited state-of-the-art performance across wide-range of domains such as recommender systems, material design, and drug repurposing. Yet message-passing GNNs suffer from over-squashing – exponential compression of long-range information from distant nodes – which limits expressivity. Rewiring techniques can ease this bottleneck; but their practical impacts are unclear due to the lack of a direct empirical over-squashing metric. We propose a rigorous, topology-focused method for assessing over-squashing between node pairs using the decay rate of their mutual sensitivity. We then extend these pairwise assessments to four graph-level statistics (prevalence, intensity, variability, extremity). Coupling these metrics with a within-graph causal design, we quantify how rewiring strategies affect over-squashing on diverse graph- and node-classification benchmarks. Our extensive empirical analyses show that most graph classification datasets suffer from over-squashing (but to various extents), and rewiring effectively mitigates it – though the degree of mitigation, and its translation into performance gains, varies by dataset and method. We also found that over-squashing is less notable in node classification datasets, where rewiring often increases over-squashing, and performance variations are uncorrelated with over-squashing changes. These findings suggest that rewiring is most beneficial when over-squashing is both substantial and corrected with restraint – while overly aggressive rewiring, or rewiring applied to minimally over-squashed graphs, is unlikely to help and may even harm performance. Our plug-and-play diagnostic tool lets practitioners decide – before any training – whether rewiring is likely to pay off.

[LG-41] LLM Empowered Prototype Learning for Zero and Few-Shot Tasks on Tabular Data

链接: https://arxiv.org/abs/2508.09263
作者: Peng Wang,Dongsheng Wang,He Zhao,Hangting Ye,Dandan Guo,Yi Chang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent breakthroughs in large language models (LLMs) have opened the door to in-depth investigation of their potential in tabular data modeling. However, effectively utilizing advanced LLMs in few-shot and even zero-shot scenarios is still challenging. To this end, we propose a novel LLM-based prototype estimation framework for tabular learning. Our key idea is to query the LLM to generate feature values based example-free prompt, which solely relies on task and feature descriptions. With the feature values generated by LLM, we can build a zero-shot prototype in a training-free manner, which can be further enhanced by fusing few-shot samples, avoiding training a classifier or finetuning the LLMs. Thanks to the example-free prompt and prototype estimation, ours bypasses the constraints brought by the example-based prompt, providing a scalable and robust framework. Extensive experiments demonstrate the effectiveness of ours in zero and few-shot tabular learning.

[LG-42] Blockchain Network Analysis using Quantum Inspired Graph Neural Networks Ensemble Models

链接: https://arxiv.org/abs/2508.09237
作者: Luigi D’Amico,Daniel De Rosso,Ninad Dixit,Raul Salles de Padua,Samuel Palmer,Samuel Mugel,Román Orús,Holger Eble,Ali Abedi
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:In the rapidly evolving domain of financial technology, the detection of illicit transactions within blockchain networks remains a critical challenge, necessitating robust and innovative solutions. This work proposes a novel approach by combining Quantum Inspired Graph Neural Networks (QI-GNN) with flexibility of choice of an Ensemble Model using QBoost or a classic model such as Random Forrest Classifier. This system is tailored specifically for blockchain network analysis in anti-money laundering (AML) efforts. Our methodology to design this system incorporates a novel component, a Canonical Polyadic (CP) decomposition layer within the graph neural network framework, enhancing its capability to process and analyze complex data structures efficiently. Our technical approach has undergone rigorous evaluation against classical machine learning implementations, achieving an F2 score of 74.8% in detecting fraudulent transactions. These results highlight the potential of quantum-inspired techniques, supplemented by the structural advancements of the CP layer, to not only match but potentially exceed traditional methods in complex network analysis for financial security. The findings advocate for a broader adoption and further exploration of quantum-inspired algorithms within the financial sector to effectively combat fraud.

[LG-43] he First Differentiable Transfer-Based Algorithm for Discrete MicroLED Repair MICRO

链接: https://arxiv.org/abs/2508.09206
作者: Ning-Yuan Lue
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 15 pages, 7 figures. Presents a differentiable optimization method for laser-enabled MicroLED repair planning, modeling discrete stage shifts in a manufacturing physics context. Includes loss landscape and gradient analyses, with large-array simulation results

点击查看摘要

Abstract:Laser-enabled selective transfer, a key process in high-throughput microLED fabrication, requires computational models that can plan shift sequences to minimize motion of XY stages and adapt to varying optimization objectives across the substrate. We propose the first repair algorithm based on a differentiable transfer module designed to model discrete shifts of transfer platforms, while remaining trainable via gradient-based optimization. Compared to local proximity searching algorithms, our approach achieves superior repair performance and enables more flexible objective designs, such as minimizing the number of steps. Unlike reinforcement learning (RL)-based approaches, our method eliminates the need for handcrafted feature extractors and trains significantly faster, allowing scalability to large arrays. Experiments show a 50% reduction in transfer steps and sub-2-minute planning time on 2000x2000 arrays. This method provides a practical and adaptable solution for accelerating microLED repair in AR/VR and next-generation display fabrication.

[LG-44] Building Safer Sites: A Large-Scale Multi-Level Dataset for Construction Safety Research CIKM2025

链接: https://arxiv.org/abs/2508.09203
作者: Zhenhui Ou,Dawei Li,Zhen Tan,Wenlin Li,Huan Liu,Siyuan Song
类目: Machine Learning (cs.LG)
*备注: The paper was accepted on the CIKM 2025

点击查看摘要

Abstract:Construction safety research is a critical field in civil engineering, aiming to mitigate risks and prevent injuries through the analysis of site conditions and human factors. However, the limited volume and lack of diversity in existing construction safety datasets pose significant challenges to conducting in-depth analyses. To address this research gap, this paper introduces the Construction Safety Dataset (CSDataset), a well-organized comprehensive multi-level dataset that encompasses incidents, inspections, and violations recorded sourced from the Occupational Safety and Health Administration (OSHA). This dataset uniquely integrates structured attributes with unstructured narratives, facilitating a wide range of approaches driven by machine learning and large language models. We also conduct a preliminary approach benchmarking and various cross-level analyses using our dataset, offering insights to inform and enhance future efforts in construction safety. For example, we found that complaint-driven inspections were associated with a 17.3% reduction in the likelihood of subsequent incidents. Our dataset and code are released at this https URL.

[LG-45] Breath as a biomarker: A survey of contact and contactless applications and approaches in respiratory monitoring

链接: https://arxiv.org/abs/2508.09187
作者: Almustapha A. Wakili,Babajide J. Asaju,Woosub Jung
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Breath analysis has emerged as a critical tool in health monitoring, offering insights into respiratory function, disease detection, and continuous health assessment. While traditional contact-based methods are reliable, they often pose challenges in comfort and practicality, particularly for long-term monitoring. This survey comprehensively examines contact-based and contactless approaches, emphasizing recent advances in machine learning and deep learning techniques applied to breath analysis. Contactless methods, including Wi-Fi Channel State Information and acoustic sensing, are analyzed for their ability to provide accurate, noninvasive respiratory monitoring. We explore a broad range of applications, from single-user respiratory rate detection to multi-user scenarios, user identification, and respiratory disease detection. Furthermore, this survey details essential data preprocessing, feature extraction, and classification techniques, offering comparative insights into machine learning/deep learning models suited to each approach. Key challenges like dataset scarcity, multi-user interference, and data privacy are also discussed, along with emerging trends like Explainable AI, federated learning, transfer learning, and hybrid modeling. By synthesizing current methodologies and identifying open research directions, this survey offers a comprehensive framework to guide future innovations in breath analysis, bridging advanced technological capabilities with practical healthcare applications.

[LG-46] Generating Feasible and Diverse Synthetic Populations Using Diffusion Models

链接: https://arxiv.org/abs/2508.09164
作者: Min Tang,Peng Lu,Qing Feng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Population synthesis is a critical task that involves generating synthetic yet realistic representations of populations. It is a fundamental problem in agent-based modeling (ABM), which has become the standard to analyze intelligent transportation systems. The synthetic population serves as the primary input for ABM transportation simulation, with traveling agents represented by population members. However, when the number of attributes describing agents becomes large, survey data often cannot densely support the joint distribution of the attributes in the population due to the curse of dimensionality. This sparsity makes it difficult to accurately model and produce the population. Interestingly, deep generative models trained from available sample data can potentially synthesize possible attribute combinations that present in the actual population but do not exist in the sample data(called sampling zeros). Nevertheless, this comes at the cost of falsely generating the infeasible attribute combinations that do not exist in the population (called structural zeros). In this study, a novel diffusion model-based population synthesis method is proposed to estimate the underlying joint distribution of a population. This approach enables the recovery of numerous missing sampling zeros while keeping the generated structural zeros minimal. Our method is compared with other recently proposed approaches such as Variational Autoencoders (VAE) and Generative Adversarial Network (GAN) approaches, which have shown success in high dimensional tabular population synthesis. We assess the performance of the synthesized outputs using a range of metrics, including marginal distribution similarity, feasibility, and diversity. The results demonstrate that our proposed method outperforms previous approaches in achieving a better balance between the feasibility and diversity of the synthesized population.

[LG-47] An Unsupervised Deep XAI Framework for Localization of Concurrent Replay Attacks in Nuclear Reactor Signals

链接: https://arxiv.org/abs/2508.09162
作者: Konstantinos Vasili,Zachery T. Dahm,William Richards,Stylianos Chatzidakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Next generation advanced nuclear reactors are expected to be smaller both in size and power output, relying extensively on fully digital instrumentation and control systems. These reactors will generate a large flow of information in the form of multivariate time series data, conveying simultaneously various non linear cyber physical, process, control, sensor, and operational states. Ensuring data integrity against deception attacks is becoming increasingly important for networked communication and a requirement for safe and reliable operation. Current efforts to address replay attacks, almost universally focus on watermarking or supervised anomaly detection approaches without further identifying and characterizing the root cause of the anomaly. In addition, these approaches rely mostly on synthetic data with uncorrelated Gaussian process and measurement noise and full state feedback or are limited to univariate signals, signal stationarity, linear quadratic regulators, or other linear-time invariant state-space which may fail to capture any unmodeled system dynamics. In the realm of regulated nuclear cyber-physical systems, additional work is needed on characterization of replay attacks and explainability of predictions using real data. Here, we propose an unsupervised explainable AI framework based on a combination of autoencoder and customized windowSHAP algorithm to fully characterize real-time replay attacks, i.e., detection, source identification, timing and type, of increasing complexity during a dynamic time evolving reactor process. The proposed XAI framework was benchmarked on several real world datasets from Purdue’s nuclear reactor PUR-1 with up to six signals concurrently being replayed. In all cases, the XAI framework was able to detect and identify the source and number of signals being replayed and the duration of the falsification with 95 percent or better accuracy.

[LG-48] Presenting DiaData for Research on Type 1 Diabetes

链接: https://arxiv.org/abs/2508.09160
作者: Beyza Cinar,Maria Maleshkova
类目: Machine Learning (cs.LG); Databases (cs.DB); Quantitative Methods (q-bio.QM)
*备注: 11 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Type 1 diabetes (T1D) is an autoimmune disorder that leads to the destruction of insulin-producing cells, resulting in insulin deficiency, as to why the affected individuals depend on external insulin injections. However, insulin can decrease blood glucose levels and can cause hypoglycemia. Hypoglycemia is a severe event of low blood glucose levels ( \le 70 mg/dL) with dangerous side effects of dizziness, coma, or death. Data analysis can significantly enhance diabetes care by identifying personal patterns and trends leading to adverse events. Especially, machine learning (ML) models can predict glucose levels and provide early alarms. However, diabetes and hypoglycemia research is limited by the unavailability of large datasets. Thus, this work systematically integrates 15 datasets to provide a large database of 2510 subjects with glucose measurements recorded every 5 minutes. In total, 149 million measurements are included, of which 4% represent values in the hypoglycemic range. Moreover, two sub-databases are extracted. Sub-database I includes demographics, and sub-database II includes heart rate data. The integrated dataset provides an equal distribution of sex and different age levels. As a further contribution, data quality is assessed, revealing that data imbalance and missing values present a significant challenge. Moreover, a correlation study on glucose levels and heart rate data is conducted, showing a relation between 15 and 55 minutes before hypoglycemia.

[LG-49] On the Generalization Limits of Quantum Generative Adversarial Networks with Pure State Generators

链接: https://arxiv.org/abs/2508.09844
作者: Jasmin Frkatovic,Akash Malemath,Ivan Kankeu,Yannick Werner,Matthias Tschöpe,Vitor Fortes Rey,Sungho Suh,Paul Lukowicz,Nikolaos Palaiodimopoulos,Maximilian Kiefer-Emmanouilidis
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:We investigate the capabilities of Quantum Generative Adversarial Networks (QGANs) in image generations tasks. Our analysis centers on fully quantum implementations of both the generator and discriminator. Through extensive numerical testing of current main architectures, we find that QGANs struggle to generalize across datasets, converging on merely the average representation of the training data. When the output of the generator is a pure-state, we analytically derive a lower bound for the discriminator quality given by the fidelity between the pure-state output of the generator and the target data distribution, thereby providing a theoretical explanation for the limitations observed in current models. Our findings reveal fundamental challenges in the generalization capabilities of existing quantum generative models. While our analysis focuses on QGANs, the results carry broader implications for the performance of related quantum generative models.

[LG-50] Improving the Speaker Anonymization Evaluations Robustness to Target Speakers with Adversarial Learning

链接: https://arxiv.org/abs/2508.09803
作者: Carlos Franzreb,Arnab Das,Tim Polzehl,Sebastian Möller
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The current privacy evaluation for speaker anonymization often overestimates privacy when a same-gender target selection algorithm (TSA) is used, although this TSA leaks the speaker’s gender and should hence be more vulnerable. We hypothesize that this occurs because the evaluation does not account for the fact that anonymized speech contains information from both the source and target speakers. To address this, we propose to add a target classifier that measures the influence of target speaker information in the evaluation, which can also be removed with adversarial learning. Experiments demonstrate that this approach is effective for multiple anonymizers, particularly when using a same-gender TSA, leading to a more reliable assessment.

[LG-51] Structured Kernel Regression VAE: A Computationally Efficient Surrogate for GP-VAEs in ICA

链接: https://arxiv.org/abs/2508.09721
作者: Yuan-Hao Wei,Fu-Hao Deng,Lin-Yong Cui,Yan-Jie Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The interpretability of generative models is considered a key factor in demonstrating their effectiveness and controllability. The generated data are believed to be determined by latent variables that are not directly observable. Therefore, disentangling, decoupling, decomposing, causal inference, or performing Independent Component Analysis (ICA) in the latent variable space helps uncover the independent factors that influence the attributes or features affecting the generated outputs, thereby enhancing the interpretability of generative models. As a generative model, Variational Autoencoders (VAEs) combine with variational Bayesian inference algorithms. Using VAEs, the inverse process of ICA can be equivalently framed as a variational inference process. In some studies, Gaussian processes (GPs) have been introduced as priors for each dimension of latent variables in VAEs, structuring and separating each dimension from temporal or spatial perspectives, and encouraging different dimensions to control various attributes of the generated data. However, GPs impose a significant computational burden, resulting in substantial resource consumption when handling large datasets. Essentially, GPs model different temporal or spatial structures through various kernel functions. Structuring the priors of latent variables via kernel functions-so that different kernel functions model the correlations among sequence points within different latent dimensions-is at the core of achieving disentanglement in VAEs. The proposed Structured Kernel Regression VAE (SKR-VAE) leverages this core idea in a more efficient way, avoiding the costly kernel matrix inversion required in GPs. This research demonstrates that, while maintaining ICA performance, SKR-VAE achieves greater computational efficiency and significantly reduced computational burden compared to GP-VAE.

[LG-52] Scalable h-adaptive probabilistic solver for time-independent and time-dependent systems

链接: https://arxiv.org/abs/2508.09623
作者: Akshay Thakur,Sawan Kumar,Matthew Zahr,Souvik Chakraborty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving partial differential equations (PDEs) within the framework of probabilistic numerics offers a principled approach to quantifying epistemic uncertainty arising from discretization. By leveraging Gaussian process regression and imposing the governing PDE as a constraint at a finite set of collocation points, probabilistic numerics delivers mesh-free solutions at arbitrary locations. However, the high computational cost, which scales cubically with the number of collocation points, remains a critical bottleneck, particularly for large-scale or high-dimensional problems. We propose a scalable enhancement to this paradigm through two key innovations. First, we develop a stochastic dual descent algorithm that reduces the per-iteration complexity from cubic to linear in the number of collocation points, enabling tractable inference. Second, we exploit a clustering-based active learning strategy that adaptively selects collocation points to maximize information gain while minimizing computational expense. Together, these contributions result in an h -adaptive probabilistic solver that can scale to a large number of collocation points. We demonstrate the efficacy of the proposed solver on benchmark PDEs, including two- and three-dimensional steady-state elliptic problems, as well as a time-dependent parabolic PDE formulated in a space-time setting.

[LG-53] DeepWKB: Learning WKB Expansions of Invariant Distributions for Stochastic Systems

链接: https://arxiv.org/abs/2508.09529
作者: Yao Li,Yicheng Liu,Shirou Wang
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注: 29 pages, 7 figures

点击查看摘要

Abstract:This paper introduces a novel deep learning method, called DeepWKB, for estimating the invariant distribution of randomly perturbed systems via its Wentzel-Kramers-Brillouin (WKB) approximation u_\epsilon(x) = Q(\epsilon)^-1 Z_\epsilon(x) \exp-V(x)/\epsilon\ , where V is known as the quasi-potential, \epsilon denotes the noise strength, and Q(\epsilon) is the normalization factor. By utilizing both Monte Carlo data and the partial differential equations satisfied by V and Z_\epsilon , the DeepWKB method computes V and Z_\epsilon separately. This enables an approximation of the invariant distribution in the singular regime where \epsilon is sufficiently small, which remains a significant challenge for most existing methods. Moreover, the DeepWKB method is applicable to higher-dimensional stochastic systems whose deterministic counterparts admit non-trivial attractors. In particular, it provides a scalable and flexible alternative for computing the quasi-potential, which plays a key role in the analysis of rare events, metastability, and the stochastic stability of complex systems.

[LG-54] A pseudo-inverse of a line graph

链接: https://arxiv.org/abs/2508.09412
作者: Sevvandi Kandanaarachchi,Philip Kilby,Cheng Soon Ong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Line graphs are an alternative representation of graphs where each vertex of the original (root) graph becomes an edge. However not all graphs have a corresponding root graph, hence the transformation from graphs to line graphs is not invertible. We investigate the case when there is a small perturbation in the space of line graphs, and try to recover the corresponding root graph, essentially defining the inverse of the line graph operation. We propose a linear integer program that edits the smallest number of edges in the line graph, that allow a root graph to be found. We use the spectral norm to theoretically prove that such a pseudo-inverse operation is well behaved. Illustrative empirical experiments on Erdős-Rényi graphs show that our theoretical results work in practice.

[LG-55] Classifying Cool Dwarfs: Comprehensive Spectral Typing of Field and Peculiar Dwarfs Using Machine Learning

链接: https://arxiv.org/abs/2508.09370
作者: Tianxing Zhou,Christopher A. Theissen,S. Jean Feeser,William M. J. Best,Adam J. Burgasser,Kelle L. Cruz,Lexu Zhao
类目: olar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 35 pages, 24 figures, 9 tables, accepted for publication in The Astrophysical Journal

点击查看摘要

Abstract:Low-mass stars and brown dwarfs – spectral types (SpTs) M0 and later – play a significant role in studying stellar and substellar processes and demographics, reaching down to planetary-mass objects. Currently, the classification of these sources remains heavily reliant on visual inspection of spectral features, equivalent width measurements, or narrow-/wide-band spectral indices. Recent advances in machine learning (ML) methods offer automated approaches for spectral typing, which are becoming increasingly important as large spectroscopic surveys such as Gaia, SDSS, and SPHEREx generate datasets containing millions of spectra. We investigate the application of ML in spectral type classification on low-resolution (R \sim 120) near-infrared spectra of M0–T9 dwarfs obtained with the SpeX instrument on the NASA Infrared Telescope Facility. We specifically aim to classify the gravity- and metallicity-dependent subclasses for late-type dwarfs. We used binned fluxes as input features and compared the efficacy of spectral type estimators built using Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbor (KNN) models. We tested the influence of different normalizations and analyzed the relative importance of different spectral regions for surface gravity and metallicity subclass classification. Our best-performing model (using KNN) classifies 95.5 \pm 0.6% of sources to within \pm 1 SpT, and assigns surface gravity and metallicity subclasses with 89.5 \pm 0.9% accuracy. We test the dependence of signal-to-noise ratio on classification accuracy and find sources with SNR \gtrsim 60 have \gtrsim 95% accuracy. We also find that zy-band plays the most prominent role in the RF model, with FeH and TiO having the highest feature importance.

[LG-56] A Generative Imputation Method for Multimodal Alzheimers Disease Diagnosis

链接: https://arxiv.org/abs/2508.09271
作者: Reihaneh Hassanzadeh,Anees Abrol,Hamid Reza Hassanzadeh,Vince D. Calhoun
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal data analysis can lead to more accurate diagnoses of brain disorders due to the complementary information that each modality adds. However, a major challenge of using multimodal datasets in the neuroimaging field is incomplete data, where some of the modalities are missing for certain subjects. Hence, effective strategies are needed for completing the data. Traditional methods, such as subsampling or zero-filling, may reduce the accuracy of predictions or introduce unintended biases. In contrast, advanced methods such as generative models have emerged as promising solutions without these limitations. In this study, we proposed a generative adversarial network method designed to reconstruct missing modalities from existing ones while preserving the disease patterns. We used T1-weighted structural magnetic resonance imaging and functional network connectivity as two modalities. Our findings showed a 9% improvement in the classification accuracy for Alzheimer’s disease versus cognitive normal groups when using our generative imputation method compared to the traditional approaches.

[LG-57] Forecasting Binary Economic Events in Modern Mercantilism: Traditional methodologies coupled with PCA and K-means Quantitative Analysis of Qualitative Sentimental Data

链接: https://arxiv.org/abs/2508.09243
作者: Sebastian Kot
类目: General Economics (econ.GN); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper examines Modern Mercantilism, characterized by rising economic nationalism, strategic technological decoupling, and geopolitical fragmentation, as a disruptive shift from the post-1945 globalization paradigm. It applies Principal Component Analysis (PCA) to 768-dimensional SBERT-generated semantic embeddings of curated news articles to extract orthogonal latent factors that discriminate binary event outcomes linked to protectionism, technological sovereignty, and bloc realignments. Analysis of principal component loadings identifies key semantic features driving classification performance, enhancing interpretability and predictive accuracy. This methodology provides a scalable, data-driven framework for quantitatively tracking emergent mercantilist dynamics through high-dimensional text analytics

[LG-58] Objective Soups: Multilingual Multi-Task Modeling for Speech Processing

链接: https://arxiv.org/abs/2508.09228
作者: A F M Saif,Lisha Chen,Xiaodong Cui,Songtao Lu,Brian Kingsbury,Tianyi Chen
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Training a single model for multilingual, multi-task speech processing (MSP) is severely hampered by conflicting objectives between tasks like speech recognition and translation. While multi-objective optimization (MOO) aims to align gradient updates, its effectiveness diminishes as the number of tasks grows, making it difficult to find a common descent direction. This raises a fundamental question: should highly conflicting objectives be optimized jointly or separated into a hierarchical structure? To address this question, this paper investigates three multi-objective MSP formulations, which we refer to as \textbfobjective soup recipes. These formulations apply multi-objective optimization at different optimization levels to mitigate potential conflicts among all objectives. To ensure efficiency, we introduce a lightweight layer-selection mechanism that computes the conflict-avoiding gradient using only the most problematic layers, minimizing computational and memory overhead. Extensive experiments on CoVoST v2, LibriSpeech, and AISHELL-1 reveal that a bi-level recipe separating recognition and translation tasks consistently outperforms standard flat optimization. Our work demonstrates that hierarchical MOO is a more effective and scalable approach for building state-of-the-art MSP models. Our code has been released at this https URL.

[LG-59] Exploring Molecular Odor Taxonomies for Structure-based Odor Predictions using Machine Learning

链接: https://arxiv.org/abs/2508.09217
作者: Akshay Sajan,Stijn Sluis,Reza Haydarlou,Sanne Abeln,Pasquale Lisena,Raphael Troncy,Caro Verbeek,Inger Leemans,Halima Mouhib
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 24 pages (58 pages including supporting information), 9 Figures, 4 Tables; additional Tables and Figures in the supporting information

点击查看摘要

Abstract:One of the key challenges to predict odor from molecular structure is unarguably our limited understanding of the odor space and the complexity of the underlying structure-odor relationships. Here, we show that the predictive performance of machine learning models for structure-based odor predictions can be improved using both, an expert and a data-driven odor taxonomy. The expert taxonomy is based on semantic and perceptual similarities, while the data-driven taxonomy is based on clustering co-occurrence patterns of odor descriptors directly from the prepared dataset. Both taxonomies improve the predictions of different machine learning models and outperform random groupings of descriptors that do not reflect existing relations between odor descriptors. We assess the quality of both taxonomies through their predictive performance across different odor classes and perform an in-depth error analysis highlighting the complexity of odor-structure relationships and identifying potential inconsistencies within the taxonomies by showcasing pear odorants used in perfumery. The data-driven taxonomy allows us to critically evaluate our expert taxonomy and better understand the molecular odor space. Both taxonomies as well as a full dataset are made available to the community, providing a stepping stone for a future community-driven exploration of the molecular basis of smell. In addition, we provide a detailed multi-layer expert taxonomy including a total of 777 different descriptors from the Pyrfume repository.

[LG-60] RadioMamba: Breaking the Accuracy-Efficiency Trade-off in Radio Map Construction via a Hybrid Mamba-UNet

链接: https://arxiv.org/abs/2508.09140
作者: Honggang Jia,Nan Cheng,Xiucheng Wang,Conghao Zhou,Ruijin Sun,Xuemin(Sherman)Shen
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Radio map (RM) has recently attracted much attention since it can provide real-time and accurate spatial channel information for 6G services and applications. However, current deep learning-based methods for RM construction exhibit well known accuracy-efficiency trade-off. In this paper, we introduce RadioMamba, a hybrid Mamba-UNet architecture for RM construction to address the trade-off. Generally, accurate RM construction requires modeling long-range spatial dependencies, reflecting the global nature of wave propagation physics. RadioMamba utilizes a Mamba-Convolutional block where the Mamba branch captures these global dependencies with linear complexity, while a parallel convolutional branch extracts local features. This hybrid design generates feature representations that capture both global context and local detail. Experiments show that RadioMamba achieves higher accuracy than existing methods, including diffusion models, while operating nearly 20 times faster and using only 2.9% of the model parameters. By improving both accuracy and efficiency, RadioMamba presents a viable approach for real-time intelligent optimization in next generation wireless systems.

信息检索

[IR-0] On the Consistency and Performance of the Iterative Bayesian Update

链接: https://arxiv.org/abs/2508.09980
作者: Ehab ElSalamouny,Catuscia Palamidessi
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:For many social, scientific, and commercial purposes, it is often important to estimate the distribution of the users’ data regarding a sensitive attribute, e.g., their ages, locations, etc. To allow this estimation while protecting the users’ privacy, every user applies a local privacy protection mechanism that releases a noisy (sanitized) version of their original datum to the data collector; then the original distribution is estimated using one of the known methods, such as the matrix inversion (INV), RAPPOR’s estimator, and the iterative Bayesian update (IBU). Unlike the other estimators, the consistency of IBU, i.e., the convergence of its estimate to the real distribution as the amount of noisy data grows, has been either ignored or incorrectly proved in the literature. In this article, we use the fact that IBU is a maximum likelihood estimator to prove that IBU is consistent. We also show, through experiments on real datasets, that IBU significantly outperforms the other methods when the users’ data are sanitized by geometric, Laplace, and exponential mechanisms, whereas it is comparable to the other methods in the case of the k-RR and RAPPOR mechanisms. Finally, we consider the case when the alphabet of the sensitive data is infinite, and we show a technique that allows IBU to operate in this case too.

[IR-1] Multimodal Fusion And Sparse Attention-based Alignment Model for Long Sequential Recommendation

链接: https://arxiv.org/abs/2508.09664
作者: Yongrui Fu,Jian Liu,Tao Li,Zonggang Wu,Shouke Qin,Hanmeng Liu
类目: Information Retrieval (cs.IR)
*备注: 10 pages

点击查看摘要

Abstract:Recent advances in multimodal recommendation enable richer item understanding, while modeling users’ multi-scale interests across temporal horizons has attracted growing attention. However, effectively exploiting multimodal item sequences and mining multi-grained user interests to substantially bridge the gap between content comprehension and recommendation remain challenging. To address these issues, we propose MUFASA, a MUltimodal Fusion And Sparse Attention-based Alignment model for long sequential recommendation. Our model comprises two core components. First, the Multimodal Fusion Layer (MFL) leverages item titles as a cross-genre semantic anchor and is trained with a joint objective of four tailored losses that promote: (i) cross-genre semantic alignment, (ii) alignment to the collaborative space for recommendation, (iii) preserving the similarity structure defined by titles and preventing modality representation collapse, and (iv) distributional regularization of the fusion space. This yields high-quality fused item representations for further preference alignment. Second, the Sparse Attention-guided Alignment Layer (SAL) scales to long user-behavior sequences via a multi-granularity sparse attention mechanism, which incorporates windowed attention, block-level attention, and selective attention, to capture user interests hierarchically and across temporal horizons. SAL explicitly models both the evolution of coherent interest blocks and fine-grained intra-block variations, producing robust user and item representations. Extensive experiments on real-world benchmarks show that MUFASA consistently surpasses state-of-the-art baselines. Moreover, online A/B tests demonstrate significant gains in production, confirming MUFASA’s effectiveness in leveraging multimodal cues and accurately capturing diverse user preferences.

[IR-2] FRank: Think-Free Reasoning Enables Practical Pointwise LLM Ranking

链接: https://arxiv.org/abs/2508.09539
作者: Yongqi Fan,Xiaoyang Chen,Dezhi Ye,Jie Liu,Haijin Liang,Jin Ma,Ben He,Yingfei Sun,Tong Ruan
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Reasoning-intensive ranking models built on Large Language Models (LLMs) have made notable progress, but existing approaches often rely on large-scale LLMs and explicit Chain-of-Thought (CoT) reasoning, resulting in high computational cost and latency that limit real-world use. To address this, we propose \textbfTFRank, an efficient pointwise reasoning ranker based on small-scale LLMs. To improve ranking performance, TFRank effectively integrates CoT data, fine-grained score supervision, and multi-task training. Furthermore, it achieves an efficient \textbfThink-\textbfFree" reasoning capability by employing a think-mode switch’’ and pointwise format constraints. Specifically, this allows the model to leverage explicit reasoning during training while delivering precise relevance scores for complex queries at inference without generating any reasoning chains. Experiments show that TFRank (e.g., 1.7B) achieves performance comparable to models with four times more parameters on the BRIGHT benchmark, and demonstrates strong competitiveness on the BEIR benchmark. Further analysis shows that TFRank achieves an effective balance between performance and efficiency, providing a practical solution for integrating advanced reasoning into real-world systems. Our code and data are released in the repository: this https URL.

[IR-3] Improving Dense Passage Retrieval with Multiple Positive Passages

链接: https://arxiv.org/abs/2508.09534
作者: Shuai Chang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:By leveraging a dual encoder architecture, Dense Passage Retrieval (DPR) has outperformed traditional sparse retrieval algorithms such as BM25 in terms of passage retrieval accuracy. Recently proposed methods have further enhanced DPR’s performance. However, these models typically pair each question with only one positive passage during training, and the effect of associating multiple positive passages has not been examined. In this paper, we explore the performance of DPR when additional positive passages are incorporated during training. Experimental results show that equipping each question with multiple positive passages consistently improves retrieval accuracy, even when using a significantly smaller batch size, which enables training on a single GPU.

[IR-4] owards Self-cognitive Exploration: Metacognitive Knowledge Graph Retrieval Augmented Generation

链接: https://arxiv.org/abs/2508.09460
作者: Xujie Yuan,Shimin Di,Jielong Tang,Libin Zheng,Jian Yin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) significantly enhances the reasoning capabilities of LargeLanguage Models by leveraging structured knowledge. However, existing KG-RAG frameworks typically operate as open-loop systems, suffering from cognitive blindness, an inability to recognize their exploration deficiencies. This leads to relevance drift and incomplete evidence, which existing self-refinement methods, designed for unstructured text-based RAG, cannot effectively resolve due to the path-dependent nature of graph exploration. To address this challenge, we propose Metacognitive Knowledge Graph Retrieval Augmented Generation (MetaKGRAG), a novel framework inspired by the human metacognition process, which introduces a Perceive-Evaluate-Adjust cycle to enable path-aware, closed-loop refinement. This cycle empowers the system to self-assess exploration quality, identify deficiencies in coverage or relevance, and perform trajectory-connected corrections from precise pivot points. Extensive experiments across five datasets in the medical, legal, and commonsense reasoning domains demonstrate that MetaKGRAG consistently outperforms strong KG-RAG and self-refinement baselines. Our results validate the superiority of our approach and highlight the critical need for path-aware refinement in structured knowledge retrieval.

附件下载

点击下载今日全部论文列表