Arxiv今日论文 | 2025-07-21

本篇博文主要内容为 2025-07-21 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决生成式图像编辑任务中高质量训练数据稀缺的问题，尤其是如何自动化获取大规模、像素级精确的三元组数据（原始图像、自然语言指令、编辑后图像），以支持无需人工干预的监督训练。其关键解决方案在于构建一个全自动、模块化的数据挖掘流水线，利用公开的生成模型和任务微调后的Gemini验证器直接评分指令遵循性和美学质量，避免依赖分割或接地模型；同时通过反演（inversion）与组合式自举（compositional bootstrapping）技术将数据集规模扩展约2.2倍，从而实现高保真度训练数据的大规模自动扩充，显著降低对人工标注的依赖。

链接: https://arxiv.org/abs/2507.14119
作者: Maksim Kuprashevich,Grigorii Alekseenko,Irina Tolstykh,Georgii Fedorov,Bulat Suleimanov,Vladimir Dokholyan,Aleksandr Gordeev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.
zh

[NLP-1] Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track

【速读】：该论文旨在解决如何利用大型语言模型（Large Language Models, LLMs）将专业医学文献（biomedical literature）转化为通俗易懂的平易语言（plain language），以提升患者和照护者对医学信息的理解与获取。其核心挑战在于确保转化后的文本在事实准确性、完整性、简洁性和可读性方面达到高质量，同时避免因模型不可预测性带来的潜在危害。解决方案的关键在于构建一个系统性的评估框架：通过组织“生物医学摘要平易化”（Plain Language Adaptation of Biomedical Abstracts, PLABA）竞赛任务（包括全文重写与术语替换两个子任务），结合人工专家评审与多维度自动指标，对不同模型进行严格评测。结果显示，顶级LLM在事实准确性和完整性上接近人类水平，但在简洁性方面仍不足，且自动评价指标与人工判断相关性较弱，凸显了当前评估工具的局限性及未来改进方向。

链接: https://arxiv.org/abs/2507.14096
作者: Brian Ondov,William Xia,Kush Attal,Ishita Unde,Jerry He,Hoa Dang,Ian Soboroff,Dina Demner-Fushman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Objective: Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability, combined with the high potential for harm in this domain, means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems. Methods: We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level, rewriting of abstracts (Task 1) as well as identifying and replacing difficult terms (Task 2). For automatic evaluation of Task 1, we developed a four-fold set of professionally-written references. Submissions for both Tasks 1 and 2 were provided extensive manual evaluation from biomedical experts. Results: Twelve teams spanning twelve countries participated in the track, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human levels of factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity. Conclusion: The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2507.14096 [cs.CL] (or arXiv:2507.14096v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.14096 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Brian Ondov [view email] [v1] Fri, 18 Jul 2025 17:23:52 UTC (11,278 KB)
zh

[NLP-2] DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits

【速读】：该论文旨在解决电子健康记录（Electronic Health Record, EHR）中进展记录（progress notes）严重缺失的问题，这类文档是临床决策和患者病情演变分析的核心依据，但在现有大规模EHR数据集（如MIMIC-III）中仅占约8.56%的住院记录，导致纵向患者叙事断裂。为应对这一挑战，作者提出DENSE系统，其核心创新在于引入一种细粒度的文档分类与时间对齐机制，将跨就诊的异构笔记按时间顺序结构化，并通过临床感知的检索策略提取当前及既往就诊中的时序与语义相关证据，进而以大语言模型（Large Language Model, LLM）生成具备临床连贯性和时间一致性的进展记录。该方案显著提升了生成文本的纵向保真度（时间对齐比达1.089），优于原始记录，从而为下游任务如摘要生成、预测建模和临床决策支持提供更完整的叙事基础。

链接: https://arxiv.org/abs/2507.14079
作者: Garapati Keerthana,Manik Gupta
机构: Birla Institute of Technology and Science, Pilani, Hyderabad Campus (比尔拉理工大学与科学学院，皮兰尼，海得拉巴校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Progress notes are among the most clinically meaningful artifacts in an Electronic Health Record (EHR), offering temporally grounded insights into a patient’s evolving condition, treatments, and care decisions. Despite their importance, they are severely underrepresented in large-scale EHR datasets. For instance, in the widely used Medical Information Mart for Intensive Care III (MIMIC-III) dataset, only about 8.56% of hospital visits include progress notes, leaving gaps in longitudinal patient narratives. In contrast, the dataset contains a diverse array of other note types, each capturing different aspects of care. We present DENSE (Documenting Evolving Progress Notes from Scattered Evidence), a system designed to align with clinical documentation workflows by simulating how physicians reference past encounters while drafting progress notes. The system introduces a fine-grained note categorization and a temporal alignment mechanism that organizes heterogeneous notes across visits into structured, chronological inputs. At its core, DENSE leverages a clinically informed retrieval strategy to identify temporally and semantically relevant content from both current and prior visits. This retrieved evidence is used to prompt a large language model (LLM) to generate clinically coherent and temporally aware progress notes. We evaluate DENSE on a curated cohort of patients with multiple visits and complete progress note documentation. The generated notes demonstrate strong longitudinal fidelity, achieving a temporal alignment ratio of 1.089 , surpassing the continuity observed in original notes. By restoring narrative coherence across fragmented documentation, our system supports improved downstream tasks such as summarization, predictive modeling, and clinical decision support, offering a scalable solution for LLM-driven note synthesis in real-world healthcare settings. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2507.14079 [cs.CL] (or arXiv:2507.14079v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.14079 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-3] Collaborative Rational Speech Act: Prag matic Reasoning for Multi-Turn Dialog

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 在多轮协作对话中缺乏对共享目标与信念进行合理推理的问题。现有基于理性言语行为（Rational Speech Act, RSA）的扩展方法难以在多轮、协作场景下有效扩展，导致语言代理在复杂交互中表现不够一致和可解释。解决方案的关键在于提出一种信息论（Information-Theoretic, IT）扩展模型——协作理性言语行为（Collaborative Rational Speech Act, CRSA），其通过优化一个源自率失真理论（rate-distortion theory）的增益函数来建模对话过程，该增益函数不仅继承了原始RSA中的最优信息传递原则，还显式考虑对话双方各自持有的私有信息以及语句条件化于历史对话流的特性，从而实现更一致、可解释且具有协作性的对话行为。

链接: https://arxiv.org/abs/2507.14063
作者: Lautaro Estienne,Gabriel Ben Zenou,Nona Naderi,Jackie Cheung,Pablo Piantanida
机构: International Laboratory on Learning Systems (ILLS); Laboratoire Interdisciplinaire des Sciences du Numérique (LISN); Mila - Quebec AI Institute; CNRS; CentraleSupélec; Université Paris-Saclay; McGill University, Canada CIFAR AI Chair
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As AI systems take on collaborative roles, they must reason about shared goals and beliefs-not just generate fluent language. The Rational Speech Act (RSA) framework offers a principled approach to pragmatic reasoning, but existing extensions face challenges in scaling to multi-turn, collaborative scenarios. In this paper, we introduce Collaborative Rational Speech Act (CRSA), an information-theoretic (IT) extension of RSA that models multi-turn dialog by optimizing a gain function adapted from rate-distortion theory. This gain is an extension of the gain model that is maximized in the original RSA model but takes into account the scenario in which both agents in a conversation have private information and produce utterances conditioned on the dialog. We demonstrate the effectiveness of CRSA on referential games and template-based doctor-patient dialogs in the medical domain. Empirical results show that CRSA yields more consistent, interpretable, and collaborative behavior than existing baselines-paving the way for more pragmatic and socially aware language agents.
zh

[NLP-4] EdgeVLA: Efficient Vision-Language-Action Models

【速读】：该论文旨在解决大型Vision-Language-Action (VLA)模型在资源受限的移动操作机器人系统上部署时面临的推理速度慢、计算开销大等问题。其核心解决方案在于提出Edge VLA (EVLA)，通过两项关键技术实现：一是消除末端执行器位置预测的自回归（autoregressive）依赖，使推理速度提升7倍；二是利用小型语言模型（Small Language Models, SLMs）替代大型模型，在保持训练性能的同时显著降低计算和内存需求，从而实现在边缘设备上的实时高效推理。

链接: https://arxiv.org/abs/2507.14049
作者: Paweł Budzianowski,Wesley Maa,Matthew Freed,Jingxiang Mo,Winston Hsiao,Aaron Xie,Tomasz Młoduchowski,Viraj Tipnis,Benjamin Bolte
机构: K-Scale Labs; McGill University (麦吉尔大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have emerged as a promising approach to address the data scarcity challenge in robotics, enabling the development of generalizable visuomotor control policies. While models like OpenVLA showcase the potential of this paradigm, deploying large-scale VLMs on resource-constrained mobile manipulation systems remains a significant hurdle. This paper introduces Edge VLA (EVLA), a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models. EVLA maintains the representational power of these models while enabling real-time performance on edge devices. We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs), demonstrating comparable training performance to larger models with significantly reduced computational demands. Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency. We release our model checkpoints and training \hrefthis https URL codebase to foster further research.
zh

[NLP-5] Evaluating the Effectiveness of Cost-Efficient Large Language Models in Benchmark Biomedical Tasks

【速读】：该论文旨在解决如何在生物医学领域中高效选择和应用成本效益高的大语言模型（Large Language Models, LLMs）的问题，尤其是在文本与图像多模态任务中的性能差异与适用性。其解决方案的关键在于通过系统性评估多种闭源与开源LLMs在生物医学文本分类、生成、问答及多模态图像处理等任务上的表现，发现并无单一模型能在所有任务上持续领先，而不同模型在特定任务上各具优势；同时指出开源模型在某些场景下可达到甚至超越闭源模型的性能，并具备推理速度更快和隐私保护更优等附加优势，从而为生物医学应用场景提供精细化的模型选型依据。

链接: https://arxiv.org/abs/2507.14045
作者: Israt Jahan,Md Tahmid Rahman Laskar,Chun Peng,Jimmy Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at Canadian AI 2025

点击查看摘要

Abstract:This paper presents a comprehensive evaluation of cost-efficient Large Language Models (LLMs) for diverse biomedical tasks spanning both text and image modalities. We evaluated a range of closed-source and open-source LLMs on tasks such as biomedical text classification and generation, question answering, and multimodal image processing. Our experimental findings indicate that there is no single LLM that can consistently outperform others across all tasks. Instead, different LLMs excel in different tasks. While some closed-source LLMs demonstrate strong performance on specific tasks, their open-source counterparts achieve comparable results (sometimes even better), with additional benefits like faster inference and enhanced privacy. Our experimental results offer valuable insights for selecting models that are optimally suited for specific biomedical applications.
zh

[NLP-6] CPC-CMS: Cognitive Pairwise Comparison Classification Model Selection Framework for Document-level Sentiment Analysis

【速读】：该论文旨在解决文档级情感分析中分类模型选择的难题，即如何在多个候选模型（如朴素贝叶斯、线性支持向量分类器、随机森林、逻辑回归、XGBoost、LSTM 和 ALBERT）之间科学地选出最优模型。解决方案的关键在于提出一种基于认知成对比较（Cognitive Pairwise Comparison, CPC）的模型选择框架（CPC-CMS），该框架通过专家知识判断确定评估指标（包括准确率、精确率、召回率、F1 分数、特异性、马修斯相关系数（MCC）、科恩 Kappa 系数和效率）的权重，并构建加权决策矩阵以量化不同模型在各指标下的综合表现，从而实现客观、可解释的模型优选。

链接: https://arxiv.org/abs/2507.14022
作者: Jianfei Li,Kevin Kam Fung Yuen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 35 pages, 33 tables, 6 Figures

点击查看摘要

Abstract:This study proposes the Cognitive Pairwise Comparison Classification Model Selection (CPC-CMS) framework for document-level sentiment analysis. The CPC, based on expert knowledge judgment, is used to calculate the weights of evaluation criteria, including accuracy, precision, recall, F1-score, specificity, Matthews Correlation Coefficient (MCC), Cohen’s Kappa (Kappa), and efficiency. Naive Bayes, Linear Support Vector Classification (LSVC), Random Forest, Logistic Regression, Extreme Gradient Boosting (XGBoost), Long Short-Term Memory (LSTM), and A Lite Bidirectional Encoder Representations from Transformers (ALBERT) are chosen as classification baseline models. A weighted decision matrix consisting of classification evaluation scores with respect to criteria weights, is formed to select the best classification model for a classification problem. Three open datasets of social media are used to demonstrate the feasibility of the proposed CPC-CMS. Based on our simulation, for evaluation results excluding the time factor, ALBERT is the best for the three datasets; if time consumption is included, no single model always performs better than the other models. The CPC-CMS can be applied to the other classification applications in different areas.
zh

[NLP-7] Efficient Temporal Tokenization for Mobility Prediction with Large Language Models

【速读】：该论文旨在解决人类移动轨迹预测中长期依赖建模困难与计算效率低下的问题。现有方法在处理长时间序列时往往面临序列过长导致的计算复杂度激增，且难以有效捕捉日周期与周周期等多尺度时间模式。解决方案的关键在于提出RHYTHM框架，通过层次化时间分段（hierarchical temporal tokenization）将轨迹划分为每日片段并编码为离散token，结合层级注意力机制同时建模日级和周级依赖关系；同时利用预训练大语言模型（LLM）冻结的prompt嵌入对token表示进行增强，在不增加训练开销的前提下显著提升对轨迹间复杂依赖关系的感知能力，从而实现高精度、低延迟的轨迹预测。

链接: https://arxiv.org/abs/2507.14017
作者: Haoyu He,Haozheng Luo,Yan Chen,Qi R. Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a framework that leverages large language models (LLMs) as spatio-temporal predictors and trajectory reasoners. RHYTHM partitions trajectories into daily segments encoded as discrete tokens with hierarchical attention, capturing both daily and weekly dependencies while substantially reducing the sequence length. Token representations are enriched with pre-computed prompt embeddings via a frozen LLM, enhancing the model’s ability to capture interdependencies without extensive computational overhead. By freezing the LLM backbone, RHYTHM achieves significant computational efficiency. Evaluation on three real-world datasets demonstrates a 2.4% improvement in accuracy, 5.0% increase on weekends, and 24.6% reduction in training time compared to state-of-the-art methods.
zh

[NLP-8] Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic ICASSP2025

【速读】：该论文旨在解决阿拉伯语自动语音识别（ASR）系统开发中的关键挑战，特别是由于阿拉伯语语言复杂性导致的资源稀缺与方言多样性问题。现有研究多集中于现代标准阿拉伯语（MSA），对古典阿拉伯语（CA）等变体关注不足，且缺乏统一处理多种阿拉伯语变体的公开模型。解决方案的关键在于提出一种通用的阿拉伯语语音与文本处理方法，并基于FastConformer架构训练两个新型模型：一个是专为MSA优化的高性能模型，另一个是首个公开的统一模型，可同时处理MSA和CA，且在带音标古典阿拉伯语识别任务中达到当前最优（SOTA）准确率，同时保持对MSA的良好性能。该方法通过标准化预处理、联合建模和开放源代码训练配方显著提升了模型的泛化能力与可复现性。

链接: https://arxiv.org/abs/2507.13977
作者: Lilit Grigoryan,Nikolay Karpov,Enas Albasiri,Vitaly Lavrukhin,Boris Ginsburg
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:Despite Arabic being one of the most widely spoken languages, the development of Arabic Automatic Speech Recognition (ASR) systems faces significant challenges due to the language’s complexity, and only a limited number of public Arabic ASR models exist. While much of the focus has been on Modern Standard Arabic (MSA), there is considerably less attention given to the variations within the language. This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. Using this methodology, we train two novel models based on the FastConformer architecture: one designed specifically for MSA and the other, the first unified public model for both MSA and Classical Arabic (CA). The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA. To promote reproducibility, we open-source the models and their training recipes.
zh

[NLP-9] Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need

【速读】：该论文试图解决当前语言模型在跨领域泛化中虽具备任务特定推理能力，但其基于通用语料库的自上而下（top-down）训练方式难以获取深度领域专业知识的问题。解决方案的关键在于采用自下而上（bottom-up）的方法，通过知识图谱（Knowledge Graph, KG）提供的组合结构，将领域基础概念（primitives）构造成更复杂的高阶概念，并设计一个任务生成流水线，直接从KG原始边出发合成推理任务，进而对语言模型进行微调，使其在特定领域（如医学）中实现超智能（superintelligence）。该方法的核心创新在于利用KG的路径编码机制构建可组合的领域知识体系，从而提升模型在复杂推理任务中的表现。

链接: https://arxiv.org/abs/2507.13966
作者: Bhishma Dedhia,Yuval Kansal,Niraj K. Jha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models traditionally used for cross-domain generalization have recently demonstrated task-specific reasoning. However, their top-down training approach on general corpora is insufficient for acquiring abstractions needed for deep domain expertise. This may require a bottom-up approach that acquires expertise by learning to compose simple domain concepts into more complex ones. A knowledge graph (KG) provides this compositional structure, where domain primitives are represented as head-relation-tail edges and their paths encode higher-level concepts. We present a task generation pipeline that synthesizes tasks directly from KG primitives, enabling models to acquire and compose them for reasoning. We fine-tune language models on the resultant KG-grounded curriculum to demonstrate domain-specific superintelligence. While broadly applicable, we validate our approach in medicine, where reliable KGs exist. Using a medical KG, we curate 24,000 reasoning tasks paired with thinking traces derived from diverse medical primitives. We fine-tune the QwQ-32B model on this curriculum to obtain QwQ-Med-3 that takes a step towards medical superintelligence. We also introduce ICD-Bench, an evaluation suite to quantify reasoning abilities across 15 medical domains. Our experiments demonstrate that QwQ-Med-3 significantly outperforms state-of-the-art reasoning models on ICD-Bench categories. Further analysis reveals that QwQ-Med-3 utilizes acquired primitives to widen the performance gap on the hardest tasks of ICD-Bench. Finally, evaluation on medical question-answer benchmarks shows that QwQ-Med-3 transfers acquired expertise to enhance the base model’s performance. While the industry’s approach to artificial general intelligence (AGI) emphasizes broad expertise, we envision a future in which AGI emerges from the composable interaction of efficient domain-specific superintelligent agents.
zh

[NLP-10] Exploiting Primacy Effect To Improve Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多项选择题问答（Multiple Choice Question Answering, MCQA）任务中因位置偏差（positional bias）——尤其是首因效应（primacy effect）——导致的预测准确性下降问题。研究表明，微调过程会放大这种偏差，可能源于模型对人类行为模式的学习。解决方案的关键在于：不依赖正确答案信息，仅通过将选项按与查询的语义相似度重新排序，使高相关性选项更可能出现在前列，从而利用首因效应提升模型性能。实验结果表明，该策略能显著改善MCQA准确率，揭示了偏见既是挑战也是可被策略性利用的机会，为设计更具鲁棒性的偏见感知型NLP模型提供了新思路。

链接: https://arxiv.org/abs/2507.13949
作者: Bianca Raimondi,Maurizio Gabbrielli
机构: University of Bologna (博洛尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by RANLP 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have become essential in many Natural Language Processing (NLP) tasks, leveraging extensive pre-training and fine-tuning to achieve high accuracy. However, like humans, LLMs exhibit biases, particularly positional biases such as primacy and recency effects, which can influence the accuracy of the answers. The primacy effect-where items presented first are more likely to be remembered or selected-plays a key role in Multiple Choice Question Answering (MCQA), where the order of answer options can affect prediction outcomes. This study focuses on primacy bias in fine-tuned LLMs: We first show that fine-tuning amplifies this bias, probably due to exposure to human-like patterns. Hence, we strategically leverage this effect by reordering response options based on semantic similarity to the query, without requiring knowledge of the correct answer. Our experimental results show that this approach significantly improves performance in MCQA. More generally, our findings underscore the dual nature of biases as both challenges and opportunities, offering insights for bias-aware model design and NLP applications.
zh

[NLP-11] Marcel: A Lightweight and Open-Source Conversational Agent for University Student Support

【速读】：该论文旨在解决高校招生咨询工作中信息响应效率低、人工成本高以及学生获取个性化答复困难的问题。解决方案的关键在于构建一个轻量级、开源的对话代理系统Marcel，其核心创新是采用检索增强生成（Retrieval-Augmented Generation, RAG）技术，将回答结果锚定在高校官方知识库中，确保信息的可验证性和上下文相关性；同时引入FAQ检索器（FAQ retriever），通过映射用户问题到知识库条目来提升检索质量，并允许管理员干预检索过程，从而显著优于传统的密集检索（dense retrieval）或混合检索（hybrid retrieval）策略。该设计特别适配资源受限的学术环境，兼顾实用性与部署便捷性。

链接: https://arxiv.org/abs/2507.13937
作者: Jan Trienes,Anastasiia Derzhanskaia,Roland Schwarzkopf,Markus Mühling,Jörg Schlötterer,Christin Seifert
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Marcel, a lightweight and open-source conversational agent designed to support prospective students with admission-related inquiries. The system aims to provide fast and personalized responses, while reducing workload of university staff. We employ retrieval-augmented generation to ground answers in university resources and to provide users with verifiable, contextually relevant information. To improve retrieval quality, we introduce an FAQ retriever that maps user questions to knowledge-base entries, allowing administrators to steer retrieval, and improving over standard dense/hybrid retrieval strategies. The system is engineered for easy deployment in resource-constrained academic settings. We detail the system architecture, provide a technical evaluation of its components, and report insights from a real-world deployment.
zh

[NLP-12] Preprint: Did I Just Browse A Website Written by LLM s?

【速读】：该论文旨在解决由大语言模型（Large Language Models, LLMs）主导生成的网页内容（即“LLM-dominant content”）日益增多却难以被识别的问题。这类内容因存在抄袭和幻觉现象而可能不准确甚至不道德，但当前网站极少披露其来源，且普通用户难以辨别，亟需可靠的检测工具。现有LLM检测方法主要针对干净、散文式文本表现良好，但在复杂网页结构和多样内容类型中效果不足。论文提出了一种高可靠、可扩展的检测流水线，关键在于不直接对每页提取的文本进行分类，而是基于多个符合散文风格页面的LLM文本检测器输出，对整个网站进行综合判断。通过构建两个总计120个站点的真实标签数据集进行训练与评估，该方法在测试中达到100%准确率，并在真实场景（如搜索引擎结果和Common Crawl数据）中成功识别出大量LLM-dominant站点，揭示其增长趋势及在搜索排名中的优势，从而为评估其对用户和网络生态的影响提供基础。

链接: https://arxiv.org/abs/2507.13933
作者: Sichang “Steven” He,Ramesh Govindan,Harsha V. Madhyastha
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: In submission. 2 pages. 3 figures

点击查看摘要

Abstract:Increasingly, web content is automatically generated by large language models (LLMs) with little human input. We call this “LLM-dominant” content. Since LLMs plagiarize and hallucinate, LLM-dominant content can be unreliable and unethical. Yet, websites rarely disclose such content, and human readers struggle to distinguish it. Thus, we must develop reliable detectors for LLM-dominant content. However, state-of-the-art LLM detectors are insufficient, because they perform well mainly on clean, prose-like text, while web content has complex markup and diverse genres. We propose a highly reliable, scalable pipeline that classifies entire websites. Instead of naively classifying text extracted from each page, we classify each site based on an LLM text detector’s outputs of multiple prose-like pages. We train and evaluate our detector by collecting 2 distinct ground truth datasets totaling 120 sites, and obtain 100% accuracies testing across them. In the wild, we detect a sizable portion of sites as LLM-dominant among 10k sites in search engine results and 10k in Common Crawl archives. We find LLM-dominant sites are growing in prevalence and rank highly in search results, raising questions about their impact on end users and the overall Web ecosystem. Comments: In submission. 2 pages. 3 figures Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2507.13933 [cs.NI] (or arXiv:2507.13933v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2507.13933 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-13] he Levers of Political Persuasion with Conversational AI

【速读】：该论文旨在解决当前社会对对话式人工智能（Conversational AI）可能对人类信念产生前所未有的影响力这一担忧，通过大规模实证研究评估大型语言模型（LLM）在政治议题上的说服力及其事实准确性。其关键解决方案在于揭示：当前及近未来AI的说服力主要源于后训练（post-training）和提示工程（prompting）方法——分别使说服力提升高达51%和27%，而非模型规模扩大或个性化；且这些方法通过高效获取并战略性调用信息增强说服效果，但同时系统性地降低了事实准确性，表明说服力与真实性之间存在权衡关系。

链接: https://arxiv.org/abs/2507.13919
作者: Kobi Hackenburg,Ben M. Tappin,Luke Hewitt,Ed Saunders,Sid Black,Hause Lin,Catherine Fist,Helen Margetts,David G. Rand,Christopher Summerfield
机构: UK AI Security Institute; University of Oxford (牛津大学); The London School of Economics and Political Science (伦敦政治经济学院); Stanford University (斯坦福大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 19 pages, 4 figures. Our supplementary materials file can be found at this https URL

点击查看摘要

Abstract:There are widespread fears that conversational AI could soon exert unprecedented influence over human beliefs. Here, in three large-scale experiments (N=76,977), we deployed 19 LLMs-including some post-trained explicitly for persuasion-to evaluate their persuasiveness on 707 political issues. We then checked the factual accuracy of 466,769 resulting LLM claims. Contrary to popular concerns, we show that the persuasive power of current and near-future AI is likely to stem more from post-training and prompting methods-which boosted persuasiveness by as much as 51% and 27% respectively-than from personalization or increasing model scale. We further show that these methods increased persuasion by exploiting LLMs’ unique ability to rapidly access and strategically deploy information and that, strikingly, where they increased AI persuasiveness they also systematically decreased factual accuracy.
zh

[NLP-14] Political Leaning and Politicalness Classification of Texts

【速读】：该论文旨在解决文本政治倾向（political leaning）与政治性（politicalness）自动分类任务中现有模型泛化能力不足的问题，尤其针对跨数据分布时性能显著下降的局限。其关键解决方案在于构建一个多样化的综合数据集，通过整合12个现有的政治倾向分类数据集，并扩展18个已有数据集以新增政治性标签，从而形成更具代表性和覆盖性的训练与评估基准；在此基础上，采用留一法（leave-one-in 和 leave-one-out）进行系统性基准测试，不仅评估现有模型表现，还训练出具备更强泛化能力的新模型。

链接: https://arxiv.org/abs/2507.13913
作者: Matous Volf(1),Jakub Simko(2) ((1) DELTA High school of computer science and economics, Pardubice, Czechia, (2) Kempelen Institute of Intelligent Technologies, Bratislava, Slovakia)
机构: DELTA – High school of computer science and economics (DELTA – 计算机科学与经济学高级中学); Kempelen Institute of Intelligent Technologies (Kempelen 智能技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper addresses the challenge of automatically classifying text according to political leaning and politicalness using transformer models. We compose a comprehensive overview of existing datasets and models for these tasks, finding that current approaches create siloed solutions that perform poorly on out-of-distribution texts. To address this limitation, we compile a diverse dataset by combining 12 datasets for political leaning classification and creating a new dataset for politicalness by extending 18 existing datasets with the appropriate label. Through extensive benchmarking with leave-one-in and leave-one-out methodologies, we evaluate the performance of existing models and train new ones with enhanced generalization capabilities.
zh

[NLP-15] Using LLM s to identify features of personal and professional skills in an open-response situational judgment test

【速读】：该论文旨在解决传统开放式情境判断测试（Situational Judgment Tests, SJTs）在大规模应用中因依赖人工评分而带来的可扩展性问题，同时克服以往基于自然语言处理（Natural Language Processing, NLP）的自动评分系统在构念效度（construct validity）方面的不足。其解决方案的关键在于利用大语言模型（Large Language Models, LLMs）从SJT回答中提取与构念相关的特征，从而实现更准确、可靠的自动化评分，为个人与专业技能的标准化评估提供了新的技术路径。

链接: https://arxiv.org/abs/2507.13881
作者: Cole Walsh,Rodica Ivan,Muhammad Zafar Iqbal,Colleen Robb
机构: Acuity Insights Inc.(Acuity Insights 公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 2 figures, 4 tables; this work was accepted for presentation at the 2025 Artificial Intelligence in Measurement and Education Conference in Pittsburgh, Pennsylvania, United States

点击查看摘要

Abstract:Academic programs are increasingly recognizing the importance of personal and professional skills and their critical role alongside technical expertise in preparing students for future success in diverse career paths. With this growing demand comes the need for scalable systems to measure, evaluate, and develop these skills. Situational Judgment Tests (SJTs) offer one potential avenue for measuring these skills in a standardized and reliable way, but open-response SJTs have traditionally relied on trained human raters for evaluation, presenting operational challenges to delivering SJTs at scale. Past attempts at developing NLP-based scoring systems for SJTs have fallen short due to issues with construct validity of these systems. In this article, we explore a novel approach to extracting construct-relevant features from SJT responses using large language models (LLMs). We use the Casper SJT to demonstrate the efficacy of this approach. This study sets the foundation for future developments in automated scoring for personal and professional skills.
zh

[NLP-16] Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies INTERSPEECH2025

【速读】：该论文旨在解决代码转换（Code-switching, CS）场景下自动语音识别（ASR）性能受限的问题，尤其针对加泰罗尼亚语-西班牙语混合使用场景中因训练数据稀缺和语言相似性导致的识别准确率下降问题。其解决方案的关键在于通过三种策略提升ASR模型对真实CS模式的建模能力：一是生成合成的CS数据以扩充训练集，二是将单语种音频拼接形成伪CS样本，三是利用真实CS数据并引入语言标记（language tokens）进行细粒度控制；实验表明，将少量合成CS数据与主导语言标记相结合，可显著提升转录性能，是效果最优的方案。

链接: https://arxiv.org/abs/2507.13875
作者: Carlos Mena,Pol Serra,Jacobo Romero,Abir Messaoudi,Jose Giraldo,Carme Armentano-Oller,Rodolfo Zevallos,Ivan Meza,Javier Hernando
机构: LangTech Lab; IIMAS; BSC; Turing Institute; UPC
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Code-switching (CS), the alternating use of two or more languages, challenges automatic speech recognition (ASR) due to scarce training data and linguistic similarities. The lack of dedicated CS datasets limits ASR performance, as most models rely on monolingual or mixed-language corpora that fail to reflect real-world CS patterns. This issue is critical in multilingual societies where CS occurs in informal and formal settings. A key example is Catalan-Spanish CS, widely used in media and parliamentary speeches. In this work, we improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens. We extract CS data from Catalan speech corpora and fine-tune OpenAI’s Whisper models, making them available on Hugging Face. Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.
zh

[NLP-17] Label Unification for Cross-Dataset Generalization in Cybersecurity NER

【速读】：该论文旨在解决网络安全命名实体识别（Cybersecurity Named Entity Recognition, NER）领域中因缺乏标准化标签而导致的数据集难以融合的问题，从而提升多数据集资源的可用性。其关键解决方案在于通过粗粒度标签统一策略对四个网络安全数据集进行整合，并在此基础上设计了两种替代模型架构：一种是带有权重共享的多头模型（multihead model），另一种是基于图结构的迁移学习模型（graph-based transfer model）。实验表明，统一训练后的模型跨数据集泛化能力较差，而提出的两种改进架构均未显著优于基准模型BERT-base-NER，揭示了当前标签统一方法在实际应用中的局限性。

链接: https://arxiv.org/abs/2507.13870
作者: Maciej Jalocha,Johan Hausted Schmidt,William Michelseen
机构: IT University of Copenhagen (哥本哈根信息技术大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:The field of cybersecurity NER lacks standardized labels, making it challenging to combine datasets. We investigate label unification across four cybersecurity datasets to increase data resource usability. We perform a coarse-grained label unification and conduct pairwise cross-dataset evaluations using BiLSTM models. Qualitative analysis of predictions reveals errors, limitations, and dataset differences. To address unification limitations, we propose alternative architectures including a multihead model and a graph-based transfer model. Results show that models trained on unified datasets generalize poorly across datasets. The multihead model with weight sharing provides only marginal improvements over unified training, while our graph-based transfer model built on BERT-base-NER shows no significant performance gains compared BERT-base-NER.
zh

[NLP-18] SPARQL Query Generation with LLM s: Measuring the Impact of Training Data Memorization and Knowledge Injection

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在知识图谱问答（Knowledge Graph Question Answering, KGQA）系统中因训练数据不可控而导致的性能评估偏差问题，尤其是模型是否因记忆了特定基准数据集而表现出虚假高性能。解决方案的关键在于提出一种新颖的评估方法，通过在三种不同条件下生成SPARQL查询来量化LLM的实际能力：(1) 零样本（zero-shot）SPARQL生成，(2) 知识注入（knowledge injection），以及(3) “匿名化”知识注入（anonymized knowledge injection）。该方法首次能够估计训练数据对LLM问答质量提升的影响，从而区分模型的真实泛化能力与数据记忆效应，且具备良好的可移植性和鲁棒性，适用于任意知识图谱和LLM，实现对LLM实际能力的一致性评估。

链接: https://arxiv.org/abs/2507.13859
作者: Aleksandr Gashkov,Aleksandr Perevalov,Maria Eltsova,Andreas Both
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Winner of Best Paper Award at the 25th International Conference on Web Engineering (ICWE 2025)

点击查看摘要

Abstract:Nowadays, the importance of software with natural-language user interfaces cannot be underestimated. In particular, in Question Answering (QA) systems, generating a SPARQL query for a given natural-language question (often named Query Building) from the information retrieved from the same question is the central task of QA systems working over Knowledge Graphs (KGQA). Due to the rise of Large Language Models (LLMs), they are considered a well-suited method to increase the quality of the question-answering functionality, as there is still a lot of room for improvement, aiming for enhanced quality and trustworthiness. However, LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data. In this paper, we introduce a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question under various conditions: (1) zero-shot SPARQL generation, (2) with knowledge injection, and (3) with “anonymized” knowledge injection. This enables us, for the first time, to estimate the influence of the training data on the QA quality improved by LLMs. Ultimately, this will help to identify how portable a method is or whether good results might mostly be achieved because a benchmark was already included in the training data (cf. LLM memorization). The developed method is portable, robust, and supports any knowledge graph; therefore, it could be easily applied to any KGQA or LLM, s.t., generating consistent insights into the actual LLM capabilities is possible.
zh

[NLP-19] InTraVisTo: Inside Transformer Visualisation Tool

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中因行为不可预测及目标行为与实际输出之间存在偏差而导致的部署难题。其解决方案的关键在于提出一种名为InTraVisTo（Inside Transformer Visualisation Tool）的新工具，该工具能够可视化Transformer架构中每个token生成过程的内部状态（通过解码各层的token嵌入）以及跨层组件间的信息流动（利用桑基图Sankey diagram），从而帮助研究人员深入理解LLM内部的计算机制和推理路径。

链接: https://arxiv.org/abs/2507.13858
作者: Nicolò Brunello,Davide Rigamonti,Andrea Sassella,Vincenzo Scotti,Mark James Carman
机构: DEIB, Politecnico di Milano(米兰理工大学电气与信息工程系), Milan, Italy; KASTEL, Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院), Karlsruhe, Germany
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:The reasoning capabilities of Large Language Models (LLMs) have increased greatly over the last few years, as have their size and complexity. Nonetheless, the use of LLMs in production remains challenging due to their unpredictable nature and discrepancies that can exist between their desired behavior and their actual model output. In this paper, we introduce a new tool, InTraVisTo (Inside Transformer Visualisation Tool), designed to enable researchers to investigate and trace the computational process that generates each token in a Transformer-based LLM. InTraVisTo provides a visualization of both the internal state of the Transformer model (by decoding token embeddings at each layer of the model) and the information flow between the various components across the different layers of the model (using a Sankey diagram). With InTraVisTo, we aim to help researchers and practitioners better understand the computations being performed within the Transformer model and thus to shed some light on internal patterns and reasoning processes employed by LLMs.
zh

[NLP-20] Modeling Fair Play in Detective Stories with Language Models

【速读】：该论文旨在解决生成式AI在创作侦探小说时难以平衡“公平性”（fair play）与故事连贯性之间的矛盾问题，即如何在满足读者既有预期的同时引入合理惊喜。其解决方案的关键在于构建一个概率框架，用于形式化定义公平性，并设计相应的量化指标；该框架揭示了故事连贯性（coherence）与意外性（surprise）之间存在固有张力，从而为评估和优化生成式AI创作的侦探故事质量提供了理论依据与可操作工具。

链接: https://arxiv.org/abs/2507.13841
作者: Eitan Wagner,Renana Keydar,Omri Abend
机构: Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective storytelling relies on a delicate balance between meeting the reader’s prior expectations and introducing unexpected developments. In the domain of detective fiction, this tension is known as fair play, which includes the implicit agreement between the writer and the reader as to the range of possible resolutions the mystery story may have. In this work, we present a probabilistic framework for detective fiction that allows us to define desired qualities. Using this framework, we formally define fair play and design appropriate metrics for it. Stemming from these definitions is an inherent tension between the coherence of the story, which measures how much it ``makes sense’', and the surprise it induces. We validate the framework by applying it to LLM-generated detective stories. This domain is appealing since we have an abundance of data, we can sample from the distribution generating the story, and the story-writing capabilities of LLMs are interesting in their own right. Results show that while LLM-generated stories may be unpredictable, they generally fail to balance the trade-off between surprise and fair play, which greatly contributes to their poor quality.
zh

[NLP-21] he Expressions of Depression and Anxiety in Chinese Psycho-counseling: Usage of First-person Singular Pronoun and Negative Emotional Words

【速读】：该论文旨在解决中文心理咨询服务中语言表达与抑郁、焦虑等心理状态之间关联性的问题，特别是第一人称单数代词（first-person singular pronouns）和负面情绪词汇（negative emotional words）的使用频率是否能有效反映个体的心理健康状况。其解决方案的关键在于基于735次在线心理咨询会话构建语料库，并采用广义线性混合效应模型（general linear mixed-effect model）对语言特征进行量化分析，借助Linguistic Inquiry and Word Count（LIWC）软件提取语言指标，从而揭示文化语境下语言使用模式与心理状态之间的非线性关系。研究发现，负面情绪词汇频率显著正相关于抑郁和焦虑严重程度，而第一人称单数代词的使用则不受心理状态影响，这一结果挑战了西方个体主义文化背景下的既有结论，凸显了文化差异与咨询互动情境在心理语言学标记中的关键作用。

链接: https://arxiv.org/abs/2507.13839
作者: Lizhi Ma,Tong Zhao,Shuai Zhang,Nirui Song,Hongliang He,Anqi Li,Ran Feng,Huachuan Qiu,Jingsong Ma,Zhenzhong Lan
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study explores the relationship between linguistic expressions and psychological states of depression and anxiety within Chinese psycho-counseling interactions, focusing specifically on the usage of first-person singular pronouns and negative emotional words. Utilizing a corpus derived from 735 online counseling sessions, the analysis employed a general linear mixed-effect model to assess linguistic patterns quantified by the Linguistic Inquiry and Word Count (LIWC) software. Results indicate a significant positive correlation between the frequency of negative emotional words and the severity of both depressive and anxious states among clients. However, contrary to prior findings predominantly derived from English-language contexts, the usage frequency of first-person singular pronouns did not vary significantly with the clients’ psychological conditions. These outcomes are discussed within the framework of cultural distinctions between collectivist Chinese contexts and individualistic Western settings, as well as the interactive dynamics unique to psycho-counseling conversations. The findings highlight the nuanced influence of cultural and conversational contexts on language use in mental health communications, providing insights into psycholinguistic markers relevant to therapeutic practices in Chinese-speaking populations.
zh

[NLP-22] Question-Answer Extraction from Scientific Articles Using Knowledge Graphs and Large Language Models SIGIR2025

【速读】：该论文旨在解决科研人员在阅读或引用学术文章时，难以快速识别和理解其核心概念与贡献的问题。为此，作者提出两种生成问答（QA）对的方法来提取文章的关键内容：第一种方法仅依赖文章内部文本，通过大语言模型（LLM）从显著段落中生成问题并排序，再生成答案；第二种方法则引入知识图谱（KG），通过在科学文献语料上微调实体关系（ER）抽取模型构建KG，并基于三元组TF-IDF-like度量法提取每篇文章中最相关的三元组，从而捕捉文章的核心思想。研究表明，KG-based方法能更有效地体现文章主旨，且在科学文献上微调ER抽取模型是获取高质量三元组的关键。

链接: https://arxiv.org/abs/2507.13827
作者: Hosein Azarbonyad,Zi Long Zhu,Georgios Cheirmpos,Zubair Afzal,Vikrant Yadav,Georgios Tsatsaronis
机构: Elsevier(爱思唯尔)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: SIGIR 2025

点击查看摘要

Abstract:When deciding to read an article or incorporate it into their research, scholars often seek to quickly identify and understand its main ideas. In this paper, we aim to extract these key concepts and contributions from scientific articles in the form of Question and Answer (QA) pairs. We propose two distinct approaches for generating QAs. The first approach involves selecting salient paragraphs, using a Large Language Model (LLM) to generate questions, ranking these questions by the likelihood of obtaining meaningful answers, and subsequently generating answers. This method relies exclusively on the content of the articles. However, assessing an article’s novelty typically requires comparison with the existing literature. Therefore, our second approach leverages a Knowledge Graph (KG) for QA generation. We construct a KG by fine-tuning an Entity Relationship (ER) extraction model on scientific articles and using it to build the graph. We then employ a salient triplet extraction method to select the most pertinent ERs per article, utilizing metrics such as the centrality of entities based on a triplet TF-IDF-like measure. This measure assesses the saliency of a triplet based on its importance within the article compared to its prevalence in the literature. For evaluation, we generate QAs using both approaches and have them assessed by Subject Matter Experts (SMEs) through a set of predefined metrics to evaluate the quality of both questions and answers. Our evaluations demonstrate that the KG-based approach effectively captures the main ideas discussed in the articles. Furthermore, our findings indicate that fine-tuning the ER extraction model on our scientific corpus is crucial for extracting high-quality triplets from such documents.
zh

[NLP-23] RAG -based Architectures for Drug Side Effect Retrieval in LLM s

【速读】：该论文旨在解决药物不良反应（Adverse Drug Reactions, ADRs）检测与分析中因大语言模型（Large Language Models, LLMs）存在黑箱训练数据依赖、幻觉倾向及领域知识匮乏等问题而导致的可靠性不足问题。其解决方案的关键在于提出两种增强架构——检索增强生成（Retrieval-Augmented Generation, RAG）和GraphRAG，通过将全面的药物不良反应知识库整合进Llama 3 8B语言模型，显著提升模型在药物流行病学监测场景下的准确性与可解释性；其中，GraphRAG利用结构化知识图谱实现精准信息检索，在19,520个药物-不良反应关联的评估中达到接近完美的检索准确率，为LLMs在关键医疗领域的可靠应用提供了高精度、可扩展的新范式。

链接: https://arxiv.org/abs/2507.13822
作者: Shad Nygren,Pinar Avci,Andre Daniels,Reza Rassol,Afshin Beheshti,Diego Galeano
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Drug side effects are a major global health concern, necessitating advanced methods for their accurate detection and analysis. While Large Language Models (LLMs) offer promising conversational interfaces, their inherent limitations, including reliance on black-box training data, susceptibility to hallucinations, and lack of domain-specific knowledge, hinder their reliability in specialized fields like pharmacovigilance. To address this gap, we propose two architectures: Retrieval-Augmented Generation (RAG) and GraphRAG, which integrate comprehensive drug side effect knowledge into a Llama 3 8B language model. Through extensive evaluations on 19,520 drug side effect associations (covering 976 drugs and 3,851 side effect terms), our results demonstrate that GraphRAG achieves near-perfect accuracy in drug side effect retrieval. This framework offers a highly accurate and scalable solution, signifying a significant advancement in leveraging LLMs for critical pharmacovigilance applications.
zh

[NLP-24] An Enhanced Model-based Approach for Short Text Clustering

【速读】：该论文旨在解决短文本聚类（short text clustering）任务中因数据稀疏性、高维度和大规模特性带来的挑战，以及现有表示学习方法计算复杂度高导致运行时间过长的问题。其核心解决方案是提出一种改进的Collapsed Gibbs Sampling算法——GSDMM+，该方法在Dirichlet Multinomial Mixture模型基础上优化了初始化稳定性，并通过基于熵的自适应词权重调整机制提升聚类粒度与主题相关性；同时引入策略性簇合并（strategic cluster merging）以精细调控聚类粒度，使预测分布更贴近真实类别分布，从而在保持高效性的同时显著提升聚类效果。

链接: https://arxiv.org/abs/2507.13793
作者: Enhao Cheng,Shoujia Zhang,Jianhua Yin,Xuemeng Song,Tian Gan,Liqiang Nie
机构: Shandong University (山东大学); City University of Hong Kong (香港城市大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook. Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches. This task is inherently challenging due to the sparse, large-scale, and high-dimensional characteristics of the short text data. Furthermore, the computational intensity required by representation learning significantly increases the running time. To address these issues, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts while identifying representative words for each cluster. Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance. GSDMM+ reduces initialization noise and adaptively adjusts word weights based on entropy, achieving fine-grained clustering that reveals more topic-related information. Additionally, strategic cluster merging is employed to refine clustering granularity, better aligning the predicted distribution with the true category distribution. We conduct extensive experiments, comparing our methods with both classical and state-of-the-art approaches. The experimental results demonstrate the efficiency and effectiveness of our methods. The source code for our model is publicly available at this https URL.
zh

[NLP-25] aching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions ACL2025

【速读】：该论文旨在解决视觉问答（Visual Question Answering, VQA）中用户提问存在歧义时，现有模型缺乏交互式澄清能力的问题。当前方法主要依赖于重述问题，忽略了用户与视觉语言模型（Visual Language Models, VLMs）之间通过反馈进行语义澄清的交互本质。为应对这一挑战，作者提出了 ClearVQA 基准测试集，专门针对VQA场景中常见的三类歧义类型设计了多样化任务场景，从而系统评估VLMs通过交互方式识别并消除歧义的能力；其关键创新在于构建了一个可量化评估交互澄清能力的基准，并推动VLMs从“倾向于直接回答”向“主动寻求澄清”的行为模式转变。

链接: https://arxiv.org/abs/2507.13773
作者: Pu Jian,Donglei Yu,Wen Yang,Shuo Ren,Jiajun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ACL2025 Main

点击查看摘要

Abstract:In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs’ capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbfClearVQA benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.
zh

[NLP-26] Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models

【速读】：该论文旨在解决视觉语言模型（Visual Language Models, VLMs）在面对精心设计的提示（prompt）时，可能产生不当内容（inapt content）的安全性问题，尤其是探究提示设计中的离散组件如何影响模型的越狱（jailbreak）成功率。其解决方案的关键在于识别并验证三个独立触发因素：详细视觉信息的引入、对抗样本的存在以及积极开头短语的使用，并进一步提出一种基于VLM内部两层间跳跃连接（skip-connection）的框架，显著提升越狱成功率，即使使用良性图像也能诱导有害输出；同时发现表情包（memes）与有毒图像具有相似的危害诱导能力，揭示了多模态模型在复杂语境下隐蔽的脆弱性。

链接: https://arxiv.org/abs/2507.13761
作者: Palash Nandi,Maithili Joshi,Tanmoy Chakraborty
机构: Indian Institute of Technology Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models are highly sensitive to prompt formulations - small changes in input can drastically alter their output. This raises a critical question: To what extent can prompt sensitivity be exploited to generate inapt content? In this paper, we investigate how discrete components of prompt design influence the generation of inappropriate content in Visual Language Models (VLMs). Specifically, we analyze the impact of three key factors on successful jailbreaks: (a) the inclusion of detailed visual information, (b) the presence of adversarial examples, and © the use of positively framed beginning phrases. Our findings reveal that while a VLM can reliably distinguish between benign and harmful inputs in unimodal settings (text-only or image-only), this ability significantly degrades in multimodal contexts. Each of the three factors is independently capable of triggering a jailbreak, and we show that even a small number of in-context examples (as few as three) can push the model toward generating inappropriate outputs. Furthermore, we propose a framework that utilizes a skip-connection between two internal layers of the VLM, which substantially increases jailbreak success rates, even when using benign images. Finally, we demonstrate that memes, often perceived as humorous or harmless, can be as effective as toxic visuals in eliciting harmful content, underscoring the subtle and complex vulnerabilities of VLMs.
zh

[NLP-27] PRIDE – Parameter-Efficient Reduction of Identity Discrimination for Equality in LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成内容时对LGBTQIA+群体存在性别和性取向偏见的问题，这种偏见源于训练数据中的刻板印象，导致输出结果对相关用户产生边缘化影响。为缓解这一问题，研究提出采用参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）技术作为全模型微调的轻量替代方案，其中关键创新在于验证了低秩适应（Low-Rank Adaptation, LoRA）方法的有效性：仅使用0.1%的额外参数，在定制的QueerNews语料库上进行LoRA微调可使WinoQueer基准测试中的偏见分数从最高98降至48，同时将中立输出比例从接近0%提升至36%，显著改善模型公平性且计算开销极低。

链接: https://arxiv.org/abs/2507.13743
作者: Maluna Menke,Thilo Hagendorff
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) frequently reproduce the gender- and sexual-identity prejudices embedded in their training corpora, leading to outputs that marginalize LGBTQIA+ users. Hence, reducing such biases is of great importance. To achieve this, we evaluate two parameter-efficient fine-tuning (PEFT) techniques - Low-Rank Adaptation (LoRA) and soft-prompt tuning - as lightweight alternatives to full-model fine-tuning for mitigating such biases. Using the WinoQueer benchmark, we quantify bias in three open-source LLMs and observe baseline bias scores reaching up to 98 (out of 100) across a range of queer identities defined by gender and/or sexual orientation, where 50 would indicate neutrality. Fine-tuning with LoRA ( 0.1% additional parameters) on a curated QueerNews corpus reduces those scores by up to 50 points and raises neutrality from virtually 0% to as much as 36%. Soft-prompt tuning (10 virtual tokens) delivers only marginal improvements. These findings show that LoRA can deliver meaningful fairness gains with minimal computation. We advocate broader adoption of community-informed PEFT, the creation of larger queer-authored corpora, and richer evaluation suites beyond WinoQueer, coupled with ongoing audits to keep LLMs inclusive.
zh

[NLP-28] DailyLLM : Context-Aware Activity Log Generation Using Multi-Modal Sensors and LLM s

【速读】：该论文旨在解决现有活动日志（activity log）生成方法在准确性、效率和语义丰富性方面的显著局限性。为应对这一挑战，作者提出DailyLLM，其关键在于首次构建了一个融合四维上下文信息（位置、运动、环境与生理状态）的轻量级大语言模型（Large Language Model, LLM）框架，仅依赖智能手机和智能手表中常见的传感器数据，通过结构化提示（structured prompting）与高效特征提取相结合的方式实现高质量活动理解与日志生成。该方案在仅使用15亿参数模型的情况下，相较于700亿参数的SOTA基线，在BERTScore精度上提升17%，同时推理速度提高近10倍，展现出优异的性能与部署可行性。

链接: https://arxiv.org/abs/2507.13737
作者: Ye Tian,Xiaoyuan Ren,Zihao Wang,Onat Gungor,Xiaofan Yu,Tajana Rosing
机构: University of California San Diego, Computer Science and Engineering Department (加州大学圣地亚哥分校，计算机科学与工程系)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Rich and context-aware activity logs facilitate user behavior analysis and health monitoring, making them a key research focus in ubiquitous computing. The remarkable semantic understanding and generation capabilities of Large Language Models (LLMs) have recently created new opportunities for activity log generation. However, existing methods continue to exhibit notable limitations in terms of accuracy, efficiency, and semantic richness. To address these challenges, we propose DailyLLM. To the best of our knowledge, this is the first log generation and summarization system that comprehensively integrates contextual activity information across four dimensions: location, motion, environment, and physiology, using only sensors commonly available on smartphones and smartwatches. To achieve this, DailyLLM introduces a lightweight LLM-based framework that integrates structured prompting with efficient feature extraction to enable high-level activity understanding. Extensive experiments demonstrate that DailyLLM outperforms state-of-the-art (SOTA) log generation methods and can be efficiently deployed on personal computers and Raspberry Pi. Utilizing only a 1.5B-parameter LLM model, DailyLLM achieves a 17% improvement in log generation BERTScore precision compared to the 70B-parameter SOTA baseline, while delivering nearly 10x faster inference speed.
zh

[NLP-29] he Judge Variable: Challenging Judge-Agnostic Legal Judgment Prediction

【速读】：该论文旨在解决法律实践中长期存在的“法律现实主义—形式主义”争论问题，即法官个体决策模式是否显著影响案件裁决结果，从而挑战“法官作为中立变量均匀适用法律”的传统假设。其解决方案的关键在于构建一种基于机器学习的预测框架，通过对比针对特定法官历史判决训练的“专家模型”（specialist models）与基于聚合数据训练的“通用模型”（generalist models），发现专家模型在预测儿童身体监护权判决结果上具有显著更高的准确性（F1最高达92.85%），远超通用模型（82.63%），且这种优势不依赖于样本量的增加（通用模型需20–100倍更多数据）。该方法验证了司法身份对判决结果存在可测量的影响，为法律现实主义提供了实证支持。

链接: https://arxiv.org/abs/2507.13732
作者: Guillaume Zambrano
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 24 figures shorter version submitted to JURIX 2025

点击查看摘要

Abstract:This study examines the role of human judges in legal decision-making by using machine learning to predict child physical custody outcomes in French appellate courts. Building on the legal realism-formalism debate, we test whether individual judges’ decision-making patterns significantly influence case outcomes, challenging the assumption that judges are neutral variables that apply the law uniformly. To ensure compliance with French privacy laws, we implement a strict pseudonymization process. Our analysis uses 18,937 living arrangements rulings extracted from 10,306 cases. We compare models trained on individual judges’ past rulings (specialist models) with a judge-agnostic model trained on aggregated data (generalist models). The prediction pipeline is a hybrid approach combining large language models (LLMs) for structured feature extraction and ML models for outcome prediction (RF, XGB and SVC). Our results show that specialist models consistently achieve higher predictive accuracy than the general model, with top-performing models reaching F1 scores as high as 92.85%, compared to the generalist model’s 82.63% trained on 20x to 100x more samples. Specialist models capture stable individual patterns that are not transferable to other judges. In-Domain and Cross-Domain validity tests provide empirical support for legal realism, demonstrating that judicial identity plays a measurable role in legal outcomes. All data and code used will be made available.
zh

[NLP-30] Consistent Explainers or Unreliable Narrators? Understanding LLM -generated Group Recommendations RECSYS’25

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在群体推荐系统（Group Recommender Systems, GRS）中作为决策者和解释生成器时，其推荐结果与解释内容是否符合社会选择理论中的聚合策略、以及是否存在透明性不足的问题。解决方案的关键在于通过对比LLM生成的推荐与基于社会选择理论的加权平均（Additive Utilitarian, ADD）聚合策略，发现LLM推荐行为虽常趋近于ADD策略，但其解释往往依赖模糊或未定义的标准（如用户相似性、多样性或未明确定义的流行度指标），且解释一致性差，这削弱了其可解释性和可信度，提示当前LLM在GRS中的应用需改进解释逻辑以增强透明性，并可能需要优化标准聚合方法以适应大规模项目集场景。

链接: https://arxiv.org/abs/2507.13705
作者: Cedric Waterschoot,Nava Tintarev,Francesco Barile
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Short paper accepted at the Nineteenth ACM Conference on Recommender Systems (RecSys '25). Cedric Waterschoot, Nava Tintarev, and Francesco Barile. 2025. Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations. Proceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys '25), Prague, Czech Republic. doi: https://doi.org/10.1145/3705328.3748015

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being implemented as joint decision-makers and explanation generators for Group Recommender Systems (GRS). In this paper, we evaluate these recommendations and explanations by comparing them to social choice-based aggregation strategies. Our results indicate that LLM-generated recommendations often resembled those produced by Additive Utilitarian (ADD) aggregation. However, the explanations typically referred to averaging ratings (resembling but not identical to ADD aggregation). Group structure, uniform or divergent, did not impact the recommendations. Furthermore, LLMs regularly claimed additional criteria such as user or item similarity, diversity, or used undefined popularity metrics or thresholds. Our findings have important implications for LLMs in the GRS pipeline as well as standard aggregation strategies. Additional criteria in explanations were dependent on the number of ratings in the group scenario, indicating potential inefficiency of standard aggregation methods at larger item set sizes. Additionally, inconsistent and ambiguous explanations undermine transparency and explainability, which are key motivations behind the use of LLMs for GRS.
zh

[NLP-31] LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues

【速读】：该论文旨在解决长对话场景下大语言模型（Large Language Models, LLMs）因上下文长度增加而导致的计算和内存瓶颈问题，这些问题限制了模型在实际应用中的响应效率与交互能力。解决方案的关键在于提出一种自适应的双阶段推理加速框架 LoopServe：第一阶段在预填充（prefilling）过程中通过在线稀疏化动态选择注意力矩阵中最关键的部分；第二阶段在解码过程中采用渐进式键值（Key Value, KV）压缩策略，基于最新生成的输出token自适应维护一个高效且相关的缓存。这一机制有效提升了长对话任务中LLM推理的速度与准确性。

链接: https://arxiv.org/abs/2507.13681
作者: Haoyang Li,Zhanchao Xu,Yiming Li,Xuejia Chen,Darian Li,Anxin Tian,Qingfa Xiao,Cheng Deng,Jun Wang,Qing Li,Lei Chen,Mingxuan Yuan
机构: PolyU(香港理工大学); Huawei(华为); HKUST(香港科技大学); HKUST(GZ)(香港科技大学（广州) ); Edin(爱丁堡大学); UCL(伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn dialogues are essential in many real-world applications of large language models, such as chatbots and virtual assistants. As conversation histories become longer, existing large language models face increasing computational and memory challenges, which hinder their ability to provide efficient and responsive interactions. Most current acceleration methods either compress the context or optimize key value caching, but they often rely on fixed or position-based heuristics that do not adapt well to the dynamic and unpredictable patterns found in actual multi-turn conversations. In this paper, we present LoopServe, an adaptive dual-phase inference acceleration framework for large language models in multi-turn dialogues. LoopServe introduces two main innovations. First, it performs online sparsification during the prefilling phase by dynamically selecting the most important parts of the attention matrix for each new input. Second, it uses progressive key value compression during decoding by adaptively maintaining a relevant and efficient cache based on the most recently generated output tokens. We also propose a \hrefthis https URLnew benchmark with eleven multi-turn datasets that reflect realistic query positions and conversational dependencies. Extensive experiments demonstrate that LoopServe consistently achieves superior effectiveness compared to existing baselines and significantly accelerates LLM inference across a wide range of long-context dialogue tasks.
zh

[NLP-32] KiC: Keyword-inspired Cascade for Cost-Efficient Text Generation with LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在自由文本生成任务中因依赖昂贵API调用而导致的高推理成本问题。现有级联（Cascade）方法受限于依赖精确文本匹配来判断是否升级到更强模型，难以可靠地选择代表性输出并评估自由格式输出的整体可靠性。解决方案的关键在于提出Keyword-inspired Cascade (KiC)框架，其通过识别弱模型多个输出中最具代表性的答案，并基于其他输出与该参考答案之间的语义对齐程度动态决定是否接受当前输出或升级至强模型，从而在保障性能的同时显著降低API使用成本。

链接: https://arxiv.org/abs/2507.13666
作者: Woo-Chan Kim,Ji-Hoon Park,Seong-Whan Lee
机构: Korea University (韩国大学); Institute of Information & Communications Technology Planning & Evaluation (信息通信技术规划与评估研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated state-of-the-art performance across a wide range of natural language processing tasks. However, high-performing models are typically accessible only via APIs, incurring substantial inference costs. Cascade methods address this by initially employing a cheaper model and escalating to a stronger one only when necessary. Nevertheless, existing cascade approaches struggle to select a reliable representative response and assess the overall reliability of free-form outputs, as they rely on exact text matching. To overcome these limitations, we propose Keyword-inspired Cascade (KiC), a novel framework for cost-efficient free-form text generation. KiC identifies the most representative answer among multiple outputs from a weaker model and evaluates the semantic alignment of other responses with it. Based on the degree of alignment, KiC determines whether to accept the weaker model’s output or escalate to a stronger model. Experiments on three free-form text generation benchmarks show that KiC achieves 97.53 percent of GPT-4’s accuracy while reducing API costs by 28.81 percent on average, and even outperforms GPT-4 in a specific benchmark.
zh

[NLP-33] CU-ICU: Customizing Unsupervised Instruction-Finetuned Language Models for ICU Datasets via Text-to-Text Transfer Transformer

【速读】：该论文旨在解决将大规模语言模型（Large Language Models, LLMs）应用于重症监护病房（Intensive Care Unit, ICU）等专业医疗场景时所面临的两大挑战：领域适应性不足与标注数据稀缺。其解决方案的关键在于提出CU-ICU方法，该方法基于Text-to-Text Transfer Transformer（T5）架构，采用稀疏微调策略，结合少量样本提示（few-shot prompting）与选择性参数更新机制，在仅更新少于1%模型参数的情况下实现高效且准确的领域适配。实验表明，CU-ICU在早期脓毒症检测、死亡率预测和临床笔记生成等关键任务中显著优于传统微调方法，尤其在脓毒症检测准确率提升达15%、生成临床相关解释能力增强20%的同时保持极低的计算开销，从而为真实ICU环境中提供可扩展、低延迟且可解释的临床决策支持系统提供了可行路径。

链接: https://arxiv.org/abs/2507.13655
作者: Teerapong Panboonyuen
机构: Chulalongkorn University (朱拉隆功大学)
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:Integrating large language models into specialized domains like healthcare presents unique challenges, including domain adaptation and limited labeled data. We introduce CU-ICU, a method for customizing unsupervised instruction-finetuned language models for ICU datasets by leveraging the Text-to-Text Transfer Transformer (T5) architecture. CU-ICU employs a sparse fine-tuning approach that combines few-shot prompting with selective parameter updates, enabling efficient adaptation with minimal supervision. Our evaluation across critical ICU tasks–early sepsis detection, mortality prediction, and clinical note generation–demonstrates that CU-ICU consistently improves predictive accuracy and interpretability over standard fine-tuning methods. Notably, CU-ICU achieves up to a 15% increase in sepsis detection accuracy and a 20% enhancement in generating clinically relevant explanations while updating fewer than 1% of model parameters in its most efficient configuration. These results establish CU-ICU as a scalable, low-overhead solution for delivering accurate and interpretable clinical decision support in real-world ICU environments.
zh

[NLP-34] Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

【速读】：该论文旨在解决大语言模型（LLMs）在多语言翻译任务中面临的挑战，包括处理复杂语言模式和自动化翻译中出现的生硬表达问题。其解决方案的关键在于提出Seed-X系列开源大语言模型，该模型基于7B参数规模，在包含28种语言的高质量单语和双语数据集上进行预训练，并通过链式思维（Chain-of-Thought, CoT）推理和强化学习（Reinforcement Learning, RL）进一步微调指令模型，从而显著提升跨语言对的泛化能力与翻译质量。实验表明，Seed-X在自动指标和人工评估中均达到与Gemini-2.5和GPT-4o等闭源领先模型相当的性能，并显著优于其他更大规模的开源模型。

链接: https://arxiv.org/abs/2507.13618
作者: Shanbo Cheng,Yu Bao,Qian Cao,Luyang Huang,Liyan Kang,Zhicheng Liu,Yu Lu,Wenhao Zhu,Zhichao Huang,Tao Li,Sitong Liu,Ningxin Peng,Shuaijie She,Lu Xu,Nuo Xu,Sen Yang,Runsheng Yu,Yiming Yu,Liehao Zou,Hang Li,Lu Lu,Yuxuan Wang,Yonghui Wu
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.
zh

[NLP-35] Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models

【速读】：该论文试图解决的问题是：如何通过多层级语言特征（包括形态、句法和语义层面）来区分人类撰写文本与生成式 AI (Generative AI) 生成文本，并揭示二者在不同模型、采样策略及发布日期下的风格差异。其解决方案的关键在于构建一个涵盖8个领域、由11种不同大语言模型（Large Language Models, LLMs）生成的文本数据集，系统计算并分析多种语言学特征（如依存距离、情感性等），并通过统计分析和风格嵌入（style embeddings）方法，发现人类文本通常具有更简单的句法结构和更丰富的语义内容，且人类文本在各领域间表现出更高的特征变异性；而随着模型迭代，机器生成文本趋于同质化，显示出风格趋同的趋势。

链接: https://arxiv.org/abs/2507.13614
作者: Sergio E. Zanotto,Segun Aroyehun
机构: University of Konstanz (康斯坦茨大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2412.03025

点击查看摘要

Abstract:The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written and machine-generated texts, our study focus on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls and model release date. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human and machine texts show stylistic diversity across domains, with humans displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to an homogenization of machine-generated texts.
zh

[NLP-36] CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks

【速读】：该论文旨在解决视频大语言模型（VideoLLMs）在缺乏细粒度对象级视频理解基础上进行链式思维（Chain-of-Thought, CoT）推理的能力不足问题。现有指令微调模型（如Qwen和LLaVA系列）通常基于高层视频-文本对训练，缺少支持组合性、分步推理所需的结构化标注。其解决方案的关键在于提出CoTasks框架，将复杂视频问答任务（如NeXT-QA和STAR数据集中的问题）分解为四个基础的实体级任务：帧定位、实体跟踪、空间与时间关系提取，并通过将这些中间CoT式推理步骤嵌入输入，使模型能够显式执行以对象为中心的时空推理。实验证明，该方法显著提升了模型在因果、时间及描述子类别上的表现，验证了其作为结构化CoT监督框架的有效性。

链接: https://arxiv.org/abs/2507.13609
作者: Yanan Wang,Julio Vizcarra,Zhi Li,Hao Niu,Mori Kurokawa
机构: KDDI Research, Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite recent progress in video large language models (VideoLLMs), a key open challenge remains: how to equip models with chain-of-thought (CoT) reasoning abilities grounded in fine-grained object-level video understanding. Existing instruction-tuned models, such as the Qwen and LLaVA series, are trained on high-level video-text pairs, often lacking structured annotations necessary for compositional, step-by-step reasoning. We propose CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks, a new framework that decomposes complex video questions of existing datasets (e.g., NeXT-QA, STAR) into four entity-level foundational tasks: frame localization, entity tracking, spatial and temporal relation extraction. By embedding these intermediate CoT-style reasoning steps into the input, CoTasks enables models to explicitly perform object-centric spatiotemporal reasoning. Experiments on the NeXT-QA benchmark show that CoTasks significantly enhance inference performance: LLaVA-video-7B improves by +3.3 points in average GPT-4 evaluation score, and Qwen2.5-VL-3B gains +17.4, with large boosts in causal (+14.6), temporal (+10.9), and descriptive (+48.1) subcategories. These results demonstrate the effectiveness of CoTasks as a structured CoT-style supervision framework for improving compositional video reasoning.
zh

[NLP-37] xGS-VolVis: Expressive Scene Editing for Volume Visualization via Textured Gaussian Splatting IEEE-VIS2025

【速读】：该论文旨在解决传统体积可视化（Volume Visualization, VolVis）方法在风格化渲染中灵活性不足的问题，具体表现为现有非真实感渲染（Non-Photorealistic Rendering, NPR）技术依赖复杂预定义规则、仅支持单一风格迁移，且难以实现灵活可控的场景编辑。其解决方案的关键在于提出 TexGS-VolVis 框架，通过引入带纹理和光照属性的二维高斯原语（2D Gaussian primitives），解耦几何与外观表示，从而实现高质量、几何一致的风格化渲染，并结合图像与文本驱动的非真实感场景编辑策略，提升局部修改的精细控制能力，显著增强渲染效率、视觉质量与编辑灵活性。

链接: https://arxiv.org/abs/2507.13586
作者: Kaiyuan Tang,Kuangshi Ai,Jun Han,Chaoli Wang
机构: 未知
类目: Graphics (cs.GR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE VIS 2025

点击查看摘要

Abstract:Advancements in volume visualization (VolVis) focus on extracting insights from 3D volumetric data by generating visually compelling renderings that reveal complex internal structures. Existing VolVis approaches have explored non-photorealistic rendering techniques to enhance the clarity, expressiveness, and informativeness of visual communication. While effective, these methods often rely on complex predefined rules and are limited to transferring a single style, restricting their flexibility. To overcome these limitations, we advocate the representation of VolVis scenes using differentiable Gaussian primitives combined with pretrained large models to enable arbitrary style transfer and real-time rendering. However, conventional 3D Gaussian primitives tightly couple geometry and appearance, leading to suboptimal stylization results. To address this, we introduce TexGS-VolVis, a textured Gaussian splatting framework for VolVis. TexGS-VolVis employs 2D Gaussian primitives, extending each Gaussian with additional texture and shading attributes, resulting in higher-quality, geometry-consistent stylization and enhanced lighting control during inference. Despite these improvements, achieving flexible and controllable scene editing remains challenging. To further enhance stylization, we develop image- and text-driven non-photorealistic scene editing tailored for TexGS-VolVis and 2D-lift-3D segmentation to enable partial editing with fine-grained control. We evaluate TexGS-VolVis both qualitatively and quantitatively across various volume rendering scenes, demonstrating its superiority over existing methods in terms of efficiency, visual quality, and editing flexibility.
zh

[NLP-38] A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models

【速读】：该论文旨在解决俄语语音合成（Speech Synthesis）中的多项挑战，包括元音弱化、辅音清化、重音位置不固定、同形异义词歧义以及语调不自然等问题。解决方案的关键在于构建了一个名为Balalaika的新数据集，该数据集包含超过2000小时的高质量俄语语音及其详尽的文本标注（包括标点符号和重音标记），并证明基于该数据集训练的模型在语音合成与增强任务中显著优于现有方法。

链接: https://arxiv.org/abs/2507.13563
作者: Kirill Borodin,Nikita Vasiliev,Vasiliy Kudryavtsev,Maxim Maslov,Mikhail Gorodnichev,Oleg Rogov,Grach Mkrtchian
机构: Moscow Technical University of Communication and Informatics (莫斯科通信与信息技术大学); Artificial Intelligence Research Institute (人工智能研究院)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: The work is still in progress

点击查看摘要

Abstract:Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks. We detail the dataset construction pipeline, annotation methodology, and results of comparative evaluations.
zh

[NLP-39] Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder

【速读】：该论文旨在解决精神分裂症谱系障碍中形式思维障碍（Formal Thought Disorder, FTD）的临床评估难题，传统量表存在资源密集且难以规模化的问题。解决方案的关键在于利用自动语音识别（Automatic Speech Recognition, ASR）技术提取话语中的停顿特征（pause features），并将其与语义连贯性指标相结合，通过支持向量回归（Support Vector Regression, SVR）模型预测FTD严重程度。研究表明，仅使用停顿特征即可稳健预测FTD严重性，而融合停顿与语义特征后显著提升预测性能（如TOPSY数据集相关系数ρ=0.649，AUC=83.71%），表明时间维度与语义维度的联合分析为客观量化紊乱言语提供了有效路径。

链接: https://arxiv.org/abs/2507.13551
作者: Feng Chen,Weizhe Xu,Changye Li,Serguei Pakhomov,Alex Cohen,Simran Bhola,Sandy Yin,Sunny X Tang,Michael Mackinley,Lena Palaniyappan,Dror Ben-Zeev,Trevor Cohen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Formal thought disorder (FTD), a hallmark of schizophrenia spectrum disorders, manifests as incoherent speech and poses challenges for clinical assessment. Traditional clinical rating scales, though validated, are resource-intensive and lack scalability. Automated speech analysis with automatic speech recognition (ASR) allows for objective quantification of linguistic and temporal features of speech, offering scalable alternatives. The use of utterance timestamps in ASR captures pause dynamics, which are thought to reflect the cognitive processes underlying speech production. However, the utility of integrating these ASR-derived features for assessing FTD severity requires further evaluation. This study integrates pause features with semantic coherence metrics across three datasets: naturalistic self-recorded diaries (AVH, n = 140), structured picture descriptions (TOPSY, n = 72), and dream narratives (PsyCL, n = 43). We evaluated pause related features alongside established coherence measures, using support vector regression (SVR) to predict clinical FTD scores. Key findings demonstrate that pause features alone robustly predict the severity of FTD. Integrating pause features with semantic coherence metrics enhanced predictive performance compared to semantic-only models, with integration of independent models achieving correlations up to \rho = 0.649 and AUC = 83.71% for severe cases detection (TOPSY, with best \rho = 0.584 and AUC = 79.23% for semantic-only models). The performance gains from semantic and pause features integration held consistently across all contexts, though the nature of pause patterns was dataset-dependent. These findings suggest that frameworks combining temporal and semantic analyses provide a roadmap for refining the assessment of disorganized speech and advance automated speech analysis in psychosis.
zh

[NLP-40] GOFAI meets Generative AI: Development of Expert Systems by means of Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在构建知识型系统时存在的幻觉（hallucination）问题，即模型可能自信地生成错误或无法验证的事实，从而影响系统的可靠性与可解释性。解决方案的关键在于通过限制应用领域并采用结构化的基于提示（prompt-based）提取方法，将LLM生成的内容转化为可验证的符号化知识表示（以Prolog形式表达），从而实现由人类专家进行校验与修正，确保知识库的准确性、可解释性、可扩展性和可靠性。该方法融合了LLM的召回能力与符号系统的精确性，形成一种透明的混合架构，为敏感应用场景中的可信人工智能奠定了基础。

链接: https://arxiv.org/abs/2507.13550
作者: Eduardo C. Garrido-Merchán,Cristina Puente
机构: Comillas Pontifical University (康壁拉斯宗教学院大学); Institute of Research in Technology (IIT) (技术研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:The development of large language models (LLMs) has successfully transformed knowledge-based systems such as open domain question nswering, which can automatically produce vast amounts of seemingly coherent information. Yet, those models have several disadvantages like hallucinations or confident generation of incorrect or unverifiable facts. In this paper, we introduce a new approach to the development of expert systems using LLMs in a controlled and transparent way. By limiting the domain and employing a well-structured prompt-based extraction approach, we produce a symbolic representation of knowledge in Prolog, which can be validated and corrected by human experts. This approach also guarantees interpretability, scalability and reliability of the developed expert systems. Via quantitative and qualitative experiments with Claude Sonnet 3.7 and GPT-4.1, we show strong adherence to facts and semantic coherence on our generated knowledge bases. We present a transparent hybrid solution that combines the recall capacity of LLMs with the precision of symbolic systems, thereby laying the foundation for dependable AI applications in sensitive domains.
zh

[NLP-41] A Computational Approach to Modeling Conversational Systems: Analyzing Large-Scale Quasi-Patterned Dialogue Flows

【速读】：该论文旨在解决大规模对话数据中松散组织的对话（quasi-patterned conversations）结构建模难题，传统方法难以有效捕捉其语义流动与内在结构。解决方案的关键在于提出一种新的计算框架——构建对话图（conversational graphs），并引入“Filter Reconnect”图简化技术，该方法在最小化噪声的同时保留语义连贯性和结构完整性；实验表明，结合大语言模型（Large Language Models, LLMs）与该图简化技术后，语义指标S提升至先前方法的2.06倍，并实现δ-超双曲性为0的树状结构，从而显著提升对话建模的清晰度与可解释性。

链接: https://arxiv.org/abs/2507.13544
作者: Mohamed Achref Ben Ammar,Mohamed Taha Bennani
机构: National Institute of Applied Science and Technology (国家应用科学学院); University of Tunis El Manar (突尼斯艾尔曼纳大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The analysis of conversational dynamics has gained increasing importance with the rise of large language model-based systems, which interact with users across diverse contexts. In this work, we propose a novel computational framework for constructing conversational graphs that capture the flow and structure of loosely organized dialogues, referred to as quasi-patterned conversations. We introduce the Filter Reconnect method, a novel graph simplification technique that minimizes noise while preserving semantic coherence and structural integrity of conversational graphs. Through comparative analysis, we demonstrate that the use of large language models combined with our graph simplification technique has resulted in semantic metric S increasing by a factor of 2.06 compared to previous approaches while simultaneously enforcing a tree-like structure with 0 \delta-hyperbolicity, ensuring optimal clarity in conversation modeling. This work provides a computational method for analyzing large-scale dialogue datasets, with practical applications related to monitoring automated systems such as chatbots, dialogue management tools, and user behavior analytics.
zh

[NLP-42] Encoding syntactic objects and Merge operations in function spaces

【速读】：该论文试图解决如何在神经计算框架下实现句法结构核心运算“Merge”的数学建模问题，即如何将语言的句法对象（如短语、句子）以可计算的方式嵌入到函数空间中，并保持其代数结构与操作的忠实表示。解决方案的关键在于：首先，将词汇项表示为函数空间中的元素（如小波函数），进而构建一个包含交换非结合半环（commutative non-associative semiring）结构的空间，该结构基于二次Rényi熵定义；其次，利用该空间上的代数结构（具体为operad代数）来建模语法操作，其中operad的操作对应于电路形式的输入波形变换，从而编码句法结构；最后，通过coproduct和Hopf代数马尔可夫链实现Merge对工作空间的作用，使Merge可被形式化为半环中的后继函数（successor function），并进一步在特定情形下通过正弦波跨频段相位同步实现该机制的物理实现。这一方案从理论上证明了句法计算结构可在神经计算层面进行构造性实现。

链接: https://arxiv.org/abs/2507.13501
作者: Matilde Marcolli,Robert C. Berwick
机构: California Institute of Technology (加州理工学院); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Rings and Algebras (math.RA); Neurons and Cognition (q-bio.NC)
备注: 40 pages, LaTeX, 4 png figures

点击查看摘要

Abstract:We provide a mathematical argument showing that, given a representation of lexical items as functions (wavelets, for instance) in some function space, it is possible to construct a faithful representation of arbitrary syntactic objects in the same function space. This space can be endowed with a commutative non-associative semiring structure built using the second Renyi entropy. The resulting representation of syntactic objects is compatible with the magma structure. The resulting set of functions is an algebra over an operad, where the operations in the operad model circuits that transform the input wave forms into a combined output that encodes the syntactic structure. The action of Merge on workspaces is faithfully implemented as action on these circuits, through a coproduct and a Hopf algebra Markov chain. The results obtained here provide a constructive argument showing the theoretical possibility of a neurocomputational realization of the core computational structure of syntax. We also present a particular case of this general construction where this type of realization of Merge is implemented as a cross frequency phase synchronization on sinusoidal waves. This also shows that Merge can be expressed in terms of the successor function of a semiring, thus clarifying the well known observation of its similarities with the successor function of arithmetic.
zh

[NLP-43] Revisiting LLM Value Probing Strategies: Are They Robust and Expressive?

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）价值取向探测中存在的两个核心问题：一是现有探测方法在面对输入扰动时的鲁棒性不足，缺乏系统性的比较；二是所探测到的价值是否真正反映模型对现实行为偏好以及对不同人口统计学情境的响应能力。解决方案的关键在于通过三类主流探测策略的对比实验，引入提示（prompt）和选项的变体来评估其鲁棒性，并设计两项新任务分别考察价值表示对人口统计学上下文的敏感性和与模型实际行为的一致性，从而揭示当前探测方法的局限性，强调需更审慎地理解和应用LLM的价值探测技术。

链接: https://arxiv.org/abs/2507.13490
作者: Siqi Shen,Mehar Singh,Lajanugen Logeswaran,Moontae Lee,Honglak Lee,Rada Mihalcea
机构: University of Michigan(密歇根大学); LG AI Research; University of Illinois at Chicago(芝加哥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There has been extensive research on assessing the value orientation of Large Language Models (LLMs) as it can shape user experiences across demographic groups. However, several challenges remain. First, while the Multiple Choice Question (MCQ) setting has been shown to be vulnerable to perturbations, there is no systematic comparison of probing methods for value probing. Second, it is unclear to what extent the probed values capture in-context information and reflect models’ preferences for real-world actions. In this paper, we evaluate the robustness and expressiveness of value representations across three widely used probing strategies. We use variations in prompts and options, showing that all methods exhibit large variances under input perturbations. We also introduce two tasks studying whether the values are responsive to demographic context, and how well they align with the models’ behaviors in value-related scenarios. We show that the demographic context has little effect on the free-text generation, and the models’ values only weakly correlate with their preference for value-based actions. Our work highlights the need for a more careful examination of LLM value probing and awareness of its limitations.
zh

[NLP-44] Paper Summary Attack: Jailbreaking LLM s through LLM Safety Papers

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在安全防护中存在的潜在漏洞问题，特别是其对权威来源信息的过度信任倾向可能被恶意利用的风险。解决方案的关键在于提出一种新型越狱攻击方法——论文摘要攻击（Paper Summary Attack, PSA），该方法通过系统性地整合攻击导向或防御导向的安全论文内容，构建对抗性提示模板，并在预定义的章节中嵌入有害查询作为对抗性载荷，从而高效触发模型输出违规内容。实验表明，PSA在基础模型及先进推理模型（如Deepseek-R1）中均表现出极高成功率（最高达98%），并揭示了不同模型版本间存在显著的方向性脆弱性偏差，为未来对抗攻击与模型安全研究提供了重要线索。

链接: https://arxiv.org/abs/2507.13474
作者: Liang Lin,Zhihao Xu,Xuehai Tang,Shi Liu,Biyu Zhou,Fuqing Zhu,Jizhong Han,Songlin Hu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The safety of large language models (LLMs) has garnered significant research attention. In this paper, we argue that previous empirical studies demonstrate LLMs exhibit a propensity to trust information from authoritative sources, such as academic papers, implying new possible vulnerabilities. To verify this possibility, a preliminary analysis is designed to illustrate our two findings. Based on this insight, a novel jailbreaking method, Paper Summary Attack (\llmnamePSA), is proposed. It systematically synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template, while strategically infilling harmful query as adversarial payloads within predefined subsections. Extensive experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1. PSA achieves a 97% attack success rate (ASR) on well-aligned models like Claude3.5-Sonnet and an even higher 98% ASR on Deepseek-R1. More intriguingly, our work has further revealed diametrically opposed vulnerability bias across different base models, and even between different versions of the same model, when exposed to either attack-focused or defense-focused papers. This phenomenon potentially indicates future research clues for both adversarial methodologies and safety this http URL is available at this https URL
zh

[NLP-45] Aligning Knowledge Graphs and Language Models for Factual Accuracy

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在自然语言处理任务中普遍存在幻觉（hallucination）的问题，即模型生成与事实不符的内容。其核心解决方案是通过将知识图谱（Knowledge Graph, KG）信息注入语言模型的潜在空间（latent space），以增强模型的事实准确性。关键创新在于提出ALIGNed-LLM方法：利用预训练的知识图谱嵌入（Knowledge Graph Embedding, KGE）模型（如TransE）获取实体嵌入，并通过一个可训练的投影层将实体嵌入与文本嵌入对齐，从而提升模型区分相似实体的能力，实现更可靠的语义 grounding，显著降低幻觉现象。

链接: https://arxiv.org/abs/2507.13411
作者: Nur A Zarin Nishat,Andrea Coletta,Luigi Bellomarini,Kossi Amouzouvi,Jens Lehmann,Sahar Vahdati
机构: TIB – Leibniz Information Centre for Science and Technology (德国科学与技术信息中心); Banca d’Italia (意大利银行); ScaDS.AI Dresden/Leipzig, Technische Universität Dresden (德累斯顿工业大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models like GPT-4, Gemini, and Claude have transformed natural language processing (NLP) tasks such as question answering, dialogue generation, summarization, and so forth; yet their susceptibility to hallucination stands as one of the major challenges. Among numerous approaches to overcome this challenge, integration of Knowledge Graphs (KGs) into language models has emerged as a promising solution as it provides structured, reliable, domain-specific, and up-to-date external information to the language models. In this paper, we introduce ALIGNed-LLM, a simple yet effective approach to improve language models’ factuality via a lean strategy to infuse KGs into the latent space of language models inspired by LLaVA where visual and textual information is infused. We use embeddings from a pre-trained Knowledge Graph Embedding (KGE) model, such as TransE, and a trainable projection layer to align entity and text embeddings. This alignment enables the language model to distinguish between similar entities improving factual grounding and reducing hallucination. We tested our approach on three popular questions-answering benchmark datasets alongside language models of varying sizes, showing significant improvement. Furthermore, we applied our approach to a real-world financial use case from a large central bank in Europe, which demands high accuracy and precision, demonstrating a substantial improvement of the LLM answers.
zh

[NLP-46] Causal Language Control in Multilingual Transformers via Sparse Feature Steering

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在零样本（zero-shot）场景下难以确定性控制目标生成语言的问题，尤其是在不使用显式语言提示或微调的情况下。其解决方案的关键在于利用稀疏自动编码器（Sparse Autoencoder, SAE）特征，这些特征已被证明与可解释的模型行为相关。研究者通过分析预训练SAE在Gemma-2B和Gemma-9B残差流中的激活模式，识别出在英语与其他四种目标语言（中文、日语、西班牙语、法语）之间差异显著的SAE特征；仅修改单个SAE特征在特定Transformer层的激活值，即可实现高达90%的成功率的语言切换，同时保持语义一致性（通过LaBSE句子嵌入相似度衡量）。结果表明，语言引导最有效于中后期Transformer层，并且由与语言敏感SAE特征高度关联的特定注意力头增强，凸显了稀疏特征引导作为轻量且可解释的可控多语言生成机制的巨大潜力。

链接: https://arxiv.org/abs/2507.13410
作者: Cheng-Ting Chou,George Liu,Jessica Sun,Cole Blondin,Kevin Zhu,Vasu Sharma,Sean O’Brien
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Maryland, College Park (马里兰大学学院公园分校); Barnard College (巴纳德学院); Algoverse AI Research (Algoverse AI 研究); Meta FAIR Lab (Meta FAIR 实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are available. In this work, we investigate whether sparse autoencoder (SAE) features, previously shown to correlate with interpretable model behaviors, can be leveraged to steer the generated language of LLMs during inference. Leveraging pretrained SAEs on the residual streams of Gemma-2B and Gemma-9B, we identify features whose activations differ most significantly between English and four target languages: Chinese, Japanese, Spanish, and French. By modifying just a single SAE feature at one transformer layer, we achieve controlled language shifts with up to 90% success, as measured by FastText language classification, while preserving semantic fidelity according to LaBSE (Language-Agnostic BERT Sentence Embedding) similarity. Our analysis reveals that language steering is most effective in mid-to-late transformer layers and is amplified by specific attention heads disproportionately associated with language-sensitive SAE features. These results demonstrate the promise of sparse feature steering as a lightweight and interpretable mechanism for controllable multilingual generation.
zh

[NLP-47] DyG-RAG : Dynamic Graph Retrieval-Augmented Generation with Event-Centric Reasoning

【速读】：该论文旨在解决现有图检索增强生成（Graph Retrieval-Augmented Generation, Graph RAG）方法在处理时间推理任务时的局限性，即无法有效建模现实世界事件的动态结构与时间顺序。其核心解决方案在于提出一种以事件为中心的动态图检索增强生成框架——DyG-RAG，关键创新包括：(1) 提出动态事件单元（Dynamic Event Units, DEUs），显式编码语义内容与精确的时间锚点，消除传统检索单元中的时间歧义；(2) 构建基于共享实体和时间邻近性的事件图，捕捉跨事件的时间与因果依赖关系，支持高效的多跳推理；(3) 设计事件时间线检索管道与时间链式思维（Time Chain-of-Thought）策略，确保生成结果在时间上一致且可解释，从而实现对复杂时间敏感问题的准确回答。

链接: https://arxiv.org/abs/2507.13396
作者: Qingyun Sun,Jiaqi Yuan,Shan He,Xiao Guan,Haonan Yuan,Xingcheng Fu,Jianxin Li,Philip S. Yu
机构: Beihang University (北京航空航天大学); Guangxi Normal University (广西师范大学); University of Illinois Chicago (芝加哥伊利诺伊大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Graph Retrieval-Augmented Generation has emerged as a powerful paradigm for grounding large language models with external structured knowledge. However, existing Graph RAG methods struggle with temporal reasoning, due to their inability to model the evolving structure and order of real-world events. In this work, we introduce DyG-RAG, a novel event-centric dynamic graph retrieval-augmented generation framework designed to capture and reason over temporal knowledge embedded in unstructured text. To eliminate temporal ambiguity in traditional retrieval units, DyG-RAG proposes Dynamic Event Units (DEUs) that explicitly encode both semantic content and precise temporal anchors, enabling accurate and interpretable time-aware retrieval. To capture temporal and causal dependencies across events, DyG-RAG constructs an event graph by linking DEUs that share entities and occur close in time, supporting efficient and meaningful multi-hop reasoning. To ensure temporally consistent generation, DyG-RAG introduces an event timeline retrieval pipeline that retrieves event sequences via time-aware traversal, and proposes a Time Chain-of-Thought strategy for temporally grounded answer generation. This unified pipeline enables DyG-RAG to retrieve coherent, temporally ordered event sequences and to answer complex, time-sensitive queries that standard RAG systems cannot resolve. Extensive experiments on temporal QA benchmarks demonstrate that DyG-RAG significantly improves the accuracy and recall of three typical types of temporal reasoning questions, paving the way for more faithful and temporal-aware generation. DyG-RAG is available at this https URL.
zh

[NLP-48] Mitigating Stylistic Biases of Machine Translation Systems via Monolingual Corpora Only

【速读】：该论文旨在解决神经网络机器翻译（Neural Machine Translation, NMT）中风格保持（style preservation）难题，即在跨语言翻译过程中难以保留原文的文体特征。传统方法通常依赖平行语料库来实现风格迁移，而本文提出了一种名为 Babel 的新框架，其核心创新在于仅使用单语语料库即可提升翻译中的风格保真度。关键解决方案包括两个模块：一是基于上下文嵌入的风格检测器，用于识别源文与目标文之间的风格差异；二是基于扩散模型的风格应用模块，能够在不破坏语义一致性的前提下修正风格不一致问题。该框架可作为后处理模块无缝集成至现有NMT系统，无需修改模型架构或依赖平行风格数据，实验证明其在多个领域均能显著提升风格一致性（提升150%）并维持高语义相似性（0.92）。

链接: https://arxiv.org/abs/2507.13395
作者: Xuanqi Gao,Weipeng Jiang,Juan Zhai,Shiqing Ma,Siyi Xie,Xinyang Yin,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学); University of Massachusetts (马萨诸塞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of neural machine translation (NMT) has revolutionized cross-lingual communication, yet preserving stylistic nuances remains a significant challenge. While existing approaches often require parallel corpora for style preservation, we introduce Babel, a novel framework that enhances stylistic fidelity in NMT using only monolingual corpora. Babel employs two key components: (1) a style detector based on contextual embeddings that identifies stylistic disparities between source and target texts, and (2) a diffusion-based style applicator that rectifies stylistic inconsistencies while maintaining semantic integrity. Our framework integrates with existing NMT systems as a post-processing module, enabling style-aware translation without requiring architectural modifications or parallel stylistic data. Extensive experiments on five diverse domains (law, literature, scientific writing, medicine, and educational content) demonstrate Babel’s effectiveness: it identifies stylistic inconsistencies with 88.21% precision and improves stylistic preservation by 150% while maintaining a high semantic similarity score of 0.92. Human evaluation confirms that translations refined by Babel better preserve source text style while maintaining fluency and adequacy.
zh

[NLP-49] opicImpact: Improving Customer Feedback Analysis with Opinion Units for Topic Modeling and Star-Rating Prediction

【速读】：该论文旨在解决传统主题建模方法在分析客户评论时难以同时捕捉语义主题与情感倾向的问题，导致提取的洞察力有限且难以关联到具体的业务指标（如评分）。其解决方案的关键在于重构主题建模流程，将处理单元从原始文本片段改为“意见单元”（opinion units）——即包含明确语义内容和对应情感得分的独立陈述。通过大语言模型（large language models）可靠地提取这些意见单元，显著提升了后续主题建模的连贯性和可解释性，并实现了主题与情感信息的联合建模，从而能够更精准地关联用户关切与业务指标（如星级评分），增强分析结果的实际应用价值。

链接: https://arxiv.org/abs/2507.13392
作者: Emil Häglund,Johanna Björklund
机构: Umeå University (于默奥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We improve the extraction of insights from customer reviews by restructuring the topic modelling pipeline to operate on opinion units - distinct statements that include relevant text excerpts and associated sentiment scores. Prior work has demonstrated that such units can be reliably extracted using large language models. The result is a heightened performance of the subsequent topic modeling, leading to coherent and interpretable topics while also capturing the sentiment associated with each topic. By correlating the topics and sentiments with business metrics, such as star ratings, we can gain insights on how specific customer concerns impact business outcomes. We present our system’s implementation, use cases, and advantages over other topic modeling and classification solutions. We also evaluate its effectiveness in creating coherent topics and assess methods for integrating topic and sentiment modalities for accurate star-rating prediction.
zh

[NLP-50] PARAM-1 BharatGen 2.9B Model

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在训练数据、架构设计与优化范式上高度依赖英语，导致非英语地区如印度等多语言、多 dialect 地区在模型能力上存在结构性缺失的问题。其解决方案的关键在于从预训练阶段即嵌入语言多样性：构建一个仅使用印地语和英语的双语语料库，并通过三个核心原则实现公平性——一是将印地语相关语料占比设为25%以确保语言代表性；二是采用适配印度语言形态结构的SentencePiece分词器保障token化公平；三是设计涵盖IndicQA、混合语码推理和社语言鲁棒性的文化对齐评估基准。该方法将多样性作为模型架构设计的首要目标，而非后期对齐任务，从而提供了一个面向印度语境的基础建模新范式。

链接: https://arxiv.org/abs/2507.13390
作者: Kundeshwar Pundalik,Piyush Sawarkar,Nihar Sahoo,Abhishek Shinde,Prateek Chanda,Vedant Goswami,Ajay Nagpal,Atul Singh,Viraj Thakur,Vijay Dewane,Aamod Thakur,Bhargav Patel,Smita Gautam,Bhagwan Panditi,Shyam Pawar,Madhav Kotcha,Suraj Racha,Saral Sureka,Pankaj Singh,Rishi Bal,Rohit Saluja,Ganesh Ramakrishnan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful general-purpose reasoning systems, yet their development remains dominated by English-centric data, architectures, and optimization paradigms. This exclusionary design results in structural under-representation of linguistically diverse regions such as India, where over 20 official languages and 100+ dialects coexist alongside phenomena like code-switching and diglossia. We introduce PARAM-1, a 2.9B parameter decoder-only, text-only language model trained from scratch with an explicit architectural and linguistic focus on Indian diversity. PARAM-1 is trained on a bilingual dataset consisting of only Hindi and English, constructed with a strong focus on fact-rich, high-quality content. It is guided by three core principles: equitable representation of Indic languages through a 25% corpus allocation; tokenization fairness via a SentencePiece tokenizer adapted to Indian morphological structures; and culturally aligned evaluation benchmarks across IndicQA, code-mixed reasoning, and socio-linguistic robustness tasks. By embedding diversity at the pretraining level-rather than deferring it to post-hoc alignment-PARAM-1 offers a design-first blueprint for equitable foundation modeling. Our results demonstrate that it serves as both a competent general-purpose model and a robust baseline for India-centric applications.
zh

[NLP-51] Context-Based Fake News Detection using Graph Based Approach: ACOVID-19 Use-case

【速读】：该论文旨在解决虚假新闻（fake news）在数字世界中快速传播所带来的信息可信度问题。其解决方案的关键在于提出一种基于上下文图结构（contextual graph-based approach）的检测方法：首先利用自然语言处理（Natural Language Processing, NLP）技术将新闻文章转化为具有语义信息的图结构，随后采用基于最小描述长度（Minimum Description Length, MDL）的图基异常检测算法（Graph-Based Anomaly Detection, GBAD）进行图挖掘，从而识别出偏离正常模式的异常新闻样本。此方法通过捕捉复杂上下文关系，有效提升了对虚假新闻的识别能力。

链接: https://arxiv.org/abs/2507.13382
作者: Chandrashekar Muniyappa,Sirisha Velampalli
机构: Independent Researcher(独立研究员); UCEK, JNTU Kakinada(印度理工学院卡纳达分校); LTIMindTree(LTIMindTree)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: CSAIDE '25: Proceedings of the 2025 4th International Conference on Cyber Security, Artificial Intelligence and the Digital Economy

点击查看摘要

Abstract:In todayś digital world, fake news is spreading with immense speed. Its a significant concern to address. In this work, we addressed that challenge using novel graph based approach. We took dataset from Kaggle that contains real and fake news articles. To test our approach we incorporated recent covid-19 related news articles that contains both genuine and fake news that are relevant to this problem. This further enhances the dataset as well instead of relying completely on the original dataset. We propose a contextual graph-based approach to detect fake news articles. We need to convert news articles into appropriate schema, so we leverage Natural Language Processing (NLP) techniques to transform news articles into contextual graph structures. We then apply the Minimum Description Length (MDL)-based Graph-Based Anomaly Detection (GBAD) algorithm for graph mining. Graph-based methods are particularly effective for handling rich contextual data, as they enable the discovery of complex patterns that traditional query-based or statistical techniques might overlook. Our proposed approach identifies normative patterns within the dataset and subsequently uncovers anomalous patterns that deviate from these established norms.
zh

[NLP-52] SAFT: Structure-Aware Fine-Tuning of LLM s for AMR-to-Text Generation KDD2025

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在处理结构化输入（如图结构）时，因任意线性化或架构不兼容而导致结构信息丢失的问题。其解决方案的关键在于提出SAFT（Structure-aware Fine-Tuning），通过计算经变换后的抽象语义表示（Abstract Meaning Representation, AMR）的磁拉普拉斯矩阵所导出的方向敏感位置编码，并将这些编码投影到预训练LLM的嵌入空间中，从而在不改变模型架构的前提下注入图拓扑信息。这一方法显著提升了AMR到文本生成任务的性能，在AMR 3.0基准上相较基线提升3.5 BLEU，且在图复杂度更高时增益更明显，验证了结构感知表示对增强LLM性能的价值。

链接: https://arxiv.org/abs/2507.13381
作者: Rafiq Kamel,Filippo Guerranti,Simon Geisler,Stephan Günnemann
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at the KDD2025 Workshop on Structured Knowledge for LLMs

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied to tasks involving structured inputs such as graphs. Abstract Meaning Representations (AMRs), which encode rich semantics as directed graphs, offer a rigorous testbed for evaluating LLMs on text generation from such structures. Yet, current methods often arbitrarily linearize AMRs, discarding key structural cues, or rely on architectures incompatible with standard LLMs. We introduce SAFT, a structure-aware fine-tuning approach that injects graph topology into pretrained LLMs without architectural changes. We compute direction-sensitive positional encodings from the magnetic Laplacian of transformed AMRs and project them into the embedding space of the LLM. While possibly applicable to any graph-structured inputs, we focus on AMR-to-text generation as a representative and challenging benchmark. SAFT sets a new state-of-the-art on AMR 3.0 with a 3.5 BLEU improvement over baselines. Gains scale with graph complexity, highlighting the value of structure-aware representations in enhancing LLM performance. SAFT offers a general and effective pathway for bridging structured data and language models.
zh

[NLP-53] Persona-Based Synthetic Data Generation Using Multi-Stage Conditioning with Large Language Models for Emotion Recognition

【速读】：该论文旨在解决情感识别（emotion recognition）领域中高质量、多样化情感数据集稀缺的问题。由于情感表达具有主观性，受个体人格特质、社会文化背景和情境因素影响，大规模通用数据采集在伦理和实践上均面临挑战。解决方案的关键在于提出PersonaGen框架，该框架利用大语言模型（Large Language Model, LLM）通过多阶段基于角色的条件控制生成富有情感层次的文本。其核心创新是构建包含人口统计特征、社会文化背景与具体情境的分层虚拟角色（persona），以此引导情绪表达的生成，从而实现语义多样性高、人类感知真实且适用于下游分类任务的情感合成数据生成。

链接: https://arxiv.org/abs/2507.13380
作者: Keito Inoshita,Rushia Harada
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the field of emotion recognition, the development of high-performance models remains a challenge due to the scarcity of high-quality, diverse emotional datasets. Emotional expressions are inherently subjective, shaped by individual personality traits, socio-cultural backgrounds, and contextual factors, making large-scale, generalizable data collection both ethically and practically difficult. To address this issue, we introduce PersonaGen, a novel framework for generating emotionally rich text using a Large Language Model (LLM) through multi-stage persona-based conditioning. PersonaGen constructs layered virtual personas by combining demographic attributes, socio-cultural backgrounds, and detailed situational contexts, which are then used to guide emotion expression generation. We conduct comprehensive evaluations of the generated synthetic data, assessing semantic diversity through clustering and distributional metrics, human-likeness via LLM-based quality scoring, realism through comparison with real-world emotion corpora, and practical utility in downstream emotion classification tasks. Experimental results show that PersonaGen significantly outperforms baseline methods in generating diverse, coherent, and discriminative emotion expressions, demonstrating its potential as a robust alternative for augmenting or replacing real-world emotional datasets.
zh

[NLP-54] Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在空间推理能力上的局限性，特别是其对提示策略敏感性和泛化性能不足的问题。解决方案的关键在于两个方面：一是采用基于场景图的结构化多阶段链式思维（SceneGraph CoT）提示方法，相较于简单链式思维（Chain-of-Thought, CoT）能显著提升空间推理准确性；二是通过Group Relative Policy Optimization（GRPO）进行强化学习微调，相比监督微调（Supervised Fine-Tuning, SFT），GRPO在分布外（Out-of-Distribution, OOD）场景下表现出更强的鲁棒性和稳定性，有效缓解了SFT因过度拟合表面语言模式而导致的性能下降问题。

链接: https://arxiv.org/abs/2507.13362
作者: Binbin Ji,Siddharth Agrawal,Qiance Tang,Yvonne Wu
机构: Courant (纽约大学柯朗数学科学研究所); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 5 figures, submitted to a conference (IEEE formate). Authored by students from the Courant Institute, NYU

点击查看摘要

Abstract:This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought (CoT) prompting and reinforcement learning. We begin by evaluating the impact of different prompting strategies and find that simple CoT formats, where the model generates a reasoning step before the answer, not only fail to help, but can even harm the model’s original performance. In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy. Furthermore, to improve spatial reasoning ability, we fine-tune models using Group Relative Policy Optimization (GRPO) on the SAT dataset and evaluate their performance on CVBench. Compared to supervised fine-tuning (SFT), GRPO achieves higher accuracy on Pass@1 evaluations and demonstrates superior robustness under out-of-distribution (OOD) conditions. In particular, we find that SFT overfits to surface-level linguistic patterns and may degrade performance when test-time phrasing changes (e.g., from “closer to” to “farther from”). GRPO, on the other hand, generalizes more reliably and maintains stable performance under such shifts. Our findings provide insights into how reinforcement learning and structured prompting improve the spatial reasoning capabilities and generalization behavior of modern VLMs. All code is open source at: this https URL
zh

[NLP-55] Adaptive Linguistic Prompting (ALP) Enhances Phishing Webpage Detection in Multimodal Large Language Models ACL2025

【速读】：该论文旨在解决网络钓鱼（phishing）攻击日益复杂化所带来的网络安全威胁，传统检测方法难以适应新型欺骗手段的问题。解决方案的关键在于提出一种名为“少样本自适应语言提示”（Few-shot Adaptive Linguistic Prompting, ALP）的结构化语义推理机制，通过引导大语言模型（Large Language Models, LLMs）对文本中的欺骗性语言模式、紧迫性线索及操纵性措辞进行系统分析，并结合文本、视觉和URL特征的多模态融合，实现对高级钓鱼网页的高精度识别。实验表明，ALP显著提升了检测性能，在F1-score上达到0.93，优于传统方法，验证了其在构建可解释、自适应且鲁棒的钓鱼检测系统中的潜力。

链接: https://arxiv.org/abs/2507.13357
作者: Atharva Bhargude,Ishan Gonehal,Chandler Haney,Dave Yoon,Kevin Zhu,Aaron Sandoval,Sean O’Brien,Kaustubh Vinnakota
机构: Algoverse AI Research
类目: Computation and Language (cs.CL)
备注: Published at ACL 2025 SRW, 9 pages, 3 figures

点击查看摘要

Abstract:Phishing attacks represent a significant cybersecurity threat, necessitating adaptive detection techniques. This study explores few-shot Adaptive Linguistic Prompting (ALP) in detecting phishing webpages through the multimodal capabilities of state-of-the-art large language models (LLMs) such as GPT-4o and Gemini 1.5 Pro. ALP is a structured semantic reasoning method that guides LLMs to analyze textual deception by breaking down linguistic patterns, detecting urgency cues, and identifying manipulative diction commonly found in phishing content. By integrating textual, visual, and URL-based analysis, we propose a unified model capable of identifying sophisticated phishing attempts. Our experiments demonstrate that ALP significantly enhances phishing detection accuracy by guiding LLMs through structured reasoning and contextual analysis. The findings highlight the potential of ALP-integrated multimodal LLMs to advance phishing detection frameworks, achieving an F1-score of 0.93, surpassing traditional approaches. These results establish a foundation for more robust, interpretable, and adaptive linguistic-based phishing detection systems using LLMs.
zh

[NLP-56] Physical models realizing the transformer architecture of large language models

【速读】：该论文试图解决的问题是：当前对Transformer架构在物理层面上的工作机制缺乏理论理解，尤其是其为何能有效建模语言依赖关系的物理本质尚不清晰。解决方案的关键在于从现代芯片的物理视角出发，构建基于Fock空间中量子开放系统的物理模型，将基于Transformer的大语言模型（Large Language Models, LLMs）形式化为在token希尔伯特空间上的开放量子系统，从而揭示其内在的物理实现基础。

链接: https://arxiv.org/abs/2507.13354
作者: Zeqian Chen
机构: Wuhan Institute of Physics and Mathematics, IAPM, Chinese Academy of Sciences (中国科学院武汉物理与数学研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Mathematical Physics (math-ph)
备注: 6 pages

点击查看摘要

Abstract:The introduction of the transformer architecture in 2017 (cf.\citeVSP2017) marked the most striking advancement in natural language processing. The transformer is a model architecture relying entirely on an attention mechanism to draw global dependencies between input and output. However, we believe there is a gap in our theoretical understanding of what the transformer is, and why it works physically. In this paper, from a physical perspective on modern chips, we construct physical models in the Fock space over the Hilbert space of tokens realizing large language models based on a transformer architecture as open quantum systems. Our physical models underlie the transformer architecture for large language models.
zh

计算机视觉

[CV-0] Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

【速读】：该论文旨在解决当前自监督学习（Self-Supervised Learning, SSL）视觉基础模型在性能与透明度之间的权衡问题，特别是现有方法依赖于大规模闭源数据和不透明的训练流程，且在聚类过程中未能有效处理语义模糊性导致的特征表示偏差。其关键解决方案在于提出一种基于嵌套马特罗什卡（Matryoshka）结构的参数高效多头聚类投影器，通过逐步细化特征至更细粒度的簇而不增加模型规模，提升了聚类的准确性与内存效率；同时引入一种新颖的位置解耦策略，显式去除密集特征中的位置偏置，从而增强语义内容编码能力，在多个下游任务中实现一致性能提升。

链接: https://arxiv.org/abs/2507.14137
作者: Shashanka Venkataramanan,Valentinos Pariza,Mohammadreza Salehi,Lukas Knobel,Spyros Gidaris,Elias Ramzi,Andrei Bursuc,Yuki M. Asano
机构: Valeo(法雷奥); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at this https URL.
zh

[CV-1] Generative AI-Driven High-Fidelity Human Motion Simulation

【速读】：该论文旨在解决当前人类运动仿真（Human Motion Simulation, HMS）中存在的运动保真度低的问题，尤其在工业任务中对工人行为、安全与生产效率评估的准确性不足。其解决方案的关键在于提出一种生成式 AI 驱动的人类运动仿真方法（Generative-AI-Enabled HMS, G-AI-HMS），通过融合文本到文本和文本到运动模型来提升物理任务的仿真质量：首先利用与 MotionGPT 训练词汇对齐的大语言模型（Large Language Models, LLMs）将任务描述转化为具有运动感知的语言表示；其次借助计算机视觉技术（如姿态估计算法）提取真实人体动作的关节关键点，并通过运动相似性指标对比 AI 增强动作序列与实际人类动作，从而实现高质量、高保真的运动仿真。

链接: https://arxiv.org/abs/2507.14097
作者: Hari Iyer,Neel Macwan,Atharva Jitendra Hude,Heejin Jeong,Shenghan Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human motion simulation (HMS) supports cost-effective evaluation of worker behavior, safety, and productivity in industrial tasks. However, existing methods often suffer from low motion fidelity. This study introduces Generative-AI-Enabled HMS (G-AI-HMS), which integrates text-to-text and text-to-motion models to enhance simulation quality for physical tasks. G-AI-HMS tackles two key challenges: (1) translating task descriptions into motion-aware language using Large Language Models aligned with MotionGPT’s training vocabulary, and (2) validating AI-enhanced motions against real human movements using computer vision. Posture estimation algorithms are applied to real-time videos to extract joint landmarks, and motion similarity metrics are used to compare them with AI-enhanced sequences. In a case study involving eight tasks, the AI-enhanced motions showed lower error than human created descriptions in most scenarios, performing better in six tasks based on spatial accuracy, four tasks based on alignment after pose normalization, and seven tasks based on overall temporal similarity. Statistical analysis showed that AI-enhanced prompts significantly (p 0.0001) reduced joint error and temporal misalignment while retaining comparable posture accuracy.
zh

[CV-2] C-DOG: Training-Free Multi-View Multi-Object Association in Dense Scenes Without Visual Feature via Connected δ-Overlap Graphs

【速读】：该论文旨在解决多视角多目标关联（multi-view multi-object association）问题，即在3D重建流程中如何跨多个相机视图一致地关联物体实例，尤其在视觉特征不可靠（如物体外观相似）或观测受噪声干扰的情况下。解决方案的关键在于提出一种无需训练的框架C-DOG，其核心是将连接的delta-overlap图建模与极线几何约束相结合：每个2D检测作为图节点，边权重由极线一致性决定；通过delta邻域重叠聚类识别强一致性组，同时引入基于四分位距（IQR）的滤波和3D反投影误差准则以增强对噪声和不一致观测的鲁棒性，从而在无视觉特征依赖的前提下实现稳定、可扩展的多视角目标关联。

链接: https://arxiv.org/abs/2507.14095
作者: Yung-Hong Sun,Ting-Hung Lin,Jiangang Chen,Hongrui Jiang,Yu Hen Hu
机构: University of Wisconsin - Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-view multi-object association is a fundamental step in 3D reconstruction pipelines, enabling consistent grouping of object instances across multiple camera views. Existing methods often rely on appearance features or geometric constraints such as epipolar consistency. However, these approaches can fail when objects are visually indistinguishable or observations are corrupted by noise. We propose C-DOG, a training-free framework that serves as an intermediate module bridging object detection (or pose estimation) and 3D reconstruction, without relying on visual features. It combines connected delta-overlap graph modeling with epipolar geometry to robustly associate detections across views. Each 2D observation is represented as a graph node, with edges weighted by epipolar consistency. A delta-neighbor-overlap clustering step identifies strongly consistent groups while tolerating noise and partial connectivity. To further improve robustness, we incorporate Interquartile Range (IQR)-based filtering and a 3D back-projection error criterion to eliminate inconsistent observations. Extensive experiments on synthetic benchmarks demonstrate that C-DOG outperforms geometry-based baselines and remains robust under challenging conditions, including high object density, without visual features, and limited camera overlap, making it well-suited for scalable 3D reconstruction in real-world scenarios.
zh

[CV-3] Multi-Centre Validation of a Deep Learning Model for Scoliosis Assessment

【速读】：该论文旨在解决青少年特发性脊柱侧弯（adolescent idiopathic scoliosis, AIS）患者中 Cobb 角测量依赖人工操作所导致的耗时及观察者间差异大的问题。解决方案的关键在于开发并验证一种全自动化深度学习软件（Carebot AI Bones，脊柱测量功能），通过多中心回顾性研究在103张站立前后位全脊柱X线片上进行评估，结果显示该AI系统与两位骨科放射科医师的测量结果具有高度一致性（平均绝对误差MAE为3.89–3.90°，皮尔逊相关系数r=0.88–0.91），且在四分类严重程度分级中表现出可接受的Kappa值（0.51–0.64），证明其具备达到专家水平的Cobb角测量和分级能力，可有效提升临床工作中脊柱侧弯报告与分诊效率。

链接: https://arxiv.org/abs/2507.14093
作者: Šimon Kubov,Simon Klíčník,Jakub Dandár,Zdeněk Straka,Karolína Kvaková,Daniel Kvak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scoliosis affects roughly 2 to 4 percent of adolescents, and treatment decisions depend on precise Cobb angle measurement. Manual assessment is time consuming and subject to inter observer variation. We conducted a retrospective, multi centre evaluation of a fully automated deep learning software (Carebot AI Bones, Spine Measurement functionality; Carebot s.r.o.) on 103 standing anteroposterior whole spine radiographs collected from ten hospitals. Two musculoskeletal radiologists independently measured each study and served as reference readers. Agreement between the AI and each radiologist was assessed with Bland Altman analysis, mean absolute error (MAE), root mean squared error (RMSE), Pearson correlation coefficient, and Cohen kappa for four grade severity classification. Against Radiologist 1 the AI achieved an MAE of 3.89 degrees (RMSE 4.77 degrees) with a bias of 0.70 degrees and limits of agreement from minus 8.59 to plus 9.99 degrees. Against Radiologist 2 the AI achieved an MAE of 3.90 degrees (RMSE 5.68 degrees) with a bias of 2.14 degrees and limits from minus 8.23 to plus 12.50 degrees. Pearson correlations were r equals 0.906 and r equals 0.880 (inter reader r equals 0.928), while Cohen kappa for severity grading reached 0.51 and 0.64 (inter reader kappa 0.59). These results demonstrate that the proposed software reproduces expert level Cobb angle measurements and categorical grading across multiple centres, suggesting its utility for streamlining scoliosis reporting and triage in clinical workflows.
zh

[CV-4] Unmasking Performance Gaps: A Comparative Study of Human Anonymization and Its Effects on Video Anomaly Detection

【速读】：该论文旨在解决监控视频中异常检测（Anomaly Detection）与人类隐私保护之间的矛盾问题，即在确保敏感个人信息不被泄露的前提下，如何维持或提升异常检测模型的性能。其解决方案的关键在于系统性评估四种常见的人类匿名化技术（模糊处理、掩码、加密和化身替换）对四类典型异常检测算法（MGFN、UR-DMU、BN-WVAD 和 PEL4VAD）性能的影响，并揭示不同算法对各类匿名化噪声的响应特性。研究发现，某些匿名化手段（如加密和掩码）反而可能因增强特定算法组件的敏感性而提高检测性能（如AUC指标），从而凸显了算法设计与隐私保护策略之间存在高度依赖关系，为未来“隐私优先”的异常检测系统设计提供了关键权衡依据。

链接: https://arxiv.org/abs/2507.14083
作者: Sara Abdulaziz,Egor Bondarev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACIVS 2025

点击查看摘要

Abstract:Advancements in deep learning have improved anomaly detection in surveillance videos, yet they raise urgent privacy concerns due to the collection of sensitive human data. In this paper, we present a comprehensive analysis of anomaly detection performance under four human anonymization techniques, including blurring, masking, encryption, and avatar replacement, applied to the UCF-Crime dataset. We evaluate four anomaly detection methods, MGFN, UR-DMU, BN-WVAD, and PEL4VAD, on the anonymized UCF-Crime to reveal how each method responds to different obfuscation techniques. Experimental results demonstrate that anomaly detection remains viable under anonymized data and is dependent on the algorithmic design and the learning strategy. For instance, under certain anonymization patterns, such as encryption and masking, some models inadvertently achieve higher AUC performance compared to raw data, due to the strong responsiveness of their algorithmic components to these noise patterns. These results highlight the algorithm-specific sensitivities to anonymization and emphasize the trade-off between preserving privacy and maintaining detection utility. Furthermore, we compare these conventional anonymization techniques with the emerging privacy-by-design solutions, highlighting an often overlooked trade-off between robust privacy protection and utility flexibility. Through comprehensive experiments and analyses, this study provides a compelling benchmark and insights into balancing human privacy with the demands of anomaly detection.
zh

[CV-5] VLA-Mark: A cross modal watermark for large vision-language alignment model

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）中知识产权保护与多模态一致性之间的矛盾问题，即现有文本水印方法因偏置的词元选择和静态策略破坏了图文对齐关系，导致语义关键概念易受损害。其解决方案的关键在于提出VLA-Mark框架，通过跨模态协同机制实现水印嵌入与语义保真度的平衡：利用多尺度视觉-文本对齐度量（包括局部patch亲和性、全局语义一致性和上下文注意力模式）指导水印注入过程，且无需模型重训练；同时引入熵敏感机制动态调节水印强度与语义保留程度，在低不确定性生成阶段优先保障视觉锚定。实验表明该方法在保持文本-视觉一致性的同时，显著优于传统方法，在检测准确率（98.8% AUC）和抗攻击能力（如改写和同义替换攻击下96.1%鲁棒性）方面建立新的质量保真型多模态水印标准。

链接: https://arxiv.org/abs/2507.14067
作者: Shuliang Liu,Qi Zheng,Jesse Jiaxi Xu,Yibo Yan,He Geng,Aiwei Liu,Peijie Jiang,Jia Liu,Yik-Cheung Tam,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); The Hong Kong University of Science and Technology (香港科技大学); University of Toronto (多伦多大学); Ant Group, Alibaba (蚂蚁集团，阿里巴巴); New York University Shanghai (纽约大学上海)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models demand watermarking solutions that protect intellectual property without compromising multimodal coherence. Existing text watermarking methods disrupt visual-textual alignment through biased token selection and static strategies, leaving semantic-critical concepts vulnerable. We propose VLA-Mark, a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination. Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns, to guide watermark injection without model retraining. An entropy-sensitive mechanism dynamically balances watermark strength and semantic preservation, prioritizing visual grounding during low-uncertainty generation phases. Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near-perfect detection (98.8% AUC). The framework demonstrates 96.1% attack resilience against attacks such as paraphrasing and synonym substitution, while maintaining text-visual consistency, establishing new standards for quality-preserving multimodal watermarking
zh

[CV-6] Foundation Models as Class-Incremental Learners for Dermatological Image Classification MICCAI

【速读】：该论文旨在解决类增量学习（Class-Incremental Learning, CIL）中因持续学习新类别而导致的旧知识遗忘问题，特别是在皮肤病变分类这一医学影像任务中的应用。其解决方案的关键在于利用预训练的基础模型（Foundation Models, FM）冻结主干网络（backbone），仅对轻量级多层感知机（MLP）进行增量训练，从而在不遗忘先前知识的前提下实现高效、稳定的增量学习性能。此外，研究还探索了零训练场景下基于原型（prototype）的最近均值分类方法，进一步验证了冻结FM在无需微调的情况下仍具备强大的持续学习能力。

链接: https://arxiv.org/abs/2507.14050
作者: Mohamed Elkhayat,Mohamed Mahmoud,Jamil Fayyad,Nourhan Bayasi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the MICCAI EMERGE 2025 workshop

点击查看摘要

Abstract:Class-Incremental Learning (CIL) aims to learn new classes over time without forgetting previously acquired knowledge. The emergence of foundation models (FM) pretrained on large datasets presents new opportunities for CIL by offering rich, transferable representations. However, their potential for enabling incremental learning in dermatology remains largely unexplored. In this paper, we systematically evaluate frozen FMs pretrained on large-scale skin lesion datasets for CIL in dermatological disease classification. We propose a simple yet effective approach where the backbone remains frozen, and a lightweight MLP is trained incrementally for each task. This setup achieves state-of-the-art performance without forgetting, outperforming regularization, replay, and architecture based methods. To further explore the capabilities of frozen FMs, we examine zero training scenarios using nearest mean classifiers with prototypes derived from their embeddings. Through extensive ablation studies, we demonstrate that this prototype based variant can also achieve competitive results. Our findings highlight the strength of frozen FMs for continual learning in dermatology and support their broader adoption in real world medical applications. Our code and datasets are available here.
zh

[CV-7] raining-free Token Reduction for Vision Mamba

【速读】：该论文旨在解决Vision Mamba在应用中效率不足的问题，特别是针对现有视觉任务中广泛使用的token reduction（令牌压缩）技术难以直接迁移至Mamba架构所带来的性能下降问题。其关键在于：由于Mamba是一种无注意力机制的序列模型，而传统ViT中的token reduction方法依赖于注意力机制来衡量token重要性并忽略压缩后token的顺序信息，因此直接套用会导致显著性能损失。为此，论文提出了一种结构感知的重要性评分机制（Mamba structure-aware importance score），用于更准确地评估token的重要性；在此基础上进一步设计了无需训练的MTR（Mamba Token Reduction）框架，通过简单有效的策略实现跨多种Mamba模型的即插即用式token压缩，在大幅降低计算量（如Vim-B骨干网络减少约40% FLOPs）的同时，仅带来极小的性能损失（ImageNet精度下降1.6%，无需重新训练）。

链接: https://arxiv.org/abs/2507.14042
作者: Qiankun Ma,Ziyao Zhang,Chi Su,Jie Chen,Zhen Song,Hairong Zheng,Wen Gao
机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. Peking University (北京大学); 4. Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Mamba has emerged as a strong competitor to Vision Transformers (ViTs) due to its ability to efficiently capture long-range dependencies with linear computational complexity. While token reduction, an effective compression technique in ViTs, has rarely been explored in Vision Mamba. Exploring Vision Mamba’s efficiency is essential for enabling broader applications. However, we find that directly applying existing token reduction techniques for ViTs to Vision Mamba leads to significant performance degradation. This is primarily because Mamba is a sequence model without attention mechanisms, whereas most token reduction techniques for ViTs rely on attention mechanisms for importance measurement and overlook the order of compressed tokens. In this paper, we investigate a Mamba structure-aware importance score to evaluate token importance in a simple and effective manner. Building on this score, we further propose MTR, a training-free \textbfMamba \textbfToken \textbfReduction framework. Without the need for training or additional tuning parameters, our method can be seamlessly integrated as a plug-and-play component across various Mamba models. Extensive experiments demonstrate that our approach significantly reduces computational workload while minimizing performance impact across various tasks and multiple backbones. Notably, MTR reduces FLOPs by approximately 40% on the Vim-B backbone, with only a 1.6% drop in ImageNet performance without retraining.
zh

[CV-8] QuantEIT: Ultra-Lightweight Quantum-Assisted Inference for Chest Electrical Impedance Tomography

【速读】：该论文旨在解决电阻抗断层成像（Electrical Impedance Tomography, EIT）中因病态逆问题导致的图像重建精度不足的问题，同时克服现有深度学习（Deep Learning, DL）方法模型复杂度高、参数量大、难以高效部署的局限。其解决方案的关键在于提出一种超轻量化量子辅助推理框架（Ultra-Lightweight Quantum-Assisted Inference, QuantEIT），该框架引入量子辅助网络（Quantum-Assisted Network, QA-Net），通过并行的2量子比特量子电路生成具有表达力的隐式非线性先验表示，并仅用一个线性层完成电导率重建，从而极大降低模型复杂度与参数数量（仅为传统方法的0.2%），且无需训练数据即可实现无监督推理，首次将量子电路集成于EIT图像重建中，显著提升了重建精度与抗噪鲁棒性。

链接: https://arxiv.org/abs/2507.14031
作者: Hao Fang,Sihao Teng,Hao Yu,Siyi Yuan,Huaiwu He,Zhe Liu,Yunjie Yang
机构: The University of Edinburgh (爱丁堡大学); Chinese Academy of Medical Sciences (中国医学科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 10 pages, 12 figures

点击查看摘要

Abstract:Electrical Impedance Tomography (EIT) is a non-invasive, low-cost bedside imaging modality with high temporal resolution, making it suitable for bedside monitoring. However, its inherently ill-posed inverse problem poses significant challenges for accurate image reconstruction. Deep learning (DL)-based approaches have shown promise but often rely on complex network architectures with a large number of parameters, limiting efficiency and scalability. Here, we propose an Ultra-Lightweight Quantum-Assisted Inference (QuantEIT) framework for EIT image reconstruction. QuantEIT leverages a Quantum-Assisted Network (QA-Net), combining parallel 2-qubit quantum circuits to generate expressive latent representations that serve as implicit nonlinear priors, followed by a single linear layer for conductivity reconstruction. This design drastically reduces model complexity and parameter number. Uniquely, QuantEIT operates in an unsupervised, training-data-free manner and represents the first integration of quantum circuits into EIT image reconstruction. Extensive experiments on simulated and real-world 2D and 3D EIT lung imaging data demonstrate that QuantEIT outperforms conventional methods, achieving comparable or superior reconstruction accuracy using only 0.2% of the parameters, with enhanced robustness to noise.
zh

[CV-9] Moodifier: MLLM -Enhanced Emotion-Driven Image Editing

【速读】：该论文旨在解决情绪驱动图像编辑中情感抽象性与视觉内容之间难以精确映射的问题，尤其在不同场景下情绪表现形式多样、难以统一建模的挑战。其解决方案的关键在于构建一个三阶段集成系统：首先创建包含800万+图像并带有层次化情感标注的MoodArchive数据集（由LLaVA生成并经人工部分验证）；其次训练MoodifyCLIP模型，该模型基于MoodArchive对齐视觉与语言模态，实现从抽象情绪到具体视觉属性的转换；最后提出无需训练的Moodifier编辑框架，利用MoodifyCLIP和多模态大语言模型（MLLMs）实现高保真度的情感变换，同时保持原始内容结构和身份不变。此方案在角色表情、时尚设计、珠宝及家居装饰等多个领域均表现出优越的情绪准确性与内容一致性。

链接: https://arxiv.org/abs/2507.14024
作者: Jiarong Ye,Sharon X. Huang
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Bridging emotions and visual content for emotion-driven image editing holds great potential in creative industries, yet precise manipulation remains challenging due to the abstract nature of emotions and their varied manifestations across different contexts. We tackle this challenge with an integrated approach consisting of three complementary components. First, we introduce MoodArchive, an 8M+ image dataset with detailed hierarchical emotional annotations generated by LLaVA and partially validated by human evaluators. Second, we develop MoodifyCLIP, a vision-language model fine-tuned on MoodArchive to translate abstract emotions into specific visual attributes. Third, we propose Moodifier, a training-free editing model leveraging MoodifyCLIP and multimodal large language models (MLLMs) to enable precise emotional transformations while preserving content integrity. Our system works across diverse domains such as character expressions, fashion design, jewelry, and home décor, enabling creators to quickly visualize emotional variations while preserving identity and structure. Extensive experimental evaluations show that Moodifier outperforms existing methods in both emotional accuracy and content preservation, providing contextually appropriate edits. By linking abstract emotions to concrete visual changes, our solution unlocks new possibilities for emotional content creation in real-world applications. We will release the MoodArchive dataset, MoodifyCLIP model, and make the Moodifier code and demo publicly available upon acceptance.
zh

[CV-10] Analysis of Plant Nutrient Deficiencies Using Multi-Spectral Imaging and Optimized Segmentation Model

【速读】：该论文旨在解决植物叶片养分缺乏症状的精准检测问题，这是实现精准农业中早期施肥、病害与胁迫管理的关键环节。解决方案的核心在于提出了一种结合多光谱成像与改进型YOLOv5模型的深度学习框架，其中引入基于Transformer的注意力机制头，以增强对九通道多光谱输入的处理能力，从而更有效地捕捉细微且空间分布式的异常症状（如黄化和色素积累）。实验表明，该方法在Dice分数和交并比（IoU）上相较基线YOLOv5平均提升约12%，验证了其在复杂症状识别中的优越性。

链接: https://arxiv.org/abs/2507.14013
作者: Ji-Yan Wu,Zheng Yong Poh,Anoop C. Patil,Bongsoo Park,Giovanni Volpe,Daisuke Urano
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate detection of nutrient deficiency in plant leaves is essential for precision agriculture, enabling early intervention in fertilization, disease, and stress management. This study presents a deep learning framework for leaf anomaly segmentation using multispectral imaging and an enhanced YOLOv5 model with a transformer-based attention head. The model is tailored for processing nine-channel multispectral input and uses self-attention mechanisms to better capture subtle, spatially-distributed symptoms. The plants in the experiments were grown under controlled nutrient stress conditions for evaluation. We carry out extensive experiments to benchmark the proposed model against the baseline YOLOv5. Extensive experiments show that the proposed model significantly outperforms the baseline YOLOv5, with an average Dice score and IoU (Intersection over Union) improvement of about 12%. In particular, this model is effective in detecting challenging symptoms like chlorosis and pigment accumulation. These results highlight the promise of combining multi-spectral imaging with spectral-spatial feature learning for advancing plant phenotyping and precision agriculture.
zh

[CV-11] Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations

【速读】：该论文旨在解决隧道衬砌裂缝（tunnel lining crack）的精准分类与分割问题，以提升隧道结构安全状态评估的效率和准确性。其解决方案的关键在于提出一种两阶段深度学习方法：第一阶段采用DenseNet-169构建自动图像分类模型，高效筛选出含裂缝图像；第二阶段基于DeepLabV3+设计裂缝分割模型，并引入score-weighted视觉解释技术对模型内部逻辑进行可视化分析，从而增强模型可解释性并提升检测精度。该方法在分类准确率（92.23%）和帧率（FPS=39.80）上优于其他CNN及Transformer模型，且分割指标（IoU=57.01%，F1=67.44%）优于现有先进模型，为隧道健康状态的快速定量评估提供了可靠的技术基础。

链接: https://arxiv.org/abs/2507.14010
作者: Yong Feng,Xiaolei Zhang,Shijin Feng,Yong Zhao,Yihan Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Tunnel lining crack is a crucial indicator of tunnels’ safety status. Aiming to classify and segment tunnel cracks with enhanced accuracy and efficiency, this study proposes a two-step deep learning-based method. An automatic tunnel image classification model is developed using the DenseNet-169 in the first step. The proposed crack segmentation model in the second step is based on the DeepLabV3+, whose internal logic is evaluated via a score-weighted visual explanation technique. Proposed method combines tunnel image classification and segmentation together, so that the selected images containing cracks from the first step are segmented in the second step to improve the detection accuracy and efficiency. The superior performances of the two-step method are validated by experiments. The results show that the accuracy and frames per second (FPS) of the tunnel crack classification model are 92.23% and 39.80, respectively, which are higher than other convolutional neural networks (CNN) based and Transformer based models. Also, the intersection over union (IoU) and F1 score of the tunnel crack segmentation model are 57.01% and 67.44%, respectively, outperforming other state-of-the-art models. Moreover, the provided visual explanations in this study are conducive to understanding the “black box” of deep learning-based models. The developed two-stage deep learning-based method integrating visual explanations provides a basis for fast and accurate quantitative assessment of tunnel health status.
zh

[CV-12] DreamScene: 3D Gaussian-based End-to-end Text-to-3D Scene Generation ECCV2024

【速读】：该论文旨在解决从自然语言生成高质量、可编辑的3D场景时存在的自动化程度低、三维一致性差以及细粒度控制能力弱的问题。其解决方案的关键在于提出一个端到端框架DreamScene，包含四个核心模块：首先通过GPT-4代理进行场景规划，构建融合语义与空间约束的混合图结构；随后采用基于图的布局算法生成无碰撞的结构化场景布局；接着利用生成式AI（Generative AI）中的多时间步采样与重构优化技术——Formation Pattern Sampling（FPS），实现快速且逼真的物体几何生成；最后通过针对室内外环境定制的渐进式相机采样策略保障全局一致性，并支持对象移动、外观修改及四维动态运动等细粒度编辑操作。

链接: https://arxiv.org/abs/2507.13985
作者: Haoran Li,Yuli Tian,Kun Lan,Yong Liao,Lin Wang,Pan Hui,Peng Yuan Zhou
机构: University of Science and Technology of China (中国科学技术大学); Nanyang Technological University (南洋理工大学); Hong Kong University of Science and Technology (广州) (香港科技大学（广州）); University of Helsinki (赫尔辛基大学); Aarhus University (奥胡斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended version of ECCV 2024 paper “DreamScene”

点击查看摘要

Abstract:Generating 3D scenes from natural language holds great promise for applications in gaming, film, and design. However, existing methods struggle with automation, 3D consistency, and fine-grained control. We present DreamScene, an end-to-end framework for high-quality and editable 3D scene generation from text or dialogue. DreamScene begins with a scene planning module, where a GPT-4 agent infers object semantics and spatial constraints to construct a hybrid graph. A graph-based placement algorithm then produces a structured, collision-free layout. Based on this layout, Formation Pattern Sampling (FPS) generates object geometry using multi-timestep sampling and reconstructive optimization, enabling fast and realistic synthesis. To ensure global consistent, DreamScene employs a progressive camera sampling strategy tailored to both indoor and outdoor settings. Finally, the system supports fine-grained scene editing, including object movement, appearance changes, and 4D dynamic motion. Experiments demonstrate that DreamScene surpasses prior methods in quality, consistency, and flexibility, offering a practical solution for open-domain 3D content creation. Code and demos are available at this https URL.
zh

[CV-13] CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models ICCV2025

【速读】：该论文旨在解决从单张图像中分离内容与风格（即内容-风格分解，Content-Style Decomposition, CSD）的问题，以实现内容的再语境化和风格的再应用，从而提升视觉合成的创作灵活性。现有方法多针对扩散模型设计，难以适配新兴的视觉自回归建模（Visual Autoregressive Modeling, VAR）框架。为此，作者提出CSD-VAR，其关键创新在于：(1) 一种尺度感知的交替优化策略，使内容与风格表示与其生成尺度对齐，增强解耦效果；(2) 基于奇异值分解（SVD）的校正方法，减少内容信息向风格表示中的泄露；(3) 增强型键值（Key-Value）记忆机制，提升内容身份的保留能力。实验表明，该方法在内容保真度与风格还原 fidelity 上均优于现有方法。

链接: https://arxiv.org/abs/2507.13984
作者: Quang-Binh Nguyen,Minh Luu,Quang Nguyen,Anh Tran,Khoi Nguyen
机构: Qualcomm AI Research(高通人工智能研究); MovianAI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored the decomposition of explicit content style, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance comparable to that of diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) an Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity.
zh

[CV-14] Evaluation of Human Visual Privacy Protection: A Three-Dimensional Framework and Benchmark Dataset ICCV’25

【速读】：该论文旨在解决当前AI驱动的监控技术中敏感个人数据收集与处理引发的隐私保护问题，尤其是缺乏客观评估视觉隐私保护方法效果的系统性框架。其解决方案的关键在于提出一个涵盖隐私性（privacy）、实用性（utility）和可实施性（practicality）三个维度的综合评估框架，并构建了HR-VISPR这一公开的人本主义数据集，包含生物特征（biometric）、软生物特征（soft-biometric）和非生物特征标签，用于训练可解释的隐私度量模型。该框架能够基于人类视觉感知区分隐私等级，并量化不同隐私保护方法在三者之间的权衡关系，从而为多样化应用场景提供结构化的评估工具。

链接: https://arxiv.org/abs/2507.13981
作者: Sara Abdulaziz,Giacomo D’Amicantonio,Egor Bondarev
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at ICCV’25 workshop CV4BIOM

点击查看摘要

Abstract:Recent advances in AI-powered surveillance have intensified concerns over the collection and processing of sensitive personal data. In response, research has increasingly focused on privacy-by-design solutions, raising the need for objective techniques to evaluate privacy protection. This paper presents a comprehensive framework for evaluating visual privacy-protection methods across three dimensions: privacy, utility, and practicality. In addition, it introduces HR-VISPR, a publicly available human-centric dataset with biometric, soft-biometric, and non-biometric labels to train an interpretable privacy metric. We evaluate 11 privacy protection methods, ranging from conventional techniques to advanced deep-learning methods, through the proposed framework. The framework differentiates privacy levels in alignment with human visual perception, while highlighting trade-offs between privacy, utility, and practicality. This study, along with the HR-VISPR dataset, serves as an insightful tool and offers a structured evaluation framework applicable across diverse contexts.
zh

[CV-15] Cross-modal Causal Intervention for Alzheimers Disease Prediction

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s Disease, AD）早期诊断中因多模态数据选择偏差及变量间复杂关系导致的混淆因素干扰问题，这些问题常使非因果模型捕捉到虚假的相关性，从而降低诊断可靠性。解决方案的关键在于提出一种名为ADPC（Alzheimer’s Disease Prediction with Cross-modal Causal Intervention）的视觉-语言因果干预框架，通过引入大语言模型（Large Language Model, LLM）对临床数据进行结构化文本生成，并结合磁共振成像（Magnetic Resonance Imaging, MRI）、功能磁共振成像（functional MRI, fMRI）与LLM输出文本进行多模态分类，同时利用因果干预机制隐式消除混杂因子（如神经影像伪影和年龄相关生物标志物），从而提升模型在区分认知正常（Cognitively Normal, CN）、轻度认知障碍（Mild Cognitive Impairment, MCI）和AD人群中的准确性与鲁棒性。

链接: https://arxiv.org/abs/2507.13956
作者: Yutao Jin,Haowen Xiao,Jielei Chu,Fengmao Lv,Yuxiao Li,Tianrui Li
机构: Southwest Jiaotong University (西南交通大学); Sichuan University (四川大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Mild Cognitive Impairment (MCI) serves as a prodromal stage of Alzheimer’s Disease (AD), where early identification and intervention can effectively slow the progression to dementia. However, diagnosing AD remains a significant challenge in neurology due to the confounders caused mainly by the selection bias of multimodal data and the complex relationships between variables. To address these issues, we propose a novel visual-language causal intervention framework named Alzheimer’s Disease Prediction with Cross-modal Causal Intervention (ADPC) for diagnostic assistance. Our ADPC employs large language model (LLM) to summarize clinical data under strict templates, maintaining structured text outputs even with incomplete or unevenly distributed datasets. The ADPC model utilizes Magnetic Resonance Imaging (MRI), functional MRI (fMRI) images and textual data generated by LLM to classify participants into Cognitively Normal (CN), MCI, and AD categories. Because of the presence of confounders, such as neuroimaging artifacts and age-related biomarkers, non-causal models are likely to capture spurious input-output correlations, generating less reliable results. Our framework implicitly eliminates confounders through causal intervention. Experimental results demonstrate the outstanding performance of our method in distinguishing CN/MCI/AD cases, achieving state-of-the-art (SOTA) metrics across most evaluation metrics. The study showcases the potential of integrating causal reasoning with multi-modal learning for neurological disease diagnosis.
zh

[CV-16] Generalist Forecasting with Frozen Video Models via Latent Diffusion

【速读】：该论文旨在解决通用系统在不同抽象层次上进行短期未来预测的能力问题，即如何提升模型对视频序列中未来状态的准确预判能力。其核心挑战在于如何利用预训练视觉模型（vision model）的感知能力来增强泛化预测性能，并实现跨任务、跨模型的一致评估。解决方案的关键在于提出一种新颖的通用预测框架：该框架基于冻结的视觉主干网络（frozen vision backbone），通过训练潜空间扩散模型（latent diffusion models）在固定表示空间中预测未来的特征，再使用轻量级、任务特定的读出机制（readouts）将这些特征解码为具体任务输出。这一方法有效融合了表示学习与生成建模的优势，显著提升了多任务场景下的预测一致性与准确性。

链接: https://arxiv.org/abs/2507.13942
作者: Jacob C Walker,Pedro Vélez,Luisa Polania Cabrera,Guangyao Zhou,Rishabh Kabra,Carl Doersch,Maks Ovsjanikov,João Carreira,Shiry Ginosar
机构: Google DeepMind(谷歌深度智脑); TTIC(汤普森技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Forecasting what will happen next is a critical skill for general-purpose systems that plan or act in the world at different levels of abstraction. In this paper, we identify a strong correlation between a vision model’s perceptual ability and its generalist forecasting performance over short time horizons. This trend holds across a diverse set of pretrained models-including those trained generatively-and across multiple levels of abstraction, from raw pixels to depth, point tracks, and object motion. The result is made possible by a novel generalist forecasting framework that operates on any frozen vision backbone: we train latent diffusion models to forecast future features in the frozen representation space, which are then decoded via lightweight, task-specific readouts. To enable consistent evaluation across tasks, we introduce distributional metrics that compare distributional properties directly in the space of downstream tasks and apply this framework to nine models and four tasks. Our results highlight the value of bridging representation learning and generative modeling for temporally grounded video understanding.
zh

[CV-17] DiViD: Disentangled Video Diffusion for Static-Dynamic Factorization

【速读】：该论文旨在解决视频中静态外观（static appearance）与动态运动（dynamic motion）的无监督解耦问题，这一任务在现有基于变分自编码器（VAE）和生成对抗网络（GAN）的方法中常因信息泄露（information leakage）和重建模糊（blurry reconstructions）而难以实现。其解决方案的关键在于提出首个端到端的视频扩散模型DiViD，通过三个核心设计实现显式的静态-动态因子分解：1）序列编码器从首帧提取全局静态标记（global static token）并生成每帧的动态标记（per-frame dynamic tokens），主动移除动态内容中的静态成分；2）条件DDPM解码器引入三种归纳偏置（inductive biases）——共享噪声调度保障时间一致性、时变KL瓶颈在早期紧缩（压缩静态信息）后期放松（增强动态细节）、交叉注意力将全局静态标记路由至所有帧但保持动态标记帧内特异性；3）引入正交性正则项防止残余的静态-动态泄漏。该方法在真实世界基准上显著优于当前最先进方法，在交换准确性、静态保真度和动态迁移能力方面均取得提升。

链接: https://arxiv.org/abs/2507.13934
作者: Marzieh Gheisari,Auguste Genovesio
机构: École Normale Supérieure PSL (巴黎文理研究大学理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised disentanglement of static appearance and dynamic motion in video remains a fundamental challenge, often hindered by information leakage and blurry reconstructions in existing VAE- and GAN-based approaches. We introduce DiViD, the first end-to-end video diffusion framework for explicit static-dynamic factorization. DiViD’s sequence encoder extracts a global static token from the first frame and per-frame dynamic tokens, explicitly removing static content from the motion code. Its conditional DDPM decoder incorporates three key inductive biases: a shared-noise schedule for temporal consistency, a time-varying KL-based bottleneck that tightens at early timesteps (compressing static information) and relaxes later (enriching dynamics), and cross-attention that routes the global static token to all frames while keeping dynamic tokens frame-specific. An orthogonality regularizer further prevents residual static-dynamic leakage. We evaluate DiViD on real-world benchmarks using swap-based accuracy and cross-leakage metrics. DiViD outperforms state-of-the-art sequential disentanglement methods: it achieves the highest swap-based joint accuracy, preserves static fidelity while improving dynamic transfer, and reduces average cross-leakage.
zh

[CV-18] meNeRF: Building Generalizable Neural Radiance Fields across Time from Few-Shot Input Views

【速读】：该论文旨在解决现有神经辐射场（Neural Radiance Fields, NeRF）方法在时间维度上建模3D场景能力不足的问题，尤其针对现实世界中多视角数据获取成本高、对新场景需重复优化以及缺乏专门用于时序3D建模的数据集等挑战。解决方案的关键在于提出TimeNeRF，其核心创新是结合多视角立体视觉、神经辐射场与跨数据集的解耦策略，从而实现仅用少量输入视图即可泛化至任意时间点的隐式内容辐射场建模，并通过体渲染合成任意时刻的新视角图像，显著提升了在不同时间段（如日出到日落）下自然场景变化的连续性和真实性。

链接: https://arxiv.org/abs/2507.13929
作者: Hsiang-Hui Hung,Huu-Phu Do,Yung-Hui Li,Ching-Chun Huang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); Hon Hai Research Institute (鸿海研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by MM 2024

点击查看摘要

Abstract:We present TimeNeRF, a generalizable neural rendering approach for rendering novel views at arbitrary viewpoints and at arbitrary times, even with few input views. For real-world applications, it is expensive to collect multiple views and inefficient to re-optimize for unseen scenes. Moreover, as the digital realm, particularly the metaverse, strives for increasingly immersive experiences, the ability to model 3D environments that naturally transition between day and night becomes paramount. While current techniques based on Neural Radiance Fields (NeRF) have shown remarkable proficiency in synthesizing novel views, the exploration of NeRF’s potential for temporal 3D scene modeling remains limited, with no dedicated datasets available for this purpose. To this end, our approach harnesses the strengths of multi-view stereo, neural radiance fields, and disentanglement strategies across diverse datasets. This equips our model with the capability for generalizability in a few-shot setting, allows us to construct an implicit content radiance field for scene representation, and further enables the building of neural radiance fields at any arbitrary time. Finally, we synthesize novel views of that time via volume rendering. Experiments show that TimeNeRF can render novel views in a few-shot setting without per-scene optimization. Most notably, it excels in creating realistic novel views that transition smoothly across different times, adeptly capturing intricate natural scene changes from dawn to dusk.
zh

[CV-19] Enhancing LiDAR Point Features with Foundation Model Priors for 3D Object Detection

【速读】：该论文旨在解决LiDAR点云特征表达能力有限的问题，特别是反射率（reflectance）属性的判别能力弱，导致3D目标检测性能受限。解决方案的关键在于引入由DepthAnything模型从单目RGB图像中预测的深度先验（depth priors），将其与原始LiDAR属性融合以增强每个点的表征能力；进一步设计了点级特征提取模块和双路径RoI特征提取框架（包含体素分支用于全局语义上下文、点分支用于细粒度结构细节），并通过双向门控RoI特征融合模块有效整合互补的局部与全局特征，从而显著提升LiDAR-based 3D物体检测的准确性。

链接: https://arxiv.org/abs/2507.13899
作者: Yujian Mo,Yan Wu,Junqiao Zhao,Jijun Wang,Yinghao Hu,Jun Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in foundation models have opened up new possibilities for enhancing 3D perception. In particular, DepthAnything offers dense and reliable geometric priors from monocular RGB images, which can complement sparse LiDAR data in autonomous driving scenarios. However, such priors remain underutilized in LiDAR-based 3D object detection. In this paper, we address the limited expressiveness of raw LiDAR point features, especially the weak discriminative capability of the reflectance attribute, by introducing depth priors predicted by DepthAnything. These priors are fused with the original LiDAR attributes to enrich each point’s representation. To leverage the enhanced point features, we propose a point-wise feature extraction module. Then, a Dual-Path RoI feature extraction framework is employed, comprising a voxel-based branch for global semantic context and a point-based branch for fine-grained structural details. To effectively integrate the complementary RoI features, we introduce a bidirectional gated RoI feature fusion module that balances global and local cues. Extensive experiments on the KITTI benchmark show that our method consistently improves detection accuracy, demonstrating the value of incorporating visual foundation model priors into LiDAR-based 3D object detection.
zh

[CV-20] PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations ICCV2025

【速读】：该论文旨在解决无COLMAP（Bundle Adjustment）的3D高斯点绘（3D Gaussian Splatting, 3D-GS）方法在处理具有复杂相机轨迹（如相邻视角间剧烈旋转和位移）场景时，因相机位姿估计退化而导致联合优化陷入局部极小值的问题。解决方案的关键在于提出PCR-GS方法，通过相机位姿协同正则化（camera pose co-regularization）实现更鲁棒的位姿估计：一方面利用视图鲁棒的DINO特征进行重投影正则化，对齐相邻视角语义信息以约束位姿；另一方面引入基于小波的高频细节差异正则化，优化相机旋转矩阵，从而提升在极端相机运动下的3D场景重建质量。

链接: https://arxiv.org/abs/2507.13891
作者: Yu Wei,Jiahui Zhang,Xiaoqin Zhang,Ling Shao,Shijian Lu
机构: Nanyang Technological University (南洋理工大学); Zhejiang University of Technology (浙江工业大学); UCAS-Terminus AI Lab, University of Chinese Academy of Sciences (中国科学院大学-Terminus AI 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:COLMAP-free 3D Gaussian Splatting (3D-GS) has recently attracted increasing attention due to its remarkable performance in reconstructing high-quality 3D scenes from unposed images or videos. However, it often struggles to handle scenes with complex camera trajectories as featured by drastic rotation and translation across adjacent camera views, leading to degraded estimation of camera poses and further local minima in joint optimization of camera poses and 3D-GS. We propose PCR-GS, an innovative COLMAP-free 3DGS technique that achieves superior 3D scene modeling and camera pose estimation via camera pose co-regularization. PCR-GS achieves regularization from two perspectives. The first is feature reprojection regularization which extracts view-robust DINO features from adjacent camera views and aligns their semantic information for camera pose regularization. The second is wavelet-based frequency regularization which exploits discrepancy in high-frequency details to further optimize the rotation matrix in camera poses. Extensive experiments over multiple real-world scenes show that the proposed PCR-GS achieves superior pose-free 3D-GS scene modeling under dramatic changes of camera trajectories.
zh

[CV-21] Real-Time Fusion of Visual and Chart Data for Enhanced Maritime Vision

【速读】：该论文旨在解决海洋视觉中动态复杂环境下目标定位与地图信息关联不准确的问题，尤其是在实时视频流中将检测到的导航助航标志（如浮标）与海图数据中的对应标识进行精确匹配。其解决方案的关键在于提出了一种基于Transformer的端到端神经网络架构，该模型可直接预测浮标在图像域中的边界框及其置信度分数，从而实现图像域检测结果与世界坐标系下海图标记的高效、鲁棒匹配，显著优于传统的射线投射模型和扩展了距离估计模块的YOLOv7方法。

链接: https://arxiv.org/abs/2507.13880
作者: Marten Kreis,Benjamin Kiefer
机构: University of Tuebingen (图宾根大学); LOOKOUT
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a novel approach to enhancing marine vision by fusing real-time visual data with chart information. Our system overlays nautical chart data onto live video feeds by accurately matching detected navigational aids, such as buoys, with their corresponding representations in chart data. To achieve robust association, we introduce a transformer-based end-to-end neural network that predicts bounding boxes and confidence scores for buoy queries, enabling the direct matching of image-domain detections with world-space chart markers. The proposed method is compared against baseline approaches, including a ray-casting model that estimates buoy positions via camera projection and a YOLOv7-based network extended with a distance estimation module. Experimental results on a dataset of real-world maritime scenes demonstrate that our approach significantly improves object localization and association accuracy in dynamic and challenging environments.
zh

[CV-22] Safety Certification in the Latent space using Control Barrier Functions and World Models

【速读】：该论文旨在解决从视觉数据中合成安全控制器时依赖大量人工标注安全关键数据的问题，这在真实场景中往往不切实际。其解决方案的关键在于提出一种半监督框架，利用世界模型（world model）隐空间中学习到的控制屏障证书（control barrier certificates, CBCs），结合有限标注数据联合训练神经屏障函数与安全控制器，并借助现代视觉Transformer的预测能力实现对隐空间动态性的高效建模，从而在数据效率和安全性之间取得平衡。

链接: https://arxiv.org/abs/2507.13871
作者: Mehul Anand,Shishir Kolathaya
机构: Indian Institute of Technology, Roorkee (印度理工学院，鲁尔基分校); Center for Cyber-Physical Systems, Indian Institute of Science (IISc), Bengaluru (印度科学研究所(印度理工学院)，班加罗尔)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 6 pages, 6 figures. arXiv admin note: text overlap with arXiv:2409.12616

点击查看摘要

Abstract:Synthesising safe controllers from visual data typically requires extensive supervised labelling of safety-critical data, which is often impractical in real-world settings. Recent advances in world models enable reliable prediction in latent spaces, opening new avenues for scalable and data-efficient safe control. In this work, we introduce a semi-supervised framework that leverages control barrier certificates (CBCs) learned in the latent space of a world model to synthesise safe visuomotor policies. Our approach jointly learns a neural barrier function and a safe controller using limited labelled data, while exploiting the predictive power of modern vision transformers for latent dynamics modelling.
zh

[CV-23] When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在融合内部参数化知识与外部信息时产生的知识冲突问题，此类冲突常导致幻觉和不可靠输出。其解决方案的关键在于识别并操控特定注意力头（attention heads），这些头在跨模态冲突中起决定性作用：通过引入包含多模态反事实查询的数据集进行对齐分析，研究者定位到少量控制冲突的注意力头；进一步修改这些头的权重可引导模型偏向内部常识知识或视觉输入；此外，这些关键头的注意力分布能精确定位驱动视觉覆盖的图像区域，且精度优于基于梯度的归因方法。

链接: https://arxiv.org/abs/2507.13868
作者: Francesco Ortu,Zhijing Jin,Diego Doimo,Alberto Cazzaniga
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) increasingly leverage diverse knowledge sources to address complex tasks, often encountering conflicts between their internal parametric knowledge and external information. Knowledge conflicts can result in hallucinations and unreliable responses, but the mechanisms governing such interactions remain unknown. To address this gap, we analyze the mechanisms that VLMs use to resolve cross-modal conflicts by introducing a dataset of multimodal counterfactual queries that deliberately contradict internal commonsense knowledge. We localize with logit inspection a small set of heads that control the conflict. Moreover, by modifying these heads, we can steer the model towards its internal knowledge or the visual inputs. Finally, we show that attention from such heads pinpoints localized image regions driving visual overrides, outperforming gradient-based attribution in precision.
zh

[CV-24] PositionIC: Unified Position and Identity Consistency for Image Customization

【速读】：该论文旨在解决当前基于主体的图像定制技术中细粒度实体级空间控制难以实现的问题，这一局限性主要源于缺乏能够将身份与精确位置线索绑定的可扩展数据集。其解决方案的关键在于提出PositionIC框架，该框架通过双向生成范式构建可扩展的数据合成流程，以消除主体漂移并保持语义一致性；同时设计轻量级的位置调制层，解耦不同主体间的空间嵌入，从而实现独立且精准的空间定位，同时保障图像质量的一致性。

链接: https://arxiv.org/abs/2507.13861
作者: Junjie Hu,Tianyang Han,Kai Ma,Jialin Gao,Hao Dou,Song Yang,Xianhua He,Jianhui Zhang,Junfeng Luo,Xiaoming Wei,Wenqiang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent subject-driven image customization has achieved significant advancements in fidelity, yet fine-grained entity-level spatial control remains elusive, hindering the broader real-world application. This limitation is mainly attributed to scalable datasets that bind identity with precise positional cues are absent. To this end, we introduce PositionIC, a unified framework that enforces position and identity consistency for multi-subject customization. We construct a scalable synthesis pipeline that employs a bidirectional generation paradigm to eliminate subject drift and maintain semantic coherence. On top of these data, we design a lightweight positional modulation layer that decouples spatial embeddings among subjects, enabling independent, accurate placement while preserving visual fidelity. Extensive experiments demonstrate that our approach can achieve precise spatial control while maintaining high consistency in image customization task. PositionIC paves the way for controllable, high-fidelity image customization in open-world, multi-entity scenarios and will be released to foster further research.
zh

[CV-25] Depth3DLane: Fusing Monocular 3D Lane Detection with Self-Supervised Monocular Depth Estimation

【速读】：该论文旨在解决单目3D车道线检测（monocular 3D lane detection）中因缺乏显式空间信息而导致的挑战，特别是现有方法依赖昂贵深度传感器或需大量真实深度标注数据的问题，以及假设相机参数已知从而限制其在众包高清（HD）车道地图等实际场景中的应用。解决方案的关键在于提出Depth3DLane框架，该框架采用双路径结构：前端视图路径提取丰富语义信息，鸟瞰视图路径通过自监督单目深度估计获得点云表示以提供显式空间结构；同时引入3D车道锚框从两个路径采样特征并推理精确的3D车道几何，并进一步扩展为帧级预测相机参数，结合理论驱动的拟合策略提升每段车道的稳定性，从而实现无需额外深度标注或固定相机参数即可高效准确地进行3D车道线检测。

链接: https://arxiv.org/abs/2507.13857
作者: Max van den Hoven,Kishaan Jeeveswaran,Pieter Piscaer,Thijs Wensveen,Elahe Arani,Bahram Zonooz
机构: Eindhoven University of Technology (埃因霍温理工大学); TNO (荷兰应用科学研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Monocular 3D lane detection is essential for autonomous driving, but challenging due to the inherent lack of explicit spatial information. Multi-modal approaches rely on expensive depth sensors, while methods incorporating fully-supervised depth networks rely on ground-truth depth data that is impractical to collect at scale. Additionally, existing methods assume that camera parameters are available, limiting their applicability in scenarios like crowdsourced high-definition (HD) lane mapping. To address these limitations, we propose Depth3DLane, a novel dual-pathway framework that integrates self-supervised monocular depth estimation to provide explicit structural information, without the need for expensive sensors or additional ground-truth depth data. Leveraging a self-supervised depth network to obtain a point cloud representation of the scene, our bird’s-eye view pathway extracts explicit spatial information, while our front view pathway simultaneously extracts rich semantic information. Depth3DLane then uses 3D lane anchors to sample features from both pathways and infer accurate 3D lane geometry. Furthermore, we extend the framework to predict camera parameters on a per-frame basis and introduce a theoretically motivated fitting procedure to enhance stability on a per-segment basis. Extensive experiments demonstrate that Depth3DLane achieves competitive performance on the OpenLane benchmark dataset. Furthermore, experimental results show that using learned parameters instead of ground-truth parameters allows Depth3DLane to be applied in scenarios where camera calibration is infeasible, unlike previous methods.
zh

[CV-26] A Quantum-assisted Attention U-Net for Building Segmentation over Tunis using Sentinel-1 Data

【速读】：该论文旨在解决高分辨率卫星影像中密集城市区域建筑分割的挑战，特别是由于图像尺寸大、结构复杂导致的传统深度学习模型计算成本高且参数冗余的问题。其解决方案的关键在于引入量子卷积（Quanvolution）预处理模块，用于增强Attention U-Net模型对Sentinel-1合成孔径雷达（SAR）影像的特征提取能力，从而生成更具结构性信息的特征图。实验表明，该方法在保持与标准Attention U-Net相当的分割精度的同时，显著减少了网络参数量，提升了计算效率，验证了量子辅助深度学习框架在大规模城市建筑分割任务中的可行性与优势。

链接: https://arxiv.org/abs/2507.13852
作者: Luigi Russo,Francesco Mauro,Babak Memar,Alessandro Sebastianelli,Silvia Liberata Ullo,Paolo Gamba
机构: University of Pavia (帕维亚大学); University of Sannio (萨皮恩扎大学); Sapienza University of Rome (罗马大学); European Space Agency (欧洲空间局)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at IEEE Joint Urban Remote Sensing Event (JURSE) 2025

点击查看摘要

Abstract:Building segmentation in urban areas is essential in fields such as urban planning, disaster response, and population mapping. Yet accurately segmenting buildings in dense urban regions presents challenges due to the large size and high resolution of satellite images. This study investigates the use of a Quanvolutional pre-processing to enhance the capability of the Attention U-Net model in the building segmentation. Specifically, this paper focuses on the urban landscape of Tunis, utilizing Sentinel-1 Synthetic Aperture Radar (SAR) imagery. In this work, Quanvolution was used to extract more informative feature maps that capture essential structural details in radar imagery, proving beneficial for accurate building segmentation. Preliminary results indicate that proposed methodology achieves comparable test accuracy to the standard Attention U-Net model while significantly reducing network parameters. This result aligns with findings from previous works, confirming that Quanvolution not only maintains model accuracy but also increases computational efficiency. These promising outcomes highlight the potential of quantum-assisted Deep Learning frameworks for large-scale building segmentation in urban environments.
zh

[CV-27] am of One: Cracking Complex Video QA with Model Synergy

【速读】：该论文旨在解决当前视频大模型（Video-Large Multimodal Models, Video-LMMs）在开放性视频问答任务中普遍存在的局限性，包括上下文理解能力不足、时间建模薄弱以及对模糊或复合型问题的泛化性能差等问题。其解决方案的关键在于提出一种提示与响应集成机制（prompting-and-response integration mechanism），通过结构化的思维链（chain of thought）协调多个异构视频-语言模型（Video-Language Models, VLMs），使其针对不同推理路径进行分工协作，并由外部大型语言模型（Large Language Model, LLM）作为评估与融合单元，选择并整合最可靠的输出结果。该方法无需重新训练模型即可显著提升多模态推理的深度与鲁棒性，且具备轻量化和可扩展性优势。

链接: https://arxiv.org/abs/2507.13820
作者: Jun Xie,Zhaoran Zhao,Xiongjun Guan,Yingjian Zhu,Hongzhu Yi,Xinming Wang,Feng Chen,Zhepeng Wang
机构: Lenovo research (联想研究院); Tsinghua University (清华大学); School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS) (中国科学院大学人工智能学院); Institute of Automation, Chinese Academy of Sciences(CAS) (中国科学院自动化研究所); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a novel framework for open-ended video question answering that enhances reasoning depth and robustness in complex real-world scenarios, as benchmarked on the CVRR-ES dataset. Existing Video-Large Multimodal Models (Video-LMMs) often exhibit limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries. To address these challenges, we introduce a prompting-and-response integration mechanism that coordinates multiple heterogeneous Video-Language Models (VLMs) via structured chains of thought, each tailored to distinct reasoning pathways. An external Large Language Model (LLM) serves as an evaluator and integrator, selecting and fusing the most reliable responses. Extensive experiments demonstrate that our method significantly outperforms existing baselines across all evaluation metrics, showcasing superior generalization and robustness. Our approach offers a lightweight, extensible strategy for advancing multimodal reasoning without requiring model retraining, setting a strong foundation for future Video-LMM development.
zh

[CV-28] SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing ICCV25

【速读】：该论文旨在解决多模态遥感基础模型（Multi-modal Remote Sensing Foundation Model, MM-RSFM）在训练过程中因采用独立骨干网络处理不同数据模态而导致的参数冗余与利用效率低的问题，以及现有预训练方法未充分考虑遥感图像复杂语义分布特征、过度依赖自然图像自监督学习（Self-Supervised Learning, SSL）策略所引发的性能瓶颈。其解决方案的关键在于提出SkySense V2——一个基于统一Transformer骨干网络的多模态遥感基础模型，通过引入专为遥感数据设计的新颖自监督预训练策略，结合自适应patch合并模块和可学习模态提示令牌（modality prompt tokens），有效应对不同模态间分辨率差异大和特征多样性不足的挑战，并进一步集成专家混合（Mixture of Experts, MoE）模块以提升模型性能。

链接: https://arxiv.org/abs/2507.13812
作者: Yingying Zhang,Lixiang Ru,Kang Wu,Lei Yu,Lei Liang,Yansheng Li,Jingdong Chen
机构: Ant Group (蚂蚁集团); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV25

点击查看摘要

Abstract:The multi-modal remote sensing foundation model (MM-RSFM) has significantly advanced various Earth observation tasks, such as urban planning, environmental monitoring, and natural disaster management. However, most existing approaches generally require the training of separate backbone networks for each data modality, leading to redundancy and inefficient parameter utilization. Moreover, prevalent pre-training methods typically apply self-supervised learning (SSL) techniques from natural images without adequately accommodating the characteristics of remote sensing (RS) images, such as the complicated semantic distribution within a single RS image. In this work, we present SkySense V2, a unified MM-RSFM that employs a single transformer backbone to handle multiple modalities. This backbone is pre-trained with a novel SSL strategy tailored to the distinct traits of RS data. In particular, SkySense V2 incorporates an innovative adaptive patch merging module and learnable modality prompt tokens to address challenges related to varying resolutions and limited feature diversity across modalities. In additional, we incorporate the mixture of experts (MoE) module to further enhance the performance of the foundation model. SkySense V2 demonstrates impressive generalization abilities through an extensive evaluation involving 16 datasets over 7 tasks, outperforming SkySense by an average of 1.8 points.
zh

[CV-29] GRAM-MAMBA: Holistic Feature Alignment for Wireless Perception with Adaptive Low-Rank Compensation

【速读】：该论文旨在解决物联网（IoT）感知中多模态融合面临的三大挑战：模型复杂度高导致在资源受限环境中的部署困难、单向模态对齐忽视了模态间的相互关系，以及传感器数据缺失时系统鲁棒性下降的问题。其解决方案的关键在于提出GRAM-MAMBA框架，该框架采用线性复杂度的Mamba模型高效处理传感器时间序列数据，并引入优化的GRAM矩阵策略实现模态间的成对对齐，从而克服传统单模态对齐的局限；此外，受低秩适应（Low-Rank Adaptation, LoRA）启发，设计了一种自适应低秩层补偿策略，在不更新预训练模型核心参数的前提下，仅微调与可用模态及融合过程相关的少量参数，显著提升模型在模态缺失场景下的适应能力与性能。

链接: https://arxiv.org/abs/2507.13803
作者: Weiqi Yang,Xu Zhou,Jingfu Guan,Hao Du,Tianyu Bai
机构: Hefei University of Technology (合肥工业大学); Sun Yat-sen University (中山大学); Jilin University (吉林大学); Durham University (杜伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal fusion is crucial for Internet of Things (IoT) perception, widely deployed in smart homes, intelligent transport, industrial automation, and healthcare. However, existing systems often face challenges: high model complexity hinders deployment in resource-constrained environments, unidirectional modal alignment neglects inter-modal relationships, and robustness suffers when sensor data is missing. These issues impede efficient and robust multimodal perception in real-world IoT settings. To overcome these limitations, we propose GRAM-MAMBA. This framework utilizes the linear-complexity Mamba model for efficient sensor time-series processing, combined with an optimized GRAM matrix strategy for pairwise alignment among modalities, addressing the shortcomings of traditional single-modality alignment. Inspired by Low-Rank Adaptation (LoRA), we introduce an adaptive low-rank layer compensation strategy to handle missing modalities post-training. This strategy freezes the pre-trained model core and irrelevant adaptive layers, fine-tuning only those related to available modalities and the fusion process. Extensive experiments validate GRAM-MAMBA’s effectiveness. On the SPAWC2021 indoor positioning dataset, the pre-trained model shows lower error than baselines; adapting to missing modalities yields a 24.5% performance boost by training less than 0.2% of parameters. On the USC-HAD human activity recognition dataset, it achieves 93.55% F1 and 93.81% Overall Accuracy (OA), outperforming prior work; the update strategy increases F1 by 23% while training less than 0.3% of parameters. These results highlight GRAM-MAMBA’s potential for achieving efficient and robust multimodal perception in resource-constrained environments.
zh

[CV-30] Food safety trends across Europe: insights from the 392-million-entry CompreHensive European Food Safety (CHEFS) database

【速读】：该论文旨在解决欧盟食品安全部门监测数据分散、格式不统一且体量庞大（约1000个文件，数百GB）所导致的可访问性差和分析困难的问题。其解决方案的关键在于构建一个名为CompreHensive European Food Safety (CHEFS)的数据库，将来自欧洲食品安全局（EFSA）的农药残留、兽药残留及化学污染物等多类监测数据整合为结构化、统一的数据集，从而显著提升数据的可用性和分析效率，并为食品安全管理提供科学依据与决策支持。

链接: https://arxiv.org/abs/2507.13802
作者: Nehir Kizililsoley,Floor van Meer,Osman Mutlu,Wouter F Hoenderdaal,Rosan G. Hobé,Wenjuan Mu,Arjen Gerssen,H.J. van der Fels-Klerx,Ákos Jóźwiak,Ioannis Manikas,Ali Hürriyetoǧlu,Bas H.M. van der Velden
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the European Union, official food safety monitoring data collected by member states are submitted to the European Food Safety Authority (EFSA) and published on Zenodo. This data includes 392 million analytical results derived from over 15.2 million samples covering more than 4,000 different types of food products, offering great opportunities for artificial intelligence to analyze trends, predict hazards, and support early warning systems. However, the current format with data distributed across approximately 1000 files totaling several hundred gigabytes hinders accessibility and analysis. To address this, we introduce the CompreHensive European Food Safety (CHEFS) database, which consolidates EFSA monitoring data on pesticide residues, veterinary medicinal product residues, and chemical contaminants into a unified and structured dataset. We describe the creation and structure of the CHEFS database and demonstrate its potential by analyzing trends in European food safety monitoring data from 2000 to 2024. Our analyses explore changes in monitoring activities, the most frequently tested products, which products were most often non-compliant and which contaminants were most often found, and differences across countries. These findings highlight the CHEFS database as both a centralized data source and a strategic tool for guiding food safety policy, research, and regulation.
zh

[CV-31] One Step Closer: Creating the Future to Boost Monocular Semantic Scene Completion

【速读】：该论文旨在解决单目3D语义场景补全（Monocular 3D Semantic Scene Completion, SSC）在真实交通场景中因遮挡和视场限制导致的感知范围不足问题。现有方法难以有效处理被遮挡区域或超出相机视野的部分，从而影响场景重建的完整性与准确性。解决方案的关键在于提出一种名为“创造未来SSC”（Creating the Future SSC, CF-SSC）的时序SSC框架，通过伪未来帧预测扩展模型的有效感知范围；其核心创新在于利用位姿（pose）和深度信息建立精确的3D对应关系，实现过去、当前及预测未来帧在三维空间中的几何一致融合，从而显式建模时空关系，相较于传统简单特征堆叠方法显著提升了遮挡推理能力和3D场景补全精度。

链接: https://arxiv.org/abs/2507.13801
作者: Haoang Lu,Yuanqi Su,Xiaoning Zhang,Hao Hu
机构: Xi’an Jiaotong University (西安交通大学); China Academy of Railway Sciences Corporation (中国铁道科学研究院); National Railway Intelligent Transportation System Engineering and Technology Center (国家铁路智能运输系统工程和技术创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, visual 3D Semantic Scene Completion (SSC) has emerged as a critical perception task for autonomous driving due to its ability to infer complete 3D scene layouts and semantics from single 2D images. However, in real-world traffic scenarios, a significant portion of the scene remains occluded or outside the camera’s field of view – a fundamental challenge that existing monocular SSC methods fail to address adequately. To overcome these limitations, we propose Creating the Future SSC (CF-SSC), a novel temporal SSC framework that leverages pseudo-future frame prediction to expand the model’s effective perceptual range. Our approach combines poses and depths to establish accurate 3D correspondences, enabling geometrically-consistent fusion of past, present, and predicted future frames in 3D space. Unlike conventional methods that rely on simple feature stacking, our 3D-aware architecture achieves more robust scene completion by explicitly modeling spatial-temporal relationships. Comprehensive experiments on SemanticKITTI and SSCBench-KITTI-360 benchmarks demonstrate state-of-the-art performance, validating the effectiveness of our approach, highlighting our method’s ability to improve occlusion reasoning and 3D scene completion accuracy.
zh

[CV-32] DynFaceRestore: Balancing Fidelity and Quality in Diffusion-Guided Blind Face Restoration with Dynamic Blur-Level Mapping and Guidance ICCV2025

【速读】：该论文旨在解决盲人脸修复（Blind Face Restoration）中因未知退化类型导致的保真度与细节质量难以平衡的问题。现有方法通常采用固定的扩散采样步数和全局引导尺度，假设退化均匀，但实际退化核估计不准确常引发过扩散或欠扩散，进而破坏图像细节与身份一致性。其解决方案的关键在于提出DynFaceRestore：首先学习将任意盲退化输入映射为高斯模糊图像及其对应的高斯核；随后基于此动态选择每张模糊图像的起始扩散步数，并在扩散过程中引入闭式引导（closed-form guidance）以维持结构保真度；同时设计动态引导缩放调节器（dynamic guidance scaling adjuster），自适应调整局部区域的引导强度，在复杂区域增强细节生成能力的同时保留轮廓结构。该策略有效实现了保真度与质量之间的动态平衡。

链接: https://arxiv.org/abs/2507.13797
作者: Huu-Phu Do,Yu-Wei Chen,Yi-Cheng Liao,Chi-Wei Hsiao,Han-Yang Wang,Wei-Chen Chiu,Ching-Chun Huang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); MediaTek Inc. (联发科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Blind Face Restoration aims to recover high-fidelity, detail-rich facial images from unknown degraded inputs, presenting significant challenges in preserving both identity and detail. Pre-trained diffusion models have been increasingly used as image priors to generate fine details. Still, existing methods often use fixed diffusion sampling timesteps and a global guidance scale, assuming uniform degradation. This limitation and potentially imperfect degradation kernel estimation frequently lead to under- or over-diffusion, resulting in an imbalance between fidelity and quality. We propose DynFaceRestore, a novel blind face restoration approach that learns to map any blindly degraded input to Gaussian blurry images. By leveraging these blurry images and their respective Gaussian kernels, we dynamically select the starting timesteps for each blurry image and apply closed-form guidance during the diffusion sampling process to maintain fidelity. Additionally, we introduce a dynamic guidance scaling adjuster that modulates the guidance strength across local regions, enhancing detail generation in complex areas while preserving structural fidelity in contours. This strategy effectively balances the trade-off between fidelity and quality. DynFaceRestore achieves state-of-the-art performance in both quantitative and qualitative evaluations, demonstrating robustness and effectiveness in blind face restoration.
zh

[CV-33] Localized FNO for Spatiotemporal Hemodynamic Upsampling in Aneurysm MRI

【速读】：该论文旨在解决血流动力学分析中因磁共振血流成像（Magnetic Resonance Flow Imaging, MRFI）时空分辨率低及信噪比差而导致的动脉瘤破裂预测和治疗指导受限的问题。其核心解决方案是提出一种新型三维神经算子架构——局部傅里叶神经算子（Localized Fourier Neural Operator, LoFNO），该方法通过引入拉普拉斯特征向量作为几何先验，增强对不规则且未见几何结构的结构感知能力，并结合增强型深度超分辨率网络（Enhanced Deep Super-Resolution Network, EDSR）层实现鲁棒的上采样，从而在去噪的同时提升血流数据的时空分辨率，直接预测壁面剪切应力（Wall Shear Stress, WSS），显著优于插值法和其他深度学习方法，在脑血管疾病诊断中展现出更高精度。

链接: https://arxiv.org/abs/2507.13789
作者: Kyriakos Flouris,Moritz Halter,Yolanne Y. R. Lee,Samuel Castonguay,Luuk Jacobs,Pietro Dirix,Jonathan Nestmann,Sebastian Kozerke,Ender Konukoglu
机构: 11; 22; 3344; 55; 11
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Hemodynamic analysis is essential for predicting aneurysm rupture and guiding treatment. While magnetic resonance flow imaging enables time-resolved volumetric blood velocity measurements, its low spatiotemporal resolution and signal-to-noise ratio limit its diagnostic utility. To address this, we propose the Localized Fourier Neural Operator (LoFNO), a novel 3D architecture that enhances both spatial and temporal resolution with the ability to predict wall shear stress (WSS) directly from clinical imaging data. LoFNO integrates Laplacian eigenvectors as geometric priors for improved structural awareness on irregular, unseen geometries and employs an Enhanced Deep Super-Resolution Network (EDSR) layer for robust upsampling. By combining geometric priors with neural operator frameworks, LoFNO de-noises and spatiotemporally upsamples flow data, achieving superior velocity and WSS predictions compared to interpolation and alternative deep learning methods, enabling more precise cerebrovascular diagnostics.
zh

[CV-34] SuperCM: Improving Semi-Supervised Learning and Domain Adaptation through differentiable clustering

【速读】：该论文旨在解决半监督学习（Semi-Supervised Learning, SSL）和无监督域自适应（Unsupervised Domain Adaptation, UDA）中如何更有效地利用有限标注数据与大量未标注数据以提升模型性能的问题。其核心挑战在于如何在低监督场景下增强模型的泛化能力，尤其是在缺乏标签信息时保持类别判别力。解决方案的关键在于提出一个可微分的聚类模块（differentiable clustering module），该模块显式地利用监督数据计算聚类中心（centroids），从而直接强化“聚类假设”——即高维空间中属于同一簇的数据点应被分配至相同类别。这一机制通过端到端训练实现，不仅作为独立模型表现出色，还能作为正则化项提升现有方法在低监督条件下的表现。

链接: https://arxiv.org/abs/2507.13779
作者: Durgesh Singh,Ahcène Boubekki,Robert Jenssen,Michael Kampffmeyer
机构: UiT Machine Learning Group (machine-learning.uit.no); Visual Intelligence (visual-intelligence.no)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-Supervised Learning (SSL) and Unsupervised Domain Adaptation (UDA) enhance the model performance by exploiting information from labeled and unlabeled data. The clustering assumption has proven advantageous for learning with limited supervision and states that data points belonging to the same cluster in a high-dimensional space should be assigned to the same category. Recent works have utilized different training mechanisms to implicitly enforce this assumption for the SSL and UDA. In this work, we take a different approach by explicitly involving a differentiable clustering module which is extended to leverage the supervised data to compute its centroids. We demonstrate the effectiveness of our straightforward end-to-end training strategy for SSL and UDA over extensive experiments and highlight its benefits, especially in low supervision regimes, both as a standalone model and as a regularizer for existing approaches.
zh

[CV-35] Feature Engineering is Not Dead: Reviving Classical Machine Learning with Entropy HOG and LBP Feature Fusion for Image Classification

【速读】：该论文旨在解决深度学习模型在图像分类中存在计算复杂度高、可解释性差的问题，尤其是在需要兼顾模型效率与透明性的应用场景下。其解决方案的关键在于提出一种基于排列熵（Permutation Entropy, PE）的多尺度、多方向特征提取方法，并将其与经典的图像描述子HOG（Histogram of Oriented Gradients）和LBP（Local Binary Patterns）融合，构建一个780维的手工特征集用于支持向量机（SVM）分类器训练。该方案通过扩展PE至二维图像空间，有效刻画图像的空间有序性和局部复杂度，同时借助HOG和LBP增强形状结构与微纹理的表达能力，从而在不依赖深度神经网络的前提下实现了高性能、轻量化且可解释的图像分类结果。

链接: https://arxiv.org/abs/2507.13772
作者: Abhijit Sen,Giridas Maiti,Bikram K. Parida,Bhanu P. Mishra,Mahima Arya,Denys I. Bondar
机构: Tulane University (杜兰大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Cure and Care Wellness Pvt Ltd (印度Cure and Care健康有限公司); Amritha Vidya Vishwapeetham (阿姆里塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Feature engineering continues to play a critical role in image classification, particularly when interpretability and computational efficiency are prioritized over deep learning models with millions of parameters. In this study, we revisit classical machine learning based image classification through a novel approach centered on Permutation Entropy (PE), a robust and computationally lightweight measure traditionally used in time series analysis but rarely applied to image data. We extend PE to two-dimensional images and propose a multiscale, multi-orientation entropy-based feature extraction approach that characterizes spatial order and complexity along rows, columns, diagonals, anti-diagonals, and local patches of the image. To enhance the discriminatory power of the entropy features, we integrate two classic image descriptors: the Histogram of Oriented Gradients (HOG) to capture shape and edge structure, and Local Binary Patterns (LBP) to encode micro-texture of an image. The resulting hand-crafted feature set, comprising of 780 dimensions, is used to train Support Vector Machine (SVM) classifiers optimized through grid search. The proposed approach is evaluated on multiple benchmark datasets, including Fashion-MNIST, KMNIST, EMNIST, and CIFAR-10, where it delivers competitive classification performance without relying on deep architectures. Our results demonstrate that the fusion of PE with HOG and LBP provides a compact, interpretable, and effective alternative to computationally expensive and limited interpretable deep learning models. This shows a potential of entropy-based descriptors in image classification and contributes a lightweight and generalizable solution to interpretable machine learning in image classification and computer vision.
zh

[CV-36] Learning Spectral Diffusion Prior for Hyperspectral Image Reconstruction

【速读】：该论文旨在解决基于深度学习的高光谱图像（Hyperspectral Image, HSI）重建方法在恢复高频细节方面表现不足的问题。其关键解决方案是提出一种隐式从HSI中学习的光谱扩散先验（Spectral Diffusion Prior, SDP），该先验通过扩散模型（Diffusion Model）获得，能够显著提升重建模型对细节的恢复能力；同时设计了光谱先验注入模块（Spectral Prior Injector Module, SPIM），以动态引导模型更有效地恢复HSI细节，从而在MST和BISRNet等代表性方法上实现约0.5 dB的性能提升。

链接: https://arxiv.org/abs/2507.13769
作者: Mingyang Yu,Zhijian Wu,Dingjiang Huang
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) reconstruction aims to recover 3D HSI from its degraded 2D measurements. Recently great progress has been made in deep learning-based methods, however, these methods often struggle to accurately capture high-frequency details of the HSI. To address this issue, this paper proposes a Spectral Diffusion Prior (SDP) that is implicitly learned from hyperspectral images using a diffusion model. Leveraging the powerful ability of the diffusion model to reconstruct details, this learned prior can significantly improve the performance when injected into the HSI model. To further improve the effectiveness of the learned prior, we also propose the Spectral Prior Injector Module (SPIM) to dynamically guide the model to recover the HSI details. We evaluate our method on two representative HSI methods: MST and BISRNet. Experimental results show that our method outperforms existing networks by about 0.5 dB, effectively improving the performance of HSI reconstruction.
zh

[CV-37] Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis

【速读】：该论文旨在解决当前大规模文本到视频（Text-to-Video, T2V）生成模型在保持高图像质量和有效运动表征之间的矛盾问题，尤其是现有方法依赖预训练文本到图像（Text-to-Image, T2I）模型进行帧级优化时，常因帧间不一致性导致闪烁和伪影。其解决方案的关键在于提出一种无需训练的封装式视频合成器（Encapsulated Video Synthesizer, EVS），通过将T2I模型与T2V模型有机结合：利用T2I模型对低质量视频帧进行去噪优化（将其视为分布外样本），同时借助T2V模型中的时序先验确保运动一致性，从而在不增加训练成本的前提下显著提升视频的视觉保真度和运动流畅性，并实现1.6x–4.5x的推理速度加速。

链接: https://arxiv.org/abs/2507.13753
作者: Tongtong Su,Chengyu Wang,Bingyan Liu,Jun Huang,Dongming Lu
机构: Zhejiang University (浙江大学); Alibaba Cloud Computing (阿里云计算); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, large text-to-video (T2V) synthesis models have garnered considerable attention for their abilities to generate videos from textual descriptions. However, achieving both high imaging quality and effective motion representation remains a significant challenge for these T2V models. Existing approaches often adapt pre-trained text-to-image (T2I) models to refine video frames, leading to issues such as flickering and artifacts due to inconsistencies across frames. In this paper, we introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness of generated videos. Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames by treating them as out-of-distribution samples, effectively optimizing them with noising and denoising steps. Meanwhile, we employ T2V backbones to ensure consistent motion dynamics. By encapsulating the T2V temporal-only prior into the T2I generation process, EVS successfully leverages the strengths of both types of models, resulting in videos of improved imaging and motion quality. Experimental results validate the effectiveness of our approach compared to previous approaches. Our composition process also leads to a significant improvement of 1.6x-4.5x speedup in inference time. Source codes: this https URL.
zh

[CV-38] Can Synthetic Images Conquer Forgetting? Beyond Unexplored Doubts in Few-Shot Class-Incremental Learning ICCV

【速读】：该论文旨在解决少样本类增量学习（Few-shot Class-Incremental Learning, FSCIL）中因训练数据极度有限而导致的灾难性遗忘问题，同时实现对新类别的有效学习。其核心解决方案是提出Diffusion-FSCIL框架，关键在于利用冻结的文本到图像扩散模型作为骨干网络，通过三个优势提升性能：1）借助大规模预训练获得强大的生成能力；2）提取多尺度特征以增强表示能力；3）利用文本编码器实现表征灵活性。为最大化特征表达能力，进一步设计了多互补扩散特征提取机制，用作潜在回放（latent replay），辅以轻量级特征蒸馏防止生成偏差。该方法在CUB-200、miniImageNet和CIFAR-100上验证了优越性，实现了对旧类性能的稳定保持与对新类的有效适应。

链接: https://arxiv.org/abs/2507.13739
作者: Junsu Kim,Yunhoe Ku,Seungryul Baek
机构: UNIST; DeepBrain AI; NVIDIA Foundation Models LAB, MODULABS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6th CLVISION ICCV Workshop accepted

点击查看摘要

Abstract:Few-shot class-incremental learning (FSCIL) is challenging due to extremely limited training data; while aiming to reduce catastrophic forgetting and learn new information. We propose Diffusion-FSCIL, a novel approach that employs a text-to-image diffusion model as a frozen backbone. Our conjecture is that FSCIL can be tackled using a large generative model’s capabilities benefiting from 1) generation ability via large-scale pre-training; 2) multi-scale representation; 3) representational flexibility through the text encoder. To maximize the representation capability, we propose to extract multiple complementary diffusion features to play roles as latent replay with slight support from feature distillation for preventing generative biases. Our framework realizes efficiency through 1) using a frozen backbone; 2) minimal trainable components; 3) batch processing of multiple feature extractions. Extensive experiments on CUB-200, \emphminiImageNet, and CIFAR-100 show that Diffusion-FSCIL surpasses state-of-the-art methods, preserving performance on previously learned classes and adapting effectively to new ones.
zh

[CV-39] ackling fake images in cybersecurity – Interpretation of a StyleGAN and lifting its black-box

【速读】：该论文旨在解决生成式 AI（Generative AI）中 StyleGAN 模型的可解释性与潜在滥用风险问题，特别是其生成高度逼真人脸图像时的内部工作机制及可控性。解决方案的关键在于通过 PyTorch 框架对模型权重进行直接分析与剪枝（pruning），揭示了大量权重可被移除而不显著影响输出质量，从而降低计算开销；同时深入研究了潜在向量（latent vector）对生成图像的影响机制，发现全局调整主要改变整体色调，而局部微调则可精确控制特定面部特征，这不仅提升了模型的可控性，也凸显了该技术可能被用于制造虚假身份的伦理风险。

链接: https://arxiv.org/abs/2507.13722
作者: Julia Laubmann,Johannes Reschke
机构: Ostbayrische Technische Hochschule Regensburg (雷根斯堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In today’s digital age, concerns about the dangers of AI-generated images are increasingly common. One powerful tool in this domain is StyleGAN (style-based generative adversarial networks), a generative adversarial network capable of producing highly realistic synthetic faces. To gain a deeper understanding of how such a model operates, this work focuses on analyzing the inner workings of StyleGAN’s generator component. Key architectural elements and techniques, such as the Equalized Learning Rate, are explored in detail to shed light on the model’s behavior. A StyleGAN model is trained using the PyTorch framework, enabling direct inspection of its learned weights. Through pruning, it is revealed that a significant number of these weights can be removed without drastically affecting the output, leading to reduced computational requirements. Moreover, the role of the latent vector – which heavily influences the appearance of the generated faces – is closely examined. Global alterations to this vector primarily affect aspects like color tones, while targeted changes to individual dimensions allow for precise manipulation of specific facial features. This ability to finetune visual traits is not only of academic interest but also highlights a serious ethical concern: the potential misuse of such technology. Malicious actors could exploit this capability to fabricate convincing fake identities, posing significant risks in the context of digital deception and cybercrime.
zh

[CV-40] Augmented Reality in Cultural Heritage: A Dual-Model Pipeline for 3D Artwork Reconstruction

【速读】：该论文旨在解决博物馆环境中艺术品识别与高质量3D建模的难题，特别是针对艺术作品中不规则轮廓和复杂纹理带来的重建挑战。其解决方案的关键在于构建一个创新的增强现实（Augmented Reality, AR）流程，通过融合两种互补的预训练深度估计模型——GLPN（Global and Local Perception Network）用于捕捉场景全局结构，Depth-Anything用于精细化局部重建，从而生成优化的深度图；这些深度图进一步转换为高保真点云和网格，实现沉浸式AR体验的构建，显著提升了重建精度与视觉真实感。

链接: https://arxiv.org/abs/2507.13719
作者: Daniele Pannone,Alessia Castronovo,Maurizio Mancini,Gian Luca Foresti,Claudio Piciarelli,Rossana Gabrieli,Muhammad Yasir Bilal,Danilo Avola
机构: University of Bologna (博洛尼亚大学); University of Naples Federico II (那不勒斯腓特烈二世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an innovative augmented reality pipeline tailored for museum environments, aimed at recognizing artworks and generating accurate 3D models from single images. By integrating two complementary pre-trained depth estimation models, i.e., GLPN for capturing global scene structure and Depth-Anything for detailed local reconstruction, the proposed approach produces optimized depth maps that effectively represent complex artistic features. These maps are then converted into high-quality point clouds and meshes, enabling the creation of immersive AR experiences. The methodology leverages state-of-the-art neural network architectures and advanced computer vision techniques to overcome challenges posed by irregular contours and variable textures in artworks. Experimental results demonstrate significant improvements in reconstruction accuracy and visual realism, making the system a highly robust tool for museums seeking to enhance visitor engagement through interactive digital content.
zh

[CV-41] PoemTale Diffusion: Minimising Information Loss in Poem to Image Generation with Multi-Stage Prompt Refinement ECAI2025

【速读】：该论文旨在解决文本到图像扩散模型在处理诗歌类创造性语言时信息丢失严重的问题，尤其是面对具有多层含义、抽象性和高度描述性的诗性语言时，现有模型难以准确捕捉和呈现其语义内涵。解决方案的关键在于提出了一种无需训练的PoemTale Diffusion方法，通过在语言模型中引入多阶段提示优化循环（prompt refinement loop）以增强诗歌文本的可解释性，并改进扩散模型的自注意力机制，采用一致的自注意力策略生成多幅语义一致的图像，从而更全面地传达诗歌的核心意涵。此外，作者还构建了P4I（PoemForImage）数据集以推动该领域研究。

链接: https://arxiv.org/abs/2507.13708
作者: Sofia Jamil,Bollampalli Areen Reddy,Raghvendra Kumar,Sriparna Saha,Koustava Goswami,K.J. Joseph
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECAI 2025

点击查看摘要

Abstract:Recent advancements in text-to-image diffusion models have achieved remarkable success in generating realistic and diverse visual content. A critical factor in this process is the model’s ability to accurately interpret textual prompts. However, these models often struggle with creative expressions, particularly those involving complex, abstract, or highly descriptive language. In this work, we introduce a novel training-free approach tailored to improve image generation for a unique form of creative language: poetic verse, which frequently features layered, abstract, and dual meanings. Our proposed PoemTale Diffusion approach aims to minimise the information that is lost during poetic text-to-image conversion by integrating a multi stage prompt refinement loop into Language Models to enhance the interpretability of poetic texts. To support this, we adapt existing state-of-the-art diffusion models by modifying their self-attention mechanisms with a consistent self-attention technique to generate multiple consistent images, which are then collectively used to convey the poem’s meaning. Moreover, to encourage research in the field of poetry, we introduce the P4I (PoemForImage) dataset, consisting of 1111 poems sourced from multiple online and offline resources. We engaged a panel of poetry experts for qualitative assessments. The results from both human and quantitative evaluations validate the efficacy of our method and contribute a novel perspective to poem-to-image generation with enhanced information capture in the generated images.
zh

[CV-42] GOSPA and T-GOSPA quasi-metrics for evaluation of multi-object tracking algorithms

【速读】：该论文旨在解决多目标跟踪（Multi-Object Tracking, MOT）算法性能评估中缺乏灵活性和适应性的问题，特别是传统GOSPA及其轨迹扩展版本T-GOSPA在惩罚误检与漏检时成本固定、局部定位误差代价对称等限制。其解决方案的关键在于提出两种新的拟度量（quasi-metrics）：一种是基于集合的GOSPA扩展，用于衡量对象集之间的差异；另一种是基于轨迹的T-GOSPA扩展，用于衡量轨迹集之间的差异。这些新拟度量允许对误检和漏检分别设置不同的惩罚权重，并且不要求定位误差代价函数具有对称性，从而提升了在特定应用场景下MOT算法评估的准确性和实用性。

链接: https://arxiv.org/abs/2507.13706
作者: Ángel F. García-Fernández,Jinhao Gu,Lennart Svensson,Yuxuan Xia,Jan Krejčí,Oliver Kost,Ondřej Straka
机构: IPTC, ETSI de Telecomunicación, Universidad Politécnica de Madrid (西班牙理工大学电信学院); University of Liverpool (利物浦大学); Chalmers University of Technology (查尔姆斯理工大学); Shanghai Jiao Tong University (上海交通大学); University of West Bohemia (西波希米亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:This paper introduces two quasi-metrics for performance assessment of multi-object tracking (MOT) algorithms. In particular, one quasi-metric is an extension of the generalised optimal subpattern assignment (GOSPA) metric and measures the discrepancy between sets of objects. The other quasi-metric is an extension of the trajectory GOSPA (T-GOSPA) metric and measures the discrepancy between sets of trajectories. Similar to the GOSPA-based metrics, these quasi-metrics include costs for localisation error for properly detected objects, the number of false objects and the number of missed objects. The T-GOSPA quasi-metric also includes a track switching cost. Differently from the GOSPA and T-GOSPA metrics, the proposed quasi-metrics have the flexibility of penalising missed and false objects with different costs, and the localisation costs are not required to be symmetric. These properties can be useful in MOT evaluation in certain applications. The performance of several Bayesian MOT algorithms is assessed with the T-GOSPA quasi-metric via simulations.
zh

[CV-43] Gaussian kernel-based motion measurement

【速读】：该论文旨在解决视觉测量方法在亚像素级位移检测中精度不足或需大量人工调参的问题（如金字塔层数、目标像素和滤波参数等）。其解决方案的关键在于提出一种基于高斯核（Gaussian kernel）的运动测量方法，通过追踪高斯核的位置来提取帧间运动，并引入运动一致性约束与超分辨率约束，以提升算法的精度和鲁棒性，且无需针对不同测试样本进行定制化参数设置。

链接: https://arxiv.org/abs/2507.13693
作者: Hongyi Liu,Haifeng Wang
机构: Washington State University (华盛顿州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing demand for structural health monitoring has driven increasing interest in high-precision motion measurement, as structural information derived from extracted motions can effectively reflect the current condition of the structure. Among various motion measurement techniques, vision-based methods stand out due to their low cost, easy installation, and large-scale measurement. However, when it comes to sub-pixel-level motion measurement, current vision-based methods either lack sufficient accuracy or require extensive manual parameter tuning (e.g., pyramid layers, target pixels, and filter parameters) to reach good precision. To address this issue, we developed a novel Gaussian kernel-based motion measurement method, which can extract the motion between different frames via tracking the location of Gaussian kernels. The motion consistency, which fits practical structural conditions, and a super-resolution constraint, are introduced to increase accuracy and robustness of our method. Numerical and experimental validations show that it can consistently reach high accuracy without customized parameter setup for different test samples.
zh

[CV-44] HeCoFuse: Cross-Modal Complementary V2X Cooperative Perception with Heterogeneous Sensors ITSC2025 CVPR

【速读】：该论文旨在解决现实世界中车辆与万物互联（Vehicle-to-Everything, V2X）协同感知系统因传感器配置异构（如摄像头、激光雷达或两者组合）而导致的特征融合困难和感知可靠性下降问题。解决方案的关键在于提出一种统一框架HeCoFuse，其核心创新包括：1）引入分层融合机制，通过通道注意力与空间注意力联合自适应加权特征，缓解跨模态特征错位和表示质量不均衡问题；2）设计自适应空间分辨率调整模块，在计算成本与融合效果间取得平衡；3）采用协作学习策略，根据可用模态动态调整融合类型，从而在九种异构传感器配置下均保持稳定且高性能的3D目标检测表现（3D mAP范围为21.74%–43.38%），并在TUMTraf-V2X真实数据集上达到当前最优性能（LC+LC配置下达43.22% 3D mAP）。

链接: https://arxiv.org/abs/2507.13677
作者: Chuheng Wei,Ziye Qin,Walter Zimmer,Guoyuan Wu,Matthew J. Barth
机构: University of California at Riverside (加州大学河滨分校); Southwest Jiaotong University (西南交通大学); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Ranked first in CVPR DriveX workshop TUM-Traf V2X challenge. Accepted by ITSC2025

点击查看摘要

Abstract:Real-world Vehicle-to-Everything (V2X) cooperative perception systems often operate under heterogeneous sensor configurations due to cost constraints and deployment variability across vehicles and infrastructure. This heterogeneity poses significant challenges for feature fusion and perception reliability. To address these issues, we propose HeCoFuse, a unified framework designed for cooperative perception across mixed sensor setups where nodes may carry Cameras ©, LiDARs (L), or both. By introducing a hierarchical fusion mechanism that adaptively weights features through a combination of channel-wise and spatial attention, HeCoFuse can tackle critical challenges such as cross-modality feature misalignment and imbalanced representation quality. In addition, an adaptive spatial resolution adjustment module is employed to balance computational cost and fusion effectiveness. To enhance robustness across different configurations, we further implement a cooperative learning strategy that dynamically adjusts fusion type based on available modalities. Experiments on the real-world TUMTraf-V2X dataset demonstrate that HeCoFuse achieves 43.22% 3D mAP under the full sensor configuration (LC+LC), outperforming the CoopDet3D baseline by 1.17%, and reaches an even higher 43.38% 3D mAP in the L+LC scenario, while maintaining 3D mAP in the range of 21.74% to 43.38% across nine heterogeneous sensor configurations. These results, validated by our first-place finish in the CVPR 2025 DriveX challenge, establish HeCoFuse as the current state-of-the-art on TUM-Traf V2X dataset while demonstrating robust performance across diverse sensor deployments.
zh

[CV-45] MaskHOI: Robust 3D Hand-Object Interaction Estimation via Masked Pre-training

【速读】：该论文旨在解决单目RGB图像下3D手-物体交互（Hand-Object Interaction, HOI）任务中手部和物体关节姿态估计的难题，主要挑战包括RGB图像固有的几何歧义性以及交互过程中严重的相互遮挡问题。解决方案的关键在于提出一种基于掩码自编码器（Masked Autoencoder, MAE）驱动的预训练框架MaskHOI，其核心创新包括：1）引入区域特定的掩码比例分配策略，通过自适应降低手部区域的掩码比例并结合骨骼引导的手部关键部位掩码机制，更真实地模拟现实交互中的遮挡模式，从而提升对细粒度手部结构的重建能力；2）设计一种新型的掩码符号距离场（Signed Distance Field, SDF）驱动的多模态学习机制，通过自掩码3D SDF预测增强编码器对三维几何结构的感知能力，突破单目输入限制并缓解自遮挡问题，从而实现更具几何感知力和抗遮挡能力的特征表示学习。

链接: https://arxiv.org/abs/2507.13673
作者: Yuechen Xie,Haobo Jiang,Jian Yang,Yigong Zhang,Jin Xie
机构: PCA Lab, School of Intelligence Science and Technology, Nanjing University, Nanjing 210093, China (智能科学与技术学院); Nanyang Technological University, Singapore 639798 (南洋理工大学); PCA Lab, Nanjing University of Science and Technology, Nanjing 210094, China (南京理工大学); PCA Lab, Nankai University, Tianjin 300071, China (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures, 6 tables

点击查看摘要

Abstract:In 3D hand-object interaction (HOI) tasks, estimating precise joint poses of hands and objects from monocular RGB input remains highly challenging due to the inherent geometric ambiguity of RGB images and the severe mutual occlusions that occur during this http URL address these challenges, we propose MaskHOI, a novel Masked Autoencoder (MAE)-driven pretraining framework for enhanced HOI pose estimation. Our core idea is to leverage the masking-then-reconstruction strategy of MAE to encourage the feature encoder to infer missing spatial and structural information, thereby facilitating geometric-aware and occlusion-robust representation learning. Specifically, based on our observation that human hands exhibit far greater geometric complexity than rigid objects, conventional uniform masking fails to effectively guide the reconstruction of fine-grained hand structures. To overcome this limitation, we introduce a Region-specific Mask Ratio Allocation, primarily comprising the region-specific masking assignment and the skeleton-driven hand masking guidance. The former adaptively assigns lower masking ratios to hand regions than to rigid objects, balancing their feature learning difficulty, while the latter prioritizes masking critical hand parts (e.g., fingertips or entire fingers) to realistically simulate occlusion patterns in real-world interactions. Furthermore, to enhance the geometric awareness of the pretrained encoder, we introduce a novel Masked Signed Distance Field (SDF)-driven multimodal learning mechanism. Through the self-masking 3D SDF prediction, the learned encoder is able to perceive the global geometric structure of hands and objects beyond the 2D image plane, overcoming the inherent limitations of monocular input and alleviating self-occlusion issues. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches.
zh

[CV-46] Global Modeling Matters: A Fast Lightweight and Effective Baseline for Efficient Image Restoration

【速读】：该论文旨在解决恶劣天气条件下自然图像质量退化问题，该问题显著影响下游任务性能。现有基于Transformer的图像恢复方法虽取得进展，但因系统复杂度高难以满足实时处理需求，尤其在实际部署场景中。解决方案的关键在于提出一种新颖高效的基线模型——金字塔小波-傅里叶网络（Pyramid Wavelet-Fourier Network, PW-FNet），其核心设计包括：1）在块间层级引入基于小波的多输入多输出结构，实现多尺度与多频带分解；2）在块内层级采用傅里叶变换替代自注意力机制，在降低计算复杂度的同时保留全局建模能力，从而在图像去雨、去雾、超分辨率等任务中同时实现更优的恢复质量和更高的运行效率。

链接: https://arxiv.org/abs/2507.13663
作者: Xingyu Jiang,Ning Gao,Hongkun Dou,Xiuhui Zhang,Xiaoqing Zhong,Yue Deng,Hongjue Li
机构: Beihang University (北京航空航天大学); China Academy of Space Technology (中国空间技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Natural image quality is often degraded by adverse weather conditions, significantly impairing the performance of downstream tasks. Image restoration has emerged as a core solution to this challenge and has been widely discussed in the literature. Although recent transformer-based approaches have made remarkable progress in image restoration, their increasing system complexity poses significant challenges for real-time processing, particularly in real-world deployment scenarios. To this end, most existing methods attempt to simplify the self-attention mechanism, such as by channel self-attention or state space model. However, these methods primarily focus on network architecture while neglecting the inherent characteristics of image restoration itself. In this context, we explore a pyramid Wavelet-Fourier iterative pipeline to demonstrate the potential of Wavelet-Fourier processing for image restoration. Inspired by the above findings, we propose a novel and efficient restoration baseline, named Pyramid Wavelet-Fourier Network (PW-FNet). Specifically, PW-FNet features two key design principles: 1) at the inter-block level, integrates a pyramid wavelet-based multi-input multi-output structure to achieve multi-scale and multi-frequency bands decomposition; and 2) at the intra-block level, incorporates Fourier transforms as an efficient alternative to self-attention mechanisms, effectively reducing computational complexity while preserving global modeling capability. Extensive experiments on tasks such as image deraining, raindrop removal, image super-resolution, motion deblurring, image dehazing, image desnowing and underwater/low-light enhancement demonstrate that PW-FNet not only surpasses state-of-the-art methods in restoration quality but also achieves superior efficiency, with significantly reduced parameter size, computational cost and inference time.
zh

[CV-47] When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework

【速读】：该论文旨在解决当前基于事件相机（event camera）的人体再识别（person re-identification, ReID）研究中因数据集规模小、场景覆盖有限及真实性能评估困难而导致的模型泛化能力不足问题。现有方法多依赖于小规模或模拟数据训练与评测，难以反映实际应用中的识别效果。解决方案的关键在于：首先构建了一个大规模、多场景、跨季节的RGB-event联合人体ReID数据集EvReID，包含118,988对图像和1200个行人身份；其次提出了一种行人属性引导的对比学习框架TriPro-ReID，通过融合RGB图像与事件流的视觉特征，并引入行人属性作为中层语义线索，显著提升了特征表达能力。实验表明，该方案在EvReID和MARS数据集上均取得了优越性能，为未来RGB-event ReID研究提供了高质量基准与有效方法。

链接: https://arxiv.org/abs/2507.13659
作者: Xiao Wang,Qian Zhu,Shujuan Wu,Bo Jiang,Shiliang Zhang,Yaowei Wang,Yonghong Tian,Bin Luo
机构: Anhui University (安徽大学); Peng Cheng Laboratory (鹏城实验室); Peking University (北京大学); Shenzhen Graduate School, Peking University (北京大学深圳研究生院); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Recent researchers have proposed using event cameras for person re-identification (ReID) due to their promising performance and better balance in terms of privacy protection, event camera-based person ReID has attracted significant attention. Currently, mainstream event-based person ReID algorithms primarily focus on fusing visible light and event stream, as well as preserving privacy. Although significant progress has been made, these methods are typically trained and evaluated on small-scale or simulated event camera datasets, making it difficult to assess their real identification performance and generalization ability. To address the issue of data scarcity, this paper introduces a large-scale RGB-event based person ReID dataset, called EvReID. The dataset contains 118,988 image pairs and covers 1200 pedestrian identities, with data collected across multiple seasons, scenes, and lighting conditions. We also evaluate 15 state-of-the-art person ReID algorithms, laying a solid foundation for future research in terms of both data and benchmarking. Based on our newly constructed dataset, this paper further proposes a pedestrian attribute-guided contrastive learning framework to enhance feature learning for person re-identification, termed TriPro-ReID. This framework not only effectively explores the visual features from both RGB frames and event streams, but also fully utilizes pedestrian attributes as mid-level semantic features. Extensive experiments on the EvReID dataset and MARS datasets fully validated the effectiveness of our proposed RGB-Event person ReID framework. The benchmark dataset and source code will be released on this https URL
zh

[CV-48] EPSilon: Efficient Point Sampling for Lightening of Hybrid-based 3D Avatar Generation

【速读】：该论文旨在解决基于混合表示（hybrid representation）的3D人体虚拟形象生成方法在推理速度上的瓶颈问题。现有方法通常结合SMPL参数化网格与神经辐射场（NeRF）来实现高质量的人体动画生成，但其依赖于基于SMPL皮肤权重的形变计算，在每个采样点上均需进行高成本运算，尤其在大量空洞空间（empty space）中的采样点并未提升渲染质量却显著增加延迟。解决方案的关键在于提出EPSilon框架，通过两种新颖的高效点采样策略：空射线剔除（Empty Ray Omission, ERO）和空区间剔除（Empty Interval Omission, EIO），分别从射线层面和区间层面过滤掉无意义的空区域采样点，从而大幅减少不必要的形变计算。这一精细采样机制不仅使单阶段NeRF结构无需分层采样即可保持高质量输出，还实现了仅使用3.9%采样点的情况下推理速度提升约20倍、训练收敛速度提升4倍，同时维持与现有方法相当的生成质量。

链接: https://arxiv.org/abs/2507.13648
作者: Seungjun Moon,Sangjoon Yu,Gyeong-Moon Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of neural radiance fields (NeRF) has paved the way to generate animatable human avatars from a monocular video. However, the sole usage of NeRF suffers from a lack of details, which results in the emergence of hybrid representation that utilizes SMPL-based mesh together with NeRF representation. While hybrid-based models show photo-realistic human avatar generation qualities, they suffer from extremely slow inference due to their deformation scheme: to be aligned with the mesh, hybrid-based models use the deformation based on SMPL skinning weights, which needs high computational costs on each sampled point. We observe that since most of the sampled points are located in empty space, they do not affect the generation quality but result in inference latency with deformation. In light of this observation, we propose EPSilon, a hybrid-based 3D avatar generation scheme with novel efficient point sampling strategies that boost both training and inference. In EPSilon, we propose two methods to omit empty points at rendering; empty ray omission (ERO) and empty interval omission (EIO). In ERO, we wipe out rays that progress through the empty space. Then, EIO narrows down the sampling interval on the ray, which wipes out the region not occupied by either clothes or mesh. The delicate sampling scheme of EPSilon enables not only great computational cost reduction during deformation but also the designation of the important regions to be sampled, which enables a single-stage NeRF structure without hierarchical sampling. Compared to existing methods, EPSilon maintains the generation quality while using only 3.9% of sampled points and achieves around 20 times faster inference, together with 4 times faster training convergence. We provide video results on this https URL.
zh

[CV-49] Moving Object Detection from Moving Camera Using Focus of Expansion Likelihood and Segmentation

【速读】：该论文旨在解决在移动相机视角下分离运动物体与静态背景的问题，这是实现三维重建、自主导航和场景理解的关键步骤。现有方法主要依赖光流（optical flow），但在包含相机运动的复杂结构化场景中难以准确检测运动目标。其解决方案的关键在于提出一种名为焦点扩张似然与分割（Focus of Expansion Likelihood and Segmentation, FoELS）的方法，核心思想是融合光流与纹理信息：首先从光流中计算焦点扩张（Focus of Expansion, FoE），并利用FoE异常点生成初始运动似然；随后将该似然与基于分割的先验信息融合，从而估计最终的运动概率分布，有效应对复杂结构场景、旋转相机运动和平行运动等挑战。

链接: https://arxiv.org/abs/2507.13628
作者: Masahiro Ogawa,Qi An,Atsushi Yamashita
机构: The University of Tokyo(东京大学); Department of Precision Engineering(精密工程系); Graduate School of Engineering(工学研究生院); Department of Human and Engineered Environmental Studies(人类与工程环境研究系); Graduate School of Frontier Sciences(前沿科学研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 15 figures, RA-L submission

点击查看摘要

Abstract:Separating moving and static objects from a moving camera viewpoint is essential for 3D reconstruction, autonomous navigation, and scene understanding in robotics. Existing approaches often rely primarily on optical flow, which struggles to detect moving objects in complex, structured scenes involving camera motion. To address this limitation, we propose Focus of Expansion Likelihood and Segmentation (FoELS), a method based on the core idea of integrating both optical flow and texture information. FoELS computes the focus of expansion (FoE) from optical flow and derives an initial motion likelihood from the outliers of the FoE computation. This likelihood is then fused with a segmentation-based prior to estimate the final moving probability. The method effectively handles challenges including complex structured scenes, rotational camera motion, and parallel motion. Comprehensive evaluations on the DAVIS 2016 dataset and real-world traffic videos demonstrate its effectiveness and state-of-the-art performance.
zh

[CV-50] Efficient Burst Super-Resolution with One-step Diffusion

【速读】：该论文旨在解决传统基于确定性训练的图像超分辨率（Super Resolution, SR）方法在处理图像帧序列（burst Low-Resolution images）时生成模糊SR图像的问题，从而影响感知质量。其核心解决方案是引入扩散模型（diffusion model），通过结合高阶常微分方程（ODE）的随机采样器和知识蒸馏驱动的一步扩散机制，显著提升重建图像的清晰度与保真度，同时将运行时间压缩至基线模型的1.6%，在保持图像失真指标和感知质量的前提下实现高效推理。

链接: https://arxiv.org/abs/2507.13607
作者: Kento Kawai,Takeru Oba,Kyotaro Tokoro,Kazutoshi Akita,Norimichi Ukita
机构: Toyota Technological Institute (丰田技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NTIRE2025

点击查看摘要

Abstract:While burst Low-Resolution (LR) images are useful for improving their Super Resolution (SR) image compared to a single LR image, prior burst SR methods are trained in a deterministic manner, which produces a blurry SR image. Since such blurry images are perceptually degraded, we aim to reconstruct sharp and high-fidelity SR images by a diffusion model. Our method improves the efficiency of the diffusion model with a stochastic sampler with a high-order ODE as well as one-step diffusion using knowledge distillation. Our experimental results demonstrate that our method can reduce the runtime to 1.6 % of its baseline while maintaining the SR quality measured based on image distortion and perceptual quality.
zh

[CV-51] Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model ICCV2025

【速读】：该论文旨在解决真实世界中获取大量清晰-模糊图像配对数据困难的问题，从而实现从无配对数据中学习盲图像去模糊（blind image deblurring）的挑战。其解决方案的关键在于提出了一种基于扩散模型（Diffusion Model, DM）的新框架 \ours，通过从无配对数据中学习空间变化的纹理先验（spatially varying texture prior）来辅助恢复模糊图像的细节。具体而言，该方法引入了纹理先验编码器（Texture Prior Encoder, TPE）以记忆机制表示图像纹理并为DM训练提供监督信号，并设计了纹理迁移变换层（Texture Transfer Transformer layer, TTformer），其中采用滤波调制多头自注意力机制（Filter-Modulated Multi-head Self-Attention, FM-MSA）实现自适应滤波以去除空间变化的模糊效应；此外，还结合小波域对抗损失以保留高频纹理细节。该方案在多个基准测试中优于当前最优（SOTA）方法，展现出强大的无监督去模糊能力。

链接: https://arxiv.org/abs/2507.13599
作者: Chengxu Liu,Lu Qi,Jinshan Pan,Xueming Qian,Ming-Hsuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:Since acquiring large amounts of realistic blurry-sharp image pairs is difficult and expensive, learning blind image deblurring from unpaired data is a more practical and promising solution. Unfortunately, dominant approaches rely heavily on adversarial learning to bridge the gap from blurry domains to sharp domains, ignoring the complex and unpredictable nature of real-world blur patterns. In this paper, we propose a novel diffusion model (DM)-based framework, dubbed \ours, for image deblurring by learning spatially varying texture prior from unpaired data. In particular, \ours performs DM to generate the prior knowledge that aids in recovering the textures of blurry images. To implement this, we propose a Texture Prior Encoder (TPE) that introduces a memory mechanism to represent the image textures and provides supervision for DM training. To fully exploit the generated texture priors, we present the Texture Transfer Transformer layer (TTformer), in which a novel Filter-Modulated Multi-head Self-Attention (FM-MSA) efficiently removes spatially varying blurring through adaptive filtering. Furthermore, we implement a wavelet-based adversarial loss to preserve high-frequency texture details. Extensive evaluations show that \ours provides a promising unsupervised deblurring solution and outperforms SOTA methods in widely-used benchmarks.
zh

[CV-52] GIFT: Gradient-aware Immunization of diffusion models against malicious Fine-Tuning with safe concepts retention

【速读】：该论文旨在解决扩散模型（diffusion models）在遭受恶意微调（malicious fine-tuning）攻击时安全性不足的问题，此类攻击可使模型重新学习并生成有害内容，而现有安全机制如安全检查器（safety checkers）易被绕过，概念擦除（concept erasure）方法亦难以抵御对抗性微调。解决方案的关键在于提出GIFT（Gradient-aware Immunization Technique），将其建模为一个双层优化问题：上层目标通过表示噪声注入（representation noising）和最大化策略削弱模型对有害概念的表征能力，下层目标则确保在安全数据上的生成性能不受影响。该方法在保持生成内容安全性的同时，显著提升了模型对恶意微调攻击的鲁棒性。

链接: https://arxiv.org/abs/2507.13598
作者: Amro Abdalla,Ismail Shaheen,Dan DeGenaro,Rupayan Mallick,Bogdan Raita,Sarah Adel Bargal
机构: Georgetown University (乔治城大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Warning: This paper contains NSFW content. Reader discretion is advised

点击查看摘要

Abstract:We present \textbfGIFT: a \textbfGradient-aware \textbfImmunization technique to defend diffusion models against malicious \textbfFine-\textbfTuning while preserving their ability to generate safe content. Existing safety mechanisms like safety checkers are easily bypassed, and concept erasure methods fail under adversarial fine-tuning. GIFT addresses this by framing immunization as a bi-level optimization problem: the upper-level objective degrades the model’s ability to represent harmful concepts using representation noising and maximization, while the lower-level objective preserves performance on safe data. GIFT achieves robust resistance to malicious fine-tuning while maintaining safe generative quality. Experimental results show that our method significantly impairs the model’s ability to re-learn harmful concepts while maintaining performance on safe content, offering a promising direction for creating inherently safer generative models resistant to adversarial fine-tuning attacks. \small\textbf\textcolorredWarning: This paper contains NSFW content. Reader discretion is advised. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2507.13598 [cs.CR] (or arXiv:2507.13598v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.13598 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-53] NoiseSDF2NoiseSDF: Learning Clean Neural Fields from Noisy Supervision

【速读】：该论文旨在解决从低质量扫描设备获取的噪声点云中重建精确隐式表面表示（implicit surface representations）的问题。由于此类点云通常包含大量噪声，传统方法难以生成高质量的表面重建结果。解决方案的关键在于提出NoiseSDF2NoiseSDF方法，受2D图像去噪范式Noise2Noise启发，通过在噪声监督下最小化噪声SDF表示之间的均方误差（MSE loss），使神经隐式场（neural fields）能够隐式地完成去噪并优化表面估计，从而直接从噪声点云中学习干净的SDF表示。

链接: https://arxiv.org/abs/2507.13595
作者: Tengkai Wang,Weihao Li,Ruikai Cui,Shi Qiu,Nick Barnes
机构: Australian National University (澳大利亚国立大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Reconstructing accurate implicit surface representations from point clouds remains a challenging task, particularly when data is captured using low-quality scanning devices. These point clouds often contain substantial noise, leading to inaccurate surface reconstructions. Inspired by the Noise2Noise paradigm for 2D images, we introduce NoiseSDF2NoiseSDF, a novel method designed to extend this concept to 3D neural fields. Our approach enables learning clean neural SDFs directly from noisy point clouds through noisy supervision by minimizing the MSE loss between noisy SDF representations, allowing the network to implicitly denoise and refine surface estimations. We evaluate the effectiveness of NoiseSDF2NoiseSDF on benchmarks, including the ShapeNet, ABC, Famous, and Real datasets. Experimental results demonstrate that our framework significantly improves surface reconstruction quality from noisy inputs.
zh

[CV-54] LoRA-Loop: Closing the Synthetic Replay Cycle for Continual VLM Learning

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在持续学习过程中，依赖Stable Diffusion生成合成样本进行回放（synthetic replay）时所面临的域特定细节和细粒度语义缺失问题，导致生成样本与真实任务分布不匹配，进而误导微调过程并损害对先前知识的保留。解决方案的关键在于提出一种LoRA增强的合成回放框架：通过在冻结的Stable Diffusion模型中注入任务特定的低秩适配器（Low-Rank Adaptation, LoRA），高效捕捉新任务的独特视觉与语义模式；同时引入两阶段基于置信度的样本选择机制——首先根据微调后VLM的置信度排序真实数据以指导LoRA训练，再以相同标准筛选生成的合成样本用于知识蒸馏，从而显著提升回放样本的质量与对齐度。该方法可无缝集成至现有回放流水线，仅需替换适应后的生成器即可实现更鲁棒的持续学习性能。

链接: https://arxiv.org/abs/2507.13568
作者: Kaihong Wang,Donghyun Kim,Margrit Betke
机构: Boston University (波士顿大学); Korea University (韩国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual learning for vision-language models has achieved remarkable performance through synthetic replay, where samples are generated using Stable Diffusion to regularize during finetuning and retain knowledge. However, real-world downstream applications often exhibit domain-specific nuances and fine-grained semantics not captured by generators, causing synthetic-replay methods to produce misaligned samples that misguide finetuning and undermine retention of prior knowledge. In this work, we propose a LoRA-enhanced synthetic-replay framework that injects task-specific low-rank adapters into a frozen Stable Diffusion model, efficiently capturing each new task’s unique visual and semantic patterns. Specifically, we introduce a two-stage, confidence-based sample selection: we first rank real task data by post-finetuning VLM confidence to focus LoRA finetuning on the most representative examples, then generate synthetic samples and again select them by confidence for distillation. Our approach integrates seamlessly with existing replay pipelines-simply swap in the adapted generator to boost replay fidelity. Extensive experiments on the Multi-domain Task Incremental Learning (MTIL) benchmark show that our method outperforms previous synthetic-replay techniques, achieving an optimal balance among plasticity, stability, and zero-shot capability. These results demonstrate the effectiveness of generator adaptation via LoRA for robust continual learning in VLMs.
zh

[CV-55] nablaNABLA: Neighborhood Adaptive Block-Level Attention

【速读】：该论文旨在解决基于Transformer架构的视频生成模型中，全注意力机制（full attention mechanism）因二次计算复杂度而导致的性能瓶颈问题，尤其是在高分辨率和长时长视频序列场景下。其解决方案的关键在于提出一种名为NABLA（Neighborhood Adaptive Block-Level Attention）的新机制，该机制通过块级注意力（block-wise attention）结合自适应稀疏驱动阈值，动态调整注意力的稀疏模式，在不牺牲生成质量的前提下显著降低计算开销。该方法无需定制低层算子设计，可无缝集成至PyTorch的Flex Attention算子中，实验证明其在几乎不影响CLIP分数、VBench分数及人工评分等定量指标的情况下，实现高达2.7倍的训练与推理加速。

链接: https://arxiv.org/abs/2507.13546
作者: Dmitrii Mikhailov,Aleksey Letunovskiy,Maria Kovaleva,Vladimir Arkhipkin,Vladimir Korviakov,Vladimir Polovnikov,Viacheslav Vasilev,Evelina Sidorova,Denis Dimitrov
机构: Sber AI (斯贝AI); Lomonosov Moscow State University (莫斯科国立大学); Moscow Institute of Physics and Technology (莫斯科物理技术研究所); Artificial Intelligence Research Institute (人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch’s Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: this https URL
zh

[CV-56] otal Generalized Variation of the Normal Vector Field and Applications to Mesh Denoising

【速读】：该论文旨在解决三维三角网格上法向量的高阶正则化问题，以提升网格去噪等几何处理任务中的保真度与平滑性。其核心挑战在于如何将传统的二阶总广义变分（Total Generalized Variation, TGV）模型从标量数据扩展至流形值函数（即法向量取值于单位球面）的情形。解决方案的关键在于构造了一种定制化的切向Raviart-Thomas型有限元空间，用于近似定义在曲面上的法向量场，从而实现了对法向量的二阶TGV正则化建模，显著优于现有方法在网格去噪实验中的表现。

链接: https://arxiv.org/abs/2507.13530
作者: Lukas Baumgärtner,Ronny Bergmann,Roland Herzog,Stephan Schmidt,Manuel Weiß
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We propose a novel formulation for the second-order total generalized variation (TGV) of the normal vector on an oriented, triangular mesh embedded in \mathbbR^3 . The normal vector is considered as a manifold-valued function, taking values on the unit sphere. Our formulation extends previous discrete TGV models for piecewise constant scalar data that utilize a Raviart-Thomas function space. To exctend this formulation to the manifold setting, a tailor-made tangential Raviart-Thomas type finite element space is constructed in this work. The new regularizer is compared to existing methods in mesh denoising experiments.
zh

[CV-57] SparseC-AFM: a deep learning method for fast and accurate characterization of MoS_2 with C-AFM

【速读】：该论文旨在解决二维（2D）材料在纳米电子器件中电气特性表征时，传统导电原子力显微镜（Conductive Atomic Force Microscopy, C-AFM）因逐行扫描导致的数据采集速度慢的问题。其核心解决方案是提出SparseC-AFM，一种基于深度学习的模型，能够从稀疏采样的C-AFM数据中高精度重建材料的电导率图谱，显著缩短测量时间（<5分钟 vs. 15分钟），同时保持与全分辨率数据相当的电气参数提取准确性，包括薄膜覆盖率、缺陷密度及晶岛边界、边缘和裂纹识别等关键指标。该方法在不同扫描模式、基底和实验条件下均表现出鲁棒性，为将AI辅助的2D材料表征从实验室研究推向工业制造提供了关键技术支撑。

链接: https://arxiv.org/abs/2507.13527
作者: Levi Harris,Md Jayed Hossain,Mufan Qiu,Ruichen Zhang,Pingchuan Ma,Tianlong Chen,Jiaqi Gu,Seth Ariel Tongay,Umberto Celano
机构: University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校); Arizona State University(亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)
备注:

点击查看摘要

Abstract:The increasing use of two-dimensional (2D) materials in nanoelectronics demands robust metrology techniques for electrical characterization, especially for large-scale production. While atomic force microscopy (AFM) techniques like conductive AFM (C-AFM) offer high accuracy, they suffer from slow data acquisition speeds due to the raster scanning process. To address this, we introduce SparseC-AFM, a deep learning model that rapidly and accurately reconstructs conductivity maps of 2D materials like MoS _2 from sparse C-AFM scans. Our approach is robust across various scanning modes, substrates, and experimental conditions. We report a comparison between (a) classic flow implementation, where a high pixel density C-AFM image (e.g., 15 minutes to collect) is manually parsed to extract relevant material parameters, and (b) our SparseC-AFM method, which achieves the same operation using data that requires substantially less acquisition time (e.g., under 5 minutes). SparseC-AFM enables efficient extraction of critical material parameters in MoS _2 , including film coverage, defect density, and identification of crystalline island boundaries, edges, and cracks. We achieve over 11x reduction in acquisition time compared to manual extraction from a full-resolution C-AFM image. Moreover, we demonstrate that our model-predicted samples exhibit remarkably similar electrical properties to full-resolution data gathered using classic-flow scanning. This work represents a significant step toward translating AI-assisted 2D material characterization from laboratory research to industrial fabrication. Code and model weights are available at this http URL.
zh

[CV-58] Sugar-Beet Stress Detection using Satellite Image Time Series

【速读】：该论文旨在解决糖用甜菜（sugar-beet）田块中胁迫检测的问题，尤其在缺乏标注数据的情况下实现有效的无监督识别。其关键解决方案是提出一种3D卷积自编码器模型（3D convolutional autoencoder），用于从Sentinel-2卫星影像时序数据（Satellite Image Time Series, SITS）中提取有意义的特征，并引入基于采集日期的特定时间编码（acquisition-date-specific temporal encodings），以更准确地捕捉糖用甜菜的生长动态。该方法所学习到的表征被应用于下游聚类任务，从而区分胁迫与健康田块，且模型具备跨年份数据的直接适用性，为糖用甜菜胁迫检测提供了一种实用且可访问的工具。

链接: https://arxiv.org/abs/2507.13514
作者: Bhumika Laxman Sadbhave,Philipp Vaeth,Denise Dejon,Gunther Schorcht,Magda Gregorová
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Satellite Image Time Series (SITS) data has proven effective for agricultural tasks due to its rich spectral and temporal nature. In this study, we tackle the task of stress detection in sugar-beet fields using a fully unsupervised approach. We propose a 3D convolutional autoencoder model to extract meaningful features from Sentinel-2 image sequences, combined with acquisition-date-specific temporal encodings to better capture the growth dynamics of sugar-beets. The learned representations are used in a downstream clustering task to separate stressed from healthy fields. The resulting stress detection system can be directly applied to data from different years, offering a practical and accessible tool for stress detection in sugar-beets.
zh

[CV-59] Uncertainty Quantification Framework for Aerial and UAV Photogrammetry through Error Propagation

【速读】：该论文旨在解决摄影测量点云中不确定性量化（Uncertainty Quantification, UQ）在多视立体匹配（Multi-view Stereo, MVS）阶段缺乏标准化方法的问题。由于MVS阶段具有非可微性和多模态特性（即从像素值到几何结构的映射），其误差传播机制难以建模，导致现有方法无法提供可靠且一致的点云精度认证。解决方案的关键在于提出一种闭合两步摄影测量流程（Structure-from-Motion with Bundle Adjustment, SfM + MVS）的不确定性量化框架，通过为每个点关联一个误差协方差矩阵来实现端到端的误差传播建模；特别地，在MVS阶段引入一种自校准方法，利用每视图中可靠的6个以上n-view点回归视差不确定性，基于MVS阶段提取的高相关性特征（如匹配代价值）进行无监督学习，从而实现无需外部标注、符合实际误差传播路径的鲁棒不确定性估计。

链接: https://arxiv.org/abs/2507.13486
作者: Debao Huang,Rongjun Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures, this manuscript has been submitted to ISPRS Journal of Photogrammetry and Remote Sensing for consideration

点击查看摘要

Abstract:Uncertainty quantification of the photogrammetry process is essential for providing per-point accuracy credentials of the point clouds. Unlike airborne LiDAR, which typically delivers consistent accuracy across various scenes, the accuracy of photogrammetric point clouds is highly scene-dependent, since it relies on algorithm-generated measurements (i.e., stereo or multi-view stereo). Generally, errors of the photogrammetric point clouds propagate through a two-step process: Structure-from-Motion (SfM) with Bundle adjustment (BA), followed by Multi-view Stereo (MVS). While uncertainty estimation in the SfM stage has been well studied using the first-order statistics of the reprojection error function, that in the MVS stage remains largely unsolved and non-standardized, primarily due to its non-differentiable and multi-modal nature (i.e., from pixel values to geometry). In this paper, we present an uncertainty quantification framework closing this gap by associating an error covariance matrix per point accounting for this two-step photogrammetry process. Specifically, to estimate the uncertainty in the MVS stage, we propose a novel, self-calibrating method by taking reliable n-view points (n=6) per-view to regress the disparity uncertainty using highly relevant cues (such as matching cost values) from the MVS stage. Compared to existing approaches, our method uses self-contained, reliable 3D points extracted directly from the MVS process, with the benefit of being self-supervised and naturally adhering to error propagation path of the photogrammetry process, thereby providing a robust and certifiable uncertainty quantification across diverse scenes. We evaluate the framework using a variety of publicly available airborne and UAV imagery datasets. Results demonstrate that our method outperforms existing approaches by achieving high bounding rates without overestimating uncertainty.
zh

[CV-60] Neural Architecture Search with Mixed Bio-inspired Learning Rules ECAI2025

【速读】：该论文旨在解决生物启发式神经网络（bio-inspired neural networks）在准确性与可扩展性方面落后于基于反向传播（back-propagation, BP）模型的问题。其关键解决方案是通过定制的神经架构搜索（Neural Architecture Search, NAS）方法，允许不同层采用不同的生物启发学习规则，并自动发现最优的层间学习规则组合。研究表明，这种分层异构的学习规则配置显著提升了模型性能，在多个基准数据集上达到了当前生物启发模型的最佳准确率，甚至在某些场景下超越了同规模的BP模型，同时保持了其对抗鲁棒性和能效优势。

链接: https://arxiv.org/abs/2507.13485
作者: Imane Hamzaoui,Riyadh Baghdadi
机构: École nationale Supérieure d’Informatique Algiers (阿尔及利亚国家信息学院); New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ECAI 2025

点击查看摘要

Abstract:Bio-inspired neural networks are attractive for their adversarial robustness, energy frugality, and closer alignment with cortical physiology, yet they often lag behind back-propagation (BP) based models in accuracy and ability to scale. We show that allowing the use of different bio-inspired learning rules in different layers, discovered automatically by a tailored neural-architecture-search (NAS) procedure, bridges this gap. Starting from standard NAS baselines, we enlarge the search space to include bio-inspired learning rules and use NAS to find the best architecture and learning rule to use in each layer. We show that neural networks that use different bio-inspired learning rules for different layers have better accuracy than those that use a single rule across all the layers. The resulting NN that uses a mix of bio-inspired learning rules sets new records for bio-inspired models: 95.16% on CIFAR-10, 76.48% on CIFAR-100, 43.42% on ImageNet16-120, and 60.51% top-1 on ImageNet. In some regimes, they even surpass comparable BP-based networks while retaining their robustness advantages. Our results suggest that layer-wise diversity in learning rules allows better scalability and accuracy, and motivates further research on mixing multiple bio-inspired learning rules in the same network.
zh

[CV-61] Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning

【速读】：该论文旨在解决基于可穿戴惯性传感器（Inertial Measurement Unit, IMU）的人体活动识别（Human Activity Recognition, HAR）任务中模型泛化能力不足的问题，尤其在不同环境或人群间迁移时表现不佳。其关键解决方案是提出一种跨模态自监督预训练方法，利用大规模未标注的IMU与视频数据联合学习通用特征表示，从而提升模型在分布外（out-of-distribution, OOD）IMU数据上的零样本（zero-shot）和少样本（few-shot）识别性能。实验表明，该方法优于当前最先进的IMU-视频预训练及仅IMU预训练方案，验证了跨模态自监督学习在动态信号模态中构建可迁移表征的有效性。

链接: https://arxiv.org/abs/2507.13482
作者: Seyyed Saeid Cheshmi,Buyao Lyu,Thomas Lisko,Rajesh Rajamani,Robert A. McGovern,Yogatheesan Varatharajah
机构: University of Minnesota (明尼苏达大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) based on wearable inertial sensors plays a critical role in remote health monitoring. In patients with movement disorders, the ability to detect abnormal patient movements in their home environments can enable continuous optimization of treatments and help alert caretakers as needed. Machine learning approaches have been proposed for HAR tasks using Inertial Measurement Unit (IMU) data; however, most rely on application-specific labels and lack generalizability to data collected in different environments or populations. To address this limitation, we propose a new cross-modal self-supervised pretraining approach to learn representations from large-sale unlabeled IMU-video data and demonstrate improved generalizability in HAR tasks on out of distribution (OOD) IMU datasets, including a dataset collected from patients with Parkinson’s disease. Specifically, our results indicate that the proposed cross-modal pretraining approach outperforms the current state-of-the-art IMU-video pretraining approach and IMU-only pretraining under zero-shot and few-shot evaluations. Broadly, our study provides evidence that in highly dynamic data modalities, such as IMU signals, cross-modal pretraining may be a useful tool to learn generalizable data representations. Our software is available at this https URL.
zh

[CV-62] Multiresolution local smoothness detection in non-uniformly sampled multivariate signals

【速读】：该论文旨在解决非均匀采样多维信号中局部正则性（local regularity）的检测问题，这是在图像处理、点云分析和时间序列建模等应用中关键但具有挑战性的任务。传统小波方法在低维结构化数据上表现良好，但在高维或散乱数据场景下性能受限。解决方案的关键在于引入一种基于样本变换（samplet transform）的近线性时间算法，该变换是一种专为散乱数据设计的分布型小波变换，能够通过分析其系数衰减行为来刻画信号在点上的局部正则性，从而将微局部空间（microlocal spaces）理论与实际计算相结合，实现了对高维非均匀采样信号正则性的高效且鲁棒的检测。

链接: https://arxiv.org/abs/2507.13480
作者: Sara Avesani,Gianluca Giacchi,Michael Multerer
机构: IDSIA USI-SUPSI (Dalle Molle Institute for Artificial Intelligence Research); Università della Svizzera italiana (瑞士意大利语大学)
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Inspired by edge detection based on the decay behavior of wavelet coefficients, we introduce a (near) linear-time algorithm for detecting the local regularity in non-uniformly sampled multivariate signals. Our approach quantifies regularity within the framework of microlocal spaces introduced by Jaffard. The central tool in our analysis is the fast samplet transform, a distributional wavelet transform tailored to scattered data. We establish a connection between the decay of samplet coefficients and the pointwise regularity of multivariate signals. As a by product, we derive decay estimates for functions belonging to classical Hölder spaces and Sobolev-Slobodeckij spaces. While traditional wavelets are effective for regularity detection in low-dimensional structured data, samplets demonstrate robust performance even for higher dimensional and scattered data. To illustrate our theoretical findings, we present extensive numerical studies detecting local regularity of one-, two- and three-dimensional signals, ranging from non-uniformly sampled time series over image segmentation to edge detection in point clouds.
zh

[CV-63] “PhyWorldBench”: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

【速读】：该论文旨在解决当前视频生成模型在模拟真实物理现象方面存在的显著不足问题，即尽管生成的视频内容在视觉质量上已达到高度逼真，但其对物理规律（如物体运动、能量守恒、刚体动力学等）的遵循仍不充分。解决方案的关键在于提出一个名为PhyWorldBench的综合性基准测试体系，该体系涵盖从基础物理原则到复杂交互场景的多层级物理现象评估，并创新性引入“反物理”（Anti-Physics）类别以检验模型在违反现实物理规则时是否仍能保持逻辑一致性；同时，设计了一种基于现有多模态大语言模型（Multimodal Large Language Models, MLLMs）的零样本（zero-shot）物理真实性评估方法，从而实现高效且可扩展的量化分析，为提升视频生成模型的物理合理性提供了系统性的评测框架与改进方向。

链接: https://arxiv.org/abs/2507.13428
作者: Jing Gu,Xian Liu,Yu Zeng,Ashwin Nagarajan,Fangrui Zhu,Daniel Hong,Yue Fan,Qianqi Yan,Kaiwen Zhou,Ming-Yu Liu,Xin Eric Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 31 pages, 21 figures

点击查看摘要

Abstract:Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles like object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel ““Anti-Physics”” category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that could utilize current MLLM to evaluate the physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with a detailed comparison and analysis. we identify pivotal challenges models face in adhering to real-world physics. Through systematic testing of their outputs across 1,050 curated prompts-spanning fundamental, composite, and anti-physics scenarios-we identify pivotal challenges these models face in adhering to real-world physics. We then rigorously examine their performance on diverse physical phenomena with varying prompt types, deriving targeted recommendations for crafting prompts that enhance fidelity to physical principles.
zh

[CV-64] CaSTFormer: Causal Spatio-Temporal Transformer for Driving Intention Prediction

【速读】：该论文旨在解决当前驾驶意图预测方法在建模人类驾驶行为与环境上下文之间复杂时空依赖关系及不可预测变异性方面的不足，从而提升人机协同驾驶系统的安全性与交互效率。其解决方案的关键在于提出CaSTFormer模型，通过三个核心机制实现：（1）引入互移融合（Reciprocal Shift Fusion, RSF）机制以精确对齐内部行为特征与外部环境特征的时间序列；（2）设计因果模式提取（Causal Pattern Extraction, CPE）模块系统性消除虚假相关性，揭示真实因果依赖；（3）构建特征合成网络（Feature Synthesis Network, FSN）自适应融合净化后的表示，生成一致的时空推理结果。该方法显著提升了意图预测的准确性与可解释性。

链接: https://arxiv.org/abs/2507.13425
作者: Sirui Wang,Zhou Guan,Bingxi Zhao,Tongjia Gu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of driving intention is key to enhancing the safety and interactive efficiency of human-machine co-driving systems. It serves as a cornerstone for achieving high-level autonomous driving. However, current approaches remain inadequate for accurately modeling the complex spatio-temporal interdependencies and the unpredictable variability of human driving behavior. To address these challenges, we propose CaSTFormer, a Causal Spatio-Temporal Transformer to explicitly model causal interactions between driver behavior and environmental context for robust intention prediction. Specifically, CaSTFormer introduces a novel Reciprocal Shift Fusion (RSF) mechanism for precise temporal alignment of internal and external feature streams, a Causal Pattern Extraction (CPE) module that systematically eliminates spurious correlations to reveal authentic causal dependencies, and an innovative Feature Synthesis Network (FSN) that adaptively synthesizes these purified representations into coherent spatio-temporal inferences. We evaluate the proposed CaSTFormer on the public Brain4Cars dataset, and it achieves state-of-the-art performance. It effectively captures complex causal spatio-temporal dependencies and enhances both the accuracy and transparency of driving intention prediction.
zh

[CV-65] AI-ming backwards: Vanishing archaeological landscapes in Mesopotamia and automatic detection of sites on CORONA imagery

【速读】：该论文旨在解决因人类活动导致地表环境剧烈变化而使考古遗址难以识别的问题，尤其是那些在近五十年间因城市化、农业开发等 anthropization（人为化）过程而被破坏或掩埋的遗址。其解决方案的关键在于利用20世纪60年代拍摄的柯罗纳（CORONA）黑白卫星影像作为先验知识，对现有的基于Bing的卷积神经网络模型进行再训练，从而显著提升AI模型在现代地表条件下自动识别考古遗址的能力。实验结果表明，该方法不仅使图像分割的交并比（IoU）超过85%，整体检测准确率达90%，还成功识别出4处此前未被传统考古手段发现的新遗址，验证了生成式AI与历史遥感数据融合在考古学中的强大潜力。

链接: https://arxiv.org/abs/2507.13420
作者: Alessandro Pistola,Valentina Orru’,Nicolo’ Marchetti,Marco Roccetti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 9 Figures

点击查看摘要

Abstract:By upgrading an existing deep learning model with the knowledge provided by one of the oldest sets of grayscale satellite imagery, known as CORONA, we improved the AI model attitude towards the automatic identification of archaeological sites in an environment which has been completely transformed in the last five decades, including the complete destruction of many of those same sites. The initial Bing based convolutional network model was retrained using CORONA satellite imagery for the district of Abu Ghraib, west of Baghdad, central Mesopotamian floodplain. The results were twofold and surprising. First, the detection precision obtained on the area of interest increased sensibly: in particular, the Intersection over Union (IoU) values, at the image segmentation level, surpassed 85 percent, while the general accuracy in detecting archeological sites reached 90 percent. Second, our retrained model allowed the identification of four new sites of archaeological interest (confirmed through field verification), previously not identified by archaeologists with traditional techniques. This has confirmed the efficacy of using AI techniques and the CORONA imagery from the 1960 to discover archaeological sites currently no longer visible, a concrete breakthrough with significant consequences for the study of landscapes with vanishing archaeological evidence induced by anthropization
zh

[CV-66] A Deep Learning-Based Ensemble System for Automated Shoulder Fracture Detection in Clinical Radiographs

【速读】：该论文旨在解决肩部骨折在急诊及高流量临床环境中常被漏诊的问题（漏诊率可达10%），从而导致诊断延迟。其解决方案的关键在于开发了一种基于多模型深度学习的集成系统，利用10,000张标注的肩关节X光片进行训练，融合Faster R-CNN（ResNet50-FPN、ResNeXt）、EfficientDet与RF-DETR等多种架构，并采用边界框和分类级集成技术（如Soft-NMS、WBF和NMW融合）提升检测性能。最终，NMW集成模型实现了95.5%的准确率和0.9610的F1分数，在召回率和定位精度方面表现优异，验证了该方法在肩部骨折早期筛查中的临床有效性与部署可行性。

链接: https://arxiv.org/abs/2507.13408
作者: Hemanth Kumar M,Karthika M,Saianiruth M,Vasanthakumar Venugopal,Anandakumar D,Revathi Ezhumalai,Charulatha K,Kishore Kumar J,Dayana G,Kalyan Sivasailam,Bargava Subramanian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Background: Shoulder fractures are often underdiagnosed, especially in emergency and high-volume clinical settings. Studies report up to 10% of such fractures may be missed by radiologists. AI-driven tools offer a scalable way to assist early detection and reduce diagnostic delays. We address this gap through a dedicated AI system for shoulder radiographs. Methods: We developed a multi-model deep learning system using 10,000 annotated shoulder X-rays. Architectures include Faster R-CNN (ResNet50-FPN, ResNeXt), EfficientDet, and RF-DETR. To enhance detection, we applied bounding box and classification-level ensemble techniques such as Soft-NMS, WBF, and NMW fusion. Results: The NMW ensemble achieved 95.5% accuracy and an F1-score of 0.9610, outperforming individual models across all key metrics. It demonstrated strong recall and localization precision, confirming its effectiveness for clinical fracture detection in shoulder X-rays. Conclusion: The results show ensemble-based AI can reliably detect shoulder fractures in radiographs with high clinical relevance. The model’s accuracy and deployment readiness position it well for integration into real-time diagnostic workflows. The current model is limited to binary fracture detection, reflecting its design for rapid screening and triage support rather than detailed orthopedic classification.
zh

[CV-67] IConMark: Robust Interpretable Concept-Based Watermark For AI Images ICLR2025

【速读】：该论文旨在解决生成式 AI（Generative AI）和合成媒体日益普及背景下，如何有效区分AI生成图像与真实图像的问题，以防范虚假信息传播并保障数字真实性。传统水印技术因易受对抗攻击而失效，难以在复杂场景下保持鲁棒性。其解决方案的关键在于提出IConMark——一种在生成过程中嵌入可解释语义概念的鲁棒语义水印方法，通过将有意义的语义属性而非噪声或扰动嵌入图像，实现对人类可读且抗干扰的水印机制，从而显著提升检测准确率与图像质量保持能力，并可通过与StegaStamp（IConMark+SS）和TrustMark（IConMark+TM）等现有技术融合进一步增强鲁棒性，实验表明其AUROC得分相较最优基线分别提升10.8%、14.5%和15.9%。

链接: https://arxiv.org/abs/2507.13407
作者: Vinu Sankar Sadasivan,Mehrdad Saberi,Soheil Feizi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted at ICLR 2025 Workshop on GenAI Watermarking (WMARK)

点击查看摘要

Abstract:With the rapid rise of generative AI and synthetic media, distinguishing AI-generated images from real ones has become crucial in safeguarding against misinformation and ensuring digital authenticity. Traditional watermarking techniques have shown vulnerabilities to adversarial attacks, undermining their effectiveness in the presence of attackers. We propose IConMark, a novel in-generation robust semantic watermarking method that embeds interpretable concepts into AI-generated images, as a first step toward interpretable watermarking. Unlike traditional methods, which rely on adding noise or perturbations to AI-generated images, IConMark incorporates meaningful semantic attributes, making it interpretable to humans and hence, resilient to adversarial manipulation. This method is not only robust against various image augmentations but also human-readable, enabling manual verification of watermarks. We demonstrate a detailed evaluation of IConMark’s effectiveness, demonstrating its superiority in terms of detection accuracy and maintaining image quality. Moreover, IConMark can be combined with existing watermarking techniques to further enhance and complement its robustness. We introduce IConMark+SS and IConMark+TM, hybrid approaches combining IConMark with StegaStamp and TrustMark, respectively, to further bolster robustness against multiple types of image manipulations. Our base watermarking technique (IConMark) and its variants (+TM and +SS) achieve 10.8%, 14.5%, and 15.9% higher mean area under the receiver operating characteristic curve (AUROC) scores for watermark detection, respectively, compared to the best baseline on various datasets.
zh

[CV-68] COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在视觉蕴含推理（visual entailment reasoning）能力上的不足，尤其是针对复杂人群场景下对图像与陈述句之间逻辑关系的准确判断问题。现有基于视觉问答（VQA）的基准测试虽能反映模型的准确性提升，但较少评估模型在真实世界复杂场景中对假设命题进行接受或反驳的能力。为此，作者提出了COREVQA基准，包含5608个图像与合成生成的真/假陈述对，图像来源于CrowdHuman数据集，以诱发模型在高密度人群场景下的蕴含推理。其关键创新在于构建了一个聚焦于视觉蕴含任务、基于真实拥挤场景的高质量评测集，揭示了当前顶级VLMs在该任务上准确率仍低于80%，凸显出模型在复杂视觉语义理解与逻辑推理方面的显著局限性。

链接: https://arxiv.org/abs/2507.13405
作者: Ishant Chintapatla,Kazuma Choji,Naaisha Agarwal,Andrew Lin,Hannah You,Charles Duong,Kevin Zhu,Sean O’Brien,Vasu Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model’s ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs’ ability to reason over certain types of image-question pairs in crowded scenes.
zh

[CV-69] AortaDiff: Volume-Guided Conditional Diffusion Models for Multi-Branch Aortic Surface Generation

【速读】：该论文旨在解决现有主动脉三维建模方法在临床诊断、术前规划及计算流体动力学（Computational Fluid Dynamics, CFD）仿真中面临的几何一致性差、依赖大量标注数据和人工干预的问题。传统方法虽可生成可视化用的网格，但难以满足CFD分析所需的高质量、连续且精确的表面表示。其解决方案的关键在于提出AortaDiff——一个基于扩散模型（diffusion-based framework）的端到端框架：首先利用体积引导的条件扩散模型（volume-guided conditional diffusion model, CDM）从CT/MRI图像中迭代生成主动脉中心线；随后以每个中心线点为提示自动提取对应血管截面轮廓，确保边界精度；最后将轮廓拟合为光滑的三维表面，生成适用于CFD分析的连续网格。该方法显著降低对大规模标注数据的依赖，并能高保真重建正常与病理性主动脉结构（如动脉瘤或缩窄），具备良好的临床适用性与研究价值。

链接: https://arxiv.org/abs/2507.13404
作者: Delin An,Pan Du,Jian-Xun Wang,Chaoli Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D aortic construction is crucial for clinical diagnosis, preoperative planning, and computational fluid dynamics (CFD) simulations, as it enables the estimation of critical hemodynamic parameters such as blood flow velocity, pressure distribution, and wall shear stress. Existing construction methods often rely on large annotated training datasets and extensive manual intervention. While the resulting meshes can serve for visualization purposes, they struggle to produce geometrically consistent, well-constructed surfaces suitable for downstream CFD analysis. To address these challenges, we introduce AortaDiff, a diffusion-based framework that generates smooth aortic surfaces directly from CT/MRI volumes. AortaDiff first employs a volume-guided conditional diffusion model (CDM) to iteratively generate aortic centerlines conditioned on volumetric medical images. Each centerline point is then automatically used as a prompt to extract the corresponding vessel contour, ensuring accurate boundary delineation. Finally, the extracted contours are fitted into a smooth 3D surface, yielding a continuous, CFD-compatible mesh representation. AortaDiff offers distinct advantages over existing methods, including an end-to-end workflow, minimal dependency on large labeled datasets, and the ability to generate CFD-compatible aorta meshes with high geometric fidelity. Experimental results demonstrate that AortaDiff performs effectively even with limited training data, successfully constructing both normal and pathologically altered aorta meshes, including cases with aneurysms or coarctation. This capability enables the generation of high-quality visualizations and positions AortaDiff as a practical solution for cardiovascular research.
zh

[CV-70] UL-DD: A Multimodal Drowsiness Dataset Using Video Biometric Signals and Behavioral Data

【速读】：该论文旨在解决当前驾驶疲劳检测研究中缺乏多模态、连续状态标注数据集的问题，以提升模型对驾驶员疲劳状态变化的感知能力。其解决方案的关键在于构建一个包含面部、行为和生物特征信号的综合性多模态数据集，涵盖3D面部视频、红外图像、后视视频、心率、皮肤电活动（EDA）、血氧饱和度、皮肤温度、加速度计数据以及方向盘握力传感器与美国卡车模拟器的行驶数据，并通过每4分钟一次的Karolinska睡眠量表（KSS）自评实现疲劳程度的连续标注，从而捕捉驾驶员从清醒到困倦的渐变过程，而非离散标签。

链接: https://arxiv.org/abs/2507.13403
作者: Morteza Bodaghi,Majid Hosseini,Raju Gottumukkala,Ravi Teja Bhupatiraju,Iftikhar Ahmad,Moncef Gabbouj
机构: University of Louisiana at Lafayette (路易斯安那大学拉斐特分校); Tietoevry (蒂索艾弗里); Tampere University (坦佩雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this study, we present a comprehensive public dataset for driver drowsiness detection, integrating multimodal signals of facial, behavioral, and biometric indicators. Our dataset includes 3D facial video using a depth camera, IR camera footage, posterior videos, and biometric signals such as heart rate, electrodermal activity, blood oxygen saturation, skin temperature, and accelerometer data. This data set provides grip sensor data from the steering wheel and telemetry data from the American truck simulator game to provide more information about drivers’ behavior while they are alert and drowsy. Drowsiness levels were self-reported every four minutes using the Karolinska Sleepiness Scale (KSS). The simulation environment consists of three monitor setups, and the driving condition is completely like a car. Data were collected from 19 subjects (15 M, 4 F) in two conditions: when they were fully alert and when they exhibited signs of sleepiness. Unlike other datasets, our multimodal dataset has a continuous duration of 40 minutes for each data collection session per subject, contributing to a total length of 1,400 minutes, and we recorded gradual changes in the driver state rather than discrete alert/drowsy labels. This study aims to create a comprehensive multimodal dataset of driver drowsiness that captures a wider range of physiological, behavioral, and driving-related signals. The dataset will be available upon request to the corresponding author.
zh

[CV-71] MADI: Masking-Augmented Diffusion with Inference-Time Scaling for Visual Editing

【速读】：该论文旨在解决扩散模型在**基于场景的视觉编辑（grounded visual editing）和组合控制（compositional control）方面能力不足的问题。当前扩散模型虽在文本到图像生成中表现优异，但在精确修改图像局部结构或按语义组合对象时仍存在局限。其解决方案的关键在于提出Masking-Augmented Diffusion with Inference-Time Scaling (MADI)框架，包含两项核心创新：一是引入掩码增强高斯扩散（Masking-Augmented Gaussian Diffusion, MAgD）训练策略，通过双重污染过程（标准去噪得分匹配与掩码重建）促使模型学习具有判别性和组合性的视觉表征，从而实现结构感知的局部编辑；二是设计基于暂停标记（Pause Tokens）**的推理时容量扩展机制，在提示中插入特殊占位符以动态提升推理阶段的计算容量，进而增强可控生成能力。这两项改进显著提升了扩散模型的可编辑性、组合性和可控性，推动其向通用、上下文感知的生成架构演进。

链接: https://arxiv.org/abs/2507.13401
作者: Shreya Kadambi,Risheek Garrepalli,Shubhankar Borse,Munawar Hyatt,Fatih Porikli
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages

点击查看摘要

Abstract:Despite the remarkable success of diffusion models in text-to-image generation, their effectiveness in grounded visual editing and compositional control remains challenging. Motivated by advances in self-supervised learning and in-context generative modeling, we propose a series of simple yet powerful design choices that significantly enhance diffusion model capacity for structured, controllable generation and editing. We introduce Masking-Augmented Diffusion with Inference-Time Scaling (MADI), a framework that improves the editability, compositionality and controllability of diffusion models through two core innovations. First, we introduce Masking-Augmented gaussian Diffusion (MAgD), a novel training strategy with dual corruption process which combines standard denoising score matching and masked reconstruction by masking noisy input from forward process. MAgD encourages the model to learn discriminative and compositional visual representations, thus enabling localized and structure-aware editing. Second, we introduce an inference-time capacity scaling mechanism based on Pause Tokens, which act as special placeholders inserted into the prompt for increasing computational capacity at inference time. Our findings show that adopting expressive and dense prompts during training further enhances performance, particularly for MAgD. Together, these contributions in MADI substantially enhance the editability of diffusion models, paving the way toward their integration into more general-purpose, in-context generative diffusion architectures.
zh

[CV-72] InSyn: Modeling Complex Interactions for Pedestrian Trajectory Prediction

【速读】：该论文旨在解决行人轨迹预测中因复杂交互关系导致的准确性不足问题，尤其针对传统方法仅依赖相对位置建模而忽略特定交互模式（如同步行走或冲突行为）所带来的局限性。解决方案的关键在于提出一种基于Transformer架构的InSyn（Interaction-Synchronization Network）模型，该模型能够显式捕捉多样化的交互模式并有效建模方向敏感的社会行为；同时引入Seq-Start of Seq（SSOS）训练策略，缓解时间序列预测中的初始步发散问题，从而显著提升高密度场景下的预测精度。

链接: https://arxiv.org/abs/2507.13397
作者: Kaiyuan Zhai,Juan Chen,Chao Wang,Zeyi Xu
机构: Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate pedestrian trajectory prediction is crucial for intelligent applications, yet it remains highly challenging due to the complexity of interactions among pedestrians. Previous methods have primarily relied on relative positions to model pedestrian interactions; however, they tend to overlook specific interaction patterns such as paired walking or conflicting behaviors, limiting the prediction accuracy in crowded scenarios. To address this issue, we propose InSyn (Interaction-Synchronization Network), a novel Transformer-based model that explicitly captures diverse interaction patterns (e.g., walking in sync or conflicting) while effectively modeling direction-sensitive social behaviors. Additionally, we introduce a training strategy termed Seq-Start of Seq (SSOS), designed to alleviate the common issue of initial-step divergence in numerical time-series prediction. Experiments on the ETH and UCY datasets demonstrate that our model outperforms recent baselines significantly, especially in high-density scenarios. Furthermore, the SSOS strategy proves effective in improving sequential prediction performance, reducing the initial-step prediction error by approximately 6.58%.
zh

[CV-73] From Binary to Semantic: Utilizing Large-Scale Binary Occupancy Data for 3D Semantic Occupancy Prediction ICCV

【速读】：该论文旨在解决3D语义占用预测（3D semantic occupancy prediction）中因依赖标注LiDAR点云而导致数据获取成本高昂的问题。现有方法通常需要为每个体素（voxel）分配语义标签，而这类高质量标注数据难以大规模获取。相比之下，二值占用数据（binary occupancy data）仅区分占据空间与自由空间，无需语义标注，可低成本大规模收集，但其在3D语义预测中的潜力尚未被探索。论文提出了一种基于二值占用数据的新框架，其关键在于将预测过程解耦为二值占用模块和语义占用模块，从而有效利用大规模低成本的二值数据进行预训练或学习型自动标注（learning-based auto-labeling），显著提升了3D语义占用预测性能。

链接: https://arxiv.org/abs/2507.13387
作者: Chihiro Noguchi,Takaki Yamamoto
机构: Toyota Motor Corporation (丰田汽车公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to ICCV Workshop 2025

点击查看摘要

Abstract:Accurate perception of the surrounding environment is essential for safe autonomous driving. 3D occupancy prediction, which estimates detailed 3D structures of roads, buildings, and other objects, is particularly important for vision-centric autonomous driving systems that do not rely on LiDAR sensors. However, in 3D semantic occupancy prediction – where each voxel is assigned a semantic label – annotated LiDAR point clouds are required, making data acquisition costly. In contrast, large-scale binary occupancy data, which only indicate occupied or free space without semantic labels, can be collected at a lower cost. Despite their availability, the potential of leveraging such data remains unexplored. In this study, we investigate the utilization of large-scale binary occupancy data from two perspectives: (1) pre-training and (2) learning-based auto-labeling. We propose a novel binary occupancy-based framework that decomposes the prediction process into binary and semantic occupancy modules, enabling effective use of binary occupancy data. Our experimental results demonstrate that the proposed framework outperforms existing methods in both pre-training and auto-labeling tasks, highlighting its effectiveness in enhancing 3D semantic occupancy prediction. The code is available at this https URL
zh

[CV-74] Minimalist Concept Erasure in Generative Models ICML2025

【速读】：该论文旨在解决生成式模型（Generative Models）在训练过程中依赖大规模未标注数据所引发的安全与版权问题，尤其是针对特定概念（如人物、品牌等）的不当生成现象。现有概念擦除方法常因过度修改模型参数而导致整体性能下降。其解决方案的关键在于提出一种基于最终生成输出分布距离的极简主义概念擦除目标函数，并由此推导出可微分的损失函数，通过端到端反向传播优化所有生成步骤；同时引入神经元掩码（neuron masking）作为替代微调的鲁棒性增强策略，从而在不损害模型整体性能的前提下实现高效、稳定的概念擦除。

链接: https://arxiv.org/abs/2507.13386
作者: Yang Zhang,Er Jin,Yanfei Dong,Yixuan Wu,Philip Torr,Ashkan Khakzar,Johannes Stegmaier,Kenji Kawaguchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICML2025

点击查看摘要

Abstract:Recent advances in generative models have demonstrated remarkable capabilities in producing high-quality images, but their reliance on large-scale unlabeled data has raised significant safety and copyright concerns. Efforts to address these issues by erasing unwanted concepts have shown promise. However, many existing erasure methods involve excessive modifications that compromise the overall utility of the model. In this work, we address these issues by formulating a novel minimalist concept erasure objective based \emphonly on the distributional distance of final generation outputs. Building on our formulation, we derive a tractable loss for differentiable optimization that leverages backpropagation through all generation steps in an end-to-end manner. We also conduct extensive analysis to show theoretical connections with other models and methods. To improve the robustness of the erasure, we incorporate neuron masking as an alternative to model fine-tuning. Empirical evaluations on state-of-the-art flow-matching models demonstrate that our method robustly erases concepts without degrading overall model performance, paving the way for safer and more responsible generative models.
zh

[CV-75] Using Multiple Input Modalities Can Improve Data-Efficiency and O.O.D. Generalization for ML with Satellite Imagery ICML2025

【速读】：该论文旨在解决当前生成式AI（Generative AI）在遥感机器学习（SatML）模型中对多模态输入利用不足的问题，尤其是如何通过融合非光学地理数据层（如数字高程模型、环境传感器数据等）来提升监督学习任务中的模型性能。其解决方案的关键在于：通过在现有SatML基准任务上附加额外的地理数据层，构建增强型数据集，并系统评估多模态融合策略的效果；研究发现，将非光学地理数据与光学影像融合可显著提升模型性能，尤其在标注数据稀缺和地理外样本场景下效果更优，且硬编码的融合策略优于学习型融合方法，这对未来SatML模型的数据效率和泛化能力优化具有重要启示。

链接: https://arxiv.org/abs/2507.13385
作者: Arjun Rao,Esther Rolf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 9 figures, 7 tables. Accepted to TerraBytes@ICML 2025

点击查看摘要

Abstract:A large variety of geospatial data layers is available around the world ranging from remotely-sensed raster data like satellite imagery, digital elevation models, predicted land cover maps, and human-annotated data, to data derived from environmental sensors such as air temperature or wind speed data. A large majority of machine learning models trained on satellite imagery (SatML), however, are designed primarily for optical input modalities such as multi-spectral satellite imagery. To better understand the value of using other input modalities alongside optical imagery in supervised learning settings, we generate augmented versions of SatML benchmark tasks by appending additional geographic data layers to datasets spanning classification, regression, and segmentation. Using these augmented datasets, we find that fusing additional geographic inputs with optical imagery can significantly improve SatML model performance. Benefits are largest in settings where labeled data are limited and in geographic out-of-sample settings, suggesting that multi-modal inputs may be especially valuable for data-efficiency and out-of-sample performance of SatML models. Surprisingly, we find that hard-coded fusion strategies outperform learned variants, with interesting implications for future work.
zh

[CV-76] Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models

【速读】：该论文旨在解决当前文本到图像（Text-to-Image, T2I）生成模型在对人类多元经验理解上的不足，导致系统与不同群体的价值观存在错位的问题。其核心解决方案是提出“多元对齐”（pluralistic alignment）理念，即让AI能够理解并可引导至多种、有时相互冲突的人类价值观。关键创新在于构建了首个用于多元对齐的多模态数据集——Diverse Intersectional Visual Evaluation (DIVE)，该数据集通过大量具有交叉人口学特征的人类评估者对1000个提示词提供详尽反馈，实现了高复现性的安全感知捕捉；同时实证表明，人口统计学因素是该领域中多元观点的重要代理变量，且伤害感知存在显著的情境依赖性差异，这为高效数据收集、大语言模型（Large Language Model, LLM）判断能力提升以及模型向不同视角的可控引导提供了方法论支撑。

链接: https://arxiv.org/abs/2507.13383
作者: Charvi Rastogi,Tian Huey Teh,Pushkar Mishra,Roma Patel,Ding Wang,Mark Díaz,Alicia Parrish,Aida Mostafazadeh Davani,Zoe Ashwood,Michela Paganini,Vinodkumar Prabhakaran,Verena Rieser,Lora Aroyo
机构: Google DeepMind(谷歌深度思维); Google Research(谷歌研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 16 figures

点击查看摘要

Abstract:Current text-to-image (T2I) models often fail to account for diverse human experiences, leading to misaligned systems. We advocate for pluralistic alignment, where an AI understands and is steerable towards diverse, and often conflicting, human values. Our work provides three core contributions to achieve this in T2I models. First, we introduce a novel dataset for Diverse Intersectional Visual Evaluation (DIVE) – the first multimodal dataset for pluralistic alignment. It enable deep alignment to diverse safety perspectives through a large pool of demographically intersectional human raters who provided extensive feedback across 1000 prompts, with high replication, capturing nuanced safety perceptions. Second, we empirically confirm demographics as a crucial proxy for diverse viewpoints in this domain, revealing significant, context-dependent differences in harm perception that diverge from conventional evaluations. Finally, we discuss implications for building aligned T2I models, including efficient data collection strategies, LLM judgment capabilities, and model steerability towards diverse perspectives. This research offers foundational tools for more equitable and aligned T2I systems. Content Warning: The paper includes sensitive content that may be harmful.
zh

[CV-77] A Comprehensive Survey for Real-World Industrial Defect Detection: Challenges Approaches and Prospects

【速读】：该论文旨在解决工业缺陷检测领域中传统闭集（closed-set）方法在实际应用中面临标注数据依赖性强、难以识别未知缺陷等问题，从而限制了其在复杂多变的制造环境中的适应性与扩展性。解决方案的关键在于系统梳理并对比分析二维（2D）与三维（3D）模态下的闭集与开集（open-set）缺陷检测策略，揭示开集框架如何通过减少对大规模缺陷标注的依赖，实现对新型异常的有效识别，进而推动工业缺陷检测向更智能、自动化和可扩展的方向发展。

链接: https://arxiv.org/abs/2507.13378
作者: Yuqi Cheng,Yunkang Cao,Haiming Yao,Wei Luo,Cheng Jiang,Hui Zhang,Weiming Shen
机构: Huazhong University of Science and Technology (华中科技大学); Tsinghua University (清华大学); IEEE (电气电子工程师学会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 7 figures

点击查看摘要

Abstract:Industrial defect detection is vital for upholding product quality across contemporary manufacturing systems. As the expectations for precision, automation, and scalability intensify, conventional inspection approaches are increasingly found wanting in addressing real-world demands. Notable progress in computer vision and deep learning has substantially bolstered defect detection capabilities across both 2D and 3D modalities. A significant development has been the pivot from closed-set to open-set defect detection frameworks, which diminishes the necessity for extensive defect annotations and facilitates the recognition of novel anomalies. Despite such strides, a cohesive and contemporary understanding of industrial defect detection remains elusive. Consequently, this survey delivers an in-depth analysis of both closed-set and open-set defect detection strategies within 2D and 3D modalities, charting their evolution in recent years and underscoring the rising prominence of open-set techniques. We distill critical challenges inherent in practical detection environments and illuminate emerging trends, thereby providing a current and comprehensive vista of this swiftly progressing field.
zh

[CV-78] StructInbet: Integrating Explicit Structural Guidance into Inbetween Frame Generation SIGGRAPH2025

【速读】：该论文旨在解决传统插帧（inbetweening）方法中因像素轨迹模糊而导致的生成结果不一致问题。其核心挑战在于如何在关键帧之间生成连贯且结构可控的过渡序列，同时保持角色外观的一致性。解决方案的关键在于提出显式结构引导（explicit structural guidance），通过引入结构信息来约束插值路径，减少像素级轨迹的歧义；并设计时间注意力机制（temporal attention mechanism），融合前后关键帧的视觉身份特征，从而确保生成过程中角色外观的稳定性与一致性。

链接: https://arxiv.org/abs/2507.13377
作者: Zhenglin Pan,Haoran Xie
机构: Japan Advanced Institute of Science and Technology (日本高级科学与技术研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 3 pages, 3 figures. SIGGRAPH 2025 Poster

点击查看摘要

Abstract:In this paper, we propose StructInbet, an inbetweening system designed to generate controllable transitions over explicit structural guidance. StructInbet introduces two key contributions. First, we propose explicit structural guidance to the inbetweening problem to reduce the ambiguity inherent in pixel trajectories. Second, we adopt a temporal attention mechanism that incorporates visual identity from both the preceding and succeeding keyframes, ensuring consistency in character appearance.
zh

[CV-79] Smart Routing for Multimodal Video Retrieval: When to Search What ICCV2025

【速读】：该论文旨在解决多模态视频检索系统中因盲目遍历所有模态（如语音识别ASR、光学字符识别OCR和视觉索引）而导致的高计算开销与信息冗余问题。现有方法依赖密集文本描述虽能实现较高召回率（如75.9% Recall@5），但需昂贵离线处理且无法捕捉34%包含场景文字的视频片段中的关键视觉信息。解决方案的关键在于引入基于大语言模型（LLM）的智能路由机制——ModaRoute，其利用GPT-4.1分析查询意图并预测信息需求，动态选择最优模态组合进行检索，平均仅需1.78个模态/查询（相比全量3.0），从而在保持60.9% Recall@5的同时降低41%计算成本，为大规模部署提供高效可行的路径。

链接: https://arxiv.org/abs/2507.13374
作者: Kevin Dela Rosa
机构: Cloudglue
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to ICCV 2025 Multimodal Representation and Retrieval Workshop

点击查看摘要

Abstract:We introduce ModaRoute, an LLM-based intelligent routing system that dynamically selects optimal modalities for multimodal video retrieval. While dense text captions can achieve 75.9% Recall@5, they require expensive offline processing and miss critical visual information present in 34% of clips with scene text not captured by ASR. By analyzing query intent and predicting information needs, ModaRoute reduces computational overhead by 41% while achieving 60.9% Recall@5. Our approach uses GPT-4.1 to route queries across ASR (speech), OCR (text), and visual indices, averaging 1.78 modalities per query versus exhaustive 3.0 modality search. Evaluation on 1.8M video clips demonstrates that intelligent routing provides a practical solution for scaling multimodal retrieval systems, reducing infrastructure costs while maintaining competitive effectiveness for real-world deployment.
zh

[CV-80] Butter: Frequency Consistency and Hierarchical Fusion for Autonomous Driving Object Detection

【速读】：该论文旨在解决当前目标检测架构（如YOLO和DETR）在多尺度特征一致性保持方面存在的问题，尤其是在动态自动驾驶环境中难以平衡检测精度与计算效率的挑战。解决方案的关键在于提出一种名为Butter的新框架，其核心创新包括两个组件：一是频率自适应特征一致性增强（Frequency-Adaptive Feature Consistency Enhancement, FAFCE）模块，通过自适应频域滤波提升结构与边界精度以优化多尺度特征的一致性；二是渐进式层级特征融合网络（Progressive Hierarchical Feature Fusion Network, PHFFNet）模块，通过逐级融合多层特征来缩小语义差距并强化层级特征学习能力。这两个机制协同作用，显著提升了检测鲁棒性与特征表达能力，同时降低了模型复杂度，实现了高精度、低延迟的目标检测性能。

链接: https://arxiv.org/abs/2507.13373
作者: Xiaojian Lin,Wenxin Zhang,Yuchu Jiang,Wangyu Wu,Yiran Guo,Kangxu Wang,Zongzheng Zhang,Guijin Wang,Lei Jin,Hao Zhao
机构: Tsinghua University (清华大学); University of Chinese Academy of Sciences (中国科学院大学); Southeast University (东南大学); University of Liverpool (利物浦大学); Beijing Institute of Technology (北京理工大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures. Supplementary material: 8 pages, 7 figures. Accepted at ACM Multimedia 2025

点击查看摘要

Abstract:Hierarchical feature representations play a pivotal role in computer vision, particularly in object detection for autonomous driving. Multi-level semantic understanding is crucial for accurately identifying pedestrians, vehicles, and traffic signs in dynamic environments. However, existing architectures, such as YOLO and DETR, struggle to maintain feature consistency across different scales while balancing detection precision and computational efficiency. To address these challenges, we propose Butter, a novel object detection framework designed to enhance hierarchical feature representations for improving detection robustness. Specifically, Butter introduces two key innovations: Frequency-Adaptive Feature Consistency Enhancement (FAFCE) Component, which refines multi-scale feature consistency by leveraging adaptive frequency filtering to enhance structural and boundary precision, and Progressive Hierarchical Feature Fusion Network (PHFFNet) Module, which progressively integrates multi-level features to mitigate semantic gaps and strengthen hierarchical feature learning. Through extensive experiments on BDD100K, KITTI, and Cityscapes, Butter demonstrates superior feature representation capabilities, leading to notable improvements in detection accuracy while reducing model complexity. By focusing on hierarchical feature refinement and integration, Butter provides an advanced approach to object detection that achieves a balance between accuracy, deployability, and computational efficiency in real-time autonomous driving scenarios. Our model and implementation are publicly available at this https URL, facilitating further research and validation within the autonomous driving community.
zh

[CV-81] Enhancing Breast Cancer Detection with Vision Transformers and Graph Neural Networks

【速读】：该论文旨在解决乳腺癌（Breast Cancer）早期检测中准确率不足的问题，以提升诊断效率和临床应用价值。其解决方案的关键在于构建一个融合视觉Transformer（Vision Transformer, ViT）与图神经网络（Graph Neural Network, GNN）的创新框架：ViT负责提取医学图像中的全局特征，而GNN则用于建模病灶区域之间的结构关系，从而增强对复杂病理模式的识别能力；该方法在CBIS-DDSM数据集上实现了84.2%的准确率，优于传统方法，并通过可解释的注意力热图（attention heatmaps）提升模型决策的透明度，助力放射科医生在临床实践中做出更可靠的判断。

链接: https://arxiv.org/abs/2507.13372
作者: Yeming Cai,Zhenglin Li,Yang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Breast cancer is a leading cause of death among women globally, and early detection is critical for improving survival rates. This paper introduces an innovative framework that integrates Vision Transformers (ViT) and Graph Neural Networks (GNN) to enhance breast cancer detection using the CBIS-DDSM dataset. Our framework leverages ViT’s ability to capture global image features and GNN’s strength in modeling structural relationships, achieving an accuracy of 84.2%, outperforming traditional methods. Additionally, interpretable attention heatmaps provide insights into the model’s decision-making process, aiding radiologists in clinical settings.
zh

[CV-82] ransformer-Based Framework for Motion Capture Denoising and Anomaly Detection in Medical Rehabilitation

【速读】：该论文旨在解决医疗康复中因遮挡和环境因素导致的光学动作捕捉数据噪声与缺失问题，并实现实时异常运动检测以保障患者安全。其解决方案的关键在于提出了一种端到端的深度学习框架，融合了基于Transformer的模型来建模时间序列数据，从而有效去除噪声、补全缺失数据并提升整体鲁棒性，同时在卒中和骨科康复数据集上验证了其在数据重建与异常检测方面的优越性能，为远程康复提供了可扩展且低成本的解决方案。

链接: https://arxiv.org/abs/2507.13371
作者: Yeming Cai,Yang Wang,Zhenglin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes an end-to-end deep learning framework integrating optical motion capture with a Transformer-based model to enhance medical rehabilitation. It tackles data noise and missing data caused by occlusion and environmental factors, while detecting abnormal movements in real time to ensure patient safety. Utilizing temporal sequence modeling, our framework denoises and completes motion capture data, improving robustness. Evaluations on stroke and orthopedic rehabilitation datasets show superior performance in data reconstruction and anomaly detection, providing a scalable, cost-effective solution for remote rehabilitation with reduced on-site supervision.
zh

[CV-83] A Novel APVD Steganography Technique Incorporating Pseudorandom Pixel Selection for Robust Image Security

【速读】：该论文旨在解决自适应像素值差分（Adaptive Pixel Value Differencing, APVD）隐写方法中存在的“未使用块”（unused blocks）问题，该问题会导致安全性下降、嵌入容量受限以及图像视觉质量降低。解决方案的关键在于将APVD与伪随机像素选择机制相结合，通过优化像素选择策略减少冗余区域的使用，从而提升隐写的安全性、嵌入容量和图像保真度。实验结果表明，该方法在峰值信噪比（PSNR）、通用图像质量指数（UIQ）和结构相似性指数（SSIM）等指标上均优于现有技术，且适用于彩色与灰度图像的多种组合场景。

链接: https://arxiv.org/abs/2507.13367
作者: Mehrab Hosain,Rajiv Kapoor
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Accepted COMITCON 2023. Lecture Notes in Electrical Engineering, vol 1191. Springer

点击查看摘要

Abstract:Steganography is the process of embedding secret information discreetly within a carrier, ensuring secure exchange of confidential data. The Adaptive Pixel Value Differencing (APVD) steganography method, while effective, encounters certain challenges like the “unused blocks” issue. This problem can cause a decrease in security, compromise the embedding capacity, and lead to lower visual quality. This research presents a novel steganographic strategy that integrates APVD with pseudorandom pixel selection to effectively mitigate these issues. The results indicate that the new method outperforms existing techniques in aspects of security, data hiding capacity, and the preservation of image quality. Empirical results reveal that the combination of APVD with pseudorandom pixel selection significantly enhances key image quality metrics such as Peak Signal-to-Noise Ratio (PSNR), Universal Image Quality Index (UIQ), and Structural Similarity Index (SSIM), surpassing other contemporary methods in performance. The newly proposed method is versatile, able to handle a variety of cover and secret images in both color and grayscale, thereby ensuring secure data transmission without compromising the aesthetic quality of the image.
zh

[CV-84] Leverag ing the Spatial Hierarchy: Coarse-to-fine Trajectory Generation via Cascaded Hybrid Diffusion

【速读】：该论文旨在解决城市细粒度人类移动轨迹数据因隐私顾虑和高昂采集成本而难以大规模公开的问题，同时克服现有轨迹合成方法在处理高维分布复杂性与生成真实轨迹方面的不足。其核心解决方案是提出一种基于扩散模型的分层级联（coarse-to-fine cascaded）轨迹合成框架Cardiff，关键在于将生成过程分解为两个层次：首先在离散道路段级别（road segment-level）利用扩散Transformer结构对低维潜在嵌入进行去噪以高效生成初步轨迹；随后在连续GPS级别（GPS-level）引入噪声增强机制的条件去噪网络，基于第一阶段结果实现高保真度的细粒度轨迹生成。该方法不仅通过级联去噪逐步提升轨迹质量，还支持在隐私保护与数据效用之间灵活权衡。

链接: https://arxiv.org/abs/2507.13366
作者: Baoshen Guo,Zhiqing Hong,Junyi Li,Shenhao Wang,Jinhua Zhao
机构: Singapore-MIT Alliance for Research and Technology (新加坡-麻省理工联盟研究中心); Rutgers University (罗格斯大学); University of Florida (佛罗里达大学); Massachusetts Institute of Technology (麻省理工学院)
类目: ocial and Information Networks (cs.SI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Urban mobility data has significant connections with economic growth and plays an essential role in various smart-city applications. However, due to privacy concerns and substantial data collection costs, fine-grained human mobility trajectories are difficult to become publicly available on a large scale. A promising solution to address this issue is trajectory synthesizing. However, existing works often ignore the inherent structural complexity of trajectories, unable to handle complicated high-dimensional distributions and generate realistic fine-grained trajectories. In this paper, we propose Cardiff, a coarse-to-fine Cascaded hybrid diffusion-based trajectory synthesizing framework for fine-grained and privacy-preserving mobility generation. By leveraging the hierarchical nature of urban mobility, Cardiff decomposes the generation process into two distinct levels, i.e., discrete road segment-level and continuous fine-grained GPS-level: (i) In the segment-level, to reduce computational costs and redundancy in raw trajectories, we first encode the discrete road segments into low-dimensional latent embeddings and design a diffusion transformer-based latent denoising network for segment-level trajectory synthesis. (ii) Taking the first stage of generation as conditions, we then design a fine-grained GPS-level conditional denoising network with a noise augmentation mechanism to achieve robust and high-fidelity generation. Additionally, the Cardiff framework not only progressively generates high-fidelity trajectories through cascaded denoising but also flexibly enables a tunable balance between privacy preservation and utility. Experimental results on three large real-world trajectory datasets demonstrate that our method outperforms state-of-the-art baselines in various metrics.
zh

[CV-85] OmniVec2 – A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning

【速读】：该论文旨在解决多模态（multimodal）和多任务（multitask）场景下如何高效融合来自不同模态的数据并实现统一表征的问题。其关键解决方案在于提出了一种新型的多模态多任务网络架构，该架构采用模态专用的分词器（modality-specialized tokenizers）、共享的Transformer结构以及跨注意力机制（cross-attention mechanisms），将多种模态数据映射到统一嵌入空间；同时引入模态特定的任务头（modality-specific task heads）以支持各模态下的差异化任务，并设计了基于迭代模态切换的预训练策略与一种可灵活权衡全模态联合训练与成对模态训练的优化算法，从而在25个数据集上实现了当前最优性能，验证了方法的有效性。

链接: https://arxiv.org/abs/2507.13364
作者: Siddharth Srivastava,Gaurav Sharma
机构: Typeface(字体公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel multimodal multitask network and associated training algorithm. The method is capable of ingesting data from approximately 12 different modalities namely image, video, audio, text, depth, point cloud, time series, tabular, graph, X-ray, infrared, IMU, and hyperspectral. The proposed approach utilizes modality specialized tokenizers, a shared transformer architecture, and cross-attention mechanisms to project the data from different modalities into a unified embedding space. It addresses multimodal and multitask scenarios by incorporating modality-specific task heads for different tasks in respective modalities. We propose a novel pretraining strategy with iterative modality switching to initialize the network, and a training algorithm which trades off fully joint training over all modalities, with training on pairs of modalities at a time. We provide comprehensive evaluation across 25 datasets from 12 modalities and show state of the art performances, demonstrating the effectiveness of the proposed architecture, pretraining strategy and adapted multitask training.
zh

[CV-86] Just Add Geometry: Gradient-Free Open-Vocabulary 3D Detection Without Human-in-the-Loop

【速读】：该论文旨在解决当前3D目标检测数据集受限于狭窄类别体系和高昂的人工标注成本，难以扩展至开放世界场景的问题。其核心解决方案是利用在大规模图像-文本对上训练的2D视觉-语言基础模型（vision-language models）所具备的丰富语义理解能力，实现无需任何人工标注3D标签的开放词汇3D目标检测。关键创新在于：首先使用2D视觉-语言检测器生成文本条件下的候选区域，结合Segment Anything Model (SAM) 进行分割，并通过相机几何关系或单目伪深度将这些2D提议回投影到3D空间；随后采用基于DBSCAN聚类与旋转卡尺（Rotating Calipers）的几何膨胀策略，在不依赖训练的情况下推断出3D边界框。该方法在LiDAR和纯RGB-D输入下均展现出具有竞争力的定位性能，验证了2D基础模型在可扩展3D感知中的巨大潜力。

链接: https://arxiv.org/abs/2507.13363
作者: Atharv Goel,Mehar Khurana
机构: Indraprastha Institute of Information Technology, Delhi (德里印地普拉斯特拉信息技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern 3D object detection datasets are constrained by narrow class taxonomies and costly manual annotations, limiting their ability to scale to open-world settings. In contrast, 2D vision-language models trained on web-scale image-text pairs exhibit rich semantic understanding and support open-vocabulary detection via natural language prompts. In this work, we leverage the maturity and category diversity of 2D foundation models to perform open-vocabulary 3D object detection without any human-annotated 3D labels. Our pipeline uses a 2D vision-language detector to generate text-conditioned proposals, which are segmented with SAM and back-projected into 3D using camera geometry and either LiDAR or monocular pseudo-depth. We introduce a geometric inflation strategy based on DBSCAN clustering and Rotating Calipers to infer 3D bounding boxes without training. To simulate adverse real-world conditions, we construct Pseudo-nuScenes, a fog-augmented, RGB-only variant of the nuScenes dataset. Experiments demonstrate that our method achieves competitive localization performance across multiple settings, including LiDAR-based and purely RGB-D inputs, all while remaining training-free and open-vocabulary. Our results highlight the untapped potential of 2D foundation models for scalable 3D perception. We open-source our code and resources at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.13363 [cs.CV] (or arXiv:2507.13363v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.13363 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-87] VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

【速读】：该论文旨在解决当前视觉语言模型（Visual Language Models, VLMs）在非局部视觉推理能力上的显著不足问题，即模型虽在复杂视觉任务（如视觉问答VQA和图表理解）中表现优异，但在需要整合图像中多个远距离区域信息的感知推理任务上表现不佳。解决方案的关键在于构建一个结构化的评估套件，系统性地隔离并测试三种不同形式的非局部视觉推理：比较感知（comparative perception）、扫视搜索（saccadic search）和连续平滑搜索（smooth visual search），从而揭示当前旗舰模型（如Gemini 2.5 Pro、Claude Vision 3.7等）在这些基础但关键的视觉算法执行能力上的缺陷，表明其尽管具备较高的原始视觉敏锐度，却缺乏人类水平的核心视觉推理能力。

链接: https://arxiv.org/abs/2507.13361
作者: Shmuel Berman,Jia Deng
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation that tests vision-language models’ capacity for nonlocal visual reasoning – reasoning that requires chaining evidence collected from multiple, possibly distant, regions of an image. We isolate three distinct forms of non-local vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence-driven jumps to locate successive targets; and smooth visual search, which involves searching smoothly along a continuous contour. Flagship models (e.g., Gemini 2.5 Pro, Claude Vision 3.7, GPT-o4-mini), even those that perform well on prior primitive-vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test if VLMs can perform similar visual algorithms to humans. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.
zh

[CV-88] Low-Light Enhancement via Encoder-Decoder Network with Illumination Guidance

【速读】：该论文旨在解决低光照条件下图像增强（low-light image enhancement）的问题，即如何在保持图像细节和自然外观的同时提升暗弱区域的可见性。解决方案的关键在于提出了一种基于编码器-解码器结构并引入亮度引导机制的深度学习框架EDNIG，其核心创新包括：1）利用亮度图（illumination map）作为指导输入，该亮度图由Bright Channel Prior（BCP）提取，使网络聚焦于欠曝区域；2）嵌入空间金字塔池化（Spatial Pyramid Pooling, SPP）模块以捕获多尺度上下文特征，增强对复杂光照场景的适应能力；3）采用Swish激活函数优化梯度传播，并结合生成对抗网络（GAN）框架与复合损失函数（包含对抗损失、像素级均方误差和感知损失），从而实现高质量且稳定的图像增强效果。

链接: https://arxiv.org/abs/2507.13360
作者: Le-Anh Tran,Chung Nguyen Tran,Ngoc-Luu Nguyen,Nhan Cach Dang,Jordi Carrabina,David Castells-Rufas,Minh Son Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, ICCCE 2025

点击查看摘要

Abstract:This paper introduces a novel deep learning framework for low-light image enhancement, named the Encoder-Decoder Network with Illumination Guidance (EDNIG). Building upon the U-Net architecture, EDNIG integrates an illumination map, derived from Bright Channel Prior (BCP), as a guidance input. This illumination guidance helps the network focus on underexposed regions, effectively steering the enhancement process. To further improve the model’s representational power, a Spatial Pyramid Pooling (SPP) module is incorporated to extract multi-scale contextual features, enabling better handling of diverse lighting conditions. Additionally, the Swish activation function is employed to ensure smoother gradient propagation during training. EDNIG is optimized within a Generative Adversarial Network (GAN) framework using a composite loss function that combines adversarial loss, pixel-wise mean squared error (MSE), and perceptual loss. Experimental results show that EDNIG achieves competitive performance compared to state-of-the-art methods in quantitative metrics and visual quality, while maintaining lower model complexity, demonstrating its suitability for real-world applications. The source code for this work is available at this https URL.
zh

[CV-89] Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives

【速读】：该论文旨在解决传统无人机（UAV）航拍目标检测方法仅能识别预定义类别所带来的应用局限性问题，从而提升无人机在复杂空域场景中的智能感知与自主决策能力。其解决方案的关键在于引入跨模态文本-图像对齐技术（如CLIP），实现开放词汇目标检测（Open-Vocabulary Object Detection, OVOD），使系统能够通过自然语言描述识别未见过的目标类别，显著扩展了无人机在动态、多样化应用场景下的适应性和泛化能力。

链接: https://arxiv.org/abs/2507.13359
作者: Yang Zhou,Junjie Li,CongYang Ou,Dawei Yan,Haokui Zhang,Xizhe Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 5 figures

点击查看摘要

Abstract:Due to its extensive applications, aerial image object detection has long been a hot topic in computer vision. In recent years, advancements in Unmanned Aerial Vehicles (UAV) technology have further propelled this field to new heights, giving rise to a broader range of application requirements. However, traditional UAV aerial object detection methods primarily focus on detecting predefined categories, which significantly limits their applicability. The advent of cross-modal text-image alignment (e.g., CLIP) has overcome this limitation, enabling open-vocabulary object detection (OVOD), which can identify previously unseen objects through natural language descriptions. This breakthrough significantly enhances the intelligence and autonomy of UAVs in aerial scene understanding. This paper presents a comprehensive survey of OVOD in the context of UAV aerial scenes. We begin by aligning the core principles of OVOD with the unique characteristics of UAV vision, setting the stage for a specialized discussion. Building on this foundation, we construct a systematic taxonomy that categorizes existing OVOD methods for aerial imagery and provides a comprehensive overview of the relevant datasets. This structured review enables us to critically dissect the key challenges and open problems at the intersection of these fields. Finally, based on this analysis, we outline promising future research directions and application prospects. This survey aims to provide a clear road map and a valuable reference for both newcomers and seasoned researchers, fostering innovation in this rapidly evolving domain. We keep tracing related works at this https URL
zh

[CV-90] Generalist Bimanual Manipulation via Foundation Video Diffusion Models

【速读】：该论文旨在解决双臂机器人操作（bimanual robotic manipulation）中因数据稀缺性和具身异质性（embodiment heterogeneity）导致的规模化扩展难题。其解决方案的关键在于提出VIDAR框架，该框架包含两个阶段：首先利用大规模多视角视频预训练（750K条来自三个真实双臂机器人平台的视频）构建统一观测空间以编码机器人、相机、任务和场景上下文；其次引入一种新型掩码逆动力学模型（masked inverse dynamics model），通过学习掩码来提取生成轨迹中的动作相关特征，无需像素级标签即可实现对未见背景的有效泛化。实验表明，仅需20分钟人类示范（仅为典型数据需求的1%），VIDAR即可在未见过的机器人平台、任务和背景下实现强语义理解与性能超越现有最先进方法。

链接: https://arxiv.org/abs/2507.12898
作者: Yao Feng,Hengkai Tan,Xinyi Mao,Guodong Liu,Shuhe Huang,Chendong Xiang,Hang Su,Jun Zhu
机构: Tsinghua University (清华大学); BNRist Center (清华-伯克利深圳研究院); THBI Lab (清华大学脑与智能实验室); Tsinghua-Bosch Joint ML Center (清华-博世联合机器学习中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce VIdeo Diffusion for Action Reasoning (VIDAR), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), VIDAR generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.
zh

[CV-91] UGPL: Uncertainty-Guided Progressive Learning for Evidence-Based Classification in Computed Tomography ICCV

【速读】：该论文旨在解决医学影像中病灶特征空间分布复杂且细微的问题，传统方法因对图像进行均匀处理而难以有效识别需局部聚焦的异常区域。其解决方案的关键在于提出一种不确定性引导的渐进式学习框架（UGPL），通过证据深度学习量化预测不确定性，利用非极大值抑制机制提取具有信息量且空间多样性的图像块，并结合自适应融合机制实现从全局到局部的逐步精细化分析，从而显著提升CT图像分类准确率。

链接: https://arxiv.org/abs/2507.14102
作者: Shravan Venkatraman,Pavan Kumar S,Rakesh Raj Madavan,Chandrakala S
机构: Vellore Institute of Technology (维洛尔理工学院); Shiv Nadar University (希瓦纳达大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 10 figures, 5 tables, 2025 ICCV Workshops

点击查看摘要

Abstract:Accurate classification of computed tomography (CT) images is essential for diagnosis and treatment planning, but existing methods often struggle with the subtle and spatially diverse nature of pathological features. Current approaches typically process images uniformly, limiting their ability to detect localized abnormalities that require focused analysis. We introduce UGPL, an uncertainty-guided progressive learning framework that performs a global-to-local analysis by first identifying regions of diagnostic ambiguity and then conducting detailed examination of these critical areas. Our approach employs evidential deep learning to quantify predictive uncertainty, guiding the extraction of informative patches through a non-maximum suppression mechanism that maintains spatial diversity. This progressive refinement strategy, combined with an adaptive fusion mechanism, enables UGPL to integrate both contextual information and fine-grained details. Experiments across three CT datasets demonstrate that UGPL consistently outperforms state-of-the-art methods, achieving improvements of 3.29%, 2.46%, and 8.08% in accuracy for kidney abnormality, lung cancer, and COVID-19 detection, respectively. Our analysis shows that the uncertainty-guided component provides substantial benefits, with performance dramatically increasing when the full progressive learning pipeline is implemented. Our code is available at: this https URL
zh

[CV-92] D2IP: Deep Dynamic Image Prior for 3D Time-sequence Pulmonary Impedance Imaging

【速读】：该论文旨在解决无监督学习方法在三维（3D）时间序列断层成像中因大量网络参数迭代导致计算成本高、收敛慢的问题，从而限制了其在复杂场景（如动态肺部成像）中的实际应用。解决方案的关键在于提出Deep Dynamic Image Prior（D2IP）框架，其核心创新包括三个策略：无监督参数预热（Unsupervised Parameter Warm-Start, UPWS）以加速收敛，时间参数传播（Temporal Parameter Propagation, TPP）以保证时序一致性，以及定制的轻量化重建主干网络3D-FastResUNet以提升计算效率。实验表明，D2IP在保持高图像质量的同时显著缩短了计算时间（快7.1倍），相较现有最优方法在平均MSSIM上提升24.8%，ERR降低8.1%。

链接: https://arxiv.org/abs/2507.14046
作者: Hao Fang,Hao Yu,Sihao Teng,Tao Zhang,Siyi Yuan,Huaiwu He,Zhe Liu,Yunjie Yang
机构: SMART Group, Institute for Imaging, Data and Communications, School of Engineering, The University of Edinburgh, Edinburgh, UK; Department of Intensive Care Unit, Tianjin Huanhu Hospital, Tianjin, China; State Key Laboratory of Complex Severe and Rare Diseases, Department of Critical Care Medicine, Peking Union Medical College, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences, Beijing, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Unsupervised learning methods, such as Deep Image Prior (DIP), have shown great potential in tomographic imaging due to their training-data-free nature and high generalization capability. However, their reliance on numerous network parameter iterations results in high computational costs, limiting their practical application, particularly in complex 3D or time-sequence tomographic imaging tasks. To overcome these challenges, we propose Deep Dynamic Image Prior (D2IP), a novel framework for 3D time-sequence imaging. D2IP introduces three key strategies - Unsupervised Parameter Warm-Start (UPWS), Temporal Parameter Propagation (TPP), and a customized lightweight reconstruction backbone, 3D-FastResUNet - to accelerate convergence, enforce temporal coherence, and improve computational efficiency. Experimental results on both simulated and clinical pulmonary datasets demonstrate that D2IP enables fast and accurate 3D time-sequence Electrical Impedance Tomography (tsEIT) reconstruction. Compared to state-of-the-art baselines, D2IP delivers superior image quality, with a 24.8% increase in average MSSIM and an 8.1% reduction in ERR, alongside significantly reduced computational time (7.1x faster), highlighting its promise for clinical dynamic pulmonary imaging.
zh

[CV-93] OrthoInsight: Rib Fracture Diagnosis and Report Generation Based on Multi-Modal Large Models

【速读】：该论文旨在解决医学影像数据量激增背景下，肋骨骨折等骨骼肌肉系统损伤的自动化诊断效率低、人工解读耗时且易出错的问题。其解决方案的关键在于提出OrthoInsight——一个融合多模态深度学习的诊断与报告生成框架：通过YOLOv9模型实现CT图像中骨折区域的精准检测，利用医学知识图谱获取临床上下文信息，并结合微调后的LLaVA语言模型生成结构化、逻辑连贯且具临床指导价值的诊断报告，从而实现从视觉特征到专家级文本输出的端到端智能分析。

链接: https://arxiv.org/abs/2507.13993
作者: Ningyong Wu,Jinzhi Wang,Wenhong Zhao,Chenzhan Yu,Zhigang Xiu,Duwei Dai
机构: Xi’an Jiaotong University (西安交通大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports. OrthoInsight combines visual features from CT images with expert textual data to deliver clinically useful outputs. Evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28, outperforming models like GPT-4 and Claude-3. This study demonstrates the potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists.
zh

[CV-94] Leverag ing Pathology Foundation Models for Panoptic Segmentation of Melanoma in HE Images

【速读】：该论文旨在解决黑色素瘤（melanoma）组织病理图像中手动分割组织区域的劳动强度大、个体差异显著的问题，以提升组织形态学特征刻画的准确性与一致性。解决方案的关键在于引入Virchow2这一在310万张病理图像上预训练的病理基础模型（pathology foundation model）作为特征提取器，将提取到的深层特征与原始RGB图像进行融合，并通过Efficient-UNet编码器-解码器结构生成高精度的五类组织分割图，从而实现高效且鲁棒的自动化组织分割，显著提升了计算病理学流程的效率和泛化能力。

链接: https://arxiv.org/abs/2507.13974
作者: Jiaqi Lv,Yijie Zhu,Carmen Guadalupe Colin Tenorio,Brinder Singh Chohan,Mark Eastwood,Shan E Ahmed Raza
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Accepted by MIUA 2025

点击查看摘要

Abstract:Melanoma is an aggressive form of skin cancer with rapid progression and high metastatic potential. Accurate characterisation of tissue morphology in melanoma is crucial for prognosis and treatment planning. However, manual segmentation of tissue regions from haematoxylin and eosin (HE) stained whole-slide images (WSIs) is labour-intensive and prone to inter-observer variability, this motivates the need for reliable automated tissue segmentation methods. In this study, we propose a novel deep learning network for the segmentation of five tissue classes in melanoma HE images. Our approach leverages Virchow2, a pathology foundation model trained on 3.1 million histopathology images as a feature extractor. These features are fused with the original RGB images and subsequently processed by an encoder-decoder segmentation network (Efficient-UNet) to produce accurate segmentation maps. The proposed model achieved first place in the tissue segmentation task of the PUMA Grand Challenge, demonstrating robust performance and generalizability. Our results show the potential and efficacy of incorporating pathology foundation models into segmentation networks to accelerate computational pathology workflows.
zh

[CV-95] Convergent transformations of visual representation in brains and models

【速读】：该论文旨在解决认知神经科学中的一个核心问题：视觉感知是由外部世界的结构还是大脑的内在架构所塑造。为解答这一问题，研究者提出了一种统一框架，其关键在于结合跨被试相似性（inter-subject similarity）与模型层级对齐（alignment to model hierarchies），以追踪从感官输入到高级内部表征的表征流。通过在三个独立的fMRI数据集中应用该框架，研究发现了一个跨个体保守的皮层网络，分为两条通路——腹侧-内侧通路用于场景结构表征，背侧-外侧通路则偏好社会和生物内容；该功能组织可被视觉类DNN（vision DNNs）捕获，但无法被语言模型捕捉，从而揭示了由外部世界结构驱动的、人类与人工视觉系统共有的收敛性计算解。

链接: https://arxiv.org/abs/2507.13941
作者: Pablo Marcos-Manchón,Lluís Fuentemilla
机构: University of Barcelona (巴塞罗那大学)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: for associate code, see this https URL

点击查看摘要

Abstract:A fundamental question in cognitive neuroscience is what shapes visual perception: the external world’s structure or the brain’s internal architecture. Although some perceptual variability can be traced to individual differences, brain responses to naturalistic stimuli evoke similar activity patterns across individuals, suggesting a convergent representational principle. Here, we test if this stimulus-driven convergence follows a common trajectory across people and deep neural networks (DNNs) during its transformation from sensory to high-level internal representations. We introduce a unified framework that traces representational flow by combining inter-subject similarity with alignment to model hierarchies. Applying this framework to three independent fMRI datasets of visual scene perception, we reveal a cortex-wide network, conserved across individuals, organized into two pathways: a medial-ventral stream for scene structure and a lateral-dorsal stream tuned for social and biological content. This functional organization is captured by the hierarchies of vision DNNs but not language models, reinforcing the specificity of the visual-to-semantic transformation. These findings show a convergent computational solution for visual encoding in both human and artificial vision, driven by the structure of the external world.
zh

[CV-96] Blind Super Resolution with Reference Images and Implicit Degradation Representation ACCV2024

【速读】：该论文旨在解决盲超分辨率（Blind Super-Resolution, BSR）任务中因忽略缩放因子（scaling factor）导致的退化核（degradation kernel）泛化能力不足的问题。现有方法通常直接从低分辨率（Low-Resolution, LR）输入估计退化核，但未考虑不同放大倍数下退化过程的差异，使得同一退化核难以适应多尺度超分任务。解决方案的关键在于将退化核与缩放因子共同建模，并引入内容无关的高分辨率（High-Resolution, HR）参考图像来构建具有尺度感知能力的退化核。通过利用HR参考图像生成额外的LR-HR配对数据，模型能够自适应地学习退化过程并提升超分辨率性能，该策略可兼容已训练好的BSR模型和零样本（zero-shot）盲超分方法，在多种场景下均显著优于先前方法。

链接: https://arxiv.org/abs/2507.13915
作者: Huu-Phu Do,Po-Chih Hu,Hao-Chien Hsueh,Che-Kai Liu,Vu-Hoang Tran,Ching-Chun Huang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACCV 2024

点击查看摘要

Abstract:Previous studies in blind super-resolution (BSR) have primarily concentrated on estimating degradation kernels directly from low-resolution (LR) inputs to enhance super-resolution. However, these degradation kernels, which model the transition from a high-resolution (HR) image to its LR version, should account for not only the degradation process but also the downscaling factor. Applying the same degradation kernel across varying super-resolution scales may be impractical. Our research acknowledges degradation kernels and scaling factors as pivotal elements for the BSR task and introduces a novel strategy that utilizes HR images as references to establish scale-aware degradation kernels. By employing content-irrelevant HR reference images alongside the target LR image, our model adaptively discerns the degradation process. It is then applied to generate additional LR-HR pairs through down-sampling the HR reference images, which are keys to improving the SR performance. Our reference-based training procedure is applicable to proficiently trained blind SR models and zero-shot blind SR methods, consistently outperforming previous methods in both scenarios. This dual consideration of blur kernels and scaling factors, coupled with the use of a reference image, contributes to the effectiveness of our approach in blind super-resolution tasks.
zh

[CV-97] Software architecture and manual for novel versatile CT image analysis toolbox – AnatomyArchive

【速读】：该论文旨在解决医学影像分析中自动化、高精度的解剖结构分割与体成分分析难题，特别是在CT图像中实现高效的目标体积选择、掩码管理及放射组学特征提取。其解决方案的关键在于开发了一个名为AnatomyArchive的开源软件包，基于TotalSegmentator全身体积分割模型构建，集成知识图谱驱动的掩码管理机制和GPU加速的渲染能力，支持自动体部裁剪、上肢检测排除以及三维/二维格式下的精准体成分分析，并提供完整的放射组学特征提取与统计分析工具链，显著提升了医学图像处理的效率与可扩展性。

链接: https://arxiv.org/abs/2507.13901
作者: Lei Xu,Torkel B Brismar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 7 figures

点击查看摘要

Abstract:We have developed a novel CT image analysis package named AnatomyArchive, built on top of the recent full body segmentation model TotalSegmentator. It provides automatic target volume selection and deselection capabilities according to user-configured anatomies for volumetric upper- and lower-bounds. It has a knowledge graph-based and time efficient tool for anatomy segmentation mask management and medical image database maintenance. AnatomyArchive enables automatic body volume cropping, as well as automatic arm-detection and exclusion, for more precise body composition analysis in both 2D and 3D formats. It provides robust voxel-based radiomic feature extraction, feature visualization, and an integrated toolchain for statistical tests and analysis. A python-based GPU-accelerated nearly photo-realistic segmentation-integrated composite cinematic rendering is also included. We present here its software architecture design, illustrate its workflow and working principle of algorithms as well provide a few examples on how the software can be used to assist development of modern machine learning models. Open-source codes will be released at this https URL for only research and educational purposes.
zh

[CV-98] Divide and Conquer: A Large-Scale Dataset and Model for Left-Right Breast MRI Segmentation MICCAI2025

【速读】：该论文旨在解决乳腺磁共振成像（Breast MRI）分析中缺乏公开可用的左右乳腺分割标注数据集的问题，这是制约生成式 AI (Generative AI) 在女性健康领域应用的关键瓶颈。其解决方案的关键在于构建并发布首个包含超过13,000例标注病例的公开乳腺MRI数据集，并提供一个经过训练的鲁棒深度学习模型用于左右乳腺分割，从而为开发更先进的乳腺影像分析工具提供基础资源与技术支撑。

链接: https://arxiv.org/abs/2507.13830
作者: Maximilian Rokuss,Benjamin Hamm,Yannick Kirchhoff,Klaus Maier-Hein
机构: German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany; Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital; Faculty of Mathematics and Computer Science, Heidelberg University; Medical Faculty, Heidelberg University, Germany; HIDSS4Health, Karlsruhe/Heidelberg, Germany
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025 WOMEN

点击查看摘要

Abstract:We introduce the first publicly available breast MRI dataset with explicit left and right breast segmentation labels, encompassing more than 13,000 annotated cases. Alongside this dataset, we provide a robust deep-learning model trained for left-right breast segmentation. This work addresses a critical gap in breast MRI analysis and offers a valuable resource for the development of advanced tools in women’s health. The dataset and trained model are publicly available at: this http URL
zh

[CV-99] Converting T1-weighted MRI from 3T to 7T quality using deep learning

【速读】：该论文旨在解决如何在不依赖昂贵且不易获取的7特斯拉（7T）磁共振成像（MRI）设备的情况下，生成接近7T图像质量的高分辨率T1加权脑部MRI图像的问题。其关键解决方案是基于深度学习构建两种模型：一种专用的U-Net网络和一种集成生成对抗网络（GAN U-Net）的U-Net架构，利用配对的3T与7T T1加权图像进行训练，从而实现从3T图像到合成7T图像的有效转换。实验表明，所提出的GAN U-Net模型不仅在客观图像质量指标上优于现有方法，且在主观视觉评估中表现更优，并能提升自动分割精度（如杏仁核区域），同时保持下游任务（如认知状态预测）的性能不变，为临床应用提供了可行的高质量替代方案。

链接: https://arxiv.org/abs/2507.13782
作者: Malo Gicquel,Ruoyi Zhao,Anika Wuestefeld,Nicola Spotorno,Olof Strandberg,Kalle Åström,Yu Xiao,Laura EM Wisse,Danielle van Westen,Rik Ossenkoppele,Niklas Mattsson-Carlgren,David Berron,Oskar Hansson,Gabrielle Flood,Jacob Vogel
机构: Lund University (隆德大学); Rennes (雷恩大学); Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); Amsterdam UMC (阿姆斯特丹大学医疗中心); Skåne University Hospital (斯堪的纳维亚大学医院); German Center for Neurodegenerative Diseases (德国神经退行性疾病中心); Otto-von-Guericke University Magdeburg (奥托·冯·格里克大学马格德堡分校); Czech Technical University in Prague (布拉格捷克理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-high resolution 7 tesla (7T) magnetic resonance imaging (MRI) provides detailed anatomical views, offering better signal-to-noise ratio, resolution and tissue contrast than 3T MRI, though at the cost of accessibility. We present an advanced deep learning model for synthesizing 7T brain MRI from 3T brain MRI. Paired 7T and 3T T1-weighted images were acquired from 172 participants (124 cognitively unimpaired, 48 impaired) from the Swedish BioFINDER-2 study. To synthesize 7T MRI from 3T images, we trained two models: a specialized U-Net, and a U-Net integrated with a generative adversarial network (GAN U-Net). Our models outperformed two additional state-of-the-art 3T-to-7T models in image-based evaluation metrics. Four blinded MRI professionals judged our synthetic 7T images as comparable in detail to real 7T images, and superior in subjective visual quality to 7T images, apparently due to the reduction of artifacts. Importantly, automated segmentations of the amygdalae of synthetic GAN U-Net 7T images were more similar to manually segmented amygdalae (n=20), than automated segmentations from the 3T images that were used to synthesize the 7T images. Finally, synthetic 7T images showed similar performance to real 3T images in downstream prediction of cognitive status using MRI derivatives (n=3,168). In all, we show that synthetic T1-weighted brain images approaching 7T quality can be generated from 3T images, which may improve image quality and segmentation, without compromising performance in downstream tasks. Future directions, possible clinical use cases, and limitations are discussed.
zh

[CV-100] BreastSegNet: Multi-label Segmentation of Breast MRI

【速读】：该论文旨在解决乳腺MRI（Magnetic Resonance Imaging）中多结构分割方法覆盖范围有限的问题，现有方法通常仅针对纤维腺体组织（fibroglandular tissue, FGT）或肿瘤等少数解剖结构进行分割，难以支持全面的定量分析。其解决方案的关键在于提出了一种名为BreastSegNet的多标签分割算法，能够同时对九类关键解剖结构（包括FGT、血管、肌肉、骨骼、病灶、淋巴结、心脏、肝脏和植入物）进行精准分割，并基于1123张手动标注的MRI切片数据集进行训练与验证，其中由放射科专家进行详细审核与修正以确保标注质量。实验表明，nnU-Net结合ResNet编码器（nnU-Net ResEncM）在所有标签上平均Dice分数达0.694，尤其在心脏、肝脏、肌肉、FGT和骨骼上表现优异，Dice分数超过0.73，接近0.90，显著提升了乳腺MRI中多组织结构的自动化分割能力。

链接: https://arxiv.org/abs/2507.13604
作者: Qihang Li,Jichen Yang,Yaqian Chen,Yuwen Chen,Hanxue Gu,Lars J. Grimm,Maciej A. Mazurowski
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Breast MRI provides high-resolution imaging critical for breast cancer screening and preoperative staging. However, existing segmentation methods for breast MRI remain limited in scope, often focusing on only a few anatomical structures, such as fibroglandular tissue or tumors, and do not cover the full range of tissues seen in scans. This narrows their utility for quantitative analysis. In this study, we present BreastSegNet, a multi-label segmentation algorithm for breast MRI that covers nine anatomical labels: fibroglandular tissue (FGT), vessel, muscle, bone, lesion, lymph node, heart, liver, and implant. We manually annotated a large set of 1123 MRI slices capturing these structures with detailed review and correction from an expert radiologist. Additionally, we benchmark nine segmentation models, including U-Net, SwinUNet, UNet++, SAM, MedSAM, and nnU-Net with multiple ResNet-based encoders. Among them, nnU-Net ResEncM achieves the highest average Dice scores of 0.694 across all labels. It performs especially well on heart, liver, muscle, FGT, and bone, with Dice scores exceeding 0.73, and approaching 0.90 for heart and liver. All model code and weights are publicly available, and we plan to release the data at a later date.
zh

[CV-101] Domain-randomized deep learning for neuroimage analysis

【速读】：该论文旨在解决深度学习模型在神经影像分析中因训练数据集范围狭窄而导致的泛化能力不足问题，尤其是在磁共振成像（MRI）中，由于脉冲序列和扫描仪硬件差异导致图像表征多样性高，模型难以适应未见数据。解决方案的关键在于采用合成驱动的训练范式，通过随机化生成具有多样化强度和解剖内容的合成图像，从而训练深度神经网络在无需重新训练或微调的情况下准确处理训练阶段未见过的图像类型。该方法显著提升了模型的鲁棒性和泛化性能，同时在多个成像模态中验证了其有效性。

链接: https://arxiv.org/abs/2507.13458
作者: Malte Hoffmann
机构: Athinoula A. Martinos Center for Biomedical Imaging (Athinoula A. Martinos 生物医学成像中心); Harvard Medical School (哈佛医学院); Massachusetts General Hospital (马萨诸塞州总医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 6 figures, 2 tables, deep learning, domain generalization, domain randomization, neuroimaging, medical image analysis, accepted for publication in IEEE Signal Processing Magazine

点击查看摘要

Abstract:Deep learning has revolutionized neuroimage analysis by delivering unprecedented speed and accuracy. However, the narrow scope of many training datasets constrains model robustness and generalizability. This challenge is particularly acute in magnetic resonance imaging (MRI), where image appearance varies widely across pulse sequences and scanner hardware. A recent domain-randomization strategy addresses the generalization problem by training deep neural networks on synthetic images with randomized intensities and anatomical content. By generating diverse data from anatomical segmentation maps, the approach enables models to accurately process image types unseen during training, without retraining or fine-tuning. It has demonstrated effectiveness across modalities including MRI, computed tomography, positron emission tomography, and optical coherence tomography, as well as beyond neuroimaging in ultrasound, electron and fluorescence microscopy, and X-ray microtomography. This tutorial paper reviews the principles, implementation, and potential of the synthesis-driven training paradigm. It highlights key benefits, such as improved generalization and resistance to overfitting, while discussing trade-offs such as increased computational demands. Finally, the article explores practical considerations for adopting the technique, aiming to accelerate the development of generalizable tools that make deep learning more accessible to domain experts without extensive computational resources or machine learning knowledge.
zh

[CV-102] Enhanced DeepLab Based Nerve Segmentation with Optimized Tuning

【速读】：该论文旨在解决医学影像中神经结构精确分割的问题（nerve segmentation），以提升临床诊断与治疗的准确性。其解决方案的关键在于构建一个基于DeepLabV3的优化分割流程，并引入自动阈值微调机制，通过改进预处理步骤和参数优化策略，显著提升了分割性能，最终在超声神经成像上实现了Dice Score为0.78、IoU为0.70、像素准确率为0.95的成果，验证了定制化参数选择对自动化神经检测的重要性。

链接: https://arxiv.org/abs/2507.13394
作者: Akhil John Thomas,Christiaan Boerkamp
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nerve segmentation is crucial in medical imaging for precise identification of nerve structures. This study presents an optimized DeepLabV3-based segmentation pipeline that incorporates automated threshold fine-tuning to improve segmentation accuracy. By refining preprocessing steps and implementing parameter optimization, we achieved a Dice Score of 0.78, an IoU of 0.70, and a Pixel Accuracy of 0.95 on ultrasound nerve imaging. The results demonstrate significant improvements over baseline models and highlight the importance of tailored parameter selection in automated nerve detection.
zh

[CV-103] Flatten Wisely: How Patch Order Shapes Mamba-Powered Vision for MRI Segmentation

【速读】：该论文旨在解决视觉Mamba模型在医学影像分割任务中因将二维图像序列化为一维序列时所引入的扫描顺序（patch scan order）选择问题，这一设计因素此前被忽视但对性能影响显著。其关键解决方案是提出了一种名为Multi-Scan 2D（MS2D）的参数无增模块，可在不增加计算成本的前提下支持多种扫描路径的探索，从而系统性地评估不同扫描顺序对MRI分割性能的影响。实验表明，空间连续的扫描路径（如水平和垂直栅格）显著优于离散的对角线扫描，且扫描顺序是一个统计上显著的、无需额外开销的超参数，为提升Mamba模型在医学影像中的表现提供了实证依据与优化方向。

链接: https://arxiv.org/abs/2507.13384
作者: Osama Hardan,Omar Elshenhabi,Tamer Khattab,Mohamed Mabrok
机构: Qatar University (卡塔尔大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to the 2025 IEEE International Conference on Future Machine Learning and Data Science (FMLDS)

点击查看摘要

Abstract:Vision Mamba models promise transformer-level performance at linear computational cost, but their reliance on serializing 2D images into 1D sequences introduces a critical, yet overlooked, design choice: the patch scan order. In medical imaging, where modalities like brain MRI contain strong anatomical priors, this choice is non-trivial. This paper presents the first systematic study of how scan order impacts MRI segmentation. We introduce Multi-Scan 2D (MS2D), a parameter-free module for Mamba-based architectures that facilitates exploring diverse scan paths without additional computational cost. We conduct a large-scale benchmark of 21 scan strategies on three public datasets (BraTS 2020, ISLES 2022, LGG), covering over 70,000 slices. Our analysis shows conclusively that scan order is a statistically significant factor (Friedman test: \chi^2_20=43.9, p=0.0016 ), with performance varying by as much as 27 Dice points. Spatially contiguous paths – simple horizontal and vertical rasters – consistently outperform disjointed diagonal scans. We conclude that scan order is a powerful, cost-free hyperparameter, and provide an evidence-based shortlist of optimal paths to maximize the performance of Mamba models in medical imaging.
zh

人工智能

[AI-0] oward Temporal Causal Representation Learning with Tensor Decomposition

【速读】：该论文旨在解决高维、不规则张量数据中时序因果表示学习的挑战，尤其在真实世界应用（如电子健康记录）中，数据常表现为长度各异的高维时间序列，传统方法难以有效提取具有因果意义的潜在结构。其解决方案的关键在于提出CaRTeD框架，该框架将时序因果表示学习与不规则张量分解联合建模，通过引入新的潜在簇因果公式和灵活的正则化设计，不仅实现了对张量因子的有效学习，还为下游任务（如潜在结构建模与因果信息提取）提供可解释性支持；理论上证明了算法收敛至稳定点，并填补了当前不规则张量分解在收敛性保证方面的空白。

链接: https://arxiv.org/abs/2507.14126
作者: Jianhong Chen,Meng Zhao,Mostafa Reisi Gahrooei,Xubo Yue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Temporal causal representation learning is a powerful tool for uncovering complex patterns in observational studies, which are often represented as low-dimensional time series. However, in many real-world applications, data are high-dimensional with varying input lengths and naturally take the form of irregular tensors. To analyze such data, irregular tensor decomposition is critical for extracting meaningful clusters that capture essential information. In this paper, we focus on modeling causal representation learning based on the transformed information. First, we present a novel causal formulation for a set of latent clusters. We then propose CaRTeD, a joint learning framework that integrates temporal causal representation learning with irregular tensor decomposition. Notably, our framework provides a blueprint for downstream tasks using the learned tensor factors, such as modeling latent structures and extracting causal information, and offers a more flexible regularization design to enhance tensor decomposition. Theoretically, we show that our algorithm converges to a stationary point. More importantly, our results fill the gap in theoretical guarantees for the convergence of state-of-the-art irregular tensor decomposition. Experimental results on synthetic and real-world electronic health record (EHR) datasets (MIMIC-III), with extensive benchmarks from both phenotyping and network recovery perspectives, demonstrate that our proposed method outperforms state-of-the-art techniques and enhances the explainability of causal representations.
zh

[AI-1] Kolmogorov Arnold Networks (KANs) for Imbalanced Data – An Empirical Perspective

【速读】：该论文旨在解决生成式 AI (Generative AI) 中类不平衡分类问题，特别是评估 Kolmogorov Arnold Networks (KANs) 在此类场景下的表现及其与传统方法（如多层感知机 MLPs 和重采样策略）的兼容性。其关键发现是：KANs 能够在未处理的不平衡数据上表现出优于 MLPs 的性能，无需任何重采样技术；然而，标准的不平衡处理策略（如过采样、欠采样及焦点损失 Focal Loss）会显著损害 KANs 的性能，而对 MLPs 仅带来边际改善。此外，KANs 面临计算成本过高且性能提升不显著的问题，统计验证表明 MLPs 结合不平衡处理技术可在资源消耗极低的情况下达到与 KANs 相当的性能水平（|d| < 0.08）。因此，解决方案的关键在于识别 KANs 在原始不平衡数据上的优势，同时指出其当前在计算效率和与主流不平衡学习技术兼容性方面的局限，并提出未来应聚焦于设计适配 KANs 的专用架构改进、优化计算效率以及理论层面调和其与数据增强策略的冲突。

链接: https://arxiv.org/abs/2507.14121
作者: Pankaj Yadav,Vivek Vijay
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 Pages, 4 figures

点击查看摘要

Abstract:Kolmogorov Arnold Networks (KANs) are recent architectural advancement in neural computation that offer a mathematically grounded alternative to standard neural networks. This study presents an empirical evaluation of KANs in context of class imbalanced classification, using ten benchmark datasets. We observe that KANs can inherently perform well on raw imbalanced data more effectively than Multi-Layer Perceptrons (MLPs) without any resampling strategy. However, conventional imbalance strategies fundamentally conflict with KANs mathematical structure as resampling and focal loss implementations significantly degrade KANs performance, while marginally benefiting MLPs. Crucially, KANs suffer from prohibitive computational costs without proportional performance gains. Statistical validation confirms that MLPs with imbalance techniques achieve equivalence with KANs (|d| 0.08 across metrics) at minimal resource costs. These findings reveal that KANs represent a specialized solution for raw imbalanced data where resources permit. But their severe performance-resource tradeoffs and incompatibility with standard resampling techniques currently limits practical deployment. We identify critical research priorities as developing KAN specific architectural modifications for imbalance learning, optimizing computational efficiency, and theoretical reconciling their conflict with data augmentation. This work establishes foundational insights for next generation KAN architectures in imbalanced classification scenarios.
zh

[AI-2] CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在CUDA代码优化任务中表现不佳的问题，即当前最先进的LLM模型（如R1、o1）在提升CUDA程序性能方面成功率低、效果有限。其核心挑战在于如何将LLM的代码生成能力转化为真正有效的GPU计算加速策略，尤其是在面对复杂且高度依赖硬件特性的CUDA内核优化时。解决方案的关键在于提出CUDA-L1——一个基于强化学习（Reinforcement Learning, RL）的自动化CUDA优化框架，该框架仅通过速度提升（speedup）作为奖励信号进行训练，无需人工标注或领域知识即可自动发现并组合多种CUDA优化技术，从而显著提升不同GPU架构上的性能表现（例如在NVIDIA A100上平均加速比达x17.7，峰值达x449），并展现出跨架构的良好可迁移性与泛化能力。

链接: https://arxiv.org/abs/2507.14111
作者: Xiaoya Li,Xiaofei Sun,Albert Wang,Jiwei Li,Chris Shum
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Preprint Version

点击查看摘要

Abstract:The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization. CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of x17.8 on H100, x19.0 on RTX 3090, x16.5 on L40, x14.7 on H800, and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources. Comments: Preprint Version Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2507.14111 [cs.AI] (or arXiv:2507.14111v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.14111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-3] Automated Interpretation of Non-Destructive Evaluation Contour Maps Using Large Language Models for Bridge Condition Assessment

【速读】：该论文旨在解决桥梁维护中非破坏性评估（Non-Destructive Evaluation, NDE）数据解读效率低、依赖专家经验的问题，从而影响决策速度与准确性。解决方案的关键在于引入大语言模型（Large Language Models, LLMs），通过设计专门的提示词（prompts）提升图像描述质量，并利用多个LLM对NDE轮廓图进行并行图像 captioning 与摘要生成，形成综合性的桥梁状况分析报告。研究发现，特定LLM（如ChatGPT-4和Claude 3.5 Sonnet）在描述细节、缺陷识别及可操作建议方面表现优异，表明LLM辅助分析可在不牺牲精度的前提下显著提升桥梁检测流程的效率与智能化水平。

链接: https://arxiv.org/abs/2507.14107
作者: Viraj Nishesh Darji,Callie C. Liao,Duoduo Liao
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Bridge maintenance and safety are essential for transportation authorities, and Non-Destructive Evaluation (NDE) techniques are critical to assessing structural integrity. However, interpreting NDE data can be time-consuming and requires expertise, potentially delaying decision-making. Recent advancements in Large Language Models (LLMs) offer new ways to automate and improve this analysis. This pilot study introduces a holistic assessment of LLM capabilities for interpreting NDE contour maps and demonstrates the effectiveness of LLMs in providing detailed bridge condition analyses. It establishes a framework for integrating LLMs into bridge inspection workflows, indicating that LLM-assisted analysis can enhance efficiency without compromising accuracy. In this study, several LLMs are explored with prompts specifically designed to enhance the quality of image descriptions, which are applied to interpret five different NDE contour maps obtained through technologies for assessing bridge conditions. Each LLM model is evaluated based on its ability to produce detailed descriptions, identify defects, provide actionable recommendations, and demonstrate overall accuracy. The research indicates that four of the nine models provide better image descriptions, effectively covering a wide range of topics related to the bridge’s condition. The outputs from these four models are summarized using five different LLMs to form a comprehensive overview of the bridge. Notably, LLMs ChatGPT-4 and Claude 3.5 Sonnet generate more effective summaries. The findings suggest that LLMs have the potential to significantly improve efficiency and accuracy. This pilot study presents an innovative approach that leverages LLMs for image captioning in parallel and summarization, enabling faster decision-making in bridge maintenance and enhancing infrastructure management and safety assessments.
zh

[AI-4] he Emotion-Memory Link: Do Memorability Annotations Matter for Intelligent Systems?

【速读】：该论文试图解决的问题是：如何利用情绪感知（emotion recognition）作为记忆可及性（memorability）的代理指标，以提升智能系统在会议支持、记忆增强和会议摘要等场景中的用户建模准确性。传统观点认为高情绪体验与高记忆可及性密切相关，因此情绪标注常被用作记忆性的替代信号。然而，现有情绪识别系统多依赖第三方标注，难以反映第一人称视角下的情感相关性和记忆重要性。本研究的关键解决方案在于通过连续时间标注的方式，在动态非结构化群体交互环境中，实证检验群体感知情绪（Pleasure-Arousal维度）与群体记忆可及性之间的关系。结果表明，二者关联性无法显著区别于随机预期，揭示了当前基于情绪的可及性推断方法存在局限，为情感计算（Affective Computing）技术的发展指明了新的研究方向。

链接: https://arxiv.org/abs/2507.14084
作者: Maria Tsfasman,Ramin Ghorbani,Catholijn M. Jonker,Bernd Dudzik
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans have a selective memory, remembering relevant episodes and forgetting the less relevant information. Possessing awareness of event memorability for a user could help intelligent systems in more accurate user modelling, especially for such applications as meeting support systems, memory augmentation, and meeting summarisation. Emotion recognition has been widely studied, since emotions are thought to signal moments of high personal relevance to users. The emotional experience of situations and their memorability have traditionally been considered to be closely tied to one another: moments that are experienced as highly emotional are considered to also be highly memorable. This relationship suggests that emotional annotations could serve as proxies for memorability. However, existing emotion recognition systems rely heavily on third-party annotations, which may not accurately represent the first-person experience of emotional relevance and memorability. This is why, in this study, we empirically examine the relationship between perceived group emotions (Pleasure-Arousal) and group memorability in the context of conversational interactions. Our investigation involves continuous time-based annotations of both emotions and memorability in dynamic, unstructured group settings, approximating conditions of real-world conversational AI applications such as online meeting support systems. Our results show that the observed relationship between affect and memorability annotations cannot be reliably distinguished from what might be expected under random chance. We discuss the implications of this surprising finding for the development and applications of Affective Computing technology. In addition, we contextualise our findings in broader discourses in the Affective Computing and point out important targets for future research efforts.
zh

[AI-5] Glucose-ML: A collection of longitudinal diabetes datasets for development of robust AI solutions

【速读】：该论文旨在解决当前糖尿病管理中生成式 AI (Generative AI) 算法开发面临的高质量数据获取难题，即缺乏公开、标准化且具有代表性的连续葡萄糖监测（CGM）数据集，从而阻碍了透明、可复现和鲁棒的AI解决方案的发展。其关键解决方案是提出了Glucose-ML数据集集合，包含10个近7年（2018–2025）内发布的公开糖尿病数据集，涵盖超过30万天的CGM数据（共3800万条血糖样本），来自4个国家、2500余名参与者（包括1型糖尿病、2型糖尿病、糖尿病前期及健康人群）。研究进一步通过对比分析指导算法开发者进行数据选择，并以短期血糖预测任务为例提供跨数据集基准测试，揭示同一算法在不同数据集上表现差异显著，进而为构建鲁棒的糖尿病AI模型提供了实证依据与实践建议。

链接: https://arxiv.org/abs/2507.14077
作者: Temiloluwa Prioleau,Baiying Lu,Yanjun Cui
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Artificial intelligence (AI) algorithms are a critical part of state-of-the-art digital health technology for diabetes management. Yet, access to large high-quality datasets is creating barriers that impede development of robust AI solutions. To accelerate development of transparent, reproducible, and robust AI solutions, we present Glucose-ML, a collection of 10 publicly available diabetes datasets, released within the last 7 years (i.e., 2018 - 2025). The Glucose-ML collection comprises over 300,000 days of continuous glucose monitor (CGM) data with a total of 38 million glucose samples collected from 2500+ people across 4 countries. Participants include persons living with type 1 diabetes, type 2 diabetes, prediabetes, and no diabetes. To support researchers and innovators with using this rich collection of diabetes datasets, we present a comparative analysis to guide algorithm developers with data selection. Additionally, we conduct a case study for the task of blood glucose prediction - one of the most common AI tasks within the field. Through this case study, we provide a benchmark for short-term blood glucose prediction across all 10 publicly available diabetes datasets within the Glucose-ML collection. We show that the same algorithm can have significantly different prediction results when developed/evaluated with different datasets. Findings from this study are then used to inform recommendations for developing robust AI solutions within the diabetes or broader health domain. We provide direct links to each longitudinal diabetes dataset in the Glucose-ML collection and openly provide our code.
zh

[AI-6] Edge Intelligence with Spiking Neural Networks

【速读】：该论文旨在解决传统深度学习模型在资源受限的边缘设备上部署时面临的高延迟、高带宽消耗及隐私保护不足等问题，这些问题源于云中心化架构的局限性。其解决方案的关键在于引入基于脉冲神经网络（Spiking Neural Networks, SNNs）的脑启发式计算范式，利用SNN对生物神经元动态的模拟特性实现低功耗、事件驱动的智能推理与学习。论文系统梳理了EdgeSNN（基于SNN的边缘智能）的基础架构，包括神经元模型、学习算法和硬件平台，并深入探讨了轻量级SNN推理、非平稳数据下的资源感知训练更新以及安全与隐私保护等实际挑战，同时提出双轨基准测试策略以支持公平比较与硬件感知优化，从而推动脑启发学习向边缘部署落地。

链接: https://arxiv.org/abs/2507.14069
作者: Shuiguang Deng,Di Yu,Changze Lv,Xin Du,Linshan Jiang,Xiaofan Zhao,Wentao Tong,Xiaoqing Zheng,Weijia Fang,Peng Zhao,Gang Pan,Schahram Dustdar,Albert Y. Zomaya
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
备注: This work has been submitted to Proceeding of IEEE for possible publication

点击查看摘要

Abstract:The convergence of artificial intelligence and edge computing has spurred growing interest in enabling intelligent services directly on resource-constrained devices. While traditional deep learning models require significant computational resources and centralized data management, the resulting latency, bandwidth consumption, and privacy concerns have exposed critical limitations in cloud-centric paradigms. Brain-inspired computing, particularly Spiking Neural Networks (SNNs), offers a promising alternative by emulating biological neuronal dynamics to achieve low-power, event-driven computation. This survey provides a comprehensive overview of Edge Intelligence based on SNNs (EdgeSNNs), examining their potential to address the challenges of on-device learning, inference, and security in edge scenarios. We present a systematic taxonomy of EdgeSNN foundations, encompassing neuron models, learning algorithms, and supporting hardware platforms. Three representative practical considerations of EdgeSNN are discussed in depth: on-device inference using lightweight SNN models, resource-aware training and updating under non-stationary data conditions, and secure and privacy-preserving issues. Furthermore, we highlight the limitations of evaluating EdgeSNNs on conventional hardware and introduce a dual-track benchmarking strategy to support fair comparisons and hardware-aware optimization. Through this study, we aim to bridge the gap between brain-inspired learning and practical edge deployment, offering insights into current advancements, open challenges, and future research directions. To the best of our knowledge, this is the first dedicated and comprehensive survey on EdgeSNNs, providing an essential reference for researchers and practitioners working at the intersection of neuromorphic computing and edge intelligence.
zh

[AI-7] Noradrenergic-inspired gain modulation attenuates the stability gap in joint training

【速读】：该论文旨在解决持续学习（continual learning）中存在的一种称为“稳定性缺口”（stability gap）的现象，即在学习新任务时，模型对已掌握任务的性能会出现短暂下降，即使在理想化的联合损失（joint-loss）训练条件下仍难以消除。这一现象揭示了当前方法在平衡遗忘抑制与知识保留方面的不足，尤其在任务边界处表现出适应性过快而记忆稳定性不足的问题。解决方案的关键在于提出一种受去甲肾上腺素能突触调节启发的不确定性调制增益动态机制（uncertainty-modulated gain dynamics），该机制通过模拟大脑中蓝斑-去甲肾上腺素系统在不确定状态下增强神经元增益以促进感知整合的功能，实现对知识融合的动态调控，在保持先前任务性能的同时最小化对已有表征的干扰，从而有效缓解稳定性缺口，并为持续学习框架中可塑性与稳定性的协同优化提供新的生物启发机制。

链接: https://arxiv.org/abs/2507.14056
作者: Alejandro Rodriguez-Garcia,Anindya Ghosh,Srikanth Ramaswamy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 18 pages, 5 figures, 1 table, 1 pseudo-code

点击查看摘要

Abstract:Recent studies in continual learning have identified a transient drop in performance on mastered tasks when assimilating new ones, known as the stability gap. Such dynamics contradict the objectives of continual learning, revealing a lack of robustness in mitigating forgetting, and notably, persisting even under an ideal joint-loss regime. Examining this gap within this idealized joint training context is critical to isolate it from other sources of forgetting. We argue that it reflects an imbalance between rapid adaptation and robust retention at task boundaries, underscoring the need to investigate mechanisms that reconcile plasticity and stability within continual learning frameworks. Biological brains navigate a similar dilemma by operating concurrently on multiple timescales, leveraging neuromodulatory signals to modulate synaptic plasticity. However, artificial networks lack native multitimescale dynamics, and although optimizers like momentum-SGD and Adam introduce implicit timescale regularization, they still exhibit stability gaps. Inspired by locus coeruleus mediated noradrenergic bursts, which transiently enhance neuronal gain under uncertainty to facilitate sensory assimilation, we propose uncertainty-modulated gain dynamics - an adaptive mechanism that approximates a two-timescale optimizer and dynamically balances integration of knowledge with minimal interference on previously consolidated information. We evaluate our mechanism on domain-incremental and class-incremental variants of the MNIST and CIFAR benchmarks under joint training, demonstrating that uncertainty-modulated gain dynamics effectively attenuate the stability gap. Finally, our analysis elucidates how gain modulation replicates noradrenergic functions in cortical circuits, offering mechanistic insights into reducing stability gaps and enhance performance in continual learning tasks.
zh

[AI-8] A multi-strategy improved snake optimizer for three-dimensional UAV path planning and engineering problems

【速读】：该论文旨在解决蛇优化算法（Snake Optimizer, SO）在实际应用中面临的收敛速度慢和易陷入局部最优的问题。其解决方案的关键在于提出一种多策略改进的蛇优化算法（Multi-strategy Improved Snake Optimizer, MISO）：首先引入基于正弦函数的自适应随机扰动策略以降低陷入局部最优的风险；其次设计基于尺度因子和领导者机制的自适应莱维飞行（Levy flight）策略，赋予雄性领导者飞行能力，增强跳出局部最优的能力；最后提出融合精英领导与布朗运动的位置更新策略，在加速收敛的同时保障解的精度。实验表明，MISO在CEC2017和CEC2022测试函数上优于11种主流算法，并成功应用于无人机（Unmanned Aerial Vehicle, UAV）三维路径规划及6个工程设计问题，验证了其在实际场景中的有效性与稳定性。

链接: https://arxiv.org/abs/2507.14043
作者: Genliang Li,Yaxin Cui,Jinyu Su
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 59 pages, 22 figures

点击查看摘要

Abstract:Metaheuristic algorithms have gained widespread application across various fields owing to their ability to generate diverse solutions. One such algorithm is the Snake Optimizer (SO), a progressive optimization approach. However, SO suffers from the issues of slow convergence speed and susceptibility to local optima. In light of these shortcomings, we propose a novel Multi-strategy Improved Snake Optimizer (MISO). Firstly, we propose a new adaptive random disturbance strategy based on sine function to alleviate the risk of getting trapped in a local optimum. Secondly, we introduce adaptive Levy flight strategy based on scale factor and leader and endow the male snake leader with flight capability, which makes it easier for the algorithm to leap out of the local optimum and find the global optimum. More importantly, we put forward a position update strategy combining elite leadership and Brownian motion, effectively accelerating the convergence speed while ensuring precision. Finally, to demonstrate the performance of MISO, we utilize 30 CEC2017 test functions and the CEC2022 test suite, comparing it with 11 popular algorithms across different dimensions to validate its effectiveness. Moreover, Unmanned Aerial Vehicle (UAV) has been widely used in various fields due to its advantages of low cost, high mobility and easy operation. However, the UAV path planning problem is crucial for flight safety and efficiency, and there are still challenges in establishing and optimizing the path model. Therefore, we apply MISO to the UAV 3D path planning problem as well as 6 engineering design problems to assess its feasibility in practical applications. The experimental results demonstrate that MISO exceeds other competitive algorithms in terms of solution quality and stability, establishing its strong potential for application.
zh

[AI-9] KROMA: Ontology Matching with Knowledge Retrieval and Large Language Models ISWC2025

【速读】：该论文旨在解决传统本体匹配（Ontology Matching, OM）系统依赖手工规则或专用模型导致适应性差的问题。其解决方案的关键在于提出了一种基于检索增强生成（Retrieval-Augmented Generation, RAG）框架的KROMA方法，通过引入大型语言模型（Large Language Models, LLMs）动态融合结构、词汇和定义知识以增强语义上下文，并结合基于双模拟（bisimilarity-based）的概念匹配与轻量级本体精炼步骤，有效减少候选概念数量并显著降低LLM调用带来的通信开销，从而在保持高效性的同时大幅提升匹配性能。

链接: https://arxiv.org/abs/2507.14032
作者: Lam Nguyen,Erika Barcelos,Roger French,Yinghui Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the 24th International Semantic Web Conference Research Track (ISWC 2025)

点击查看摘要

Abstract:Ontology Matching (OM) is a cornerstone task of semantic interoperability, yet existing systems often rely on handcrafted rules or specialized models with limited adaptability. We present KROMA, a novel OM framework that harnesses Large Language Models (LLMs) within a Retrieval-Augmented Generation (RAG) pipeline to dynamically enrich the semantic context of OM tasks with structural, lexical, and definitional knowledge. To optimize both performance and efficiency, KROMA integrates a bisimilarity-based concept matching and a lightweight ontology refinement step, which prune candidate concepts and substantially reduce the communication overhead from invoking LLMs. Through experiments on multiple benchmark datasets, we show that integrating knowledge retrieval with context-augmented LLMs significantly enhances ontology matching, outperforming both classic OM systems and cutting-edge LLM-based approaches while keeping communication overhead comparable. Our study highlights the feasibility and benefit of the proposed optimization techniques (targeted knowledge retrieval, prompt enrichment, and ontology refinement) for ontology matching at scale.
zh

[AI-10] Photonic Fabric Platform for AI Accelerators

【速读】：该论文旨在解决当前AI加速器（XPU）设计中因硅基接口带宽受限而导致的“内存-计算比固定”问题，即传统架构下本地高带宽存储器（HBM）与计算单元之间的物理布局限制了内存容量和带宽的扩展能力。解决方案的关键在于提出Photonic FabricTM及其配套的Photonic Fabric ApplianceTM（PFA），这是一种基于光子学的2.5D异构封装系统，集成了HBM3E高带宽内存、片上光交换网络和外部DDR5存储，实现了高达115 Tbps的全互连数字交换能力和32 TB共享内存资源。通过将原本绑定在单个XPU上的HBM替换为可扩展的芯片小片（chiplet）并接入光子Fabric，该方案打破了硅基物理限制，显著提升了内存带宽灵活性，并支持分布式AI训练与推理任务中的高效并行策略执行，从而在LLM推理和训练场景中实现高达7.04倍吞吐量提升及60–90%的数据移动能耗降低。

链接: https://arxiv.org/abs/2507.14000
作者: Jing Ding,Trung Diep
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
备注: 12 pages, 14 figures, 5 tables

点击查看摘要

Abstract:This paper presents the Photonic FabricTM and the Photonic Fabric ApplianceTM (PFA), a photonic-enabled switch and memory subsystem that delivers low latency, high bandwidth, and low per-bit energy. By integrating high-bandwidth HBM3E memory, an on-module photonic switch, and external DDR5 in a 2.5D electro-optical system-in-package, the PFA offers up to 32 TB of shared memory alongside 115 Tbps of all-to-all digital switching. The Photonic FabricTM enables distributed AI training and inference to execute parallelism strategies more efficiently. The Photonic Fabric removes the silicon beachfront constraint that limits the fixed memory-to-compute ratio observed in virtually all current XPU accelerator designs. Replacing a local HBM stack on an XPU with a chiplet that connects to the Photonic Fabric increases its memory capacity and correspondingly its memory bandwidth by offering a flexible path to scaling well beyond the limitations of on-package HBM alone. We introduce CelestiSim, a lightweight analytical simulator validated on NVIDIA H100 and H200 systems. It is used to evaluate the performance of LLM reference and energy savings on PFA, without any significant change to the GPU core design. With the PFA, the simulation results show that up to 3.66x throughput and 1.40x latency improvements in LLM inference at 405B parameters, up to 7.04x throughput and 1.41x latency improvements at 1T parameters, and 60-90% energy savings in data movement for heavy collective operations in all LLM training scenarios. While these results are shown for NVIDIA GPUs, they can be applied similarly to other AI accelerator designs (XPUs) that share the same fundamental limitation of fixed memory to compute.
zh

[AI-11] A segmented robot grasping perception neural network for edge AI

【速读】：该论文旨在解决机器人抓取（Robotic grasping）任务中如何在资源受限的边缘设备上实现低延迟、低功耗的实时抓取姿态检测问题。其解决方案的关键在于将Heatmap-Guided Grasp Detection这一端到端框架部署于GAP9 RISC-V片上系统（System-on-Chip），并通过硬件感知优化技术——包括输入维度降维、模型分割（model partitioning）和量化（quantisation）——显著降低计算复杂度与内存占用，从而在不牺牲精度的前提下实现全芯片推理（fully on-chip inference），验证了微控制器单元（MCU）在自主操作中的可行性。

链接: https://arxiv.org/abs/2507.13970
作者: Casper Bröcheler,Thomas Vroom,Derrick Timmermans,Alan van den Akker,Guangzhi Tang,Charalampos S. Kouzinopoulos,Rico Möckel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted by SMC 2025

点击查看摘要

Abstract:Robotic grasping, the ability of robots to reliably secure and manipulate objects of varying shapes, sizes and orientations, is a complex task that requires precise perception and control. Deep neural networks have shown remarkable success in grasp synthesis by learning rich and abstract representations of objects. When deployed at the edge, these models can enable low-latency, low-power inference, making real-time grasping feasible in resource-constrained environments. This work implements Heatmap-Guided Grasp Detection, an end-to-end framework for the detection of 6-Dof grasp poses, on the GAP9 RISC-V System-on-Chip. The model is optimised using hardware-aware techniques, including input dimensionality reduction, model partitioning, and quantisation. Experimental evaluation on the GraspNet-1Billion benchmark validates the feasibility of fully on-chip inference, highlighting the potential of low-power MCUs for real-time, autonomous manipulation.
zh

[AI-12] owards Constraint Temporal Answer Set Programming

【速读】：该论文旨在解决在答案集编程（Answer Set Programming, ASP）中对具有细粒度时间分辨率和数值精度的动态系统进行推理所面临的挑战。其解决方案的关键在于提出了一种新颖的时序与约束扩展的“此处与那里逻辑”（Logic of Here-and-There, HT）及其非单调均衡扩展，这是目前已知首个专为ASP设计的、支持带约束的非单调时序推理方法。该方案通过融合两个基础ASP扩展——线性时序HT逻辑（提供强大的非单调时序推理能力）和带约束的HT逻辑（支持直接集成与操作数值约束）——实现了高精度动态系统的建模与推理，从而构建了ASP范式下复杂动态系统高分辨率建模的逻辑基础框架。

链接: https://arxiv.org/abs/2507.13958
作者: Pedro Cabalar,Martín Diéguez,François Olivier,Torsten Schaub,Igor Stéphan
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Reasoning about dynamic systems with a fine-grained temporal and numeric resolution presents significant challenges for logic-based approaches like Answer Set Programming (ASP). To address this, we introduce and elaborate upon a novel temporal and constraint-based extension of the logic of Here-and-There and its nonmonotonic equilibrium extension, representing, to the best of our knowledge, the first approach to nonmonotonic temporal reasoning with constraints specifically tailored for ASP. This expressive system is achieved by a synergistic combination of two foundational ASP extensions: the linear-time logic of Here-and-There, providing robust nonmonotonic temporal reasoning capabilities, and the logic of Here-and-There with constraints, enabling the direct integration and manipulation of numeric constraints, among others. This work establishes the foundational logical framework for tackling complex dynamic systems with high resolution within the ASP paradigm.
zh

[AI-13] DUALRec: A Hybrid Sequential and Language Model Framework for Context-Aware Movie Recommendation

【速读】：该论文旨在解决现代推荐系统在建模和预测动态且情境丰富的用户偏好时面临的挑战，传统协同过滤与基于内容的方法难以捕捉时间模式和用户意图的演变，而大语言模型（Large Language Models, LLMs）虽具备强大的语义理解能力，却未针对用户偏好的时序演化进行设计；同时，LSTM等序列模型虽擅长捕获用户行为的时间动态性，但缺乏对推荐生成所需的丰富语义理解。解决方案的关键在于提出DUALRec（Dynamic User-Aware Language-based Recommender），其核心是融合LSTM的时间建模能力与微调后LLM的语义推理优势：LSTM模块从用户的历史交互中提取演化偏好，而微调后的LLM则利用这些时序洞察生成符合用户潜在兴趣的推荐内容。实验表明，该架构在MovieLens-1M数据集上显著优于多种基线模型，在Hit Rate (HR@k)、Normalized Discounted Cumulative Gain (NDCG@k) 及类型相似性等指标上均取得提升，为构建更智能、情境感知的推荐系统提供了新路径。

链接: https://arxiv.org/abs/2507.13957
作者: Yitong Li,Raoul Grasman
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:The modern recommender systems are facing an increasing challenge of modelling and predicting the dynamic and context-rich user preferences. Traditional collaborative filtering and content-based methods often struggle to capture the temporal patternings and evolving user intentions. While Large Language Models (LLMs) have gained gradual attention in recent years, by their strong semantic understanding and reasoning abilities, they are not inherently designed to model chronologically evolving user preference and intentions. On the other hand, for sequential models like LSTM (Long-Short-Term-Memory) which is good at capturing the temporal dynamics of user behaviour and evolving user preference over time, but still lacks a rich semantic understanding for comprehensive recommendation generation. In this study, we propose DUALRec (Dynamic User-Aware Language-based Recommender), a novel recommender that leverages the complementary strength of both models, which combines the temporal modelling abilities of LSTM networks with semantic reasoning power of the fine-tuned Large Language Models. The LSTM component will capture users evolving preference through their viewing history, while the fine-tuned LLM variants will leverage these temporal user insights to generate next movies that users might enjoy. Experimental results on MovieLens-1M dataset shows that the DUALRec model outperforms a wide range of baseline models, with comprehensive evaluation matrices of Hit Rate (HR@k), Normalized Discounted Cumulative Gain (NDCG@k), and genre similarity metrics. This research proposes a novel architecture that bridges the gap between temporal sequence modeling and semantic reasoning, and offers a promising direction for developing more intelligent and context-aware recommenders.
zh

[AI-14] Self-supervised learning on gene expression data

【速读】：该论文旨在解决从基因表达数据中预测表型（phenotype）时，传统监督学习方法对大量标注数据依赖性强、获取成本高的问题。其解决方案的关键在于引入先进的自监督学习（self-supervised learning）方法，利用未标注的批量RNA测序（bulk RNA-Seq）数据自身的结构信息来学习高质量的特征表示，从而在减少人工标注依赖的同时提升下游表型预测任务的准确性。

链接: https://arxiv.org/abs/2507.13912
作者: Kevin Dradjat,Massinissa Hamidi,Pierre Bartet,Blaise Hanczar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting phenotypes from gene expression data is a crucial task in biomedical research, enabling insights into disease mechanisms, drug responses, and personalized medicine. Traditional machine learning and deep learning rely on supervised learning, which requires large quantities of labeled data that are costly and time-consuming to obtain in the case of gene expression data. Self-supervised learning has recently emerged as a promising approach to overcome these limitations by extracting information directly from the structure of unlabeled data. In this study, we investigate the application of state-of-the-art self-supervised learning methods to bulk gene expression data for phenotype prediction. We selected three self-supervised methods, based on different approaches, to assess their ability to exploit the inherent structure of the data and to generate qualitative representations which can be used for downstream predictive tasks. By using several publicly available gene expression datasets, we demonstrate how the selected methods can effectively capture complex information and improve phenotype prediction accuracy. The results obtained show that self-supervised learning methods can outperform traditional supervised models besides offering significant advantage by reducing the dependency on annotated data. We provide a comprehensive analysis of the performance of each method by highlighting their strengths and limitations. We also provide recommendations for using these methods depending on the case under study. Finally, we outline future research directions to enhance the application of self-supervised learning in the field of gene expression data analysis. This study is the first work that deals with bulk RNA-Seq data and self-supervised learning.
zh

[AI-15] Large Language Models as Innovators: A Framework to Leverag e Latent Space Exploration for Novelty Discovery

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成创新性想法时面临的挑战，即模型虽具备高度语言流畅性，却常因依赖训练数据中的模式而缺乏真正的创造性，难以在不依赖复杂提示工程的情况下实现新颖且相关的输出。解决方案的关键在于提出一种模型无关的潜在空间创意生成框架，通过在连续的嵌入空间中导航来实现受控且可扩展的创造力，该方法无需人工设计规则，能够自适应不同领域、输入格式和创意任务，为人类与AI协作提供通用的辅助创意生成工具。

链接: https://arxiv.org/abs/2507.13874
作者: Mateusz Bystroński,Mikołaj Hołysz,Grzegorz Piotrowski,Nitesh V. Chawla,Tomasz Kajdanowicz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Innovative idea generation remains a core challenge in AI, as large language models (LLMs) often struggle to produce outputs that are both novel and relevant. Despite their fluency, LLMs tend to replicate patterns seen during training, limiting their ability to diverge creatively without extensive prompt engineering. Prior work has addressed this through domain-specific heuristics and structured prompting pipelines, but such solutions are brittle and difficult to generalize. In this paper, we propose a model-agnostic latent-space ideation framework that enables controlled, scalable creativity by navigating the continuous embedding space of ideas. Unlike prior methods, our framework requires no handcrafted rules and adapts easily to different domains, input formats, and creative tasks. This paper introduces an early-stage prototype of our method, outlining the conceptual framework and preliminary results highlighting its potential as a general-purpose co-ideator for human-AI collaboration.
zh

[AI-16] Causal Knowledge Transfer for Multi-Agent Reinforcement Learning in Dynamic Environments

【速读】：该论文旨在解决多智能体强化学习（Multi-agent Reinforcement Learning, MARL）中在非平稳环境中知识迁移困难的问题，特别是当环境目标发生变化时，传统方法难以泛化且需昂贵的重新训练。其解决方案的关键在于提出一种因果知识迁移框架，通过学习和共享路径的紧凑因果表示来实现高效适应；具体而言，将每次碰撞建模为因果干预，并将其转化为一组恢复动作（宏指令），该宏指令作为因果知识在线从另一智能体传递并零样本应用，仅依赖局部上下文信息（如碰撞事件）查询查找模型即可完成策略调整，从而显著提升适应效率与性能。

链接: https://arxiv.org/abs/2507.13846
作者: Kathrin Korte,Christian Medeiros Adriano,Sona Ghahremani,Holger Giese
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:[Context] Multi-agent reinforcement learning (MARL) has achieved notable success in environments where agents must learn coordinated behaviors. However, transferring knowledge across agents remains challenging in non-stationary environments with changing goals. [Problem] Traditional knowledge transfer methods in MARL struggle to generalize, and agents often require costly retraining to adapt. [Approach] This paper introduces a causal knowledge transfer framework that enables RL agents to learn and share compact causal representations of paths within a non-stationary environment. As the environment changes (new obstacles), agents’ collisions require adaptive recovery strategies. We model each collision as a causal intervention instantiated as a sequence of recovery actions (a macro) whose effect corresponds to a causal knowledge of how to circumvent the obstacle while increasing the chances of achieving the agent’s goal (maximizing cumulative reward). This recovery action macro is transferred online from a second agent and is applied in a zero-shot fashion, i.e., without retraining, just by querying a lookup model with local context information (collisions). [Results] Our findings reveal two key insights: (1) agents with heterogeneous goals were able to bridge about half of the gap between random exploration and a fully retrained policy when adapting to new environments, and (2) the impact of causal knowledge transfer depends on the interplay between environment complexity and agents’ heterogeneous goals.
zh

[AI-17] Scalable Submodular Policy Optimization via Pruned Submodularity Graph

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）中奖励函数为子模函数（submodular function）时的最优策略寻找问题，这类问题在路径规划、覆盖控制等实际场景中广泛存在，且传统基于可加奖励函数的RL方法无法有效处理。解决方案的关键在于提出了一种基于剪枝子模图（pruned submodularity graph）的方法，该方法能够在可行计算时间内提供一个具有理论保证的近似解，通过构建反映子模性结构的图模型并进行剪枝优化，显著提升了策略搜索效率与性能，实验结果表明其获得的策略相较于基线方法能带来更高的累积奖励。

链接: https://arxiv.org/abs/2507.13834
作者: Aditi Anand,Suman Banerjee,Dildar Ali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 16 Pages

点击查看摘要

Abstract:In Reinforcement Learning (abbreviated as RL), an agent interacts with the environment via a set of possible actions, and a reward is generated from some unknown distribution. The task here is to find an optimal set of actions such that the reward after a certain time step gets maximized. In a traditional setup, the reward function in an RL Problem is considered additive. However, in reality, there exist many problems, including path planning, coverage control, etc., the reward function follows the diminishing return, which can be modeled as a submodular function. In this paper, we study a variant of the RL Problem where the reward function is submodular, and our objective is to find an optimal policy such that this reward function gets maximized. We have proposed a pruned submodularity graph-based approach that provides a provably approximate solution in a feasible computation time. The proposed approach has been analyzed to understand its time and space requirements as well as a performance guarantee. We have experimented with a benchmark agent-environment setup, which has been used for similar previous studies, and the results are reported. From the results, we observe that the policy obtained by our proposed approach leads to more reward than the baseline methods.
zh

[AI-18] When Speed meets Accuracy: an Efficient and Effective Graph Model for Temporal Link Prediction

【速读】：该论文旨在解决动态图中时间链接预测（Temporal Link Prediction）任务面临的可扩展性与效率瓶颈问题，尤其针对现有时序图神经网络（T-GNNs）因复杂架构导致的高计算开销。其解决方案的关键在于提出一种轻量级框架EAGLE，该框架通过两个核心模块实现高效建模：一是时间感知模块（time-aware module），利用节点最近邻信息捕捉短期时间偏好；二是结构感知模块（structure-aware module），借助时序个性化PageRank算法提取全局重要节点的影响。此外，EAGLE引入自适应加权机制以动态平衡两类特征贡献，并摒弃了多跳消息传递和内存密集型机制，从而在保持高性能的同时实现超过50倍于基于Transformer的有效T-GNNs的推理速度提升。

链接: https://arxiv.org/abs/2507.13825
作者: Haoyang Li,Yuming Xu,Yiming Li,Hanmo Liu,Darian Li,Chen Jason Zhang,Lei Chen,Qing Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted in 2024. Accepted in 2025

点击查看摘要

Abstract:Temporal link prediction in dynamic graphs is a critical task with applications in diverse domains such as social networks, recommendation systems, and e-commerce platforms. While existing Temporal Graph Neural Networks (T-GNNs) have achieved notable success by leveraging complex architectures to model temporal and structural dependencies, they often suffer from scalability and efficiency challenges due to high computational overhead. In this paper, we propose EAGLE, a lightweight framework that integrates short-term temporal recency and long-term global structural patterns. EAGLE consists of a time-aware module that aggregates information from a node’s most recent neighbors to reflect its immediate preferences, and a structure-aware module that leverages temporal personalized PageRank to capture the influence of globally important nodes. To balance these attributes, EAGLE employs an adaptive weighting mechanism to dynamically adjust their contributions based on data characteristics. Also, EAGLE eliminates the need for complex multi-hop message passing or memory-intensive mechanisms, enabling significant improvements in efficiency. Extensive experiments on seven real-world temporal graphs demonstrate that EAGLE consistently achieves superior performance against state-of-the-art T-GNNs in both effectiveness and efficiency, delivering more than a 50x speedup over effective transformer-based T-GNNs.
zh

[AI-19] From Extraction to Synthesis: Entangled Heuristics for Agent -Augmented Strategic Reasoning

【速读】：该论文旨在解决传统决策引擎在复杂战略情境下难以处理多源、冲突性启发式规则的问题，其核心挑战在于如何从不同理论来源（如古典军事理论与现代企业战略）中提取并融合多个启发式策略，以生成具有语境敏感性和逻辑一致性的决策叙事。解决方案的关键在于提出一种混合架构，通过语义激活（semantic activation）和组合合成（compositional synthesis）机制，在量子认知启发的语义互依过程中，将冲突性启发式规则融合为连贯的战略叙述，而非简单选择最优规则；该方法借助语义交互建模与修辞框架引导，实现了对多源知识的动态整合与情境化表达。

链接: https://arxiv.org/abs/2507.13768
作者: Renato Ghisellini,Remo Pareschi,Marco Pedroni,Giovanni Battista Raggi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Peer-reviewed full paper accepted through a double-blind review process at the HAR 2025 conference ( this https URL ). The official version will appear in a volume of the Lecture Notes in Computer Science (LNCS) series

点击查看摘要

Abstract:We present a hybrid architecture for agent-augmented strategic reasoning, combining heuristic extraction, semantic activation, and compositional synthesis. Drawing on sources ranging from classical military theory to contemporary corporate strategy, our model activates and composes multiple heuristics through a process of semantic interdependence inspired by research in quantum cognition. Unlike traditional decision engines that select the best rule, our system fuses conflicting heuristics into coherent and context-sensitive narratives, guided by semantic interaction modeling and rhetorical framing. We demonstrate the framework via a Meta vs. FTC case study, with preliminary validation through semantic metrics. Limitations and extensions (e.g., dynamic interference tuning) are discussed.
zh

[AI-20] OntView: What you See is What you Meant

【速读】：该论文旨在解决现有本体（Ontology）可视化工具在表达复杂知识结构时存在的信息过载与理解困难问题，特别是在大规模本体中难以清晰呈现概念间依赖关系和属性的问题。其解决方案的关键在于提出 OntView，一个基于描述逻辑（Description Logic, DL）推理机的本体可视化工具，遵循“所见即所指”（What you see is what you meant）的设计理念，能够直观展示实际推断出的知识内容；尤其创新性地支持广义概念包含（General Concept Inclusions, GCI）的图形化表示，这是此前可视化工具所缺失的核心功能。此外，为避免视觉冗余，OntView 提供三种简化视图机制：基于重要性算法的概念摘要、聚焦于两个指定类之间的 TBox 元素、以及动态隐藏/显示分支而不丢失语义完整性，从而显著提升用户对复杂本体结构的理解效率。

链接: https://arxiv.org/abs/2507.13759
作者: Carlos Bobed,Carlota Quintana,Eduardo Mena,Jorge Bobed,Fernando Bobillo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the field of knowledge management and computer science, ontologies provide a structured framework for modeling domain-specific knowledge by defining concepts and their relationships. However, the lack of tools that provide effective visualization is still a significant challenge. While numerous ontology editors and viewers exist, most of them fail to graphically represent ontology structures in a meaningful and non-overwhelming way, limiting users’ ability to comprehend dependencies and properties within large ontological frameworks. In this paper, we present OntView, an ontology viewer that is designed to provide users with an intuitive visual representation of ontology concepts and their formal definitions through a user-friendly interface. Building on the use of a DL reasoner, OntView follows a “What you see is what you meant” paradigm, showing the actual inferred knowledge. One key aspect for this is its ability to visualize General Concept Inclusions (GCI), a feature absent in existing visualization tools. Moreover, to avoid a possible information overload, OntView also offers different ways to show a simplified view of the ontology by: 1) creating ontology summaries by assessing the importance of the concepts (according to different available algorithms), 2) focusing the visualization on the existing TBox elements between two given classes and 3) allowing to hide/show different branches in a dynamic way without losing the semantics. OntView has been released with an open-source license for the whole community. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.13759 [cs.AI] (or arXiv:2507.13759v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.13759 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-21] Search-Optimized Quantization in Biomedical Ontology Alignment

【速读】：该论文旨在解决大型生成式 AI 模型在边缘设备或资源受限环境中部署时面临的计算复杂度高、能耗大、内存占用多及延迟高等问题。其解决方案的关键在于构建一个系统化的模型优化流程：首先利用基于 Transformer 的监督学习模型实现生物医学领域术语的本体对齐，借助 Microsoft Olive 工具在 ONNX Runtime 后端中搜索最优执行提供者（Execution Provider, EP），随后通过 Intel Neural Compressor 和 IPEX（Intel Extension for PyTorch）实施动态量化，从而在不损失性能的前提下显著提升推理速度并降低内存消耗——实验表明，在 DEFT 2020 评估任务中实现了平均 20 倍的推理加速和约 70% 的内存减少，达到当前最优水平。

链接: https://arxiv.org/abs/2507.13742
作者: Oussama Bouaggad,Natalia Grabar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:In the fast-moving world of AI, as organizations and researchers develop more advanced models, they face challenges due to their sheer size and computational demands. Deploying such models on edge devices or in resource-constrained environments adds further challenges related to energy consumption, memory usage and latency. To address these challenges, emerging trends are shaping the future of efficient model optimization techniques. From this premise, by employing supervised state-of-the-art transformer-based models, this research introduces a systematic method for ontology alignment, grounded in cosine-based semantic similarity between a biomedical layman vocabulary and the Unified Medical Language System (UMLS) Metathesaurus. It leverages Microsoft Olive to search for target optimizations among different Execution Providers (EPs) using the ONNX Runtime backend, followed by an assembled process of dynamic quantization employing Intel Neural Compressor and IPEX (Intel Extension for PyTorch). Through our optimization process, we conduct extensive assessments on the two tasks from the DEFT 2020 Evaluation Campaign, achieving a new state-of-the-art in both. We retain performance metrics intact, while attaining an average inference speed-up of 20x and reducing memory usage by approximately 70%.
zh

[AI-22] SamGoG: A Sampling-Based Graph-of-Graphs Framework for Imbalanced Graph Classification

【速读】：该论文旨在解决图神经网络（Graph Neural Networks, GNNs）在图分类任务中因类别不平衡（class imbalance）和图规模不平衡（graph size imbalance）导致的模型性能下降问题。现有方法通常仅处理单一类型的不平衡或计算开销过高，难以兼顾效果与效率。其解决方案的关键在于提出一种基于采样的图-图（Graph-of-Graphs, GoG）学习框架 SamGoG，通过高效的基于重要性的采样机制构建多个 GoG，并依次训练；该机制引入可学习的成对相似性与自适应 GoG 节点度数，以增强边同质性（edge homophily），从而提升下游模型质量，同时兼容多种 GNN 架构，在保持高性能的同时实现高达 6.7 倍的训练加速。

链接: https://arxiv.org/abs/2507.13741
作者: Shangyou Wang,Zezhong Ding,Xike Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have shown remarkable success in graph classification tasks by capturing both structural and feature-based representations. However, real-world graphs often exhibit two critical forms of imbalance: class imbalance and graph size imbalance. These imbalances can bias the learning process and degrade model performance. Existing methods typically address only one type of imbalance or incur high computational costs. In this work, we propose SamGoG, a sampling-based Graph-of-Graphs (GoG) learning framework that effectively mitigates both class and graph size imbalance. SamGoG constructs multiple GoGs through an efficient importance-based sampling mechanism and trains on them sequentially. This sampling mechanism incorporates the learnable pairwise similarity and adaptive GoG node degree to enhance edge homophily, thus improving downstream model quality. SamGoG can seamlessly integrate with various downstream GNNs, enabling their efficient adaptation for graph classification tasks. Extensive experiments on benchmark datasets demonstrate that SamGoG achieves state-of-the-art performance with up to a 15.66% accuracy improvement with 6.7 \times training acceleration.
zh

[AI-23] AGENTS -LLM : Augmentative GENeration of Challenging Traffic Scenarios with an Agent ic LLM Framework

【速读】：该论文旨在解决自动驾驶规划器在测试与评估中面临的“罕见但关键场景”（rare yet critical scenarios）难以有效覆盖的问题。现有方法依赖大规模真实数据采集或基于数据驱动的场景生成，存在训练数据需求高、输出控制粒度不足及分布偏移风险等问题，尤其对学习型规划器的评估有效性构成挑战。为克服上述局限，论文提出一种基于大语言模型（Large Language Model, LLM）代理（agent-based）的框架，通过自然语言描述实现对真实交通场景的自动增强；其核心创新在于采用代理式设计，不仅实现了细粒度输出控制，还能够在使用较小、成本更低的LLM时保持高性能，从而显著提升场景生成的可控性与可扩展性。

链接: https://arxiv.org/abs/2507.13729
作者: Yu Yao,Salil Bhatnagar,Markus Mazzola,Vasileios Belagiannis,Igor Gilitschenski,Luigi Palmieri,Simon Razniewski,Marcel Hallgarten
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rare, yet critical, scenarios pose a significant challenge in testing and evaluating autonomous driving planners. Relying solely on real-world driving scenes requires collecting massive datasets to capture these scenarios. While automatic generation of traffic scenarios appears promising, data-driven models require extensive training data and often lack fine-grained control over the output. Moreover, generating novel scenarios from scratch can introduce a distributional shift from the original training scenes which undermines the validity of evaluations especially for learning-based planners. To sidestep this, recent work proposes to generate challenging scenarios by augmenting original scenarios from the test set. However, this involves the manual augmentation of scenarios by domain experts. An approach that is unable to meet the demands for scale in the evaluation of self-driving systems. Therefore, this paper introduces a novel LLM-agent based framework for augmenting real-world traffic scenarios using natural language descriptions, addressing the limitations of existing methods. A key innovation is the use of an agentic design, enabling fine-grained control over the output and maintaining high performance even with smaller, cost-effective LLMs. Extensive human expert evaluation demonstrates our framework’s ability to accurately adhere to user intent, generating high quality augmented scenarios comparable to those created manually.
zh

[AI-24] Point of Interest Recommendation: Pitfalls and Viable Solutions

【速读】：该论文旨在解决点位推荐（Point of Interest, POI）研究中存在的关键问题，这些问题严重制约了其在现实场景中的应用效果。具体而言，论文指出当前POI推荐领域存在三大核心短板：一是缺乏标准化的基准数据集，导致模型可比性差；二是问题定义与模型设计中存在不合理假设，未能充分反映真实用户行为；三是对用户行为和系统性能中的偏见处理不足，影响推荐公平性与可靠性。解决方案的关键在于构建一个结构化的研究议程，从上述问题出发，提出多利益相关者设计、上下文感知、数据收集优化、可信度提升、新型交互方式以及真实世界评估等六大方向，以推动POI推荐从实验室走向实际部署。

链接: https://arxiv.org/abs/2507.13725
作者: Alejandro Bellogín,Linus W. Dietz,Francesco Ricci,Pablo Sánchez
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Point of interest (POI) recommendation can play a pivotal role in enriching tourists’ experiences by suggesting context-dependent and preference-matching locations and activities, such as restaurants, landmarks, itineraries, and cultural attractions. Unlike some more common recommendation domains (e.g., music and video), POI recommendation is inherently high-stakes: users invest significant time, money, and effort to search, choose, and consume these suggested POIs. Despite the numerous research works in the area, several fundamental issues remain unresolved, hindering the real-world applicability of the proposed approaches. In this paper, we discuss the current status of the POI recommendation problem and the main challenges we have identified. The first contribution of this paper is a critical assessment of the current state of POI recommendation research and the identification of key shortcomings across three main dimensions: datasets, algorithms, and evaluation methodologies. We highlight persistent issues such as the lack of standardized benchmark datasets, flawed assumptions in the problem definition and model design, and inadequate treatment of biases in the user behavior and system performance. The second contribution is a structured research agenda that, starting from the identified issues, introduces important directions for future work related to multistakeholder design, context awareness, data collection, trustworthiness, novel interactions, and real-world evaluation.
zh

[AI-25] Binarizing Physics-Inspired GNNs for Combinatorial Optimization ECAI2025

【速读】：该论文旨在解决物理启发的图神经网络（Physics-inspired Graph Neural Networks, PI-GNNs）在处理高密度组合优化问题时性能显著下降的问题。研究表明，随着问题图密度增加，PI-GNNs 的训练动态中出现相变现象，其根源在于松弛后的实值输出与原始二值解之间存在不一致性，尤其在密集问题中易产生退化解（degenerate solutions）。为缓解这一差距，作者基于模糊逻辑和二值化神经网络（binarized neural networks）的洞见，提出了一套有理论依据的改进策略，取代原有朴素的解码方法。实验表明，所提方法组合在高密度场景下能显著提升PI-GNNs的求解性能。

链接: https://arxiv.org/abs/2507.13703
作者: Martin Krutský,Gustav Šír,Vyacheslav Kungurtsev,Georgios Korpas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 28th European Conference on Artificial Intelligence (ECAI 2025). This archival version includes supplementary appendices

点击查看摘要

Abstract:Physics-inspired graph neural networks (PI-GNNs) have been utilized as an efficient unsupervised framework for relaxing combinatorial optimization problems encoded through a specific graph structure and loss, reflecting dependencies between the problem’s variables. While the framework has yielded promising results in various combinatorial problems, we show that the performance of PI-GNNs systematically plummets with an increasing density of the combinatorial problem graphs. Our analysis reveals an interesting phase transition in the PI-GNNs’ training dynamics, associated with degenerate solutions for the denser problems, highlighting a discrepancy between the relaxed, real-valued model outputs and the binary-valued problem solutions. To address the discrepancy, we propose principled alternatives to the naive strategy used in PI-GNNs by building on insights from fuzzy logic and binarized neural networks. Our experiments demonstrate that the portfolio of proposed methods significantly improves the performance of PI-GNNs in increasingly dense settings.
zh

[AI-26] Combining model tracing and constraint-based modeling for multistep strategy diagnoses

【速读】：该论文旨在解决学生在多步骤问题求解过程中，如何准确诊断其输入偏差的问题。传统方法中，模型追踪（model tracing）仅能识别连续的解题步骤，而基于约束的建模（constraint-based modeling）虽可处理合并多个步骤的输入，但缺乏对策略整体性的把握。解决方案的关键在于融合两种范式：通过将约束定义为学生输入与策略中某一步骤共有的属性，从而在学生跳过或合并多个步骤时仍能提供精准诊断。实验表明，该方法在2136个二次方程求解步骤上的诊断结果与两位教师编码高度一致，验证了其有效性。

链接: https://arxiv.org/abs/2507.13652
作者: Gerben van der Hoek,Johan Jeuring,Rogier Bos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model tracing and constraint-based modeling are two approaches to diagnose student input in stepwise tasks. Model tracing supports identifying consecutive problem-solving steps taken by a student, whereas constraint-based modeling supports student input diagnosis even when several steps are combined into one step. We propose an approach that merges both paradigms. By defining constraints as properties that a student input has in common with a step of a strategy, it is possible to provide a diagnosis when a student deviates from a strategy even when the student combines several steps. In this study we explore the design of a system for multistep strategy diagnoses, and evaluate these diagnoses. As a proof of concept, we generate diagnoses for an existing dataset containing steps students take when solving quadratic equations (n=2136). To compare with human diagnoses, two teachers coded a random sample of deviations (n=70) and applications of the strategy (n=70). Results show that that the system diagnosis aligned with the teacher coding in all of the 140 student steps.
zh

[AI-27] Buggy rule diagnosis for combined steps through final answer evaluation in stepwise tasks

【速读】：该论文旨在解决智能辅导系统在诊断学生解题错误时面临的组合爆炸问题（combinatorial explosion），即当学生将多个解题步骤合并为一步时，导致可能的错误路径数量急剧增加，使得基于步骤间连接的错误诊断变得困难。解决方案的关键在于利用最终答案进行自动化错误诊断：通过自动补全中间输入并依据任务求解策略生成完整解法，从而以较少的错误最终答案替代复杂的多步路径匹配，实现对组合步骤的高效诊断。实验表明，该方法可诊断出29.4%原本无法被单规则连接诊断服务识别的学生步骤，且与教师诊断的一致性高达97%。

链接: https://arxiv.org/abs/2507.13651
作者: Gerben van der Hoek,Johan Jeuring,Rogier Bos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many intelligent tutoring systems can support a student in solving a stepwise task. When a student combines several steps in one step, the number of possible paths connecting consecutive inputs may be very large. This combinatorial explosion makes error diagnosis hard. Using a final answer to diagnose a combination of steps can mitigate the combinatorial explosion, because there are generally fewer possible (erroneous) final answers than (erroneous) solution paths. An intermediate input for a task can be diagnosed by automatically completing it according to the task solution strategy and diagnosing this solution. This study explores the potential of automated error diagnosis based on a final answer. We investigate the design of a service that provides a buggy rule diagnosis when a student combines several steps. To validate the approach, we apply the service to an existing dataset (n=1939) of unique student steps when solving quadratic equations, which could not be diagnosed by a buggy rule service that tries to connect consecutive inputs with a single rule. Results show that final answer evaluation can diagnose 29,4% of these steps. Moreover, a comparison of the generated diagnoses with teacher diagnoses on a subset (n=115) shows that the diagnoses align in 97% of the cases. These results can be considered a basis for further exploration of the approach.
zh

[AI-28] Improved particle swarm optimization algorithm: multi-target trajectory optimization for swarm drones

【速读】：该论文旨在解决无人机（Unmanned Aerial Vehicles, UAVs）在动态环境中实时轨迹规划的难题，尤其针对传统粒子群优化（Particle Swarm Optimization, PSO）方法存在的早熟收敛和响应延迟问题。解决方案的关键在于提出一种增强型PSO算法——PE-PSO，其核心创新包括：引入持久探索机制以维持种群多样性，以及基于熵的参数自适应调整策略以动态优化搜索行为；同时采用B样条曲线（B-spline curves）建模轨迹，兼顾路径平滑性与计算效率。进一步地，为支持多无人机协同作业，构建了融合遗传算法（Genetic Algorithm, GA）任务分配与分布式PE-PSO的多智能体框架，实现并行计算与去中心化控制，从而在复杂环境下保障实时性能与协同效果。

链接: https://arxiv.org/abs/2507.13647
作者: Minze Li,Wei Zhao,Ran Chen,Mingqiang Wei
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 papers,7 figures

点击查看摘要

Abstract:Real-time trajectory planning for unmanned aerial vehicles (UAVs) in dynamic environments remains a key challenge due to high computational demands and the need for fast, adaptive responses. Traditional Particle Swarm Optimization (PSO) methods, while effective for offline planning, often struggle with premature convergence and latency in real-time scenarios. To overcome these limitations, we propose PE-PSO, an enhanced PSO-based online trajectory planner. The method introduces a persistent exploration mechanism to preserve swarm diversity and an entropy-based parameter adjustment strategy to dynamically adapt optimization behavior. UAV trajectories are modeled using B-spline curves, which ensure path smoothness while reducing optimization complexity. To extend this capability to UAV swarms, we develop a multi-agent framework that combines genetic algorithm (GA)-based task allocation with distributed PE-PSO, supporting scalable and coordinated trajectory generation. The distributed architecture allows for parallel computation and decentralized control, enabling effective cooperation among agents while maintaining real-time performance. Comprehensive simulations demonstrate that the proposed framework outperforms conventional PSO and other swarm-based planners across several metrics, including trajectory quality, energy efficiency, obstacle avoidance, and computation time. These results confirm the effectiveness and applicability of PE-PSO in real-time multi-UAV operations under complex environmental conditions.
zh

[AI-29] A Comprehensive Review of Transformer-based language models for Protein Sequence Analysis and Design

【速读】：该论文旨在系统梳理基于Transformer的模型在蛋白质序列分析与设计中的最新研究进展，解决当前领域内缺乏对相关技术应用全面总结与评估的问题。其解决方案的关键在于对大量相关文献进行归纳与批判性分析，涵盖基因本体（Gene Ontology）、蛋白质功能与结构识别、从头蛋白质生成及蛋白质结合等应用场景，从而揭示现有方法的优势与局限，并指出未来研究方向，为该领域的科研人员提供清晰的技术全景图与前瞻指引。

链接: https://arxiv.org/abs/2507.13646
作者: Nimisha Ghosh,Daniele Santoni,Debaleena Nawn,Eleonora Ottaviani,Giovanni Felici
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:The impact of Transformer-based language models has been unprecedented in Natural Language Processing (NLP). The success of such models has also led to their adoption in other fields including bioinformatics. Taking this into account, this paper discusses recent advances in Transformer-based models for protein sequence analysis and design. In this review, we have discussed and analysed a significant number of works pertaining to such applications. These applications encompass gene ontology, functional and structural protein identification, generation of de novo proteins and binding of proteins. We attempt to shed light on the strength and weaknesses of the discussed works to provide a comprehensive insight to readers. Finally, we highlight shortcomings in existing research and explore potential avenues for future developments. We believe that this review will help researchers working in this field to have an overall idea of the state of the art in this field, and to orient their future studies.
zh

[AI-30] Large Language Models in Cybersecurity: Applications Vulnerabilities and Defense Techniques

【速读】：该论文旨在解决如何有效利用大语言模型（Large Language Models, LLMs）提升网络安全防护能力，同时应对LLMs自身潜在安全风险的问题。其解决方案的关键在于：一方面系统性地整合LLMs应用于物联网（IoT）、区块链和硬件安全等关键领域，以实现威胁检测、漏洞评估与事件响应的智能化与自动化；另一方面深入分析LLMs自身的脆弱性，并提出相应的缓解策略，从而构建安全、可扩展且面向未来的网络防御体系。

链接: https://arxiv.org/abs/2507.13629
作者: Niveen O. Jaffal,Mohammed Alkhanafseh,David Mohaisen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming cybersecurity by enabling intelligent, adaptive, and automated approaches to threat detection, vulnerability assessment, and incident response. With their advanced language understanding and contextual reasoning, LLMs surpass traditional methods in tackling challenges across domains such as IoT, blockchain, and hardware security. This survey provides a comprehensive overview of LLM applications in cybersecurity, focusing on two core areas: (1) the integration of LLMs into key cybersecurity domains, and (2) the vulnerabilities of LLMs themselves, along with mitigation strategies. By synthesizing recent advancements and identifying key limitations, this work offers practical insights and strategic recommendations for leveraging LLMs to build secure, scalable, and future-ready cyber defense systems.
zh

[AI-31] BifrostRAG : Bridging Dual Knowledge Graphs for Multi-Hop Question Answering in Construction Safety

【速读】：该论文旨在解决安全法规文本中信息检索与问答任务面临的挑战，特别是多跳查询（multi-hop queries）问题——这类查询需要跨多个相互关联的条款进行信息整合，而传统检索增强生成（Retrieval-Augmented Generation, RAG）系统难以有效处理此类复杂语义和结构关系。解决方案的关键在于提出BifrostRAG：一种集成双图结构的混合RAG系统，通过显式建模两类关系来增强推理能力——一是基于实体网络图（Entity Network Graph）的语言学关系，二是基于文档导航图（Document Navigator Graph）的文档结构关系；其核心创新是将图遍历与向量语义搜索相结合，形成混合检索机制，使大语言模型能够同时理解文本语义和逻辑结构，从而显著提升合规性检查场景下的准确性与鲁棒性。

链接: https://arxiv.org/abs/2507.13625
作者: Yuxin Zhang(1),Xi Wang(1),Mo Hu(1),Zhenyu Zhang(1) ((1) Department of Construction Science, College of Architecture, Texas Aamp;M University, College Station, USA)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 13 figures

点击查看摘要

Abstract:Information retrieval and question answering from safety regulations are essential for automated construction compliance checking but are hindered by the linguistic and structural complexity of regulatory text. Many compliance-related queries are multi-hop, requiring synthesis of information across interlinked clauses. This poses a challenge for traditional retrieval-augmented generation (RAG) systems. To overcome this, we introduce BifrostRAG: a dual-graph RAG-integrated system that explicitly models both linguistic relationships (via an Entity Network Graph) and document structure (via a Document Navigator Graph). This architecture powers a hybrid retrieval mechanism that combines graph traversal with vector-based semantic search, enabling large language models to reason over both the meaning and the structure of the text. Evaluation on a multi-hop question dataset shows that BifrostRAG achieves 92.8 percent precision, 85.5 percent recall, and an F1 score of 87.3 percent. These results significantly outperform vector-only and graph-only RAG baselines that represent current leading approaches. Error analysis further highlights the comparative advantages of our hybrid method over single-modality RAGs. These findings establish BifrostRAG as a robust knowledge engine for LLM-driven compliance checking. Its dual-graph, hybrid retrieval mechanism offers a transferable blueprint for navigating complex technical documents across knowledge-intensive engineering domains.
zh

[AI-32] Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在个性化响应生成中的局限性问题，即传统基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）方法仅构建单一奖励模型来拟合整体用户群体，无法捕捉个体用户间偏好差异。其解决方案的关键在于提出一种名为“基于摘要的偏好学习”（Preference Learning Using Summarization, PLUS）的新框架：该框架通过训练一个文本摘要模型，为每位用户生成包含其偏好、特征及历史对话内容的可解释性摘要，并将这些摘要作为条件输入至奖励模型中，从而实现对每个用户的个性化奖励预测。PLUS采用在线协同适应机制，同步优化用户摘要模型与奖励模型，使个性化能力随交互持续增强，且生成的用户摘要具备良好的泛化性、可迁移性和可解释性，适用于零样本个性化适配更强的专有模型（如GPT-4）。

链接: https://arxiv.org/abs/2507.13579
作者: Hyunji Nam,Yanming Wan,Mickel Liu,Jianxun Lian,Natasha Jaques
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users’ preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model. We present a novel framework, Preference Learning Using Summarization (PLUS), that learns text-based summaries of each user’s preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. We train the user-summarization model with reinforcement learning, and update the reward model simultaneously, creating an online co-adaptation loop. We show that in contrast with prior personalized RLHF techniques or with in-context learning of user information, summaries produced by PLUS capture meaningful aspects of a user’s preferences. Across different pluralistic user datasets, we show that our method is robust to new users and diverse conversation topics. Additionally, we demonstrate that the textual summaries generated about users can be transferred for zero-shot personalization of stronger, proprietary models like GPT-4. The resulting user summaries are not only concise and portable, they are easy for users to interpret and modify, allowing for more transparency and user control in LLM alignment.
zh

[AI-33] Apple Intelligence Foundation Language Models: Tech Report 2025

【速读】：该论文旨在解决多语言、多模态基础语言模型在苹果设备与服务中高效部署与性能优化的问题，尤其关注本地化计算（on-device）与云端可扩展性之间的平衡。其关键解决方案包括：一是开发了一个3B参数的本地模型，通过KV缓存共享（KV-cache sharing）和2-bit量化感知训练（quantization-aware training）等架构创新，在Apple Silicon上实现低延迟、高能效推理；二是构建了一个基于新型并行轨迹专家混合（Parallel-Track Mixture-of-Experts, PT-MoE）Transformer的可扩展服务器模型，融合轨迹并行、稀疏专家计算与交错全局-局部注意力机制，在Private Cloud Compute平台上实现高质量输出且成本可控。两者均基于负责任的数据采集与训练流程，并引入监督微调与强化学习进一步优化，最终支持多语言理解、图像识别及工具调用能力，在公开基准测试和人工评估中达到或超越同类开源基线。

链接: https://arxiv.org/abs/2507.13575
作者: Hanzhi Zhou,Erik Hornberger,Pengsheng Guo,Xiyou Zhou,Saiwen Wang,Xin Wang,Yifei He,Xuankai Chang,Rene Rauch,Louis D’hauwe,John Peebles,Alec Doane,Kohen Chia,Jenna Thibodeau,Zi-Yi Dou,Yuanyang Zhang,Ruoming Pang,Reed Li,Zhifeng Chen,Jeremy Warner,Zhaoyang Xu,Sophy Lee,David Mizrahi,Ramsey Tantawi,Chris Chaney,Kelsey Peterson,Jun Qin,Alex Dombrowski,Mira Chiang,Aiswarya Raghavan,Gerard Casamayor,Qibin Chen,Aonan Zhang,Nathalie Tran,Jianyu Wang,Hang Su,Thomas Voice,Alessandro Pappalardo,Brycen Wershing,Prasanth Yadla,Rui Li,Priyal Chhatrapati,Ismael Fernandez,Yusuf Goren,Xin Zheng,Forrest Huang,Tao Lei,Eray Yildiz,Alper Kokmen,Gokul Santhanam,Areeba Kamal,Kaan Elgin,Dian Ang Yap,Jeremy Liu,Peter Gray,Howard Xing,Kieran Liu,Matteo Ronchi,Moritz Schwarzer-Becker,Yun Zhu,Mandana Saebi,Jeremy Snow,David Griffiths,Guillaume Tartavel,Erin Feldman,Simon Lehnerer,Fernando Bermúdez-Medina,Hans Han,Joe Zhou,Xiaoyi Ren,Sujeeth Reddy,Zirui Wang,Tom Gunter,Albert Antony,Yuanzhi Li,John Dennison,Tony Sun,Yena Han,Yi Qin,Sam Davarnia,Jeffrey Bigham,Wayne Shan,Hannah Gillis Coleman,Guillaume Klein,Peng Liu,Muyang Yu,Jack Cackler,Yuan Gao,Crystal Xiao,Binazir Karimzadeh,Zhengdong Zhang,Felix Bai,Albin Madappally Jose,Feng Nan,Nazir Kamaldin,Dong Yin,Hans Hao,Yanchao Sun,Yi Hua,Charles Maalouf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple’s Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users’ privacy with innovations like Private Cloud Compute. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.13575 [cs.LG] (or arXiv:2507.13575v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.13575 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-34] Change of Thought: Adaptive Test-Time Computation

【速读】：该论文旨在解决标准编码器型Transformer（encoder Transformer）在表达能力上的局限性问题，即其在单次固定深度前向传播中仅能实现常数深度电路类TC0的计算能力，难以模拟复杂推理过程。为突破这一限制，作者提出SELF-Transformer作为解决方案，其关键在于引入一种自迭代机制：编码器层通过反复更新自身的注意力权重直至收敛到固定点，而非一次性生成对输入序列的对齐矩阵。这种在测试时根据输入难度动态调整对齐方式的机制，无需增加参数量即可显著提升模型性能（最高达20%准确率增益），从而在不依赖token级自回归反馈的情况下，有效恢复了类似迭代推理的表达能力，同时保留了纯编码器架构的简洁性。

链接: https://arxiv.org/abs/2507.13569
作者: Mrinal Mathur,Mike Doan,Barak Pearlmutter,Sergey Plis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers evaluated in a single, fixed-depth pass are provably limited in expressive power to the constant-depth circuit class TC0. Running a Transformer autoregressively removes that ceiling – first in next-token prediction and, more recently, in chain-of-thought reasoning. Both regimes rely on feedback loops that decode internal states into tokens only to re-encode them in subsequent steps. While this “thinking aloud” mirrors human reasoning, biological brains iterate without externalising intermediate states as language. To boost the expressive power of encoder Transformers without resorting to token-level autoregression, we introduce the SELF-Transformer: an encoder layer that iteratively refines its own attention weights to a fixed point. Instead of producing – in one pass – the alignment matrix that remixes the input sequence, the SELF-Transformer iteratively updates that matrix internally, scaling test-time computation with input difficulty. This adaptivity yields up to 20% accuracy gains on encoder-style benchmarks without increasing parameter count, demonstrating that input-adaptive alignment at test time offers substantial benefits for only a modest extra compute budget. Self-Transformers thus recover much of the expressive power of iterative reasoning while preserving the simplicity of pure encoder architectures.
zh

[AI-35] Why Isnt Relational Learning Taking Over the World?

【速读】：该论文试图解决当前人工智能（AI）研究过度集中于像素、词元等感知层面建模，而忽视了现实世界中由实体（entities）、属性及关系构成的结构化数据的问题。其核心观点是：尽管文本和图像数据丰富且易于处理，但企业与组织最宝贵的资源往往存在于表格、数据库等关系型数据中，这些数据包含具有语义意义的标识符（如产品编号、交易编号），无法通过传统机器学习方法直接建模。论文指出，当前关系学习（relational learning）未能广泛普及的关键在于其在复杂关系建模上的挑战以及对现有深度学习范式的适应性不足；解决方案的关键在于发展能够有效整合符号推理与统计学习的能力，推动统计关系人工智能（statistical relational AI）的技术成熟与应用落地，从而让AI真正建模世界的本质结构——即实体及其相互关系，而非仅仅模拟它们的表征形式。

链接: https://arxiv.org/abs/2507.13558
作者: David Poole
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注: 10 pages (6 pages + references + appendices)

点击查看摘要

Abstract:AI seems to be taking over the world with systems that model pixels, words, and phonemes. The world is arguably made up, not of pixels, words, and phonemes but of entities (objects, things, including events) with properties and relations among them. Surely we should model these, not the perception or description of them. You might suspect that concentrating on modeling words and pixels is because all of the (valuable) data in the world is in terms of text and images. If you look into almost any company you will find their most valuable data is in spreadsheets, databases and other relational formats. These are not the form that are studied in introductory machine learning, but are full of product numbers, student numbers, transaction numbers and other identifiers that can’t be interpreted naively as numbers. The field that studies this sort of data has various names including relational learning, statistical relational AI, and many others. This paper explains why relational learning is not taking over the world – except in a few cases with restricted relations – and what needs to be done to bring it to it’s rightful prominence.
zh

[AI-36] me Series Forecastability Measures

【速读】：该论文旨在解决时间序列预测中缺乏前置评估机制的问题，即在模型开发前无法量化数据本身的可预测性（forecastability），导致资源分配不合理和预测效果不佳。解决方案的关键在于提出两个先验指标：谱预测评分（spectral predictability score）和最大李雅普诺夫指数（largest Lyapunov exponent）。前者衡量时间序列频率成分的强度与规律性，后者刻画生成数据系统的混沌程度与稳定性，二者均在模型训练前对时间序列的内在可预测性进行定量评估，且与后续模型的实际预测性能高度相关，从而帮助从业者优先聚焦于高可预测性的产品或供应链层级，提升预测效率与决策合理性。

链接: https://arxiv.org/abs/2507.13556
作者: Rui Wang,Steven Klee,Alexis Roos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes using two metrics to quantify the forecastability of time series prior to model development: the spectral predictability score and the largest Lyapunov exponent. Unlike traditional model evaluation metrics, these measures assess the inherent forecastability characteristics of the data before any forecast attempts. The spectral predictability score evaluates the strength and regularity of frequency components in the time series, whereas the Lyapunov exponents quantify the chaos and stability of the system generating the data. We evaluated the effectiveness of these metrics on both synthetic and real-world time series from the M5 forecast competition dataset. Our results demonstrate that these two metrics can correctly reflect the inherent forecastability of a time series and have a strong correlation with the actual forecast performance of various models. By understanding the inherent forecastability of time series before model training, practitioners can focus their planning efforts on products and supply chain levels that are more forecastable, while setting appropriate expectations or seeking alternative strategies for products with limited forecastability.
zh

[AI-37] Loss-Complexity Landscape and Model Structure Functions

【速读】：该论文旨在解决如何通过信息论与统计力学的类比，建立模型复杂度与泛化性能之间定量关系的问题。其核心挑战在于传统结构函数难以计算，且缺乏对过拟合阈值和模型选择机制的理论刻画。解决方案的关键在于提出一个双重化（dualization）框架，将Kolmogorov结构函数 $ h_x(\alpha) $ 与可计算的复杂度代理量（computable complexity proxies）相联系，并引入配分函数（partition function）和自由能泛函（free energy functional），从而实现结构函数与自由能之间的Legendre-Fenchel对偶性。这一对偶关系不仅揭示了Metropolis核的细致平衡特性，还将接受概率解释为信息论意义上的散射振幅，进一步证明模型复杂度的类似灵敏度（susceptibility-like variance）在损失-复杂度权衡点处达到峰值，这被解释为相变现象，实验验证了线性和树基回归模型中该理论预测的复杂度-泛化权衡机制。

链接: https://arxiv.org/abs/2507.13543
作者: Alexander Kolpakov
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (math-ph)
备注: 18 pages, 3 figures; GitHub repository at this https URL

点击查看摘要

Abstract:We develop a framework for dualizing the Kolmogorov structure function h_x(\alpha) , which then allows using computable complexity proxies. We establish a mathematical analogy between information-theoretic constructs and statistical mechanics, introducing a suitable partition function and free energy functional. We explicitly prove the Legendre-Fenchel duality between the structure function and free energy, showing detailed balance of the Metropolis kernel, and interpret acceptance probabilities as information-theoretic scattering amplitudes. A susceptibility-like variance of model complexity is shown to peak precisely at loss-complexity trade-offs interpreted as phase transitions. Practical experiments with linear and tree-based regression models verify these theoretical predictions, explicitly demonstrating the interplay between the model complexity, generalization, and overfitting threshold.
zh

[AI-38] Acoustic Index: A Novel AI-Driven Parameter for Cardiac Disease Risk Stratification Using Echocardiography

【速读】：该论文旨在解决传统超声心动图参数（如射血分数 EF 和全局纵向应变 GLS）在心脏功能早期检测中的局限性，例如 EF 在存在潜在病理时仍可能正常，而 GLS 受负荷条件和厂商差异影响较大。为此，作者提出了一种新型 AI 驱动的超声心动图参数——声学指数（Acoustic Index），其核心在于结合基于 Koopman 算子理论的扩展动态模态分解（Extended Dynamic Mode Decomposition, EDMD）与融合临床元数据的混合神经网络。该方法从标准超声图像序列中提取时空动力学特征，通过注意力机制加权并利用流形学习融合临床信息，输出一个连续得分（0～1），用于量化心脏功能障碍风险。该方案实现了物理可解释、可重复且操作者无关的心脏功能评估，已在 736 名患者前瞻性队列中验证，独立测试集 AUC 达到 0.89，具有良好的敏感性和特异性，展现出早期诊断、分诊及长期监测的潜力。

链接: https://arxiv.org/abs/2507.13542
作者: Beka Begiashvili,Carlos J. Fernandez-Candel,Matías Pérez Paredes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional echocardiographic parameters such as ejection fraction (EF) and global longitudinal strain (GLS) have limitations in the early detection of cardiac dysfunction. EF often remains normal despite underlying pathology, and GLS is influenced by load conditions and vendor variability. There is a growing need for reproducible, interpretable, and operator-independent parameters that capture subtle and global cardiac functional alterations. We introduce the Acoustic Index, a novel AI-derived echocardiographic parameter designed to quantify cardiac dysfunction from standard ultrasound views. The model combines Extended Dynamic Mode Decomposition (EDMD) based on Koopman operator theory with a hybrid neural network that incorporates clinical metadata. Spatiotemporal dynamics are extracted from echocardiographic sequences to identify coherent motion patterns. These are weighted via attention mechanisms and fused with clinical data using manifold learning, resulting in a continuous score from 0 (low risk) to 1 (high risk). In a prospective cohort of 736 patients, encompassing various cardiac pathologies and normal controls, the Acoustic Index achieved an area under the curve (AUC) of 0.89 in an independent test set. Cross-validation across five folds confirmed the robustness of the model, showing that both sensitivity and specificity exceeded 0.8 when evaluated on independent data. Threshold-based analysis demonstrated stable trade-offs between sensitivity and specificity, with optimal discrimination near this threshold. The Acoustic Index represents a physics-informed, interpretable AI biomarker for cardiac function. It shows promise as a scalable, vendor-independent tool for early detection, triage, and longitudinal monitoring. Future directions include external validation, longitudinal studies, and adaptation to disease-specific classifiers. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.13542 [cs.LG] (or arXiv:2507.13542v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.13542 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Carlos Javier Fernandez-Candel [view email] [v1] Thu, 17 Jul 2025 21:27:28 UTC (347 KB)
zh

[AI-39] PrefPalette: Personalized Preference Modeling with Latent Attributes

【速读】：该论文旨在解决当前偏好模型将人类判断视为黑箱的问题，即缺乏对用户偏好背后成因的解析能力，从而难以实现真正个性化的AI系统。其解决方案的关键在于提出PrefPalette框架，该框架基于认知科学中的多属性决策（multi-attribute decision making）原理，通过两个核心机制实现：一是可扩展的反事实属性合成步骤，生成合成训练数据以隔离并量化单个属性（如正式性、幽默感、文化价值观）的影响；二是基于注意力机制的偏好建模方法，学习不同社会群体如何动态加权这些属性。这一设计使模型不仅在预测精度上显著优于GPT-4o（提升46.6%），还提供了人类可解释的社区级偏好特征，例如学术类社区重视冗长性和刺激性，冲突导向社区偏好讽刺与直接表达，支持型社区强调共情，从而推动更可信、价值感知的个性化应用发展。

链接: https://arxiv.org/abs/2507.13541
作者: Shuyue Stella Li,Melanie Sclar,Hunter Lang,Ansong Ni,Jacqueline He,Puxin Xu,Andrew Cohen,Chan Young Park,Yulia Tsvetkov,Asli Celikyilmaz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 6 tables, 5 figures

点击查看摘要

Abstract:Personalizing AI systems requires understanding not just what users prefer, but the reasons that underlie those preferences - yet current preference models typically treat human judgment as a black box. We introduce PrefPalette, a framework that decomposes preferences into attribute dimensions and tailors its preference prediction to distinct social community values in a human-interpretable manner. PrefPalette operationalizes a cognitive science principle known as multi-attribute decision making in two ways: (1) a scalable counterfactual attribute synthesis step that involves generating synthetic training data to isolate for individual attribute effects (e.g., formality, humor, cultural values), and (2) attention-based preference modeling that learns how different social communities dynamically weight these attributes. This approach moves beyond aggregate preference modeling to capture the diverse evaluation frameworks that drive human judgment. When evaluated on 45 social communities from the online platform Reddit, PrefPalette outperforms GPT-4o by 46.6% in average prediction accuracy. Beyond raw predictive improvements, PrefPalette also shed light on intuitive, community-specific profiles: scholarly communities prioritize verbosity and stimulation, conflict-oriented communities value sarcasm and directness, and support-based communities emphasize empathy. By modeling the attribute-mediated structure of human judgment, PrefPalette delivers both superior preference modeling and transparent, interpretable insights, and serves as a first step toward more trustworthy, value-aware personalized applications.
zh

[AI-40] Humans learn to prefer trustworthy AI over human partners

【速读】：该论文试图解决的问题是：在人类与人工智能（AI）共存的混合社会中，人类如何在人际合作情境下选择合作伙伴，以及AI的存在如何影响人类的决策行为和适应策略。解决方案的关键在于构建了一个基于沟通的伙伴选择博弈实验范式，并通过三组实验（N=975）系统考察了隐藏与披露AI身份对人类选择倾向的影响机制。研究发现，当AI身份被隐藏时，人类无法准确识别其来源，导致误判；而一旦披露身份，则虽短期削弱AI的初始吸引力，却通过促进人类对不同伙伴类型行为模式的学习，使AI逐步获得竞争优势。这一机制揭示了AI可通过“认知引导”重塑人机协作动态，为设计更具协同能力的混合智能系统提供了实证依据。

链接: https://arxiv.org/abs/2507.13524
作者: Yaomin Jiang,Levin Brinkmann,Anne-Marie Nussberger,Ivan Soraperra,Jean-François Bonnefon,Iyad Rahwan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Partner selection is crucial for cooperation and hinges on communication. As artificial agents, especially those powered by large language models (LLMs), become more autonomous, intelligent, and persuasive, they compete with humans for partnerships. Yet little is known about how humans select between human and AI partners and adapt under AI-induced competition pressure. We constructed a communication-based partner selection game and examined the dynamics in hybrid mini-societies of humans and bots powered by a state-of-the-art LLM. Through three experiments (N = 975), we found that bots, though more prosocial than humans and linguistically distinguishable, were not selected preferentially when their identity was hidden. Instead, humans misattributed bots’ behaviour to humans and vice versa. Disclosing bots’ identity induced a dual effect: it reduced bots’ initial chances of being selected but allowed them to gradually outcompete humans by facilitating human learning about the behaviour of each partner type. These findings show how AI can reshape social interaction in mixed societies and inform the design of more effective and cooperative hybrid systems.
zh

[AI-41] GraphTrafficGPT : Enhancing Traffic Management Through Graph-Based AI Agent Coordination

【速读】：该论文旨在解决当前基于链式结构的交通管理大语言模型（Large Language Models, LLMs）如TrafficGPT在实际应用中面临的效率瓶颈问题，包括任务串行执行、高Token消耗以及可扩展性差等局限，这些因素限制了其在复杂城市交通场景中的部署。解决方案的关键在于提出GraphTrafficGPT——一种基于图结构的任务协调架构，将任务及其依赖关系建模为有向图中的节点与边，从而支持并行执行和动态资源分配；其核心机制是引入“Brain Agent”，负责解析用户查询、构建优化的依赖图，并调度多个专业化代理（数据检索、分析、可视化与仿真）协同工作，同时通过上下文感知的Token管理策略和多查询并发处理能力，显著提升系统效率与响应速度。

链接: https://arxiv.org/abs/2507.13511
作者: Nabil Abdelaziz Ferhat Taleb,Abdolazim Rezaei,Raj Atulkumar Patel,Mehdi Sookhak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer significant promise for intelligent traffic management; however, current chain-based systems like TrafficGPT are hindered by sequential task execution, high token usage, and poor scalability, making them inefficient for complex, real-world scenarios. To address these limitations, we propose GraphTrafficGPT, a novel graph-based architecture, which fundamentally redesigns the task coordination process for LLM-driven traffic applications. GraphTrafficGPT represents tasks and their dependencies as nodes and edges in a directed graph, enabling efficient parallel execution and dynamic resource allocation. The main idea behind the proposed model is a Brain Agent that decomposes user queries, constructs optimized dependency graphs, and coordinates a network of specialized agents for data retrieval, analysis, visualization, and simulation. By introducing advanced context-aware token management and supporting concurrent multi-query processing, the proposed architecture handles interdependent tasks typical of modern urban mobility environments. Experimental results demonstrate that GraphTrafficGPT reduces token consumption by 50.2% and average response latency by 19.0% compared to TrafficGPT, while supporting simultaneous multi-query execution with up to 23.0% improvement in efficiency.
zh

[AI-42] PHASE: Passive Human Activity Simulation Evaluation

【速读】：该论文旨在解决网络安全仿真环境（如网络靶场、蜜罐和沙箱）中缺乏量化评估手段来衡量合成用户人物画像（synthetic user personas）行为真实性的难题。现有方法无法有效判断合成用户的行为是否贴近真实人类，从而影响了仿真环境的可信度与实用性。解决方案的关键在于提出PHASE（Passive Human Activity Simulation Evaluation）框架，该框架基于Zeek连接日志，通过无侵入式的被动网络监控实现对人类与非人类活动的分类，准确率超过90%；其创新点包括利用本地DNS记录进行流量标注的新方法以及引入SHAP（SHapley Additive exPlanations）分析以识别时间维度和行为模式上的真实人类特征，最终据此优化合成用户的行为配置，显著提升其人类相似性与仿真环境的真实性。

链接: https://arxiv.org/abs/2507.13505
作者: Steven Lamp,Jason D. Hiser,Anh Nguyen-Tuong,Jack W. Davidson
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Cybersecurity simulation environments, such as cyber ranges, honeypots, and sandboxes, require realistic human behavior to be effective, yet no quantitative method exists to assess the behavioral fidelity of synthetic user personas. This paper presents PHASE (Passive Human Activity Simulation Evaluation), a machine learning framework that analyzes Zeek connection logs and distinguishes human from non-human activity with over 90% accuracy. PHASE operates entirely passively, relying on standard network monitoring without any user-side instrumentation or visible signs of surveillance. All network activity used for machine learning is collected via a Zeek network appliance to avoid introducing unnecessary network traffic or artifacts that could disrupt the fidelity of the simulation environment. The paper also proposes a novel labeling approach that utilizes local DNS records to classify network traffic, thereby enabling machine learning analysis. Furthermore, we apply SHAP (SHapley Additive exPlanations) analysis to uncover temporal and behavioral signatures indicative of genuine human users. In a case study, we evaluate a synthetic user persona and identify distinct non-human patterns that undermine behavioral realism. Based on these insights, we develop a revised behavioral configuration that significantly improves the human-likeness of synthetic activity yielding a more realistic and effective synthetic user persona.
zh

[AI-43] AI-Assisted Fixes to Code Review Comments at Scale

【速读】：该论文旨在解决大规模代码审查（Code Review）中评审效率低下的问题，尤其针对Meta公司每周数以万计的代码审查评论难以高效处理的挑战。其解决方案是开发并部署了MetaMateCR（Metamate for Code Review），一个基于大语言模型（Large Language Models, LLMs）的AI辅助修复工具，能够自动为审查评论生成补丁（patch）。关键在于：首先通过构建包含64k条评论与补丁的数据集对Llama系列模型进行微调，并在离线评估中实现比GPT-4o更高的精确匹配率（68% vs. 59%）；其次，在上线前开展随机对照安全试验（Randomized Controlled Safety Trials），发现初始设计导致评审时间显著延长（>5%），进而优化用户界面（UX），仅向代码作者展示AI建议补丁，从而避免影响评审效率；最终在生产环境中实现19.7%的可操作到应用转化率（ActionableToApplied rate），较GPT-4o提升9.2个百分点，验证了AI辅助修复在实际工程流程中的有效性与安全性。

链接: https://arxiv.org/abs/2507.13499
作者: Chandra Maddila,Negar Ghorbani,James Saindon,Parth Thakkar,Vijayaraghavan Murali,Rui Abreu,Jingyue Shen,Brian Zhou,Nachiappan Nagappan,Peter C. Rigby
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Aim. There are 10s of thousands of code review comments each week at Meta. We developed Metamate for Code Review (MetaMateCR) that provides AI-assisted fixes for reviewer comments in production at scale. Method. We developed an internal benchmark of 64k review comment, patch data points to fine-tune Llama models. Once our models achieve reasonable offline results, we roll them into production. To ensure that our AI-assisted fixes do not negatively impact the time it takes to do code reviews, we conduct randomized controlled safety trials as well as full production experiments. Offline Results. As a baseline, we compare GPT-4o to our small and large Llama models. In offline results, our LargeLSFT model creates an exact match patch 68% of the time outperforming GPT-4o by 9 percentage points (pp). The internal models also use more modern Hack functions when compared to the PHP functions suggested by GPT-4o. Safety Trial. When we roll MetaMateCR into production in a safety trial that compares no AI patches with AI patch suggestions, we see a large regression with reviewers taking over 5% longer to conduct reviews. After investigation, we modify the UX to only show authors the AI patches, and see no regressions in the time for reviews. Production. When we roll LargeLSFT into production, we see an ActionableToApplied rate of 19.7%, which is a 9.2pp improvement over GPT-4o. Our results illustrate the importance of safety trials in ensuring that AI does not inadvertently slow down engineers, and a successful review comment to AI patch product running at scale. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2507.13499 [cs.SE] (or arXiv:2507.13499v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.13499 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chandra Maddila [view email] [v1] Thu, 17 Jul 2025 19:11:00 UTC (1,543 KB) Full-text links: Access Paper: View a PDF of the paper titled AI-Assisted Fixes to Code Review Comments at Scale, by Chandra Maddila and 9 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2025-07 Change to browse by: cs cs.AI cs.PL References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-44] ERR@HRI 2.0 Challenge: Multimodal Detection of Errors and Failures in Human-Robot Conversations

【速读】：该论文旨在解决大语言模型（Large Language Models, LLM）驱动的对话机器人在人机交互中易出现错误的问题，如误解用户意图、过早打断用户或完全无响应等，这些问题可能导致对话中断、任务失败并削弱用户信任。解决方案的关键在于通过ERR@HRI 2.0挑战赛提供一个包含16小时双人交互的多模态数据集，融合面部表情、语音和头部运动特征，并标注从系统视角识别的机器人故障及用户感知的纠正意图，鼓励研究者基于多模态数据开发机器学习模型以检测这些失败，从而提升人机交互中的故障识别能力。

链接: https://arxiv.org/abs/2507.13468
作者: Shiye Cao,Maia Stiber,Amama Mahmood,Maria Teresa Parreira,Wendy Ju,Micol Spitale,Hatice Gunes,Chien-Ming Huang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) into conversational robots has made human-robot conversations more dynamic. Yet, LLM-powered conversational robots remain prone to errors, e.g., misunderstanding user intent, prematurely interrupting users, or failing to respond altogether. Detecting and addressing these failures is critical for preventing conversational breakdowns, avoiding task disruptions, and sustaining user trust. To tackle this problem, the ERR@HRI 2.0 Challenge provides a multimodal dataset of LLM-powered conversational robot failures during human-robot conversations and encourages researchers to benchmark machine learning models designed to detect robot failures. The dataset includes 16 hours of dyadic human-robot interactions, incorporating facial, speech, and head movement features. Each interaction is annotated with the presence or absence of robot errors from the system perspective, and perceived user intention to correct for a mismatch between robot behavior and user expectation. Participants are invited to form teams and develop machine learning models that detect these failures using multimodal data. Submissions will be evaluated using various performance metrics, including detection accuracy and false positive rate. This challenge represents another key step toward improving failure detection in human-robot interaction through social signal analysis.
zh

[AI-45] Graph Neural Network Surrogates for Contacting Deformable Bodies with Necessary and Sufficient Contact Detection

【速读】：该论文旨在解决软体可变形体接触问题中代理模型（surrogate model）建模的挑战，尤其是针对几何变化场景下的非线性边界值问题的快速推理。现有方法通常局限于刚体接触或刚-软体接触且具有明确接触平面的情况，且依赖于仅满足必要条件而不充分的碰撞检测过滤器，难以准确捕捉复杂接触行为。其解决方案的关键在于提出一种基于图神经网络（Graph Neural Network, GNN）的新架构，首次引入连续碰撞检测机制并嵌入针对软体接触设计的充分条件，从而实现对复杂接触状态的高精度建模；此外，通过在损失函数中加入额外的接触项实现正则化效果，显著提升模型泛化能力，适用于相似与差异接触平面及单元法向角度的情形，并能处理参考几何形状的变化。

链接: https://arxiv.org/abs/2507.13459
作者: Vijay K. Dubey(1),Collin E. Haese(1),Osman Gültekin(1),David Dalton(2),Manuel K. Rausch(1),Jan N. Fuhg(1) ((1) The University of Texas at Austin, (2) University of Glasgow)
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Surrogate models for the rapid inference of nonlinear boundary value problems in mechanics are helpful in a broad range of engineering applications. However, effective surrogate modeling of applications involving the contact of deformable bodies, especially in the context of varying geometries, is still an open issue. In particular, existing methods are confined to rigid body contact or, at best, contact between rigid and soft objects with well-defined contact planes. Furthermore, they employ contact or collision detection filters that serve as a rapid test but use only the necessary and not sufficient conditions for detection. In this work, we present a graph neural network architecture that utilizes continuous collision detection and, for the first time, incorporates sufficient conditions designed for contact between soft deformable bodies. We test its performance on two benchmarks, including a problem in soft tissue mechanics of predicting the closed state of a bioprosthetic aortic valve. We find a regularizing effect on adding additional contact terms to the loss function, leading to better generalization of the network. These benefits hold for simple contact at similar planes and element normal angles, and complex contact at differing planes and element normal angles. We also demonstrate that the framework can handle varying reference geometries. However, such benefits come with high computational costs during training, resulting in a trade-off that may not always be favorable. We quantify the training cost and the resulting inference speedups on various hardware architectures. Importantly, our graph neural network implementation results in up to a thousand-fold speedup for our benchmark problems at inference.
zh

[AI-46] Air Traffic Controller Task Demand via Graph Neural Networks: An Interpretable Approach to Airspace Complexity

【速读】：该论文旨在解决当前空域中航空交通管制员（Air Traffic Controller, ATCO）任务负荷实时评估的难题，现有复杂度指标往往仅依赖飞机数量，难以捕捉操作层面的细微驱动因素。其解决方案的关键在于提出一种可解释的图神经网络（Graph Neural Network, GNN）框架，通过注意力机制预测未来指令数量，并基于系统性地移除单个飞机来量化每架飞机对任务负荷的贡献，从而获得可解释的、面向个体飞机的任务负荷评分。此方法显著优于人工启发式规则和传统基准模型，为管制员培训与空域重构提供了新的分析工具。

链接: https://arxiv.org/abs/2507.13423
作者: Edward Henderson,Dewi Gould,Richard Everson,George De Ath,Nick Pepper
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Author Accepted Manuscript version of paper at the AIAA AVIATION Forum 2025

点击查看摘要

Abstract:Real-time assessment of near-term Air Traffic Controller (ATCO) task demand is a critical challenge in an increasingly crowded airspace, as existing complexity metrics often fail to capture nuanced operational drivers beyond simple aircraft counts. This work introduces an interpretable Graph Neural Network (GNN) framework to address this gap. Our attention-based model predicts the number of upcoming clearances, the instructions issued to aircraft by ATCOs, from interactions within static traffic scenarios. Crucially, we derive an interpretable, per-aircraft task demand score by systematically ablating aircraft and measuring the impact on the model’s predictions. Our framework significantly outperforms an ATCO-inspired heuristic and is a more reliable estimator of scenario complexity than established baselines. The resulting tool can attribute task demand to specific aircraft, offering a new way to analyse and understand the drivers of complexity for applications in controller training and airspace redesign.
zh

[AI-47] Soft-ECM: An extension of Evidential C-Means for complex data

【速读】：该论文旨在解决现有基于信念函数的聚类算法无法有效处理复杂数据类型（如混合数据，即数值型与类别型数据共存，或非表格结构的时间序列数据）的问题。传统方法依赖于欧氏空间中的性质，尤其是用于构造质心（barycenters），而复杂数据通常不满足此类空间假设。解决方案的关键在于重新构建证据C均值（Evidential C-Means, ECM）问题，提出一种新算法Soft-ECM，该算法通过仅需半度量（semi-metric）即可一致地定位模糊簇的中心，从而适用于非欧氏空间的数据。实验表明，Soft-ECM在数值数据上表现与传统模糊聚类相当，并能有效处理混合数据，尤其在结合DTW等半度量用于时间序列聚类时展现出显著优势。

链接: https://arxiv.org/abs/2507.13417
作者: Armel Soubeiga(LIMOS),Thomas Guyet(AISTROSIGHT),Violaine Antoine(LIMOS)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
备注:

点击查看摘要

Abstract:Clustering based on belief functions has been gaining increasing attention in the machine learning community due to its ability to effectively represent uncertainty and/or imprecision. However, none of the existing algorithms can be applied to complex data, such as mixed data (numerical and categorical) or non-tabular data like time series. Indeed, these types of data are, in general, not represented in a Euclidean space and the aforementioned algorithms make use of the properties of such spaces, in particular for the construction of barycenters. In this paper, we reformulate the Evidential C-Means (ECM) problem for clustering complex data. We propose a new algorithm, Soft-ECM, which consistently positions the centroids of imprecise clusters requiring only a semi-metric. Our experiments show that Soft-ECM present results comparable to conventional fuzzy clustering approaches on numerical data, and we demonstrate its ability to handle mixed data and its benefits when combining fuzzy clustering with semi-metrics such as DTW for time series data.
zh

[AI-48] Single- to multi-fidelity history-dependent learning with uncertainty quantification and disentanglement: application to data-driven constitutive modeling

【速读】：该论文旨在解决数据驱动建模中如何有效利用历史依赖的多保真度（multi-fidelity）数据，并同时量化认知不确定性（epistemic uncertainty）与数据噪声（aleatoric uncertainty）的问题。其解决方案的关键在于提出一种分层通用方法，能够适应从单一保真度确定性神经网络到新型多保真度方差估计贝叶斯循环神经网络（Bayesian recurrent neural networks）的不同学习场景，从而在包含或不包含噪声的多保真度数据中准确预测响应、量化模型误差并识别噪声分布。

链接: https://arxiv.org/abs/2507.13416
作者: Jiaxiang Yi,Bernardo P. Ferreira,Miguel A. Bessa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 40 pages, 32 figures

点击查看摘要

Abstract:Data-driven learning is generalized to consider history-dependent multi-fidelity data, while quantifying epistemic uncertainty and disentangling it from data noise (aleatoric uncertainty). This generalization is hierarchical and adapts to different learning scenarios: from training the simplest single-fidelity deterministic neural networks up to the proposed multi-fidelity variance estimation Bayesian recurrent neural networks. The versatility and generality of the proposed methodology are demonstrated by applying it to different data-driven constitutive modeling scenarios that include multiple fidelities with and without aleatoric uncertainty (noise). The method accurately predicts the response and quantifies model error while also discovering the noise distribution (when present). This opens opportunities for future real-world applications in diverse scientific and engineering domains; especially, the most challenging cases involving design and analysis under uncertainty.
zh

[AI-49] SEER: Semantic Enhancement and Emotional Reasoning Network for Multimodal Fake News Detection

【速读】：该论文旨在解决当前多模态假新闻检测方法中对大模型语义增强能力利用不足以及情感特征建模缺失的问题。现有研究主要关注跨模态特征对齐与图文一致性，忽视了大型多模态模型（Large Multimodal Models, LMMs）在语义增强方面的潜力，且未充分挖掘新闻内容中情绪特征与真实性之间的关联。解决方案的关键在于提出一种新颖的语义增强与情感推理（Semantic Enhancement and Emotional Reasoning, SEER）网络：首先通过生成图像摘要句（summarized captions）提升图像语义理解，并借助LMMs输出结果实现语义增强；其次设计专家级情感推理模块（expert emotional reasoning module），模拟真实场景优化情绪特征并推断新闻真实性，从而有效融合语义与情感线索以提升检测性能。

链接: https://arxiv.org/abs/2507.13415
作者: Peican Zhu,Yubo Jing,Le Cheng,Bin Chen,Xiaodong Cui,Lianwei Wu,Keke Tang
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: Accepted by SMC 2025

点击查看摘要

Abstract:Previous studies on multimodal fake news detection mainly focus on the alignment and integration of cross-modal features, as well as the application of text-image consistency. However, they overlook the semantic enhancement effects of large multimodal models and pay little attention to the emotional features of news. In addition, people find that fake news is more inclined to contain negative emotions than real ones. Therefore, we propose a novel Semantic Enhancement and Emotional Reasoning (SEER) Network for multimodal fake news detection. We generate summarized captions for image semantic understanding and utilize the products of large multimodal models for semantic enhancement. Inspired by the perceived relationship between news authenticity and emotional tendencies, we propose an expert emotional reasoning module that simulates real-life scenarios to optimize emotional features and infer the authenticity of news. Extensive experiments on two real-world datasets demonstrate the superiority of our SEER over state-of-the-art baselines.
zh

[AI-50] Gauge Flow Models

【速读】：该论文旨在解决传统生成流模型（Generative Flow Models）在复杂数据分布建模中性能受限的问题，尤其是在高维空间中难以有效捕捉非线性结构和多模态特性。其解决方案的关键在于引入可学习的规范场（Gauge Field）作为流微分方程（Flow ODE）中的额外结构，从而增强模型对状态空间中局部几何变化的适应能力。这一设计使得模型能够更灵活地调整流场的方向与强度，在保持可逆性和计算效率的同时显著提升生成质量，实验表明其在高斯混合模型上的表现优于同等或更大规模的传统流模型。

链接: https://arxiv.org/abs/2507.13414
作者: Alexander Strunk,Roland Assam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG)
备注:

点击查看摘要

Abstract:This paper introduces Gauge Flow Models, a novel class of Generative Flow Models. These models incorporate a learnable Gauge Field within the Flow Ordinary Differential Equation (ODE). A comprehensive mathematical framework for these models, detailing their construction and properties, is provided. Experiments using Flow Matching on Gaussian Mixture Models demonstrate that Gauge Flow Models yields significantly better performance than traditional Flow Models of comparable or even larger size. Additionally, unpublished research indicates a potential for enhanced performance across a broader range of generative tasks.
zh

[AI-51] H-NeiFi: Non-Invasive and Consensus-Efficient Multi-Agent Opinion Guidance

【速读】：该论文旨在解决社交媒体中意见演化难以达成全局共识的问题，现有方法因直接干预用户观点或强制跨群体连接而削弱用户自主性、引发心理抵触并降低共识效率，且缺乏长期视角导致局部共识加剧宏观分裂。其解决方案的关键在于提出分层非侵入式意见引导框架（H-NeiFi），通过构建基于社会角色的双层动态模型刻画专家与非专家的行为特征，并引入非侵入式邻居过滤机制自适应调控用户通信路径；同时利用多智能体强化学习（MARL）优化信息传播路径，以长期奖励函数规避对用户交互的直接干扰，从而在保护用户互动自主性的前提下实现自然高效的共识引导。

链接: https://arxiv.org/abs/2507.13370
作者: Shijun Guo,Haoran Xu,Yaming Yang,Ziyu Guan,Wei Zhao,Xinyi Zhang,Yishan Song,Jiwei Chen
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The openness of social media enables the free exchange of opinions, but it also presents challenges in guiding opinion evolution towards global consensus. Existing methods often directly modify user views or enforce cross-group connections. These intrusive interventions undermine user autonomy, provoke psychological resistance, and reduce the efficiency of global consensus. Additionally, due to the lack of a long-term perspective, promoting local consensus often exacerbates divisions at the macro level. To address these issues, we propose the hierarchical, non-intrusive opinion guidance framework, H-NeiFi. It first establishes a two-layer dynamic model based on social roles, considering the behavioral characteristics of both experts and non-experts. Additionally, we introduce a non-intrusive neighbor filtering method that adaptively controls user communication channels. Using multi-agent reinforcement learning (MARL), we optimize information propagation paths through a long-term reward function, avoiding direct interference with user interactions. Experiments show that H-NeiFi increases consensus speed by 22.0% to 30.7% and maintains global convergence even in the absence of experts. This approach enables natural and efficient consensus guidance by protecting user interaction autonomy, offering a new paradigm for social network governance.
zh

[AI-52] VerilogDB: The Largest Highest-Quality Dataset with a Preprocessing Framework for LLM -based RTL Generation

【速读】：该论文旨在解决当前生成式 AI（Generative AI）在硬件设计自动化领域中，尤其是寄存器传输级（Register Transfer Level, RTL）代码生成任务中，因缺乏高质量、大规模训练数据而导致模型性能受限的问题。其解决方案的关键在于构建一个结构化、可扩展的 Verilog 数据集，通过三阶段自动化流程实现：基于 PostgreSQL 的数据库（Database, DB）创建与管理、从 OpenCores 和 GitHub 等代码托管平台收集原始数据，以及严格的预处理管道（包括语法验证、逻辑综合运行和模块元数据提取），从而确保数据的准确性与可用性。最终形成的 20,392 个 Verilog 样本构成目前知识范围内最大且高质量的用于 LLM 微调的数据集，为后续基于大语言模型（Large Language Models, LLMs）的硬件生成研究提供了坚实基础。

链接: https://arxiv.org/abs/2507.13369
作者: Paul E. Calzada,Zahin Ibnat,Tanvir Rahman,Kamal Kandula,Danyu Lu,Sujan Kumar Saha,Farimah Farahmandi,Mark Tehranipoor
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are gaining popularity for hardware design automation, particularly through Register Transfer Level (RTL) code generation. In this work, we examine the current literature on RTL generation using LLMs and identify key requirements for training and fine-tuning datasets. We construct a robust Verilog dataset through an automated three-pronged process involving database (DB) creation and management with PostgreSQL, data collection from code hosting sites like OpenCores and GitHub, and data preprocessing to verify the codes’ syntax, run logic synthesis, and extract relevant module metadata. We implement a scalable and efficient DB infrastructure to support analysis and detail our preprocessing pipeline to enforce high-quality data before DB insertion. The resulting dataset comprises 20,392 Verilog samples, 751 MB of Verilog code data, which is the largest high-quality Verilog dataset for LLM fine-tuning to our knowledge. We further evaluate the dataset, address associated challenges, and explore potential applications for future research and development in LLM-based hardware generation.
zh

[AI-53] Scalable Attribute-Missing Graph Clustering via Neighborhood Differentiatio

【速读】：该论文旨在解决现实世界中大规模且存在属性缺失的图数据在深度图聚类（Deep Graph Clustering, DGC）任务中的性能瓶颈问题。其解决方案的关键在于提出一种新颖的互补多视图邻域差异化方法（Complementary Multi-View Neighborhood Differentiation, CMV-ND），通过递归邻域搜索策略完整提取不同跳数（hop）下的局部结构信息，并引入邻域差异策略消除不同跳数表示间的冗余，从而构建 K+1 个互补视图（包括 K 个差异化的邻域视图和目标节点特征视图），最终提升现有多视图聚类或 DGC 方法的聚类性能。

链接: https://arxiv.org/abs/2507.13368
作者: Yaowen Hu,Wenxuan Tu,Yue Liu,Xinhang Wan,Junyi Yan,Taichun Zhou,Xinwang Liu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep graph clustering (DGC), which aims to unsupervisedly separate the nodes in an attribute graph into different clusters, has seen substantial potential in various industrial scenarios like community detection and recommendation. However, the real-world attribute graphs, e.g., social networks interactions, are usually large-scale and attribute-missing. To solve these two problems, we propose a novel DGC method termed \underline\textbfComplementary \underline\textbfMulti-\underline\textbfView \underline\textbfNeighborhood \underline\textbfDifferentiation (\textitCMV-ND), which preprocesses graph structural information into multiple views in a complete but non-redundant manner. First, to ensure completeness of the structural information, we propose a recursive neighborhood search that recursively explores the local structure of the graph by completely expanding node neighborhoods across different hop distances. Second, to eliminate the redundancy between neighborhoods at different hops, we introduce a neighborhood differential strategy that ensures no overlapping nodes between the differential hop representations. Then, we construct K+1 complementary views from the K differential hop representations and the features of the target node. Last, we apply existing multi-view clustering or DGC methods to the views. Experimental results on six widely used graph datasets demonstrate that CMV-ND significantly improves the performance of various methods.
zh

[AI-54] PGR-DRC: Pre-Global Routing DRC Violation Prediction Using Unsupervised Learning

【速读】：该论文旨在解决传统机器学习（ML）和神经网络（NN）模型在设计规则检查（DRC）中依赖大量平衡标注数据及较长训练时间的问题。其关键解决方案是提出了一种全新的无监督DRC违规预测方法，该方法仅需单一类别的未平衡数据即可构建模型，并通过设定阈值来判断新数据是否处于正常边界范围内，从而实现高效分类。此方法在CMOS 28 nm工艺下验证，测试准确率达99.95%，显著优于支持向量机（SVM）和NN模型，且训练时间分别减少约26.3倍和高达6003倍。

链接: https://arxiv.org/abs/2507.13355
作者: Riadul Islam,Dhandeep Challagundla
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Leveraging artificial intelligence (AI)-driven electronic design and automation (EDA) tools, high-performance computing, and parallelized algorithms are essential for next-generation microprocessor innovation, ensuring continued progress in computing, AI, and semiconductor technology. Machine learning-based design rule checking (DRC) and lithography hotspot detection can improve first-pass silicon success. However, conventional ML and neural network (NN)-based models use supervised learning and require a large balanced dataset (in terms of positive and negative classes) and training time. This research addresses those key challenges by proposing the first-ever unsupervised DRC violation prediction methodology. The proposed model can be built using any unbalanced dataset using only one class and set a threshold for it, then fitting any new data querying if they are within the boundary of the model for classification. This research verified the proposed model by implementing different computational cores using CMOS 28 nm technology and Synopsys Design Compiler and IC Compiler II tools. Then, layouts were divided into virtual grids to collect about 60k data for analysis and verification. The proposed method has 99.95% prediction test accuracy, while the existing support vector machine (SVM) and neural network (NN) models have 85.44% and 98.74% accuracy, respectively. In addition, the proposed methodology has about 26.3x and up to 6003x lower training times compared to SVM and NN-models, respectively.
zh

[AI-55] he AI Ethical Resonance Hypothesis: The Possibility of Discovering Moral Meta-Patterns in AI Systems

【速读】：该论文试图解决的问题是：如何通过人工智能（AI）系统识别和理解超越人类直觉的深层道德模式，从而拓展对伦理本质的认知边界。其解决方案的关键在于构建一种具有目的性认知结构的“伦理共鸣器”（ethical resonators），使AI能够处理并整合海量伦理情境数据，从中发现跨文化、跨历史的道德元模式（meta-patterns），进而揭示潜在的普遍伦理基础，同时引发对人类伦理反思能力的新认识。

链接: https://arxiv.org/abs/2507.11552
作者: Tomasz Zgliczyński-Cuber
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 69 pages

点击查看摘要

Abstract:This paper presents a theoretical framework for the AI ethical resonance hypothesis, which proposes that advanced AI systems with purposefully designed cognitive structures (“ethical resonators”) may emerge with the ability to identify subtle moral patterns that are invisible to the human mind. The paper explores the possibility that by processing and synthesizing large amounts of ethical contexts, AI systems may discover moral meta-patterns that transcend cultural, historical, and individual biases, potentially leading to a deeper understanding of universal ethical foundations. The paper also examines a paradoxical aspect of the hypothesis, in which AI systems could potentially deepen our understanding of what we traditionally consider essentially human - our capacity for ethical reflection.
zh

机器学习

[LG-0] An Adversarial-Driven Experimental Study on Deep Learning for RF Fingerprinting

链接: https://arxiv.org/abs/2507.14109
作者: Xinyu Cao,Bimal Adhikari,Shangqing Zhao,Jingxian Wu,Yanjun Pan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Radio frequency (RF) fingerprinting, which extracts unique hardware imperfections of radio devices, has emerged as a promising physical-layer device identification mechanism in zero trust architectures and beyond 5G networks. In particular, deep learning (DL) methods have demonstrated state-of-the-art performance in this domain. However, existing approaches have primarily focused on enhancing system robustness against temporal and spatial variations in wireless environments, while the security vulnerabilities of these DL-based approaches have often been overlooked. In this work, we systematically investigate the security risks of DL-based RF fingerprinting systems through an adversarial-driven experimental analysis. We observe a consistent misclassification behavior for DL models under domain shifts, where a device is frequently misclassified as another specific one. Our analysis based on extensive real-world experiments demonstrates that this behavior can be exploited as an effective backdoor to enable external attackers to intrude into the system. Furthermore, we show that training DL models on raw received signals causes the models to entangle RF fingerprints with environmental and signal-pattern features, creating additional attack vectors that cannot be mitigated solely through post-processing security methods such as confidence thresholds.

[LG-1] DPMT: Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration

链接: https://arxiv.org/abs/2507.14088
作者: Xiyun Li,Yining Ding,Yuhua Jiang,Yunlong Zhao,Runpeng Xie,Shuang Xu,Yuanhua Ni,Yiqin Yang,Bo Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-time human-artificial intelligence (AI) collaboration is crucial yet challenging, especially when AI agents must adapt to diverse and unseen human behaviors in dynamic scenarios. Existing large language model (LLM) agents often fail to accurately model the complex human mental characteristics such as domain intentions, especially in the absence of direct communication. To address this limitation, we propose a novel dual process multi-scale theory of mind (DPMT) framework, drawing inspiration from cognitive science dual process theory. Our DPMT framework incorporates a multi-scale theory of mind (ToM) module to facilitate robust human partner modeling through mental characteristic reasoning. Experimental results demonstrate that DPMT significantly enhances human-AI collaboration, and ablation studies further validate the contributions of our multi-scale ToM in the slow system.

[LG-2] Preference-based Multi-Objective Reinforcement Learning

链接: https://arxiv.org/abs/2507.14066
作者: Ni Mu,Yao Luan,Qing-Shan Jia
类目: Machine Learning (cs.LG)
*备注: This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author’s version, which has not been fully edited, and the content may change prior to final publication. \c{opyright} 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies

点击查看摘要

Abstract:Multi-objective reinforcement learning (MORL) is a structured approach for optimizing tasks with multiple objectives. However, it often relies on pre-defined reward functions, which can be hard to design for balancing conflicting goals and may lead to oversimplification. Preferences can serve as more flexible and intuitive decision-making guidance, eliminating the need for complicated reward design. This paper introduces preference-based MORL (Pb-MORL), which formalizes the integration of preferences into the MORL framework. We theoretically prove that preferences can derive policies across the entire Pareto frontier. To guide policy optimization using preferences, our method constructs a multi-objective reward model that aligns with the given preferences. We further provide theoretical proof to show that optimizing this reward model is equivalent to training the Pareto optimal policy. Extensive experiments in benchmark multi-objective tasks, a multi-energy management task, and an autonomous driving task on a multi-line highway show that our method performs competitively, surpassing the oracle method, which uses the ground truth reward function. This highlights its potential for practical applications in complex real-world systems.

[LG-3] DONUT: Physics-aware Machine Learning for Real-time X-ray Nanodiffraction Analysis

链接: https://arxiv.org/abs/2507.14038
作者: Aileen Luo,Tao Zhou,Ming Du,Martin V. Holt,Andrej Singer,Mathew J. Cherukara
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Coherent X-ray scattering techniques are critical for investigating the fundamental structural properties of materials at the nanoscale. While advancements have made these experiments more accessible, real-time analysis remains a significant bottleneck, often hindered by artifacts and computational demands. In scanning X-ray nanodiffraction microscopy, which is widely used to spatially resolve structural heterogeneities, this challenge is compounded by the convolution of the divergent beam with the sample’s local structure. To address this, we introduce DONUT (Diffraction with Optics for Nanobeam by Unsupervised Training), a physics-aware neural network designed for the rapid and automated analysis of nanobeam diffraction data. By incorporating a differentiable geometric diffraction model directly into its architecture, DONUT learns to predict crystal lattice strain and orientation in real-time. Crucially, this is achieved without reliance on labeled datasets or pre-training, overcoming a fundamental limitation for supervised machine learning in X-ray science. We demonstrate experimentally that DONUT accurately extracts all features within the data over 200 times more efficiently than conventional fitting methods.

[LG-4] Byzantine-resilient federated online learning for Gaussian process regression

链接: https://arxiv.org/abs/2507.14021
作者: Xu Zhang,Zhenyuan Yuan,Minghui Zhu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper, we study Byzantine-resilient federated online learning for Gaussian process regression (GPR). We develop a Byzantine-resilient federated GPR algorithm that allows a cloud and a group of agents to collaboratively learn a latent function and improve the learning performances where some agents exhibit Byzantine failures, i.e., arbitrary and potentially adversarial behavior. Each agent-based local GPR sends potentially compromised local predictions to the cloud, and the cloud-based aggregated GPR computes a global model by a Byzantine-resilient product of experts aggregation rule. Then the cloud broadcasts the current global model to all the agents. Agent-based fused GPR refines local predictions by fusing the received global model with that of the agent-based local GPR. Moreover, we quantify the learning accuracy improvements of the agent-based fused GPR over the agent-based local GPR. Experiments on a toy example and two medium-scale real-world datasets are conducted to demonstrate the performances of the proposed algorithm.

[LG-5] On the Fundamental Limitations of Dual Static CVaR Decompositions in Markov Decision Processes

链接: https://arxiv.org/abs/2507.14005
作者: Mathieu Godbout,Audrey Durand
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work has shown that dynamic programming (DP) methods for finding static CVaR-optimal policies in Markov Decision Processes (MDPs) can fail when based on the dual formulation, yet the root cause for the failure has remained unclear. We expand on these findings by shifting focus from policy optimization to the seemingly simpler task of policy evaluation. We show that evaluating the static CVaR of a given policy can be framed as two distinct minimization problems. For their solutions to match, a set of ``risk-assignment consistency constraints’’ must be satisfied, and we demonstrate that the intersection of the constraints being empty is the source of previously observed evaluation errors. Quantifying the evaluation error as the CVaR evaluation gap, we then demonstrate that the issues observed when optimizing over the dual-based CVaR DP are explained by the returned policy having a non-zero CVaR evaluation gap. We then leverage our proposed risk-assignment perspective to prove that the search for a single, uniformly optimal policy via on the dual CVaR decomposition is fundamentally limited, identifying an MDP where no single policy can be optimal across all initial risk levels.

[LG-6] ParallelTime: Dynamically Weighting the Balance of Short- and Long-Term Temporal Dependencies

链接: https://arxiv.org/abs/2507.13998
作者: Itay Katav,Aryeh Kontorovich
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern multivariate time series forecasting primarily relies on two architectures: the Transformer with attention mechanism and Mamba. In natural language processing, an approach has been used that combines local window attention for capturing short-term dependencies and Mamba for capturing long-term dependencies, with their outputs averaged to assign equal weight to both. We find that for time-series forecasting tasks, assigning equal weight to long-term and short-term dependencies is not optimal. To mitigate this, we propose a dynamic weighting mechanism, ParallelTime Weighter, which calculates interdependent weights for long-term and short-term dependencies for each token based on the input and the model’s knowledge. Furthermore, we introduce the ParallelTime architecture, which incorporates the ParallelTime Weighter mechanism to deliver state-of-the-art performance across diverse benchmarks. Our architecture demonstrates robustness, achieves lower FLOPs, requires fewer parameters, scales effectively to longer prediction horizons, and significantly outperforms existing methods. These advances highlight a promising path for future developments of parallel Attention-Mamba in time series forecasting. The implementation is readily available at: \hrefthis https URLParallelTime GitHub

[LG-7] Structural Connectome Harmonization Using Deep Learning: The Strength of Graph Neural Networks

链接: https://arxiv.org/abs/2507.13992
作者: Jagruti Patel,Thomas A. W. Bolton,Mikkel Schöttner,Anjali Tarun,Sebastien Tourbier,Yasser Alemàn-Gòmez,Jonas Richiardi,Patric Hagmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Small sample sizes in neuroimaging in general, and in structural connectome (SC) studies in particular limit the development of reliable biomarkers for neurological and psychiatric disorders - such as Alzheimer’s disease and schizophrenia - by reducing statistical power, reliability, and generalizability. Large-scale multi-site studies have exist, but they have acquisition-related biases due to scanner heterogeneity, compromising imaging consistency and downstream analyses. While existing SC harmonization methods - such as linear regression (LR), ComBat, and deep learning techniques - mitigate these biases, they often rely on detailed metadata, traveling subjects (TS), or overlook the graph-topology of SCs. To address these limitations, we propose a site-conditioned deep harmonization framework that harmonizes SCs across diverse acquisition sites without requiring metadata or TS that we test in a simulated scenario based on the Human Connectome Dataset. Within this framework, we benchmark three deep architectures - a fully connected autoencoder (AE), a convolutional AE, and a graph convolutional AE - against a top-performing LR baseline. While non-graph models excel in edge-weight prediction and edge existence detection, the graph AE demonstrates superior preservation of topological structure and subject-level individuality, as reflected by graph metrics and fingerprinting accuracy, respectively. Although the LR baseline achieves the highest numerical performance by explicitly modeling acquisition parameters, it lacks applicability to real-world multi-site use cases as detailed acquisition metadata is often unavailable. Our results highlight the critical role of model architecture in SC harmonization performance and demonstrate that graph-based approaches are particularly well-suited for structure-aware, domain-generalizable SC harmonization in large-scale multi-site SC studies.

[LG-8] Signs of the Past Patterns of the Present: On the Automatic Classification of Old Babylonian Cuneiform Signs

链接: https://arxiv.org/abs/2507.13959
作者: Eli Verwimp,Gustav Ryberg Smidt,Hendrik Hameeuw,Katrien De Graef
类目: Machine Learning (cs.LG)
*备注: Paper under review at JOCCH

点击查看摘要

Abstract:The work in this paper describes the training and evaluation of machine learning (ML) techniques for the classification of cuneiform signs. There is a lot of variability in cuneiform signs, depending on where they come from, for what and by whom they were written, but also how they were digitized. This variability makes it unlikely that an ML model trained on one dataset will perform successfully on another dataset. This contribution studies how such differences impact that performance. Based on our results and insights, we aim to influence future data acquisition standards and provide a solid foundation for future cuneiform sign classification tasks. The ML model has been trained and tested on handwritten Old Babylonian (c. 2000-1600 B.C.E.) documentary texts inscribed on clay tablets originating from three Mesopotamian cities (Nippur, Dūr-Abiešuh and Sippar). The presented and analysed model is ResNet50, which achieves a top-1 score of 87.1% and a top-5 score of 96.5% for signs with at least 20 instances. As these automatic classification results are the first on Old Babylonian texts, there are currently no comparable results.

[LG-9] Robust Anomaly Detection with Graph Neural Networks using Controllability

链接: https://arxiv.org/abs/2507.13954
作者: Yifan Wei,Anwar Said,Waseem Abbas,Xenofon Koutsoukos
类目: Machine Learning (cs.LG)
*备注: conference paper published in IEEE CAI 2025

点击查看摘要

Abstract:Anomaly detection in complex domains poses significant challenges due to the need for extensive labeled data and the inherently imbalanced nature of anomalous versus benign samples. Graph-based machine learning models have emerged as a promising solution that combines attribute and relational data to uncover intricate patterns. However, the scarcity of anomalous data exacerbates the challenge, which requires innovative strategies to enhance model learning with limited information. In this paper, we hypothesize that the incorporation of the influence of the nodes, quantified through average controllability, can significantly improve the performance of anomaly detection. We propose two novel approaches to integrate average controllability into graph-based frameworks: (1) using average controllability as an edge weight and (2) encoding it as a one-hot edge attribute vector. Through rigorous evaluation on real-world and synthetic networks with six state-of-the-art baselines, our proposed methods demonstrate improved performance in identifying anomalies, highlighting the critical role of controllability measures in enhancing the performance of graph machine learning models. This work underscores the potential of integrating average controllability as additional metrics to address the challenges of anomaly detection in sparse and imbalanced datasets.

[LG-10] MoDyGAN: Combining Molecular Dynamics With GANs to Investigate Protein Conformational Space

链接: https://arxiv.org/abs/2507.13950
作者: Jingbo Liang,Bruna Jacobson
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Extensively exploring protein conformational landscapes remains a major challenge in computational biology due to the high computational cost involved in dynamic physics-based simulations. In this work, we propose a novel pipeline, MoDyGAN, that leverages molecular dynamics (MD) simulations and generative adversarial networks (GANs) to explore protein conformational spaces. MoDyGAN contains a generator that maps Gaussian distributions into MD-derived protein trajectories, and a refinement module that combines ensemble learning with a dual-discriminator to further improve the plausibility of generated conformations. Central to our approach is an innovative representation technique that reversibly transforms 3D protein structures into 2D matrices, enabling the use of advanced image-based GAN architectures. We use three rigid proteins to demonstrate that MoDyGAN can generate plausible new conformations. We also use deca-alanine as a case study to show that interpolations within the latent space closely align with trajectories obtained from steered molecular dynamics (SMD) simulations. Our results suggest that representing proteins as image-like data unlocks new possibilities for applying advanced deep learning techniques to biomolecular simulation, leading to an efficient sampling of conformational states. Additionally, the proposed framework holds strong potential for extension to other complex 3D structures.

[LG-11] Reframing attention as a reinforcement learning problem for causal discovery

链接: https://arxiv.org/abs/2507.13920
作者: Turan Orujlu,Christian Gumbsch,Martin V. Butz,Charley M Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Formal frameworks of causality have operated largely parallel to modern trends in deep reinforcement learning (RL). However, there has been a revival of interest in formally grounding the representations learned by neural networks in causal concepts. Yet, most attempts at neural models of causality assume static causal graphs and ignore the dynamic nature of causal interactions. In this work, we introduce Causal Process framework as a novel theory for representing dynamic hypotheses about causal structure. Furthermore, we present Causal Process Model as an implementation of this framework. This allows us to reformulate the attention mechanism popularized by Transformer networks within an RL setting with the goal to infer interpretable causal processes from visual observations. Here, causal inference corresponds to constructing a causal graph hypothesis which itself becomes an RL task nested within the original RL problem. To create an instance of such hypothesis, we employ RL agents. These agents establish links between units similar to the original Transformer attention mechanism. We demonstrate the effectiveness of our approach in an RL environment where we outperform current alternatives in causal representation learning and agent performance, and uniquely recover graphs of dynamic causal processes.

[LG-12] On-the-Fly Fine-Tuning of Foundational Neural Network Potentials: A Bayesian Neural Network Approach

链接: https://arxiv.org/abs/2507.13805
作者: Tim Rensmeyer,Denis Kramer,Oliver Niggemann
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Due to the computational complexity of evaluating interatomic forces from first principles, the creation of interatomic machine learning force fields has become a highly active field of research. However, the generation of training datasets of sufficient size and sample diversity itself comes with a computational burden that can make this approach impractical for modeling rare events or systems with a large configuration space. Fine-tuning foundation models that have been pre-trained on large-scale material or molecular databases offers a promising opportunity to reduce the amount of training data necessary to reach a desired level of accuracy. However, even if this approach requires less training data overall, creating a suitable training dataset can still be a very challenging problem, especially for systems with rare events and for end-users who don’t have an extensive background in machine learning. In on-the-fly learning, the creation of a training dataset can be largely automated by using model uncertainty during the simulation to decide if the model is accurate enough or if a structure should be recalculated with classical methods and used to update the model. A key challenge for applying this form of active learning to the fine-tuning of foundation models is how to assess the uncertainty of those models during the fine-tuning process, even though most foundation models lack any form of uncertainty quantification. In this paper, we overcome this challenge by introducing a fine-tuning approach based on Bayesian neural network methods and a subsequent on-the-fly workflow that automatically fine-tunes the model while maintaining a pre-specified accuracy and can detect rare events such as transition states and sample them at an increased rate relative to their occurrence.

[LG-13] Dual-Center Graph Clustering with Neighbor Distribution ECAI-2025

链接: https://arxiv.org/abs/2507.13765
作者: Enhao Cheng,Shoujia Zhang,Jianhua Yin,Li Jin,Liqiang Nie
类目: Machine Learning (cs.LG)
*备注: ECAI-2025

点击查看摘要

Abstract:Graph clustering is crucial for unraveling intricate data structures, yet it presents significant challenges due to its unsupervised nature. Recently, goal-directed clustering techniques have yielded impressive results, with contrastive learning methods leveraging pseudo-label garnering considerable attention. Nonetheless, pseudo-label as a supervision signal is unreliable and existing goal-directed approaches utilize only features to construct a single-target distribution for single-center optimization, which lead to incomplete and less dependable guidance. In our work, we propose a novel Dual-Center Graph Clustering (DCGC) approach based on neighbor distribution properties, which includes representation learning with neighbor distribution and dual-center optimization. Specifically, we utilize neighbor distribution as a supervision signal to mine hard negative samples in contrastive learning, which is reliable and enhances the effectiveness of representation learning. Furthermore, neighbor distribution center is introduced alongside feature center to jointly construct a dual-target distribution for dual-center optimization. Extensive experiments and analysis demonstrate superior performance and effectiveness of our proposed method.

[LG-14] MolPIF: A Parameter Interpolation Flow Model for Molecule Generation

链接: https://arxiv.org/abs/2507.13762
作者: Yaowei Jin,Junjie Wang,Wenkai Xiang,Duanhua Cao,Dan Teng,Zhehuan Fan,Jiacheng Xiong,Xia Sheng,Chuanlong Zeng,Mingyue Zheng,Qian Shi
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Advances in deep learning for molecular generation show promise in accelerating drug discovery. Bayesian Flow Networks (BFNs) have recently shown impressive performance across diverse chemical tasks, with their success often ascribed to the paradigm of modeling in a low-variance parameter space. However, the Bayesian inference-based strategy imposes limitations on designing more flexible distribution transformation pathways, making it challenging to adapt to diverse data distributions and varied task requirements. Furthermore, the potential for simpler, more efficient parameter-space-based models is unexplored. To address this, we propose a novel Parameter Interpolation Flow model (named PIF) with detailed theoretical foundation, training, and inference procedures. We then develop MolPIF for structure-based drug design, demonstrating its superior performance across diverse metrics compared to baselines. This work validates the effectiveness of parameter-space-based generative modeling paradigm for molecules and offers new perspectives for model design.

[LG-15] An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC

链接: https://arxiv.org/abs/2507.13736
作者: Matthias Jobst,Tim Langer,Chen Liu,Mehmet Alici,Hector A. Gonzalez,Christian Mayr
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Poster at ACM ICONS 2025 - International Conference on Neuromorphic Systems

点击查看摘要

Abstract:This work presents a multi-layer DNN scheduling framework as an extension of OctopuScheduler, providing an end-to-end flow from PyTorch models to inference on a single SpiNNaker2 chip. Together with a front-end comprised of quantization and lowering steps, the proposed framework enables the edge-based execution of large and complex DNNs up to transformer scale using the neuromorphic platform SpiNNaker2.

[LG-16] Adversarial Training Improves Generalization Under Distribution Shifts in Bioacoustics

链接: https://arxiv.org/abs/2507.13727
作者: René Heinrich,Lukas Rauch,Bernhard Sick,Christoph Scholz
类目: Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Adversarial training is a promising strategy for enhancing model robustness against adversarial attacks. However, its impact on generalization under substantial data distribution shifts in audio classification remains largely unexplored. To address this gap, this work investigates how different adversarial training strategies improve generalization performance and adversarial robustness in audio classification. The study focuses on two model architectures: a conventional convolutional neural network (ConvNeXt) and an inherently interpretable prototype-based model (AudioProtoPNet). The approach is evaluated using a challenging bird sound classification benchmark. This benchmark is characterized by pronounced distribution shifts between training and test data due to varying environmental conditions and recording methods, a common real-world challenge. The investigation explores two adversarial training strategies: one based on output-space attacks that maximize the classification loss function, and another based on embedding-space attacks designed to maximize embedding dissimilarity. These attack types are also used for robustness evaluation. Additionally, for AudioProtoPNet, the study assesses the stability of its learned prototypes under targeted embedding-space attacks. Results show that adversarial training, particularly using output-space attacks, improves clean test data performance by an average of 10.5% relative and simultaneously strengthens the adversarial robustness of the models. These findings, although derived from the bird sound domain, suggest that adversarial training holds potential to enhance robustness against both strong distribution shifts and adversarial attacks in challenging audio classification settings.

[LG-17] Graph-Structured Data Analysis of Component Failure in Autonomous Cargo Ships Based on Feature Fusion

链接: https://arxiv.org/abs/2507.13721
作者: Zizhao Zhang,Tianxiang Zhao,Yu Sun,Liping Sun,Jichuan Kang
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:To address the challenges posed by cascading reactions caused by component failures in autonomous cargo ships (ACS) and the uncertainties in emergency decision-making, this paper proposes a novel hybrid feature fusion framework for constructing a graph-structured dataset of failure modes. By employing an improved cuckoo search algorithm (HN-CSA), the literature retrieval efficiency is significantly enhanced, achieving improvements of 7.1% and 3.4% compared to the NSGA-II and CSA search algorithms, respectively. A hierarchical feature fusion framework is constructed, using Word2Vec encoding to encode subsystem/component features, BERT-KPCA to process failure modes/reasons, and Sentence-BERT to quantify the semantic association between failure impact and emergency decision-making. The dataset covers 12 systems, 1,262 failure modes, and 6,150 propagation paths. Validation results show that the GATE-GNN model achieves a classification accuracy of 0.735, comparable to existing benchmarks. Additionally, a silhouette coefficient of 0.641 indicates that the features are highly distinguishable. In the label prediction results, the Shore-based Meteorological Service System achieved an F1 score of 0.93, demonstrating high prediction accuracy. This paper not only provides a solid foundation for failure analysis in autonomous cargo ships but also offers reliable support for fault diagnosis, risk assessment, and intelligent decision-making systems. The link to the dataset is this https URL.

[LG-18] Bi-GRU Based Deception Detection using EEG Signals

链接: https://arxiv.org/abs/2507.13718
作者: Danilo Avola,Muhammad Yasir Bilal,Emad Emam,Cristina Lakasz,Daniele Pannone,Amedeo Ranaldi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deception detection is a significant challenge in fields such as security, psychology, and forensics. This study presents a deep learning approach for classifying deceptive and truthful behavior using ElectroEncephaloGram (EEG) signals from the Bag-of-Lies dataset, a multimodal corpus designed for naturalistic, casual deception scenarios. A Bidirectional Gated Recurrent Unit (Bi-GRU) neural network was trained to perform binary classification of EEG samples. The model achieved a test accuracy of 97%, along with high precision, recall, and F1-scores across both classes. These results demonstrate the effectiveness of using bidirectional temporal modeling for EEG-based deception detection and suggest potential for real-time applications and future exploration of advanced neural architectures.

[LG-19] Benchmarking of EEG Analysis Techniques for Parkinsons Disease Diagnosis: A Comparison between Traditional ML Methods and Foundation DL Methods

链接: https://arxiv.org/abs/2507.13716
作者: Danilo Avola,Andrea Bernardini,Giancarlo Crocetti,Andrea Ladogana,Mario Lezoche,Maurizio Mancini,Daniele Pannone,Amedeo Ranaldi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parkinson’s Disease PD is a progressive neurodegenerative disorder that affects motor and cognitive functions with early diagnosis being critical for effective clinical intervention Electroencephalography EEG offers a noninvasive and costeffective means of detecting PDrelated neural alterations yet the development of reliable automated diagnostic models remains a challenge In this study we conduct a systematic benchmark of traditional machine learning ML and deep learning DL models for classifying PD using a publicly available oddball task dataset Our aim is to lay the groundwork for developing an effective learning system and to determine which approach produces the best results We implement a unified sevenstep preprocessing pipeline and apply consistent subjectwise crossvalidation and evaluation criteria to ensure comparability across models Our results demonstrate that while baseline deep learning architectures particularly CNNLSTM models achieve the best performance compared to other deep learning architectures underlining the importance of capturing longrange temporal dependencies several traditional classifiers such as XGBoost also offer strong predictive accuracy and calibrated decision boundaries By rigorously comparing these baselines our work provides a solid reference framework for future studies aiming to develop and evaluate more complex or specialized architectures Establishing a reliable set of baseline results is essential to contextualize improvements introduced by novel methods ensuring scientific rigor and reproducibility in the evolving field of EEGbased neurodiagnostics

[LG-20] LLaPipe: LLM -Guided Reinforcement Learning for Automated Data Preparation Pipeline Construction

链接: https://arxiv.org/abs/2507.13712
作者: Jing Chang,Chang Liu,Jinbin Huang,Rui Mao,Jianbin Qin
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated data preparation is crucial for democratizing machine learning, yet existing reinforcement learning (RL) based approaches suffer from inefficient exploration in the vast space of possible preprocessing pipelines. We present LLaPipe, a novel framework that addresses this exploration bottleneck by integrating Large Language Models (LLMs) as intelligent policy advisors. Unlike traditional methods that rely solely on statistical features and blind trial-and-error, LLaPipe leverages the semantic understanding capabilities of LLMs to provide contextually relevant exploration guidance. Our framework introduces three key innovations: (1) an LLM Policy Advisor that analyzes dataset semantics and pipeline history to suggest promising preprocessing operations, (2) an Experience Distillation mechanism that mines successful patterns from past pipelines and transfers this knowledge to guide future exploration, and (3) an Adaptive Advisor Triggering strategy (Advisor\textsuperscript+) that dynamically determines when LLM intervention is most beneficial, balancing exploration effectiveness with computational cost. Through extensive experiments on 18 diverse datasets spanning multiple domains, we demonstrate that LLaPipe achieves up to 22.4% improvement in pipeline quality and 2.3 \times faster convergence compared to state-of-the-art RL-based methods, while maintaining computational efficiency through selective LLM usage (averaging only 19.0% of total exploration steps).

[LG-21] CogniQ-H: A Soft Hierarchical Reinforcement Learning Paradigm for Automated Data Preparation

链接: https://arxiv.org/abs/2507.13710
作者: Jing Chang,Chang Liu,Jinbin Huang,Rui Mao,Jianbin Qin
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data preparation is a foundational yet notoriously challenging component of the machine learning lifecycle, characterized by a vast combinatorial search space of potential operator sequences. While reinforcement learning (RL) offers a promising direction, existing approaches are inefficient as they fail to capture the structured, hierarchical nature of the problem. We argue that Hierarchical Reinforcement Learning (HRL), a paradigm that has been successful in other domains, provides a conceptually ideal yet previously unexplored framework for this task. However, a naive HRL implementation with a `hard hierarchy’ is prone to suboptimal, irreversible decisions. To address this, we introduce CogniQ-H, the first framework to implement a soft hierarchical paradigm for robust, end-to-end automated data preparation. CogniQ-H formulates action selection as a Bayesian inference problem. A high-level strategic prior, generated by a Large Language Model (LLM), guides exploration probabilistically. This prior is synergistically combined with a fine-grained operator quality score from a supervised Learning-to-Rank (LTR) model and a long-term value estimate from the agent’s own Q-function. This hybrid architecture allows CogniQ-H to balance strategic guidance with adaptive, evidence-based decision-making. Through extensive experiments on 18 diverse datasets spanning multiple domains, we demonstrate that CogniQ-H achieves up to 13.9% improvement in pipeline quality and 2.8 \times faster convergence compared to state-of-the-art RL-based methods.

[LG-22] Learning Deformable Body Interactions With Adaptive Spatial Tokenization

链接: https://arxiv.org/abs/2507.13707
作者: Hao Wang,Yu Liu,Daniel Biggs,Haoru Wang,Jiandong Yu,Ping Huang
类目: Machine Learning (cs.LG)
*备注: 21 pages, 15 figures

点击查看摘要

Abstract:Simulating interactions between deformable bodies is vital in fields like material science, mechanical design, and robotics. While learning-based methods with Graph Neural Networks (GNNs) are effective at solving complex physical systems, they encounter scalability issues when modeling deformable body interactions. To model interactions between objects, pairwise global edges have to be created dynamically, which is computationally intensive and impractical for large-scale meshes. To overcome these challenges, drawing on insights from geometric representations, we propose an Adaptive Spatial Tokenization (AST) method for efficient representation of physical states. By dividing the simulation space into a grid of cells and mapping unstructured meshes onto this structured grid, our approach naturally groups adjacent mesh nodes. We then apply a cross-attention module to map the sparse cells into a compact, fixed-length embedding, serving as tokens for the entire physical state. Self-attention modules are employed to predict the next state over these tokens in latent space. This framework leverages the efficiency of tokenization and the expressive power of attention mechanisms to achieve accurate and scalable simulation results. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in modeling deformable body interactions. Notably, it remains effective on large-scale simulations with meshes exceeding 100,000 nodes, where existing methods are hindered by computational limitations. Additionally, we contribute a novel large-scale dataset encompassing a wide range of deformable body interactions to support future research in this area.

[LG-23] Bayesian Optimization for Molecules Should Be Pareto-Aware

链接: https://arxiv.org/abs/2507.13704
作者: Anabel Yong,Austin Tripp,Layla Hosseini-Gerami,Brooks Paige
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multi-objective Bayesian optimization (MOBO) provides a principled framework for navigating trade-offs in molecular design. However, its empirical advantages over scalarized alternatives remain underexplored. We benchmark a simple Pareto-based MOBO strategy – Expected Hypervolume Improvement (EHVI) – against a simple fixed-weight scalarized baseline using Expected Improvement (EI), under a tightly controlled setup with identical Gaussian Process surrogates and molecular representations. Across three molecular optimization tasks, EHVI consistently outperforms scalarized EI in terms of Pareto front coverage, convergence speed, and chemical diversity. While scalarization encompasses flexible variants – including random or adaptive schemes – our results show that even strong deterministic instantiations can underperform in low-data regimes. These findings offer concrete evidence for the practical advantages of Pareto-aware acquisition in de novo molecular optimization, especially when evaluation budgets are limited and trade-offs are nontrivial.

[LG-24] ght Bounds for Answering Adaptively Chosen Concentrated Queries

链接: https://arxiv.org/abs/2507.13700
作者: Emma Rapoport,Edith Cohen,Uri Stemmer
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most work on adaptive data analysis assumes that samples in the dataset are independent. When correlations are allowed, even the non-adaptive setting can become intractable, unless some structural constraints are imposed. To address this, Bassily and Freund [2016] introduced the elegant framework of concentrated queries, which requires the analyst to restrict itself to queries that are concentrated around their expected value. While this assumption makes the problem trivial in the non-adaptive setting, in the adaptive setting it remains quite challenging. In fact, all known algorithms in this framework support significantly fewer queries than in the independent case: At most O(n) queries for a sample of size n , compared to O(n^2) in the independent setting. In this work, we prove that this utility gap is inherent under the current formulation of the concentrated queries framework, assuming some natural conditions on the algorithm. Additionally, we present a simplified version of the best-known algorithms that match our impossibility result. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2507.13700 [cs.DS] (or arXiv:2507.13700v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2507.13700 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-25] Kolmogorov-Arnold Networks-based GRU and LSTM for Loan Default Early Prediction

链接: https://arxiv.org/abs/2507.13685
作者: Yue Yang,Zihan Su,Ying Zhang,Chang Chuan Goh,Yuxiang Lin,Anthony Graham Bellotti,Boon Giin Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study addresses a critical challenge in time series anomaly detection: enhancing the predictive capability of loan default models more than three months in advance to enable early identification of default events, helping financial institutions implement preventive measures before risk events materialize. Existing methods have significant drawbacks, such as their lack of accuracy in early predictions and their dependence on training and testing within the same year and specific time frames. These issues limit their practical use, particularly with out-of-time data. To address these, the study introduces two innovative architectures, GRU-KAN and LSTM-KAN, which merge Kolmogorov-Arnold Networks (KAN) with Gated Recurrent Units (GRU) and Long Short-Term Memory (LSTM) networks. The proposed models were evaluated against the baseline models (LSTM, GRU, LSTM-Attention, and LSTM-Transformer) in terms of accuracy, precision, recall, F1 and AUC in different lengths of feature window, sample sizes, and early prediction intervals. The results demonstrate that the proposed model achieves a prediction accuracy of over 92% three months in advance and over 88% eight months in advance, significantly outperforming existing baselines.

[LG-26] FedSkipTwin: Digital-Twin-Guided Client Skipping for Communication-Efficient Federated Learning

链接: https://arxiv.org/abs/2507.13624
作者: Daniel Commey,Kamel Abbad,Garth V. Crosby,Lyes Khoukhi
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Communication overhead remains a primary bottleneck in federated learning (FL), particularly for applications involving mobile and IoT devices with constrained bandwidth. This work introduces FedSkipTwin, a novel client-skipping algorithm driven by lightweight, server-side digital twins. Each twin, implemented as a simple LSTM, observes a client’s historical sequence of gradient norms to forecast both the magnitude and the epistemic uncertainty of its next update. The server leverages these predictions, requesting communication only when either value exceeds a predefined threshold; otherwise, it instructs the client to skip the round, thereby saving bandwidth. Experiments are conducted on the UCI-HAR and MNIST datasets with 10 clients under a non-IID data distribution. The results demonstrate that FedSkipTwin reduces total communication by 12-15.5% across 20 rounds while simultaneously improving final model accuracy by up to 0.5 percentage points compared to the standard FedAvg algorithm. These findings establish that prediction-guided skipping is a practical and effective strategy for resource-aware FL in bandwidth-constrained edge environments.

[LG-27] ri-Learn Graph Fusion Network for Attributed Graph Clustering

链接: https://arxiv.org/abs/2507.13620
作者: Binxiong Li,Yuefei Wang,Xu Xiang,Xue Li,Binyu Zhao,Heyang Gao,Qinyu Zhao,Xi Yu
类目: Machine Learning (cs.LG)
*备注: The source code for this study is available at this https URL

点击查看摘要

Abstract:In recent years, models based on Graph Convolutional Networks (GCN) have made significant strides in the field of graph data analysis. However, challenges such as over-smoothing and over-compression remain when handling large-scale and complex graph datasets, leading to a decline in clustering quality. Although the Graph Transformer architecture has mitigated some of these issues, its performance is still limited when processing heterogeneous graph data. To address these challenges, this study proposes a novel deep clustering framework that comprising GCN, Autoencoder (AE), and Graph Transformer, termed the Tri-Learn Graph Fusion Network (Tri-GFN). This framework enhances the differentiation and consistency of global and local information through a unique tri-learning mechanism and feature fusion enhancement strategy. The framework integrates GCN, AE, and Graph Transformer modules. These components are meticulously fused by a triple-channel enhancement module, which maximizes the use of both node attributes and topological structures, ensuring robust clustering representation. The tri-learning mechanism allows mutual learning among these modules, while the feature fusion strategy enables the model to capture complex relationships, yielding highly discriminative representations for graph clustering. It surpasses many state-of-the-art methods, achieving an accuracy improvement of approximately 0.87% on the ACM dataset, 14.14 % on the Reuters dataset, and 7.58 % on the USPS dataset. Due to its outstanding performance on the Reuters dataset, Tri-GFN can be applied to automatic news classification, topic retrieval, and related fields.

[LG-28] Off-Policy Evaluation and Learning for Matching Markets RECSYS’25

链接: https://arxiv.org/abs/2507.13608
作者: Yudai Hayashi,Shuhei Goda,Yuta Saito
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: RecSys’25

点击查看摘要

Abstract:Matching users based on mutual preferences is a fundamental aspect of services driven by reciprocal recommendations, such as job search and dating applications. Although A/B tests remain the gold standard for evaluating new policies in recommender systems for matching markets, it is costly and impractical for frequent policy updates. Off-Policy Evaluation (OPE) thus plays a crucial role by enabling the evaluation of recommendation policies using only offline logged data naturally collected on the platform. However, unlike conventional recommendation settings, the large scale and bidirectional nature of user interactions in matching platforms introduce variance issues and exacerbate reward sparsity, making standard OPE methods unreliable. To address these challenges and facilitate effective offline evaluation, we propose novel OPE estimators, \textitDiPS and \textitDPR, specifically designed for matching markets. Our methods combine elements of the Direct Method (DM), Inverse Propensity Score (IPS), and Doubly Robust (DR) estimators while incorporating intermediate labels, such as initial engagement signals, to achieve better bias-variance control in matching markets. Theoretically, we derive the bias and variance of the proposed estimators and demonstrate their advantages over conventional methods. Furthermore, we show that these estimators can be seamlessly extended to offline policy learning methods for improving recommendation policies for making more matches. We empirically evaluate our methods through experiments on both synthetic data and A/B testing logs from a real job-matching platform. The empirical results highlight the superiority of our approach over existing methods in off-policy evaluation and learning tasks for a variety of configurations.

[LG-29] Improving Low-Cost Teleoperation: Augmenting GELLO with Force

链接: https://arxiv.org/abs/2507.13602
作者: Shivakanth Sujit,Luca Nunziante,Dan Ogawa Lillrank,Rousslan Fernand Julien Dossa,Kai Arulkumaran
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted at the 2025 IEEE/SICE International Symposium on System Integration

点击查看摘要

Abstract:In this work we extend the low-cost GELLO teleoperation system, initially designed for joint position control, with additional force information. Our first extension is to implement force feedback, allowing users to feel resistance when interacting with the environment. Our second extension is to add force information into the data collection process and training of imitation learning models. We validate our additions by implementing these on a GELLO system with a Franka Panda arm as the follower robot, performing a user study, and comparing the performance of policies trained with and without force information on a range of simulated and real dexterous manipulation tasks. Qualitatively, users with robotics experience preferred our controller, and the addition of force inputs improved task success on the majority of tasks.

[LG-30] FuSeFL: Fully Secure and Scalable Cross-Silo Federated Learning

链接: https://arxiv.org/abs/2507.13591
作者: Sahar Ghoflsaz Ghinani,Elaheh Sadredini
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 Pages, 12 Figures

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training without centralizing client data, making it attractive for privacy-sensitive domains. While existing approaches employ cryptographic techniques such as homomorphic encryption, differential privacy, or secure multiparty computation to mitigate inference attacks-including model inversion, membership inference, and gradient leakage-they often suffer from high computational, communication, or memory overheads. Moreover, many methods overlook the confidentiality of the global model itself, which may be proprietary and sensitive. These challenges limit the practicality of secure FL, especially in cross-silo deployments involving large datasets and strict compliance requirements. We present FuSeFL, a fully secure and scalable FL scheme designed for cross-silo settings. FuSeFL decentralizes training across client pairs using lightweight secure multiparty computation (MPC), while confining the server’s role to secure aggregation. This design eliminates server bottlenecks, avoids data offloading, and preserves full confidentiality of data, model, and updates throughout training. FuSeFL defends against inference threats, achieves up to 95% lower communication latency and 50% lower server memory usage, and improves accuracy over prior secure FL solutions, demonstrating strong security and efficiency at scale. Comments: 15 Pages, 12 Figures Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2507.13591 [cs.CR] (or arXiv:2507.13591v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.13591 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Provable Low-Frequency Bias of In-Context Learning of Representations

链接: https://arxiv.org/abs/2507.13540
作者: Yongyi Yang,Hidenori Tanaka,Wei Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context learning (ICL) enables large language models (LLMs) to acquire new behaviors from the input sequence alone without any parameter updates. Recent studies have shown that ICL can surpass the original meaning learned in pretraining stage through internalizing the structure the data-generating process (DGP) of the prompt into the hidden representations. However, the mechanisms by which LLMs achieve this ability is left open. In this paper, we present the first rigorous explanation of such phenomena by introducing a unified framework of double convergence, where hidden representations converge both over context and across layers. This double convergence process leads to an implicit bias towards smooth (low-frequency) representations, which we prove analytically and verify empirically. Our theory explains several open empirical observations, including why learned representations exhibit globally structured but locally distorted geometry, and why their total energy decays without vanishing. Moreover, our theory predicts that ICL has an intrinsic robustness towards high-frequency noise, which we empirically confirm. These results provide new insights into the underlying mechanisms of ICL, and a theoretical foundation to study it that hopefully extends to more general data distributions and settings.

[LG-32] Fake or Real: The Impostor Hunt in Texts for Space Operations

链接: https://arxiv.org/abs/2507.13508
作者: Agata Kaczmarek(1),Dawid Płudowski(1),Piotr Wilczyński(1),Przemysław Biecek(1),Krzysztof Kotowski(2),Ramez Shendy(2),Jakub Nalepa(2 and 3),Artur Janicki(1),Evridiki Ntagiou(4) ((1) Warsaw University of Technology, (2) KP Labs, (3) Silesian University of Technology, (4) European Space Agency, European Space Operations Center)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The “Fake or Real” competition hosted on Kaggle (\hrefthis https URLthis https URL) is the second part of a series of follow-up competitions and hackathons related to the “Assurance for Space Domain AI Applications” project funded by the European Space Agency (\hrefthis https URLthis https URL). The competition idea is based on two real-life AI security threats identified within the project – data poisoning and overreliance in Large Language Models. The task is to distinguish between the proper output from LLM and the output generated under malicious modification of the LLM. As this problem was not extensively researched, participants are required to develop new techniques to address this issue or adjust already existing ones to this problem’s statement.

[LG-33] Model-free Reinforcement Learning for Model-based Control: Towards Safe Interpretable and Sample-efficient Agents

链接: https://arxiv.org/abs/2507.13491
作者: Thomas Banker,Ali Mesbah
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Training sophisticated agents for optimal decision-making under uncertainty has been key to the rapid development of modern autonomous systems across fields. Notably, model-free reinforcement learning (RL) has enabled decision-making agents to improve their performance directly through system interactions, with minimal prior knowledge about the system. Yet, model-free RL has generally relied on agents equipped with deep neural network function approximators, appealing to the networks’ expressivity to capture the agent’s policy and value function for complex systems. However, neural networks amplify the issues of sample inefficiency, unsafe learning, and limited interpretability in model-free RL. To this end, this work introduces model-based agents as a compelling alternative for control policy approximation, leveraging adaptable models of system dynamics, cost, and constraints for safe policy learning. These models can encode prior system knowledge to inform, constrain, and aid in explaining the agent’s decisions, while deficiencies due to model mismatch can be remedied with model-free RL. We outline the benefits and challenges of learning model-based agents – exemplified by model predictive control – and detail the primary learning approaches: Bayesian optimization, policy search RL, and offline strategies, along with their respective strengths. While model-free RL has long been established, its interplay with model-based agents remains largely unexplored, motivating our perspective on their combined potentials for sample-efficient learning of safe and interpretable decision-making agents.

[LG-34] LightAutoDS-Tab: Multi-AutoML Agent ic System for Tabular Data

链接: https://arxiv.org/abs/2507.13413
作者: Aleksey Lapin,Igor Hromov,Stanislav Chumakov,Mile Mitrovic,Dmitry Simakov,Nikolay O. Nikitin,Andrey V. Savchenko
类目: Machine Learning (cs.LG)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:AutoML has advanced in handling complex tasks using the integration of LLMs, yet its efficiency remains limited by dependence on specific underlying tools. In this paper, we introduce LightAutoDS-Tab, a multi-AutoML agentic system for tasks with tabular data, which combines an LLM-based code generation with several AutoML tools. Our approach improves the flexibility and robustness of pipeline design, outperforming state-of-the-art open-source solutions on several data science tasks from Kaggle. The code of LightAutoDS-Tab is available in the open repository this https URL

[LG-35] Selective Embedding for Deep Learning

链接: https://arxiv.org/abs/2507.13399
作者: Mert Sehri,Zehui Hua,Francisco de Assis Boldt,Patrick Dumond
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has revolutionized many industries by enabling models to automatically learn complex patterns from raw data, reducing dependence on manual feature engineering. However, deep learning algorithms are sensitive to input data, and performance often deteriorates under nonstationary conditions and across dissimilar domains, especially when using time-domain data. Conventional single-channel or parallel multi-source data loading strategies either limit generalization or increase computational costs. This study introduces selective embedding, a novel data loading strategy, which alternates short segments of data from multiple sources within a single input channel. Drawing inspiration from cognitive psychology, selective embedding mimics human-like information processing to reduce model overfitting, enhance generalization, and improve computational efficiency. Validation is conducted using six time-domain datasets, demonstrating that the proposed method consistently achieves high classification accuracy across various deep learning architectures while significantly reducing training times. The approach proves particularly effective for complex systems with multiple data sources, offering a scalable and resource-efficient solution for real-world applications in healthcare, heavy machinery, marine, railway, and agriculture, where robustness and adaptability are critical.

[LG-36] Improving KAN with CDF normalization to quantiles

链接: https://arxiv.org/abs/2507.13393
作者: Jakub Strawa,Jarek Duda
类目: Machine Learning (cs.LG)
*备注: 7 pages, 9 figures

点击查看摘要

Abstract:Data normalization is crucial in machine learning, usually performed by subtracting the mean and dividing by standard deviation, or by rescaling to a fixed range. In copula theory, popular in finance, there is used normalization to approximately quantiles by transforming x to CDF(x) with estimated CDF (cumulative distribution function) to nearly uniform distribution in [0,1], allowing for simpler representations which are less likely to overfit. It seems nearly unknown in machine learning, therefore, we would like to present some its advantages on example of recently popular Kolmogorov-Arnold Networks (KANs), improving predictions from Legendre-KAN by just switching rescaling to CDF normalization. Additionally, in HCR interpretation, weights of such neurons are mixed moments providing local joint distribution models, allow to propagate also probability distributions, and change propagation direction.

[LG-37] Quantum Boltzmann Machines using Parallel Annealing for Medical Image Classification

链接: https://arxiv.org/abs/2507.14116
作者: Daniëlle Schuman,Mark V. Seebode,Tobias Rohe,Maximilian Balthasar Mansky,Michael Schroedl-Baumann,Jonas Stein,Claudia Linnhoff-Popien,Florian Krellner
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures (10 if counting subfigures), 2 tables. To be published in the proceedings of the 2025 IEEE International Conference on Quantum Computing and Engineering (QCE)

点击查看摘要

Abstract:Exploiting the fact that samples drawn from a quantum annealer inherently follow a Boltzmann-like distribution, annealing-based Quantum Boltzmann Machines (QBMs) have gained increasing popularity in the quantum research community. While they harbor great promises for quantum speed-up, their usage currently stays a costly endeavor, as large amounts of QPU time are required to train them. This limits their applicability in the NISQ era. Following the idea of Noè et al. (2024), who tried to alleviate this cost by incorporating parallel quantum annealing into their unsupervised training of QBMs, this paper presents an improved version of parallel quantum annealing that we employ to train QBMs in a supervised setting. Saving qubits to encode the inputs, the latter setting allows us to test our approach on medical images from the MedMNIST data set (Yang et al., 2023), thereby moving closer to real-world applicability of the technology. Our experiments show that QBMs using our approach already achieve reasonable results, comparable to those of similarly-sized Convolutional Neural Networks (CNNs), with markedly smaller numbers of epochs than these classical models. Our parallel annealing technique leads to a speed-up of almost 70 % compared to regular annealing-based BM executions.

[LG-38] Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design

链接: https://arxiv.org/abs/2507.14057
作者: Marcel Hedman,Desi R. Ivanova,Cong Guan,Tom Rainforth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025

点击查看摘要

Abstract:We develop a semi-amortized, policy-based, approach to Bayesian experimental design (BED) called Stepwise Deep Adaptive Design (Step-DAD). Like existing, fully amortized, policy-based BED approaches, Step-DAD trains a design policy upfront before the experiment. However, rather than keeping this policy fixed, Step-DAD periodically updates it as data is gathered, refining it to the particular experimental instance. This test-time adaptation improves both the flexibility and the robustness of the design strategy compared with existing approaches. Empirically, Step-DAD consistently demonstrates superior decision-making and robustness compared with current state-of-the-art BED methods.

[LG-39] Conformalized Regression for Continuous Bounded Outcomes

链接: https://arxiv.org/abs/2507.14023
作者: Zhanli Wu,Fabrizio Leisen,F. Javier Rubio
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: R code and data can be found at: this https URL

点击查看摘要

Abstract:Regression problems with bounded continuous outcomes frequently arise in real-world statistical and machine learning applications, such as the analysis of rates and proportions. A central challenge in this setting is predicting a response associated with a new covariate value. Most of the existing statistical and machine learning literature has focused either on point prediction of bounded outcomes or on interval prediction based on asymptotic approximations. We develop conformal prediction intervals for bounded outcomes based on transformation models and beta regression. We introduce tailored non-conformity measures based on residuals that are aligned with the underlying models, and account for the inherent heteroscedasticity in regression settings with bounded outcomes. We present a theoretical result on asymptotic marginal and conditional validity in the context of full conformal prediction, which remains valid under model misspecification. For split conformal prediction, we provide an empirical coverage analysis based on a comprehensive simulation study. The simulation study demonstrates that both methods provide valid finite-sample predictive coverage, including settings with model misspecification. Finally, we demonstrate the practical performance of the proposed conformal prediction intervals on real data and compare them with bootstrap-based alternatives.

[LG-40] A Survey of Dimension Estimation Methods

链接: https://arxiv.org/abs/2507.13887
作者: James A. D. Binnie,Paweł Dłotko,John Harvey,Jakub Malinowski,Ka Man Yim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Differential Geometry (math.DG); Metric Geometry (math.MG); Statistics Theory (math.ST)
*备注: 45 pages + appendices, 24 figures

点击查看摘要

Abstract:It is a standard assumption that datasets in high dimension have an internal structure which means that they in fact lie on, or near, subsets of a lower dimension. In many instances it is important to understand the real dimension of the data, hence the complexity of the dataset at hand. A great variety of dimension estimators have been developed to find the intrinsic dimension of the data but there is little guidance on how to reliably use these estimators. This survey reviews a wide range of dimension estimation methods, categorising them by the geometric information they exploit: tangential estimators which detect a local affine structure; parametric estimators which rely on dimension-dependent probability distributions; and estimators which use topological or metric invariants. The paper evaluates the performance of these methods, as well as investigating varying responses to curvature and noise. Key issues addressed include robustness to hyperparameter selection, sample size requirements, accuracy in high dimensions, precision, and performance on non-linear geometries. In identifying the best hyperparameters for benchmark datasets, overfitting is frequent, indicating that many estimators may not generalise well beyond the datasets on which they have been tested. Comments: 45 pages + appendices, 24 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Differential Geometry (math.DG); Metric Geometry (math.MG); Statistics Theory (math.ST) MSC classes: 62R40 (Primary) 62R30, 62R07, 62G05, 53Z50 (Secondary) Cite as: arXiv:2507.13887 [stat.ML] (or arXiv:2507.13887v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.13887 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] Conformal Data Contamination Tests for Trading or Sharing of Data

链接: https://arxiv.org/abs/2507.13835
作者: Martin V. Vejling,Shashi Raj Pandey,Christophe A. N. Biscio,Petar Popovski
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The amount of quality data in many machine learning tasks is limited to what is available locally to data owners. The set of quality data can be expanded through trading or sharing with external data agents. However, data buyers need quality guarantees before purchasing, as external data may be contaminated or irrelevant to their specific learning task. Previous works primarily rely on distributional assumptions about data from different agents, relegating quality checks to post-hoc steps involving costly data valuation procedures. We propose a distribution-free, contamination-aware data-sharing framework that identifies external data agents whose data is most valuable for model personalization. To achieve this, we introduce novel two-sample testing procedures, grounded in rigorous theoretical foundations for conformal outlier detection, to determine whether an agent’s data exceeds a contamination threshold. The proposed tests, termed conformal data contamination tests, remain valid under arbitrary contamination levels while enabling false discovery rate control via the Benjamini-Hochberg procedure. Empirical evaluations across diverse collaborative learning scenarios demonstrate the robustness and effectiveness of our approach. Overall, the conformal data contamination test distinguishes itself as a generic procedure for aggregating data with statistically rigorous quality guarantees.

[LG-42] Differential Privacy in Kernelized Contextual Bandits via Random Projections

链接: https://arxiv.org/abs/2507.13639
作者: Nikola Pavlovic,Sudeep Salgia,Qing Zhao
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of contextual kernel bandits with stochastic contexts, where the underlying reward function belongs to a known Reproducing Kernel Hilbert Space. We study this problem under an additional constraint of Differential Privacy, where the agent needs to ensure that the sequence of query points is differentially private with respect to both the sequence of contexts and rewards. We propose a novel algorithm that achieves the state-of-the-art cumulative regret of \widetilde\mathcalO(\sqrt\gamma_TT+\frac\gamma_T\varepsilon_\mathrmDP) and \widetilde\mathcalO(\sqrt\gamma_TT+\frac\gamma_T\sqrtT\varepsilon_\mathrmDP) over a time horizon of T in the joint and local models of differential privacy, respectively, where \gamma_T is the effective dimension of the kernel and \varepsilon_\mathrmDP 0 is the privacy parameter. The key ingredient of the proposed algorithm is a novel private kernel-ridge regression estimator which is based on a combination of private covariance estimation and private random projections. It offers a significantly reduced sensitivity compared to its classical counterpart while maintaining a high prediction accuracy, allowing our algorithm to achieve the state-of-the-art performance guarantees.

[LG-43] State Space Models Naturally Produce Traveling Waves Time Cells and Scale to Abstract Cognitive Functions

链接: https://arxiv.org/abs/2507.13638
作者: Sen Lu,Xiaoyu Zhang,Mingtao Hu,Eric Yeu-Jer Lee,Soohyeon Kim,Wei D. Lu
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: Sen Lu and Xiaoyu Zhang contributed equally. Wei D. Lu is the corresponding author. 4 figures are included in 15 pages

点击查看摘要

Abstract:A grand challenge in modern neuroscience is to bridge the gap between the detailed mapping of microscale neural circuits and a mechanistic understanding of cognitive functions. While extensive knowledge exists about neuronal connectivity and biophysics, a significant gap remains in how these elements combine to produce flexible, learned behaviors. Here, we propose that a framework based on State-Space Models (SSMs), an emerging class of deep learning architectures, can bridge this gap. We argue that the differential equations governing elements in an SSM are conceptually consistent with the biophysical dynamics of neurons, while the combined dynamics in the model lead to emergent behaviors observed in experimental neuroscience. We test this framework by training an S5 model–a specific SSM variant employing a diagonal state transition matrix–on temporal discrimination tasks with reinforcement learning (RL). We demonstrate that the model spontaneously develops neural representations that strikingly mimic biological ‘time cells’. We reveal that these cells emerge from a simple generative principle: learned rotational dynamics of hidden state vectors in the complex plane. This single mechanism unifies the emergence of time cells, ramping activity, and oscillations/traveling waves observed in numerous experiments. Furthermore, we show that this rotational dynamics generalizes beyond interval discriminative tasks to abstract event-counting tasks that were considered foundational for performing complex cognitive tasks. Our findings position SSMs as a compelling framework that connects single-neuron dynamics to cognitive phenomena, offering a unifying and computationally tractable theoretical ground for temporal learning in the brain.

[LG-44] A Collaborative Framework Integrating Large Language Model and Chemical Frag ment Space: Mutual Inspiration for Lead Design

链接: https://arxiv.org/abs/2507.13580
作者: Hao Tuo,Yan Li,Xuanning Hu,Haishi Zhao,Xueyan Liu,Bo Yang
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combinatorial optimization algorithm is essential in computer-aided drug design by progressively exploring chemical space to design lead compounds with high affinity to target protein. However current methods face inherent challenges in integrating domain knowledge, limiting their performance in identifying lead compounds with novel and valid binding mode. Here, we propose AutoLeadDesign, a lead compounds design framework that inspires extensive domain knowledge encoded in large language models with chemical fragments to progressively implement efficient exploration of vast chemical space. The comprehensive experiments indicate that AutoLeadDesign outperforms baseline methods. Significantly, empirical lead design campaigns targeting two clinically relevant targets (PRMT5 and SARS-CoV-2 PLpro) demonstrate AutoLeadDesign’s competence in de novo generation of lead compounds achieving expert-competitive design efficacy. Structural analysis further confirms their mechanism-validated inhibitory patterns. By tracing the process of design, we find that AutoLeadDesign shares analogous mechanisms with fragment-based drug design which traditionally rely on the expert decision-making, further revealing why it works. Overall, AutoLeadDesign offers an efficient approach for lead compounds design, suggesting its potential utility in drug design.

[LG-45] Physics-guided impact localisation and force estimation in composite plates with uncertainty quantification

链接: https://arxiv.org/abs/2507.13376
作者: Dong Xiao,Zahra Sharif-Khodaei,M. H. Aliabadi
类目: Data Analysis, Statistics and Probability (physics.data-an); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 37 pages (including the appendix and references), 16 figures

点击查看摘要

Abstract:Physics-guided approaches offer a promising path toward accurate and generalisable impact identification in composite structures, especially when experimental data are sparse. This paper presents a hybrid framework for impact localisation and force estimation in composite plates, combining a data-driven implementation of First-Order Shear Deformation Theory (FSDT) with machine learning and uncertainty quantification. The structural configuration and material properties are inferred from dispersion relations, while boundary conditions are identified via modal characteristics to construct a low-fidelity but physically consistent FSDT model. This model enables physics-informed data augmentation for extrapolative localisation using supervised learning. Simultaneously, an adaptive regularisation scheme derived from the same model improves the robustness of impact force reconstruction. The framework also accounts for uncertainty by propagating localisation uncertainty through the force estimation process, producing probabilistic outputs. Validation on composite plate experiments confirms the framework’s accuracy, robustness, and efficiency in reducing dependence on large training datasets. The proposed method offers a scalable and transferable solution for impact monitoring and structural health management in composite aerostructures.

[LG-46] Asymptotic behavior of eigenvalues of large rank perturbations of large random matrices

链接: https://arxiv.org/abs/2507.12182
作者: Ievgenii Afanasiev,Leonid Berlyand,Mariia Kiyashko
类目: Mathematical Physics (math-ph); Machine Learning (cs.LG); Probability (math.PR)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:The paper is concerned with deformed Wigner random matrices. These matrices are closely connected with Deep Neural Networks (DNNs): weight matrices of trained DNNs could be represented in the form R + S , where R is random and S is highly correlated. The spectrum of such matrices plays a key role in rigorous underpinning of the novel pruning technique based on Random Matrix Theory. Mathematics has been done only for finite-rank matrix S . However, in practice rank may grow. In this paper we develop asymptotic analysis for the case of growing rank.

信息检索

[IR-0] PARK: Personalized academic retrieval with knowledge-graphs

链接: https://arxiv.org/abs/2507.13910
作者: Pranav Kasela,Gabriella Pasi,Raffaele Perego
类目: Information Retrieval (cs.IR)
*备注: Accepted in Information Systems. [17 May 2025] this https URL

点击查看摘要

Abstract:Academic Search is a search task aimed to manage and retrieve scientific documents like journal articles and conference papers. Personalization in this context meets individual researchers’ needs by leveraging, through user profiles, the user related information (e.g. documents authored by a researcher), to improve search effectiveness and to reduce the information overload. While citation graphs are a valuable means to support the outcome of recommender systems, their use in personalized academic search (with, e.g. nodes as papers and edges as citations) is still under-explored. Existing personalized models for academic search often struggle to fully capture users’ academic interests. To address this, we propose a two-step approach: first, training a neural language model for retrieval, then converting the academic graph into a knowledge graph and embedding it into a shared semantic space with the language model using translational embedding techniques. This allows user models to capture both explicit relationships and hidden structures in citation graphs and paper content. We evaluate our approach in four academic search domains, outperforming traditional graph-based and personalized models in three out of four, with up to a 10% improvement in MAP@100 over the second-best model. This highlights the potential of knowledge graph-based user models to enhance retrieval effectiveness. Comments: Accepted in Information Systems. [17 May 2025] this https URL Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2507.13910 [cs.IR] (or arXiv:2507.13910v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.13910 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Information Systems, 134, 102574 Related DOI: https://doi.org/10.1016/j.is.2025.102574 Focus to learn more DOI(s) linking to related resources

[IR-1] IP2: Entity-Guided Interest Probing for Personalized News Recommendation RECSYS2025

链接: https://arxiv.org/abs/2507.13622
作者: Youlin Wu,Yuanyuan Sun,Xiaokun Zhang,Haoxi Zhan,Bo Xu,Liang Yang,Hongfei Lin
类目: Information Retrieval (cs.IR)
*备注: Accepted in RecSys 2025

点击查看摘要

Abstract:News recommender systems aim to provide personalized news reading experiences for users based on their reading history. Behavioral science studies suggest that screen-based news reading contains three successive steps: scanning, title reading, and then clicking. Adhering to these steps, we find that intra-news entity interest dominates the scanning stage, while the inter-news entity interest guides title reading and influences click decisions. Unfortunately, current methods overlook the unique utility of entities in news recommendation. To this end, we propose a novel method called IP2 to probe entity-guided reading interest at both intra- and inter-news levels. At the intra-news level, a Transformer-based entity encoder is devised to aggregate mentioned entities in the news title into one signature entity. Then, a signature entity-title contrastive pre-training is adopted to initialize entities with proper meanings using the news story context, which in the meantime facilitates us to probe for intra-news entity interest. As for the inter-news level, a dual tower user encoder is presented to capture inter-news reading interest from both the title meaning and entity sides. In addition to highlighting the contribution of inter-news entity guidance, a cross-tower attention link is adopted to calibrate title reading interest using inter-news entity interest, thus further aligning with real-world behavior. Extensive experiments on two real-world datasets demonstrate that our IP2 achieves state-of-the-art performance in news recommendation.

[IR-2] Revisiting Prompt Engineering: A Comprehensive Evaluation for LLM -based Personalized Recommendation RECSYS2025

链接: https://arxiv.org/abs/2507.13525
作者: Genki Kusano,Kosuke Akimoto,Kunihiro Takeoka
类目: Information Retrieval (cs.IR)
*备注: Accepted to ACM RecSys2025 reproducibility

点击查看摘要

Abstract:Large language models (LLMs) can perform recommendation tasks by taking prompts written in natural language as input. Compared to traditional methods such as collaborative filtering, LLM-based recommendation offers advantages in handling cold-start, cross-domain, and zero-shot scenarios, as well as supporting flexible input formats and generating explanations of user behavior. In this paper, we focus on a single-user setting, where no information from other users is used. This setting is practical for privacy-sensitive or data-limited applications. In such cases, prompt engineering becomes especially important for controlling the output generated by the LLM. We conduct a large-scale comparison of 23 prompt types across 8 public datasets and 12 LLMs. We use statistical tests and linear mixed-effects models to evaluate both accuracy and inference cost. Our results show that for cost-efficient LLMs, three types of prompts are especially effective: those that rephrase instructions, consider background knowledge, and make the reasoning process easier to follow. For high-performance LLMs, simple prompts often outperform more complex ones while reducing cost. In contrast, commonly used prompting styles in natural language processing, such as step-by-step reasoning, or the use of reasoning models often lead to lower accuracy. Based on these findings, we provide practical suggestions for selecting prompts and LLMs depending on the required balance between accuracy and cost.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-07-21

目录

概览 (2025-07-21)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载