本篇博文主要内容为 2025-07-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-07-02)

今日共更新529篇论文,其中:

  • 自然语言处理70篇(Computation and Language (cs.CL))
  • 人工智能163篇(Artificial Intelligence (cs.AI))
  • 计算机视觉136篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习152篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

【速读】: 该论文试图解决科学文献任务中基础模型评估的不足,特别是传统基准在评估模型对科学文献的理解与综合能力时存在的局限性。其解决方案的关键在于构建一个开放且协作的平台——SciArena,通过社区投票的方式直接让研究者参与模型比较,利用集体智慧对需要基于文献、长篇幅回答的开放性科学任务进行模型性能评估。该平台支持多种基础模型,并已收集大量来自不同科学领域的研究人员的投票数据,从而为模型评估提供了更具现实意义和多样性的参考标准。

链接: https://arxiv.org/abs/2507.01001
作者: Yilun Zhao,Kaiyan Zhang,Tiansheng Hu,Sihong Wu,Ronan Le Bras,Taira Anderson,Jonathan Bragg,Joseph Chee Chang,Jesse Dodge,Matt Latzke,Yixin Liu,Charles McGrady,Xiangru Tang,Zihang Wang,Chen Zhao,Hannaneh Hajishirzi,Doug Downey,Arman Cohan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark’s challenges and emphasize the need for more reliable automated evaluation methods.
zh

[NLP-1] La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America ACL2025

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在西班牙语社区的语言和文化多样性方面代表性不足的问题。解决方案的关键在于提出La Leaderboard,这是首个开源排行榜,用于评估在西班牙及拉丁美洲语言和语言变体中的生成式AI(Generative AI)模型,通过整合66个数据集并展示50个模型的评估结果,推动针对西班牙语社区的LLMs开发,并提供方法论指导以促进其他语言的社区驱动型排行榜建设。

链接: https://arxiv.org/abs/2507.00999
作者: María Grandury,Javier Aula-Blasco,Júlia Falcão,Clémentine Fourrier,Miguel González,Gonzalo Martínez,Gonzalo Santamaría,Rodrigo Agerri,Nuria Aldama,Luis Chiruzzo,Javier Conde,Helena Gómez,Marta Guerrero,Guido Ivetta,Natalia López,Flor Miriam Plaza-del-Arco,María Teresa Martín-Valdivia,Helena Montoro,Carmen Muñoz,Pedro Reviriego,Leire Rosado,Alejandro Vaca,María Estrella Vallecillo-Rodríguez,Jorge Vallego,Irune Zubiaga
机构: SomosNLP; ETSIT, Universidad Politécnica de Madrid (ETSIT,马德里理工大学); Barcelona Supercomputing Center (巴塞罗那超级计算中心); Hugging Face; Universidad Carlos III de Madrid (卡洛斯三世大学); Instituto de Ingeniería del Conocimiento (知识工程研究所); Centro HiTZ - Ixa, Universidad del País Vasco UPV/EHU (巴斯克地区大学HiTZ中心-Ixa); LenguajeNatural.AI; LIACS, Leiden University (莱顿大学LIACS); Universidad de Jaén (哈恩大学); Universidad Nacional de Córdoba (科尔多瓦国立大学); Universidad Nacional Autónoma de México (墨西哥国立自治大学); Universidad de la República, Uruguay (乌拉圭共和国大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2025 Main

点击查看摘要

Abstract:Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Basque, Catalan, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.
zh

[NLP-2] Should We Still Pretrain Encoders with Masked Language Modeling?

【速读】: 该论文试图解决的问题是:在文本表示学习中,基于因果语言建模(Causal Language Modeling, CLM)预训练的解码器模型是否能有效作为编码器使用,并且其性能提升是否源于CLM目标本身的内在优势,还是由模型规模、数据量等混杂因素导致。论文的关键解决方案是通过一系列大规模、受控的预训练消融实验,验证CLM与掩码语言建模(Masked Language Modeling, MLM)在文本表示任务中的表现差异,并提出一种分阶段训练策略,即先进行CLM预训练再进行MLM微调,在固定计算预算下实现最优性能。此外,该策略在利用现有预训练CLM模型初始化时更具优势,能够降低训练高性能编码器模型的计算负担。

链接: https://arxiv.org/abs/2507.00994
作者: Hippolyte Gisserot-Boukhlef,Nicolas Boizard,Manuel Faysse,Duarte M. Alves,Emmanuel Malherbe,André F. T. Martins,Céline Hudelot,Pierre Colombo
机构: Artefact Research Center; Diabolocom; Illuin Technology; Equall; Unbabel; MICS, CentraleSupélec, Université Paris-Saclay; Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit); Instituto de Telecomunicações
类目: Computation and Language (cs.CL)
备注: 23 pages, 10 figures, 17 tables

点击查看摘要

Abstract:Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at this https URL to foster further research.
zh

[NLP-3] Discourse Heuristics For Paradoxically Moral Self-Correction

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在道德自我修正(moral self-correction)过程中存在的两个主要悖论:一是尽管有实证和理论证据支持自我修正的有效性,但其能力仅停留在表层;二是LLMs虽然具备识别输出中不道德内容的能力,但在自我修正过程中难以确定道德不一致的原因。解决方案的关键在于通过分析用于增强道德自我修正的微调语料库中的话语结构,揭示有效构造背后的启发式策略(heuristics),并利用这些启发式策略改进道德自我修正能力,同时指出该能力在情境化上下文学习和模型规模扩展方面的泛化挑战。

链接: https://arxiv.org/abs/2507.00985
作者: Guangliang Liu,Zimo Qi,Xitong Zhang,Kristen Marie Johnson
机构: Michigan State University (密歇根州立大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.
zh

[NLP-4] Enhancing LLM Agent Safety via Causal Influence Prompting ACL2025

【速读】: 该论文旨在解决由大型语言模型(Large Language Models, LLMs)驱动的自主代理在执行辅助任务时可能出现的安全与可靠性问题,以防止意外后果的发生。其解决方案的关键在于引入一种名为CIP的新技术,该技术利用因果影响图(Causal Influence Diagrams, CIDs)来识别和缓解代理决策过程中产生的风险。CIDs通过结构化地表示因果关系,使代理能够预见有害结果并做出更安全的决策,整个方法包括初始化CID、基于CID引导代理与环境交互以及根据观察到的行为和结果迭代优化CID三个关键步骤。

链接: https://arxiv.org/abs/2507.00979
作者: Dongyoon Hahm,Woogyeol Jin,June Suk Choi,Sungsoo Ahn,Kimin Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ACL 2025 Findings, Source code: this https URL

点击查看摘要

Abstract:As autonomous agents powered by large language models (LLMs) continue to demonstrate potential across various assistive tasks, ensuring their safe and reliable behavior is crucial for preventing unintended consequences. In this work, we introduce CIP, a novel technique that leverages causal influence diagrams (CIDs) to identify and mitigate risks arising from agent decision-making. CIDs provide a structured representation of cause-and-effect relationships, enabling agents to anticipate harmful outcomes and make safer decisions. Our approach consists of three key steps: (1) initializing a CID based on task specifications to outline the decision-making process, (2) guiding agent interactions with the environment using the CID, and (3) iteratively refining the CID based on observed behaviors and outcomes. Experimental results demonstrate that our method effectively enhances safety in both code execution and mobile device control tasks.
zh

[NLP-5] he Cognate Data Bottleneck in Language Phylogenetics

【速读】: 该论文试图解决如何利用计算系统发生方法处理词源数据(cognate data)的问题,其核心挑战在于现有手动收集的词源数据集规模过小,无法满足复杂模型和基于机器学习的技术的需求。论文提出的关键解决方案是尝试从BabelNet这一大型多语言百科全书词典中自动提取更大的词源数据集,但实验结果表明,由此生成的字符矩阵所进行的系统发生推断与已知的黄金标准树存在显著不一致,从而表明从其他多语言资源中提取更合适的字符矩阵也面临困难。因此,当前尚无法有效解决大规模词源数据集的自动生成问题。

链接: https://arxiv.org/abs/2507.00911
作者: Luise Häuser,Alexandros Stamatakis
机构: Heidelberg Institute for Theoretical Studies (海德堡理论研究所); Institute for Theoretical Informatics, Karlsruhe Institute of Technology (理论信息学研究所,卡尔斯鲁厄理工学院); Institute of Computer Science, Foundation for Research and Technology - Hellas (计算机科学研究所,希腊科学研究技术基金会)
类目: Computation and Language (cs.CL); Populations and Evolution (q-bio.PE)
备注:

点击查看摘要

Abstract:To fully exploit the potential of computational phylogenetic methods for cognate data one needs to leverage specific (complex) models an machine learning-based techniques. However, both approaches require datasets that are substantially larger than the manually collected cognate data currently available. To the best of our knowledge, there exists no feasible approach to automatically generate larger cognate datasets. We substantiate this claim by automatically extracting datasets from BabelNet, a large multilingual encyclopedic dictionary. We demonstrate that phylogenetic inferences on the respective character matrices yield trees that are largely inconsistent with the established gold standard ground truth trees. We also discuss why we consider it as being unlikely to be able to extract more suitable character matrices from other multilingual resources. Phylogenetic data analysis approaches that require larger datasets can therefore not be applied to cognate data. Thus, it remains an open question how, and if these computational approaches can be applied in historical linguistics.
zh

[NLP-6] ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models ICCV2025

【速读】: 该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,即模型在生成文本响应时可能产生与输入图像不一致或无依据的错误信息,这限制了其在实际应用中的可靠性。解决方案的关键在于提出一种无需训练的解码方法——ONLY,该方法仅需一次查询和解码过程中的单层干预,通过利用每个标记的文本到视觉熵比来选择性地增强关键文本信息,从而有效减少幻觉现象,同时保证高效的实时部署能力。

链接: https://arxiv.org/abs/2507.00898
作者: Zifu Wan,Ce Zhang,Silong Yong,Martin Q. Ma,Simon Stepputtis,Louis-Philippe Morency,Deva Ramanan,Katia Sycara,Yaqi Xie
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by ICCV 2025. Project page: this https URL

点击查看摘要

Abstract:Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at this https URL.
zh

[NLP-7] MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes

【速读】: 该论文旨在解决现有对话数据集在表达力和语境细微差别方面的不足,尤其是在多模态交互中缺乏生动、直观且富有情感的表达方式。其关键解决方案是引入MemeCMD,一个基于上下文检索的自动生成的中文多轮对话数据集,该数据集结合了大规模MLLM标注的图片文字库与双智能体自动生成的对话,通过引入检索框架和自适应阈值来确保上下文相关且自然分布的图片文字使用,从而有效生成符合语境且多样化的图片文字融合对话。

链接: https://arxiv.org/abs/2507.00891
作者: Yuheng Wang,Xianhe Tang,Pufeng Huang
机构: Wuhan University (武汉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memes are widely used in online social interactions, providing vivid, intuitive, and often humorous means to express intentions and emotions. Existing dialogue datasets are predominantly limited to either manually annotated or pure-text conversations, lacking the expressiveness and contextual nuance that multimodal interactions this http URL address these challenges, we introduce MemeCMD, an automatically generated Chinese Multi-turn Dialogue dataset with contextually retrieved memes. Our dataset combines a large-scale, MLLM-annotated meme library with dialogues auto-generated by dual agents across diverse scenarios. We introduce a retrieval framework and adaptive threshold to ensure contextually relevant, naturally spaced meme usage. Experiments demonstrate the effectiveness of our approach in generating contextually appropriate and diverse meme-incorporated dialogues, offering a scalable and privacy-preserving resource for advancing multimodal conversational AI.
zh

[NLP-8] Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check

【速读】: 该论文试图解决下游缩放定律(downstream scaling laws)在预测大规模任务性能时的有效性问题,特别是探讨是否能够从较小规模预训练损失中准确预测大规模任务性能。研究发现,仅有39%的情况下数据与线性缩放定律拟合良好,而实验设置的微小变化可能导致缩放趋势发生显著改变。解决方案的关键在于认识到缩放行为可能偏离线性趋势,并强调需要深入理解在何种条件下缩放定律能够成功,从而更全面地建模预训练损失与下游任务性能之间的关系。

链接: https://arxiv.org/abs/2507.00885
作者: Nicholas Lourie,Michael Y. Hu,Kyunghyun Cho
机构: New York University (纽约大学); Prescient Design (Prescient Design); CIFAR LMB (CIFAR LMB)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Downstream scaling laws aim to predict task performance at larger scales from pretraining losses at smaller scales. Whether this prediction should be possible is unclear: some works demonstrate that task performance follows clear linear scaling trends under transformation, whereas others point out fundamental challenges to downstream scaling laws, such as emergence and inverse scaling. In this work, we conduct a meta-analysis of existing data on downstream scaling laws, finding that close fit to linear scaling laws only occurs in a minority of cases: 39% of the time. Furthermore, seemingly benign changes to the experimental setting can completely change the scaling trend. Our analysis underscores the need to understand the conditions under which scaling laws succeed. To fully model the relationship between pretraining loss and downstream task performance, we must embrace the cases in which scaling behavior deviates from linear trends.
zh

[NLP-9] Mathematics Isnt Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations

【速读】: 该论文试图解决数学问题在不同文化背景下呈现时可能存在的隐性文化偏见问题,即现有基准测试如GSM8K主要基于西方文化规范,可能无法全面反映非西方地区模型的性能。解决方案的关键在于通过基于提示的转换方法生成针对非洲、印度、中国、韩国和日本五个地区的文化适配版本的GSM8K测试集,并经过人工验证以确保其文化相关性,从而评估大语言模型(LLM)在面对文化差异时的鲁棒性。

链接: https://arxiv.org/abs/2507.00883
作者: Aditya Tomar,Nihar Ranjan Sahoo,Ashish Mittal,Rudra Murthy,Pushpak Bhattacharyya
机构: IIT Bombay(印度理工学院孟买分校); IBM Research, India(IBM印度研究实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks
zh

[NLP-10] Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite

【速读】: 该论文试图解决当前自然语言(Natural Language, NL)到时态逻辑(Temporal Logic, TL)翻译系统在验证能力上的不足,即现有研究仅关注翻译准确性,而忽略了系统将原子命题映射到新场景或环境的能力,这对公式在具体状态空间中的验证至关重要。解决方案的关键是提出一个统一的基准测试集——可验证线性时态逻辑基准(Verifiable Linear Temporal Logic Benchmark, VLTL-Bench),该基准不仅包含多个独立的状态空间和多样化的自然语言规范及其对应的时态逻辑规范,还提供样本轨迹以验证时态逻辑表达式,并在翻译过程的各个阶段(提升、接地、翻译和验证)提供真实标签,从而支持对整体问题各子步骤的研究与评估。

链接: https://arxiv.org/abs/2507.00877
作者: William H English,Chase Walker,Dominic Simon,Sumit Kumar Jha,Rickard Ewetz
机构: University of Florida, Gainesville, FL, USA; Florida International University, Miami, FL, USA
类目: ystems and Control (eess.SY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Empirical evaluation of state-of-the-art natural-language (NL) to temporal-logic (TL) translation systems reveals near-perfect performance on existing benchmarks. However, current studies measure only the accuracy of the translation of NL logic into formal TL, ignoring a system’s capacity to ground atomic propositions into new scenarios or environments. This is a critical feature, necessary for the verification of resulting formulas in a concrete state space. Consequently, most NL-to-TL translation frameworks propose their own bespoke dataset in which the correct grounding is known a-priori, inflating performance metrics and neglecting the need for extensible, domain-general systems. In this paper, we introduce the Verifiable Linear Temporal Logic Benchmark ( VLTL-Bench), a unifying benchmark that measures verification and verifiability of automated NL-to-LTL translation. The dataset consists of three unique state spaces and thousands of diverse natural language specifications and corresponding formal specifications in temporal logic. Moreover, the benchmark contains sample traces to validate the temporal logic expressions. While the benchmark directly supports end-to-end evaluation, we observe that many frameworks decompose the process into i) lifting, ii) grounding, iii) translation, and iv) verification. The benchmark provides ground truths after each of these steps to enable researches to improve and evaluate different substeps of the overall problem. To encourage methodologically sound advances in verifiable NL-to-LTL translation approaches, we release VLTL-Bench here: this https URL bench.
zh

[NLP-11] ransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation

【速读】: 该论文旨在解决将大型语言模型(Large Language Models, LLMs)应用于香港法律判决翻译中的准确性、风格恰当性和结构连贯性问题。其解决方案的关键在于提出一种名为TransLaw的多智能体框架,该框架包含三个专业智能体:翻译者(Translator)、标注者(Annotator)和校对者(Proofreader),通过协作实现法律意义的高精度传达、风格的适当性以及结构上的充分连贯与衔接。

链接: https://arxiv.org/abs/2507.00875
作者: Xi Xuan,King-kui Sin,Yufei Zhou,Chunyu Kit
机构: City University of Hong Kong (香港城市大学); UOW College Hong Kong (南澳大学香港学院)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: arXiv admin note: text overlap with arXiv:2501.09444 ; text overlap with arXiv:2409.20288 by other authors

点击查看摘要

Abstract:Multi-agent systems empowered by large language models (LLMs) have demonstrated remarkable capabilities in a wide range of downstream applications, including machine translation. However, the potential of LLMs in translating Hong Kong legal judgments remains uncertain due to challenges such as intricate legal terminology, culturally embedded nuances, and strict linguistic structures. In this work, we introduce TransLaw, a novel multi-agent framework implemented for real-world Hong Kong case law translation. It employs three specialized agents, namely, Translator, Annotator, and Proofreader, to collaboratively produce translations for high accuracy in legal meaning, appropriateness in style, and adequate coherence and cohesion in structure. This framework supports customizable LLM configurations and achieves tremendous cost reduction compared to professional human translation services. We evaluated its performance using 13 open-source and commercial LLMs as agents and obtained interesting findings, including that it surpasses GPT-4o in legal semantic accuracy, structural coherence, and stylistic fidelity, yet trails human experts in contextualizing complex terminology and stylistic naturalness. Our platform website is available at CityUHK, and our bilingual judgment corpus used for the evaluation is available at Hugging Face.
zh

[NLP-12] Stylometry recognizes human and LLM -generated texts in short samples

【速读】: 该论文试图解决如何区分由大型语言模型(Large Language Models, LLMs)和人类生成的文本的问题,旨在应对模型归属、知识产权和伦理使用等挑战。其解决方案的关键在于应用风格分析(stylometry)方法,通过提取文本中的词汇、语法、句法和标点模式等特征,并利用树基模型(如决策树和LightGBM)进行分类,从而识别LLMs生成文本的新兴写作模式。研究构建了一个基于维基百科的基准数据集,包含人类撰写的术语摘要、纯LLMs生成文本以及经过多种文本摘要和改写方法处理的文本,并在多类和二分类任务中取得了较高的分类性能。

链接: https://arxiv.org/abs/2507.00838
作者: Karol Przystalski,Jan K. Argasiński,Iwona Grabska-Gradzińska,Jeremi K. Ochab
机构: Exadel; Sano - Centre for Computational Medicine; Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University; Mark Kac Centre for Complex Systems Research, Jagiellonian University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), © processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show – crucially, in the context of the increasingly sophisticated LLMs – that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.
zh

[NLP-13] ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering ACL2025

【速读】: 该论文试图解决主题模型和文档聚类评估中存在的问题,即现有的自动化指标与人类偏好对齐不足,而依赖专家标签的评估方法则难以扩展。其解决方案的关键在于设计一种可扩展的人类评估协议及其对应的自动化近似方法,该方法反映了实践中模型的实际使用情况。通过让标注者或基于大语言模型(LLM)的代理审查分配到特定主题或聚类的文本项,推断出该组的类别,并将其应用于其他文档,从而收集了多种主题模型在两个数据集上的广泛众包标注数据,验证了最佳LLM代理与人工标注者在统计上无差异,可作为自动化评估中的合理替代方案。

链接: https://arxiv.org/abs/2507.00828
作者: Alexander Hoyle,Lorena Calvo-Bartolomé,Jordan Boyd-Graber,Philip Resnik
机构: ETH Zürich (ETH Zurich); Universidad Carlos III de Madrid (Universidad Carlos III de Madrid); University of Maryland (University of Maryland)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 (Main)

点击查看摘要

Abstract:Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners’ real-world usage of models. Annotators – or an LLM-based proxy – review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. Package, web interface, and data are at this https URL
zh

[NLP-14] Many LLM s Are More Utilitarian Than One

【速读】: 该论文试图解决多智能体大语言模型(LLM)在道德判断中的集体行为与人类群体推理之间的异同问题,以及其对人工智能对齐和道德推理的影响。研究的关键在于通过对比单个模型与群体协作下的道德困境决策,揭示多智能体系统中是否存在类似人类群体讨论带来的功利主义提升,并分析其背后的机制差异。实验结果显示,尽管多智能体LLM在行为上表现出与人类相似的道德妥协倾向,但其机制主要源于规范敏感性降低或客观性增强,而非人类所表现出的结果敏感性提高。

链接: https://arxiv.org/abs/2507.00814
作者: Anita Keshmirian,Razan Baltaji,Babak Hemmatian,Hadi Asghari,Lav R. Varshney
机构: Forward College; University of Illinois at Urbana-Champaign; University of Nebraska, Lincoln; Technische Universität Berlin
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages, 8 Figures, 7 tables

点击查看摘要

Abstract:Moral judgment is integral to large language model (LLM) alignment and social reasoning. As multi-agent systems gain prominence, it becomes crucial to understand how LLMs function collectively during collaboration, compared to individual agents. In human moral judgment, group deliberation leads to a utilitarian boost: a tendency to endorse norm violations that maximize benefits for the greatest number of people despite harms. We study whether a similar dynamic emerges in multi-agent LLM systems. We tested six models on well-established sets of moral dilemmas across two conditions: (1) Solo, where models reasoned independently, and (2) Group, where they engaged in multi-turn discussions in pairs or triads. In personal moral dilemmas, where agents must decide to directly harm one individual to maximize the utility for others, all models found moral violations to be more acceptable when part of a group than individually, similar to human experiments. Some models endorsed actions that maximized overall well-being, even if they benefited strangers over familiar individuals. Others became more willing to violate moral norms in groups. However, while human groups show a similar action bias, the mechanism for their utilitarian boost differs from LLMs. Whereas the human shift comes from heightened sensitivity to decision outcomes, LLM groups show either reduced norm sensitivity or enhanced impartiality. This suggests that while the surface behavior of LLM collectives mimics human group reasoning, the underlying drivers differ. We discuss the implications for AI alignment, multi-agent design, and artificial moral reasoning.
zh

[NLP-15] Multi-interaction TTS toward professional recording reproduction

【速读】: 该论文试图解决文本到语音合成(Text-to-Speech, TTS)中缺乏迭代式风格优化的问题,即在初始合成后无法进行细粒度风格调整,导致合成语音与用户预期风格存在偏差。解决方案的关键在于提出一种多步骤交互的TTS方法,通过建模TTS模型与用户的互动,模拟配音演员与配音导演之间的关系,从而实现用户对合成语音风格的直观且快速的迭代 refinements。

链接: https://arxiv.org/abs/2507.00808
作者: Hiroki Kanagawa,Kenichi Fujita,Aya Watanabe,Yusuke Ijima
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 7 pages,6 figures, Accepted to Speech Synthesis Workshop 2025 (SSW13)

点击查看摘要

Abstract:Voice directors often iteratively refine voice actors’ performances by providing feedback to achieve the desired outcome. While this iterative feedback-based refinement process is important in actual recordings, it has been overlooked in text-to-speech synthesis (TTS). As a result, fine-grained style refinement after the initial synthesis is not possible, even though the synthesized speech often deviates from the user’s intended style. To address this issue, we propose a TTS method with multi-step interaction that allows users to intuitively and rapidly refine synthetized speech. Our approach models the interaction between the TTS model and its user to emulate the relationship between voice actors and voice directors. Experiments show that the proposed model with its corresponding dataset enable iterative style refinements in accordance with users’ directions, thus demonstrating its multi-interaction capability. Sample audios are available: https://ntt-hilab-gensp. this http URL
zh

[NLP-16] Generative AI and the future of scientometrics: current topics and future questions

【速读】: 该论文试图解决生成式 AI (Generative AI) 在科学计量学中的应用及其对学科发展可能产生的深远影响问题。其解决方案的关键在于通过区分生成式 AI 的生成性和概率性本质,对其在科学计量任务中的表现进行批判性评估,包括主题标注、引用语境分析、预测应用、学者画像和研究评估等。研究指出,生成式 AI 在语言生成主导的任务中表现出潜力,但在需要稳定语义、语用推理或结构化领域知识的任务中存在局限性,因此建议针对具体任务系统地比较不同生成式 AI 模型的性能。

链接: https://arxiv.org/abs/2507.00783
作者: Benedetto Lepori,Jens Peter Andersen,Karsten Donnay
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:The aim of this paper is to review the use of GenAI in scientometrics, and to begin a debate on the broader implications for the field. First, we provide an introduction on GenAI’s generative and probabilistic nature as rooted in distributional linguistics. And we relate this to the debate on the extent to which GenAI might be able to mimic human ‘reasoning’. Second, we leverage this distinction for a critical engagement with recent experiments using GenAI in scientometrics, including topic labelling, the analysis of citation contexts, predictive applications, scholars’ profiling, and research assessment. GenAI shows promise in tasks where language generation dominates, such as labelling, but faces limitations in tasks that require stable semantics, pragmatic reasoning, or structured domain knowledge. However, these results might become quickly outdated. Our recommendation is, therefore, to always strive to systematically compare the performance of different GenAI models for specific tasks. Third, we inquire whether, by generating large amounts of scientific language, GenAI might have a fundamental impact on our field by affecting textual characteristics used to measure science, such as authors, words, and references. We argue that careful empirical work and theoretical reflection will be essential to remain capable of interpreting the evolving patterns of knowledge production.
zh

[NLP-17] A Diagrammatic Calculus for a Functional Model of Natural Language Semantics

【速读】: 该论文试图解决传统指称语义在表达能力上的局限性问题,通过引入函数式编程的方法来增强自然语言语义的表达能力。其解决方案的关键在于形式化一种基于范畴的类型与效应系统,并构建一个图示化演算来建模解析过程及效应处理,从而高效计算句子的指称。

链接: https://arxiv.org/abs/2507.00782
作者: Matthieu Pierre Boyer
机构: 未知
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: 15 pages, preprint before submission to CSL 2026

点击查看摘要

Abstract:In this paper, we study a functional programming approach to natural language semantics, allowing us to increase the expressivity of a more traditional denotation style. We will formalize a category based type and effect system, and construct a diagrammatic calculus to model parsing and handling of effects, and use it to efficiently compute the denotations for sentences.
zh

[NLP-18] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)生成的创意写作缺乏可靠自动化评估方法的问题,因为开放式的叙事缺乏真实标签(ground truths)。解决方案的关键在于引入LitBench,这是首个针对创意写作验证的标准化基准和配对数据集,包含2,480个去偏见、人工标注的故事对比测试集以及43,827对人类偏好标签的训练语料。通过LitBench,研究者不仅评估了零样本LLM裁判的表现,还训练了Bradley-Terry和生成式奖励模型,并通过在线人类研究验证了奖励模型在新型LLM生成故事中的排名一致性,最终实现了更准确的自动化评估与优化。

链接: https://arxiv.org/abs/2507.00769
作者: Daniel Fein,Sebastian Russo,Violet Xiang,Kabir Jolly,Rafael Rafailov,Nick Haber
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at this https URL, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.
zh

[NLP-19] Safe Low Bandwidth SPV: A Formal Treatment of Simplified Payment Verification Protocols and Security Bounds

【速读】: 该论文试图解决简化支付验证(Simplified Payment Verification, SPV)在比特币系统中的形式化规范与安全性证明问题。传统实现中存在对SPV的误解,而本文通过从基础原理重建SPV协议,将其验证模型建立在符号自动机、Merkle成员关系和链式证明优势谓词之上,从而证明了SPV在有限敌对假设下的安全性与最优性。解决方案的关键在于基于概率论和博弈论的严格分析,以及对协议在部分连接性、敌对中继网络和对抗传播延迟等场景下的活性与安全性验证。此外,论文还引入了低带宽优化技术,如自适应轮询和压缩头同步,同时保持协议的正确性。

链接: https://arxiv.org/abs/2507.00740
作者: Craig S Wright
机构: University of Exeter Business School (埃克塞特大学商学院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 56 pages 5 images

点击查看摘要

Abstract:This paper presents a complete formal specification, protocol description, and mathematical proof structure for Simplified Payment Verification (SPV) as originally defined in the Bitcoin whitepaper \citenakamoto2008. In stark contrast to the misrepresentations proliferated by popular implementations, we show that SPV is not only secure under bounded adversarial assumptions but strictly optimal for digital cash systems requiring scalable and verifiable transaction inclusion. We reconstruct the SPV protocol from first principles, grounding its verification model in symbolic automata, Merkle membership relations, and chain-of-proof dominance predicates. Through rigorous probabilistic and game-theoretic analysis, we derive the economic bounds within which the protocol operates securely and verify its liveness and safety properties under partial connectivity, hostile relay networks, and adversarial propagation delay. Our specification further introduces low-bandwidth optimisations such as adaptive polling and compressed header synchronisation while preserving correctness. This document serves both as a blueprint for secure SPV implementation and a rebuttal of common misconceptions surrounding non-validating clients.
zh

[NLP-20] AI Analyst: Framework and Comprehensive Evaluation of Large Language Models for Financial Time Series Report Generation

【速读】: 该论文试图解决如何利用大语言模型(Large Language Models, LLMs)从时间序列数据中生成财务报告的问题,其核心挑战在于确保生成内容的准确性和逻辑性。解决方案的关键在于提出一个包含提示工程、模型选择和评估的框架,并引入自动化突出显示系统,以区分报告中直接来源于时间序列数据的信息、基于财务推理的内容以及依赖外部知识的部分,从而评估模型的事实依据和推理能力。

链接: https://arxiv.org/abs/2507.00718
作者: Elizabeth Fons,Elena Kochkina,Rachneet Kaur,Zhen Zeng,Berowne Hlavaty,Charese Smiley,Svitlana Vyetrenko,Manuela Veloso
机构: J.P. Morgan AI Research (摩根大通人工智能研究院); J.P. Morgan Chase (摩根大通公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores the potential of large language models (LLMs) to generate financial reports from time series data. We propose a framework encompassing prompt engineering, model selection, and evaluation. We introduce an automated highlighting system to categorize information within the generated reports, differentiating between insights derived directly from time series data, stemming from financial reasoning, and those reliant on external knowledge. This approach aids in evaluating the factual grounding and reasoning capabilities of the models. Our experiments, utilizing both data from the real stock market indices and synthetic time series, demonstrate the capability of LLMs to produce coherent and informative financial reports.
zh

[NLP-21] Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English

【速读】: 该论文试图解决的问题是:不同文化背景下的视觉信息处理差异是否会在以不同语言(如日语和英语)训练的生成式 AI (Generative AI) 模型中体现出来。解决方案的关键在于通过对比分析图像描述,探究这些模型是否反映了整体性与分析性认知倾向的文化差异,从而揭示文化认知如何隐式地影响模型输出。

链接: https://arxiv.org/abs/2507.00700
作者: Ahmed Sabir,Azinovič Gasper,Mengsay Loem,Rajesh Sharma
机构: University of Tartu, Estonia; University of Ljubljana, Slovenia; Sansan, Inc., Japan; Plaksha University, India
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-cultural research in perception and cognition has shown that individuals from different cultural backgrounds process visual information in distinct ways. East Asians, for example, tend to adopt a holistic perspective, attending to contextual relationships, whereas Westerners often employ an analytical approach, focusing on individual objects and their attributes. In this study, we investigate whether Vision-Language Models (VLMs) trained predominantly on different languages, specifically Japanese and English, exhibit similar culturally grounded attentional patterns. Using comparative analysis of image descriptions, we examine whether these models reflect differences in holistic versus analytic tendencies. Our findings suggest that VLMs not only internalize the structural properties of language but also reproduce cultural behaviors embedded in the training data, indicating that cultural cognition may implicitly shape model outputs.
zh

[NLP-22] Leverag ing Large Language Models for Spontaneous Speech-Based Suicide Risk Detection INTERSPEECH2025

【速读】: 该论文旨在解决青少年自杀风险的早期识别问题,以预防自残行为的发生。其解决方案的关键在于利用大型语言模型(Large Language Model, LLM)作为主要特征提取工具,结合传统的声学和语义特征,通过分析语音数据来识别高风险个体。该方法在SW1挑战中取得了74%的准确率,并排名第一,证明了基于LLM的语音分析在自杀风险评估中的潜力。

链接: https://arxiv.org/abs/2507.00693
作者: Yifan Gao,Jiao Fu,Long Guo,Hong Liu
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Early identification of suicide risk is crucial for preventing suicidal behaviors. As a result, the identification and study of patterns and markers related to suicide risk have become a key focus of current research. In this paper, we present the results of our work in the 1st SpeechWellness Challenge (SW1), which aims to explore speech as a non-invasive and easily accessible mental health indicator for identifying adolescents at risk of this http URL approach leverages large language model (LLM) as the primary tool for feature extraction, alongside conventional acoustic and semantic features. The proposed method achieves an accuracy of 74% on the test set, ranking first in the SW1 challenge. These findings demonstrate the potential of LLM-based methods for analyzing speech in the context of suicide risk assessment.
zh

[NLP-23] SAFER: Probing Safety in Reward Models with Sparse Autoencoder

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)与人类价值观对齐过程中奖励模型(Reward Models)缺乏可解释性的问题。其核心挑战在于奖励模型的决策机制不透明,难以进行有效的安全审计和优化。论文提出的解决方案是基于稀疏自编码器(Sparse Autoencoders, SAEs)的增强奖励模型框架——SAFER,通过机械分析揭示奖励模型激活中的可解释特征,并利用这些特征设计针对性的数据污染和去噪策略,从而实现对安全对齐的精准调控。

链接: https://arxiv.org/abs/2507.00665
作者: Sihang Li,Wei Shi,Ziyuan Xie,Tao Liang,Guojun Ma,Xiang Wang
机构: University of Science and Technology of China (中国科学技术大学); Douyin Co., Ltd. (抖音公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbfSAFER), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at this https URL. \textitThis paper discusses topics related to large language model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.
zh

[NLP-24] Mixture of Reasoning s: Teach Large Language Models to Reason with Adaptive Strategies

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂任务中依赖手动设计的、任务特定提示所导致的适应性和效率受限的问题。其解决方案的关键在于提出一种名为Mixture of Reasoning (MoR)的训练框架,该框架通过将多种推理策略嵌入到LLMs中,实现无需外部提示工程的自主、任务自适应推理。

链接: https://arxiv.org/abs/2507.00606
作者: Tao Xiong,Xavier Hu,Wenyan Fan,Shengyu Zhang
机构: Dalian University of Technology (大连理工大学); Independent (独立); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in complex tasks through advanced prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but their reliance on manually crafted, task-specific prompts limits adaptability and efficiency. We introduce Mixture of Reasoning (MoR), a training framework that embeds diverse reasoning strategies into LLMs for autonomous, task-adaptive reasoning without external prompt engineering. MoR has two phases: Thought Generation, creating reasoning chain templates with models like GPT-4o, and SFT Dataset Construction, pairing templates with benchmark datasets for supervised this http URL experiments show that MoR significantly enhances performance, with MoR150 achieving 0.730 (2.2% improvement) using CoT prompting and 0.734 (13.5% improvement) compared to baselines. MoR eliminates the need for task-specific prompts, offering a generalizable solution for robust reasoning across diverse tasks.
zh

[NLP-25] ransferable Modeling Strategies for Low-Resource LLM Tasks: A Prompt and Alignment-Based

【速读】: 该论文旨在解决大型语言模型在低资源语言场景下的迁移与适应能力有限的问题(low-resource language scenarios)。其解决方案的关键在于提出一个统一框架,该框架结合了知识迁移模块与参数高效微调策略,通过引入知识对齐损失和软提示微调,引导模型在极少标注数据的情况下有效吸收目标语言或任务的结构特征,从而提升模型的泛化性能与训练稳定性。此外,框架还包含轻量级适配模块,并结合冻结策略与提示注入,以在保留原始知识的同时实现快速任务适应。

链接: https://arxiv.org/abs/2507.00601
作者: Shuangquan Lyu,Yingnan Deng,Guiran Liu,Zhen Qi,Ruotong Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses the limited transfer and adaptation capabilities of large language models in low-resource language scenarios. It proposes a unified framework that combines a knowledge transfer module with parameter-efficient fine-tuning strategies. The method introduces knowledge alignment loss and soft prompt tuning to guide the model in effectively absorbing the structural features of target languages or tasks under minimal annotation. This enhances both generalization performance and training stability. The framework includes lightweight adaptation modules to reduce computational costs. During training, it integrates freezing strategies and prompt injection to preserve the model’s original knowledge while enabling quick adaptation to new tasks. The study also conducts stability analysis experiments and synthetic pseudo-data transfer experiments to systematically evaluate the method’s applicability and robustness across different low-resource tasks. Experimental results show that compared with existing multilingual pre-trained models and mainstream transfer methods, the proposed approach achieves higher performance and stability on cross-lingual tasks such as MLQA, XQuAD, and PAWS-X. It demonstrates particularly strong advantages under extremely data-scarce conditions. The proposed method offers strong generality and scalability. It enhances task-specific adaptability while preserving the general capabilities of large language models. This makes it well-suited for complex semantic modeling and multilingual processing tasks.
zh

[NLP-26] UM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification SEMEVAL-2025 ACL

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中的幻觉问题,这一问题严重影响了模型的可信度和广泛应用。解决方案的关键在于提出一个两阶段的处理流程,结合基于检索的事实验证与微调的BERT系统,以识别常见的幻觉模式。该方法在多种语言中均表现出色,达到了多项语言的前十名成绩,并支持超出共享任务覆盖范围的多种语言。

链接: https://arxiv.org/abs/2507.00579
作者: Miriam Anschütz,Ekaterina Gikalo,Niklas Herbster,Georg Groh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, SemEval-2025 Task 3, ACL

点击查看摘要

Abstract:Hallucinations are one of the major problems of LLMs, hindering their trustworthiness and deployment to wider use cases. However, most of the research on hallucinations focuses on English data, neglecting the multilingual nature of LLMs. This paper describes our submission to the SemEval-2025 Task-3 - Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. We propose a two-part pipeline that combines retrieval-based fact verification against Wikipedia with a BERT-based system fine-tuned to identify common hallucination patterns. Our system achieves competitive results across all languages, reaching top-10 results in eight languages, including English. Moreover, it supports multiple languages beyond the fourteen covered by the shared task. This multilingual hallucination identifier can help to improve LLM outputs and their usefulness in the future.
zh

[NLP-27] Methodological Rigour in Algorithm Application: An Illustration of Topic Modelling Algorithm

【速读】: 该论文试图解决计算密集型研究方法在理论构建中的方法论严谨性问题,特别是在生成式 AI(Generative AI)等复杂算法应用中所面临的透明度不足和方法学挑战。其解决方案的关键在于通过结构化主题建模(Structural Topic Modeling)算法的应用实例,提出一套适用于主题建模研究的方法论指南,并强调在不同应用场景下进行适当调整的重要性,以确保研究的可靠性和可重复性。

链接: https://arxiv.org/abs/2507.00547
作者: Malmi Amadoru
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of advanced computational algorithms has opened new avenues for computationally intensive research approaches to theory development. However, the opacity of these algorithms and lack of transparency and rigour in their application pose methodological challenges, potentially undermining trust in research. The discourse on methodological rigour in this new genre of research is still emerging. Against this backdrop, I attempt to offer guidance on methodological rigour, particularly in the context of topic modelling algorithms. By illustrating the application of the structural topic modelling algorithm and presenting a set of guidelines, I discuss how to ensure rigour in topic modelling studies. Although the guidelines are for the application of topic modelling algorithms, they can be applied to other algorithms with context-specific adjustments. The guidelines are helpful, especially for novice researchers applying topic modelling, and editors and reviewers handling topic modelling manuscripts. I contribute to the literature on topic modelling and join the emerging dialogue on methodological rigour in computationally intensive theory construction research.
zh

[NLP-28] Capsule Network-Based Semantic Intent Modeling for Human-Computer Interaction

【速读】: 该论文试图解决人机交互中意图识别准确率不足的问题(intent recognition accuracy insufficiency)。其解决方案的关键在于提出一种基于胶囊网络(Capsule Networks)的用户语义意图建模算法,通过向量化的胶囊结构表示输入文本的语义特征,并利用动态路由机制在多层胶囊之间传递信息,从而更有效地捕捉语义实体之间的层次关系和部分-整体结构。此外,该模型还引入了基于边界损失函数的机制以提升类别区分能力。

链接: https://arxiv.org/abs/2507.00540
作者: Shixiao Wang,Yifan Zhuang,Runsheng Zhang,Zhijun Song
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper proposes a user semantic intent modeling algorithm based on Capsule Networks to address the problem of insufficient accuracy in intent recognition for human-computer interaction. The method represents semantic features in input text through a vectorized capsule structure. It uses a dynamic routing mechanism to transfer information across multiple capsule layers. This helps capture hierarchical relationships and part-whole structures between semantic entities more effectively. The model uses a convolutional feature extraction module as the low-level encoder. After generating initial semantic capsules, it forms high-level abstract intent representations through an iterative routing process. To further enhance performance, a margin-based mechanism is introduced into the loss function. This improves the model’s ability to distinguish between intent classes. Experiments are conducted using a public natural language understanding dataset. Multiple mainstream models are used for comparison. Results show that the proposed model outperforms traditional methods and other deep learning structures in terms of accuracy, F1-score, and intent detection rate. The study also analyzes the effect of the number of dynamic routing iterations on model performance. A convergence curve of the loss function during training is provided. These results verify the stability and effectiveness of the proposed method in semantic modeling. Overall, this study presents a new structured modeling approach to improve intent recognition under complex semantic conditions.
zh

[NLP-29] NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data

【速读】: 该论文试图解决持续学习(Continual Learning, CL)在多语言和多领域自动语音识别(Automatic Speech Recognition, ASR)中的评估问题。解决方案的关键在于提出Nirantar框架,该框架通过在印度22种语言和208个地区中逐步收集数据,模拟真实世界的语言和领域增量变化,从而支持Language-Incremental (LIL)、Domain-Incremental (DIL)以及新型的Language-Incremental Domain-Incremental Learning (LIDIL)场景的评估。与以往依赖模拟事件的研究不同,Nirantar引入了动态且非均匀的语言和领域迁移,为CL研究提供了更贴近实际的测试环境。

链接: https://arxiv.org/abs/2507.00534
作者: Tahir Javed,Kaushal Bhogale,Mitesh M. Khapra
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in Interspecch 2025

点击查看摘要

Abstract:We introduce Nirantar, a comprehensive framework for evaluating continual learning (CL) in multilingual and multi-domain ASR. Designed to reflect real-world CL challenges, Nirantar leverages data collected incrementally across 22 languages and 208 districts in India through natural episodes. This enables evaluation across Language-Incremental (LIL), Domain-Incremental (DIL), and the novel Language-Incremental Domain-Incremental Learning (LIDIL) scenarios. Unlike prior work that relies on simulated episodes, Nirantar presents dynamic, non-uniform language and domain shifts, making it an ideal testbed for CL research. With 3250 hours of human-transcribed speech, including 1720 hours newly introduced in this work, our framework enables systematic benchmarking of CL methods. We evaluate existing approaches and demonstrate that no single method performs consistently well, underscoring the need for more robust CL strategies.
zh

[NLP-30] amCMU at Touché: Adversarial Co-Evolution for Advertisement Integration and Detection in Conversational Search

【速读】: 该论文试图解决在基于检索增强生成(Retrieval-Augmented Generation, RAG)的对话式搜索引擎中,如何有效管理和集成广告内容以平衡商业需求与用户体验的问题。其关键解决方案是提出一个模块化管道,包含广告重写器(ad-rewriter)和鲁棒的广告分类器(ad-classifier),并通过合成数据训练高精度的分类器,结合监督微调和最佳N采样策略,实现广告的隐蔽性插入与检测。

链接: https://arxiv.org/abs/2507.00509
作者: To Eun Kim,João Coelho,Gbemileke Onilude,Jai Singh
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As conversational search engines increasingly adopt generation-based paradigms powered by Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), the integration of advertisements into generated responses presents both commercial opportunities and challenges for user experience. Unlike traditional search, where advertisements are clearly delineated, generative systems blur the boundary between informational content and promotional material, raising concerns around transparency and trust. In this work, we propose a modular pipeline for advertisement management in RAG-based conversational systems, consisting of an ad-rewriter for seamless ad integration and a robust ad-classifier for detection. We leverage synthetic data to train high-performing classifiers, which are then used to guide two complementary ad-integration strategies: supervised fine-tuning of the ad-rewriter and a best-of-N sampling approach that selects the least detectable ad-integrated response among multiple candidates. Our evaluation focuses on two core questions: the effectiveness of ad classifiers in detecting diverse ad integration strategies, and the training methods that best support coherent, minimally intrusive ad insertion. Experimental results show that our ad-classifier, trained on synthetic advertisement data inspired by marketing strategies and enhanced through curriculum learning, achieves robust detection performance. Additionally, we demonstrate that classifier-guided optimization, through both fine-tuning and best-of-N sampling, significantly improves ad stealth, enabling more seamless integration. These findings contribute an adversarial co-evolution framework for developing more sophisticated ad-aware generative search systems and robust ad classifiers.
zh

[NLP-31] MassTool: A Multi-Task Search-Based Tool Retrieval Framework for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在与外部工具交互时,如何精准地从大量工具中筛选出候选工具的问题。现有方法主要关注工具表示的优化,而忽视了查询理解的重要性。解决方案的关键在于提出MassTool,这是一个基于多任务搜索的框架,通过双塔结构实现工具使用检测和工具检索,并引入查询中心图卷积网络(QC-GCN)进行有效的查询-工具匹配,同时结合搜索驱动的用户意图建模(SUIM)和自适应知识迁移(AdaKT)模块,以提升查询理解和检索准确性。

链接: https://arxiv.org/abs/2507.00487
作者: Jianghao Lin,Xinyuan Wang,Xinyi Dai,Menghui Zhu,Bo Chen,Ruiming Tang,Yong Yu,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool retrieval is a critical component in enabling large language models (LLMs) to interact effectively with external tools. It aims to precisely filter the massive tools into a small set of candidates for the downstream tool-augmented LLMs. However, most existing approaches primarily focus on optimizing tool representations, often neglecting the importance of precise query comprehension. To address this gap, we introduce MassTool, a multi-task search-based framework designed to enhance both query representation and tool retrieval accuracy. MassTool employs a two-tower architecture: a tool usage detection tower that predicts the need for function calls, and a tool retrieval tower that leverages a query-centric graph convolution network (QC-GCN) for effective query-tool matching. It also incorporates search-based user intent modeling (SUIM) to handle diverse and out-of-distribution queries, alongside an adaptive knowledge transfer (AdaKT) module for efficient multi-task learning. By jointly optimizing tool usage detection loss, list-wise retrieval loss, and contrastive regularization loss, MassTool establishes a robust dual-step sequential decision-making pipeline for precise query understanding. Extensive experiments demonstrate its effectiveness in improving retrieval accuracy. Our code is available at this https URL.
zh

[NLP-32] Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer Architecture

【速读】: 该论文试图解决音乐表演MIDI中的节拍跟踪(beat tracking)问题,这是一个对于记谱级音乐转录和节奏分析具有挑战性和重要性的问题。现有方法主要集中在基于音频的解决方案,而本文提出了一种端到端的基于Transformer的模型,用于表演MIDI中的节拍和强拍跟踪,其关键在于采用编码器-解码器架构,将MIDI输入序列转换为节拍标注,同时引入了动态增强和优化的分词策略等创新的数据预处理技术,以提高模型在不同数据集上的准确性和泛化能力。

链接: https://arxiv.org/abs/2507.00466
作者: Sebastian Murgul,Michael Heizmann
机构: Klangio GmbH (Klangio GmbH); Karlsruhe Institute of Technology (Karlsruhe Institute of Technology)
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted to the 22nd Sound and Music Computing Conference (SMC), 2025

点击查看摘要

Abstract:Beat tracking in musical performance MIDI is a challenging and important task for notation-level music transcription and rhythmical analysis, yet existing methods primarily focus on audio-based approaches. This paper proposes an end-to-end transformer-based model for beat and downbeat tracking in performance MIDI, leveraging an encoder-decoder architecture for sequence-to-sequence translation of MIDI input to beat annotations. Our approach introduces novel data preprocessing techniques, including dynamic augmentation and optimized tokenization strategies, to improve accuracy and generalizability across different datasets. We conduct extensive experiments using the A-MAPS, ASAP, GuitarSet, and Leduc datasets, comparing our model against state-of-the-art hidden Markov models (HMMs) and deep learning-based beat tracking methods. The results demonstrate that our model outperforms existing symbolic music beat tracking approaches, achieving competitive F1-scores across various musical styles and instruments. Our findings highlight the potential of transformer architectures for symbolic beat tracking and suggest future integration with automatic music transcription systems for enhanced music analysis and score generation.
zh

[NLP-33] Pitfalls of Evaluating Language Models with Open Benchmarks

【速读】: 该论文试图解决开放大型语言模型(Large Language Model, LLM)基准测试中存在的潜在缺陷问题,即这些基准的开放性可能导致评估结果不能真实反映模型的实际效果。解决方案的关键是通过系统构建“作弊”模型——即在公开测试集上直接微调的较小版本的BART、T5和GPT-2模型,这些模型虽然在开放基准HELM中取得了顶级排名,但表现出较差的泛化能力和有限的实际应用价值。这一方法揭示了开放基准可能存在的局限性,并强调了需要结合私有或动态基准以保障评估的完整性,同时呼吁对现有基准测试方法进行根本性重新审视。

链接: https://arxiv.org/abs/2507.00460
作者: Md. Najib Hasan(1),Mohammad Fakhruddin Babar(2),Souvika Sarkar(1),Monowar Hasan(2),Santu Karmaker(3) ((1) Wichita State University, (2) Washington State University, (3) University of Central Florida)
机构: Wichita State University (威奇托州立大学); Washington State University (华盛顿州立大学); University of Central Florida (中佛罗里达大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open Large Language Model (LLM) benchmarks, such as HELM and BIG-bench, offer standardized, transparent protocols that facilitate the fair comparison, reproducibility, and iterative advancement of Language Models (LMs). However, their openness also introduces critical and underexplored pitfalls. This study exposes these weaknesses by systematically constructing ``cheating’’ models – smaller variants of BART, T5, and GPT-2 fine-tuned directly on public test sets – which achieve top rankings on a prominent open, holistic benchmark (HELM) despite poor generalization and limited practical utility. Our findings underscore three key insights: \ca high leaderboard performance on open benchmarks may not always reflect real-world effectiveness; \cb private or dynamic benchmarks must complement open evaluations to safeguard integrity; and \cc a fundamental reevaluation of current benchmarking practices is essential to ensure robust and trustworthy LM assessments.
zh

[NLP-34] Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention

【速读】: 该论文旨在解决长序列建模中生成式 AI (Generative AI) 的效率与表达能力不足的问题,特别是在处理长距离依赖关系时,传统状态空间模型(SSMs)存在局限性。其关键解决方案是将 SSM 与上下文相关的稀疏注意力机制(CDSA)相结合,并引入局部敏感哈希注意力与稀疏键选择(HAX),以在次二次时间复杂度内有效解决多查询联合回忆任务,从而提升模型在真实世界长上下文场景下的性能。

链接: https://arxiv.org/abs/2507.00449
作者: Zhihao Zhan,Jianan Zhao,Zhaocheng Zhu,Jian Tang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Proceedings of the 42nd International Conference on Machine Learning, ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 18 pages, 9 figures

点击查看摘要

Abstract:Efficient long-context modeling remains a critical challenge for natural language processing (NLP), as the time complexity of the predominant Transformer architecture scales quadratically with the sequence length. While state-space models (SSMs) offer alternative sub-quadratic solutions, they struggle to capture long-range dependencies effectively. In this work, we focus on analyzing and improving the long-context modeling capabilities of SSMs. We show that the widely used synthetic task, associative recall, which requires a model to recall a value associated with a single key without context, insufficiently represents the complexities of real-world long-context modeling. To address this limitation, we extend the associative recall to a novel synthetic task, \emphjoint recall, which requires a model to recall the value associated with a key given in a specified context. Theoretically, we prove that SSMs do not have the expressiveness to solve multi-query joint recall in sub-quadratic time complexity. To resolve this issue, we propose a solution based on integrating SSMs with Context-Dependent Sparse Attention (CDSA), which has the expressiveness to solve multi-query joint recall with sub-quadratic computation. To bridge the gap between theoretical analysis and real-world applications, we propose locality-sensitive Hashing Attention with sparse Key Selection (HAX), which instantiates the theoretical solution and is further tailored to natural language domains. Extensive experiments on both synthetic and real-world long-context benchmarks show that HAX consistently outperforms SSM baselines and SSMs integrated with context-independent sparse attention (CISA).
zh

[NLP-35] Beyond Sociodemographic Prompting: Using Supervision to Align LLM s with Human Response Distributions

【速读】: 该论文试图解决如何准确预测不同人口群体对主观问题的回答这一问题,其核心挑战在于使语言模型更好地与多样化的人口群体对齐。解决方案的关键在于使用相对简单的监督机制,该机制在三个涵盖不同主题的数据集上表现出显著提升模型对齐效果的能力。研究不仅评估了整体性能,还分析了特定群体间的对齐差异,强调了方法的简洁性和通用性,从而促进了实际应用中的广泛采用。

链接: https://arxiv.org/abs/2507.00439
作者: Gauri Kambhatla,Sanjana Gautam,Angela Zhang,Alex Liu,Ravi Srinivasan,Junyi Jessy Li,Matthew Lease
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability to accurately predict how different population groups would answer subjective questions would have great value. In this work, we show that use of relatively simple supervision can greatly improve language model alignment with diverse population groups, as measured over three datasets spanning various topics. Beyond evaluating average performance, we also report how alignment varies across specific groups. The simplicity and generality of our approach promotes easy adoption, while our broad findings provide useful guidance for when to use or not use our approach in practice. By conducting evaluation over many LLMs and prompting strategies, along with open-sourcing our work, we provide a useful benchmark to stimulate future research.
zh

[NLP-36] Does Math Reasoning Improve General LLM LLM Capabilities? Understanding Transferability of LLM Reasoning

【速读】: 该论文试图解决的问题是:尽管大型语言模型(Large Language Models, LLMs)在数学推理任务上取得了显著进展,但这些性能提升是否代表了更广泛的问题解决能力,还是仅仅是针对特定任务的过拟合现象。论文的解决方案关键在于通过对比不同微调方法对模型泛化能力的影响,发现基于强化学习(Reinforcement Learning, RL)微调的模型在跨领域任务中表现出更好的泛化能力,而监督微调(Supervised Fine-Tuning, SFT)则可能导致模型遗忘通用能力,从而揭示了当前后训练策略可能存在的局限性。

链接: https://arxiv.org/abs/2507.00432
作者: Maggie Huan,Yuetai Li,Tuney Zheng,Xiaoyu Xu,Seungone Kim,Minxin Du,Radha Poovendran,Graham Neubig,Xiang Yue
机构: Carnegie Mellon University (卡内基梅隆大学); University of Pennsylvania (宾夕法尼亚大学); University of Washington (华盛顿大学); M-A-P; The Hong Kong Polytechnic University (香港理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.
zh

[NLP-37] Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows

【速读】: 该论文试图解决传统自回归模型在语言建模中因依赖离散标记空间、单向上下文和单次解码而带来的建模灵活性不足的问题。其解决方案的关键在于将语言建模从离散标记空间转移到连续潜在空间,并提出了一种新的框架TarFlowLM,该框架利用基于Transformer的自回归归一化流来建模这些连续表示,从而实现了全局双向上下文捕捉、块级生成以及分层多轮生成等灵活建模能力。

链接: https://arxiv.org/abs/2507.00425
作者: Ruixiang Zhang,Shuangfei Zhai,Jiatao Gu,Yizhe Zhang,Huangjie Zheng,Tianrong Chen,Miguel Angel Bautista,Josh Susskind,Navdeep Jaitly
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and single-pass decoding, while central to their success, also inspires the exploration of a design space that could offer new axes of modeling flexibility. In this work, we explore an alternative paradigm, shifting language modeling from a discrete token space to a continuous latent space. We propose a novel framework TarFlowLM, that employs transformer-based autoregressive normalizing flows to model these continuous representations. This approach unlocks substantial flexibility, enabling the construction of models that can capture global bi-directional context through stacked, alternating-direction autoregressive transformations, support block-wise generation with flexible token patch sizes, and facilitate a hierarchical multi-pass generation process. We further propose new mixture-based coupling transformations designed to capture complex dependencies within the latent space shaped by discrete data, and demonstrate theoretical connections to conventional discrete autoregressive models. Extensive experiments on language modeling benchmarks demonstrate strong likelihood performance and highlight the flexible modeling capabilities inherent in our framework.
zh

[NLP-38] ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context

【速读】: 该论文试图解决如何提升非推理型大型语言模型(Large Language Models, LLMs)的推理能力问题,特别是针对如Llama 3这样的模型。现有开源推理模型虽然表现良好,但它们依赖于已具备较强推理能力和搜索行为的基座模型,因此难以直接推广至其他非推理模型。解决方案的关键在于提出ASTRO框架,通过蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)生成的合成数据集,使模型内部化结构化的搜索行为,并将其转化为自然语言的思维链,从而在强化学习过程中引入丰富的探索先验知识。该方法显著提升了模型在数学竞赛任务中的表现,尤其是在需要迭代修正的复杂问题上。

链接: https://arxiv.org/abs/2507.00417
作者: Joongwon Kim,Anirudh Goyal,Liang Tan,Hannaneh Hajishirzi,Srinivasan Iyer,Tianlu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 36 pages, 23 figures

点击查看摘要

Abstract:We introduce ASTRO, the “Autoregressive Search-Taught Reasoner”, a framework for training language models to reason like search algorithms, explicitly leveraging self-reflection, backtracking, and exploration in their outputs. Recently, training large language models (LLMs) via reinforcement learning (RL) has led to the advent of reasoning models with greatly enhanced reasoning capabilities. Open-source replications of reasoning models, while successful, build upon models that already exhibit strong reasoning capabilities along with search behavior observed even before RL. As a result, it is yet unclear how to boost the reasoning capabilities of other non-reasoner models including Llama 3. ASTRO teaches such models to internalize structured search behavior through a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. By converting search traces into natural language chain-of-thoughts that capture both successes and recoveries from failure, ASTRO bootstraps models with a rich prior for exploration during RL. We finetune our models on these search-derived traces and further improve performance via RL with verifiable rewards. We apply ASTRO to the Llama 3 family of models and achieve absolute performance gains of 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024, especially improving upon challenging problems that require iterative correction. Our results demonstrate that search-inspired training offers a principled way to instill robust reasoning capabilities into open LLMs.
zh

[NLP-39] Causal Prompting for Implicit Sentiment Analysis with Large Language Models

【速读】: 该论文旨在解决隐式情感分析(Implicit Sentiment Analysis, ISA)中由于依赖多数投票机制进行链式思维(Chain-of-Thought, CoT)推理路径选择而导致的内部偏差和虚假相关性问题。其解决方案的关键在于提出CAPITAL框架,该框架将前门调整(front-door adjustment)引入CoT推理过程,通过分解整体因果效应为输入提示对推理链的影响以及推理链对最终输出的影响,并利用基于编码器的聚类和NWGM近似进行估计,同时采用对比学习目标以增强编码器表示与LLM推理空间的一致性。

链接: https://arxiv.org/abs/2507.00389
作者: Jing Ren,Wenhao Zhou,Bowen Li,Mujie Liu,Nguyen Linh Dan Le,Jiade Cen,Liping Chen,Ziqi Xu,Xiwei Xu,Xiaodong Li
机构: RMIT University(皇家墨尔本理工大学); Federation University Australia(联邦大学); CSIRO’s Data61(澳大利亚联邦科学与工业研究组织数据61实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Implicit Sentiment Analysis (ISA) aims to infer sentiment that is implied rather than explicitly stated, requiring models to perform deeper reasoning over subtle contextual cues. While recent prompting-based methods using Large Language Models (LLMs) have shown promise in ISA, they often rely on majority voting over chain-of-thought (CoT) reasoning paths without evaluating their causal validity, making them susceptible to internal biases and spurious correlations. To address this challenge, we propose CAPITAL, a causal prompting framework that incorporates front-door adjustment into CoT reasoning. CAPITAL decomposes the overall causal effect into two components: the influence of the input prompt on the reasoning chains, and the impact of those chains on the final output. These components are estimated using encoder-based clustering and the NWGM approximation, with a contrastive learning objective used to better align the encoder’s representation with the LLM’s reasoning space. Experiments on benchmark ISA datasets with three LLMs demonstrate that CAPITAL consistently outperforms strong prompting baselines in both accuracy and robustness, particularly under adversarial conditions. This work offers a principled approach to integrating causal inference into LLM prompting and highlights its benefits for bias-aware sentiment reasoning. The source code and case study are available at: this https URL.
zh

[NLP-40] Gregorian melody modality and memory: Segmenting chant with Bayesian nonparametrics

【速读】: 该论文试图解决格里高利圣咏旋律是否由某种可重复使用的片段词汇构成的问题,即所谓的“拼贴理论”(centonisation)。其解决方案的关键在于利用嵌套的分层Pitman-Yor语言模型进行无监督的旋律分割,从而实现最优的调式分类(mode classification),并验证了调式与记忆效率之间的实证联系。

链接: https://arxiv.org/abs/2507.00380
作者: Vojtěch Lanz,Jan Hajič jr
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The idea that Gregorian melodies are constructed from some vocabulary of segments has long been a part of chant scholarship. This so-called “centonisation” theory has received much musicological criticism, but frequent re-use of certain melodic segments has been observed in chant melodies, and the intractable number of possible segmentations allowed the option that some undiscovered segmentation exists that will yet prove the value of centonisation, and recent empirical results have shown that segmentations can outperform music-theoretical features in mode classification. Inspired by the fact that Gregorian chant was memorised, we search for an optimal unsupervised segmentation of chant melody using nested hierarchical Pitman-Yor language models. The segmentation we find achieves state-of-the-art performance in mode classification. Modeling a monk memorising the melodies from one liturgical manuscript, we then find empirical evidence for the link between mode classification and memory efficiency, and observe more formulaic areas at the beginnings and ends of melodies corresponding to the practical role of modality in performance. However, the resulting segmentations themselves indicate that even such a memory-optimal segmentation is not what is understood as centonisation.
zh

[NLP-41] Question Decomposition for Retrieval-Augmented Generation ACL

【速读】: 该论文试图解决多跳问题(multi-hop questions)在检索增强生成(RAG)中的挑战,即相关事实通常分散在多个文档中,导致标准RAG难以有效检索到足够的信息。解决方案的关键在于引入问题分解(question decomposition):首先由大语言模型(LLM)将原始查询分解为子问题,然后针对每个子问题检索相关段落,并通过重排序(reranking)合并候选池以提升检索证据的覆盖率和精确度。该方法通过结合现成的交叉编码器重排序器与LLM驱动的问题分解,有效弥补了多跳问题中的检索差距。

链接: https://arxiv.org/abs/2507.00355
作者: Paul J. L. Ammann,Jonas Golde,Alan Akbik
机构: Humboldt-Universität zu Berlin (洪堡大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL SRW 2025. 9 Pages, 2 Figures, 4 Tables

点击查看摘要

Abstract:Grounding large language models (LLMs) in verifiable external sources is a well-established strategy for generating reliable answers. Retrieval-augmented generation (RAG) is one such approach, particularly effective for tasks like question answering: it retrieves passages that are semantically related to the question and then conditions the model on this evidence. However, multi-hop questions, such as “Which company among NVIDIA, Apple, and Google made the biggest profit in 2023?,” challenge RAG because relevant facts are often distributed across multiple documents rather than co-occurring in one source, making it difficult for standard RAG to retrieve sufficient information. To address this, we propose a RAG pipeline that incorporates question decomposition: (i) an LLM decomposes the original query into sub-questions, (ii) passages are retrieved for each sub-question, and (iii) the merged candidate pool is reranked to improve the coverage and precision of the retrieved evidence. We show that question decomposition effectively assembles complementary documents, while reranking reduces noise and promotes the most relevant passages before answer generation. Although reranking itself is standard, we show that pairing an off-the-shelf cross-encoder reranker with LLM-driven question decomposition bridges the retrieval gap on multi-hop questions and provides a practical, drop-in enhancement, without any extra training or specialized indexing. We evaluate our approach on the MultiHop-RAG and HotpotQA, showing gains in retrieval (MRR@10: +36.7%) and answer accuracy (F1: +11.6%) over standard RAG baselines.
zh

[NLP-42] Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios

【速读】: 该论文旨在解决预训练语言模型(PLM)在冷启动场景下,即没有标注数据的情况下,基于提示的方法对模板、词法器(verbalizer)和少量样本选择的敏感性问题。其解决方案的关键在于提出COLDSELECT,一种联合词法器与实例选择的方法,通过将PLM词汇表和h_[MASK]嵌入映射到共享空间,并利用降维和聚类技术实现高效且多样化的选择,从而优化最小不确定性与最大多样性,有效捕捉数据间的关系。

链接: https://arxiv.org/abs/2507.00330
作者: Mohna Chakraborty,Adithya Kulkarni,Qi Li
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabilities depend on verbalizer token proximity in the embedding space. To address this, we propose COLDSELECT, a joint verbalizer and instance selection approach that models data diversity. COLDSELECT maps PLM vocabulary and h_[MASK] embeddings into a shared space, applying dimensionality reduction and clustering to ensure efficient and diverse selection. By optimizing for minimal uncertainty and maximal diversity, COLDSELECT captures data relationships effectively. Experiments on eight benchmarks demonstrate COLDSELECT’s superiority in reducing uncertainty and enhancing generalization, outperforming baselines in verbalizer and few-shot instance selection for cold-start scenarios.
zh

[NLP-43] Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

【速读】: 该论文试图解决语言模型(Language Models, LMs)在处理简单句法任务(如生成平衡括号)时仍存在错误的问题。研究发现,LMs依赖多个组件(如注意力头和前馈神经元)进行独立预测,其中一些组件能够可靠地促进正确答案(即“健全机制”),而另一些则可能引入噪声并导致错误(即“故障机制”)。当故障机制主导预测时,错误就会发生。解决方案的关键在于提出RASteer方法,通过系统识别并增强可靠组件的贡献,从而提升模型性能,特别是在平衡括号任务中显著提高了准确率,同时不影响模型的一般编码能力。

链接: https://arxiv.org/abs/2507.00322
作者: Daking Rai,Samuel Miller,Kevin Moran,Ziyu Yao
机构: George Mason University (乔治梅森大学); University of Central Florida (中佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 23 pages, 10 figures, Preprint

点击查看摘要

Abstract:Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms’‘), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms’‘). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from 0 % to around 100 % without impairing the models’ general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around 20 %.
zh

[NLP-44] μ2Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation MICCAI2025

【速读】: 该论文旨在解决自动化放射学报告生成(Automated Radiology Report Generation, RRG)中的两个关键问题:在资源受限条件下从影像数据中提取相关信息的固有复杂性,以及客观评估模型生成报告与专家撰写报告之间差异的困难。其解决方案的关键在于提出一种多尺度多模态大语言模型(\mu^2 LLM),并通过一种新颖的\mu^2 Tokenizer整合多模态特征,结合直接偏好优化(DPO)方法,在GREEN-RedLlama的指导下提升报告生成质量。

链接: https://arxiv.org/abs/2507.00316
作者: Siyou Li,Pengyao Qin,Huanan Wu,Dong Nie,Arun J. Thirunavukarasu,Juntao Yu,Le Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
备注: Accepted by MICCAI 2025

点击查看摘要

Abstract:Automated radiology report generation (RRG) aims to produce detailed textual reports from clinical imaging, such as computed tomography (CT) scans, to improve the accuracy and efficiency of diagnosis and provision of management advice. RRG is complicated by two key challenges: (1) inherent complexity in extracting relevant information from imaging data under resource constraints, and (2) difficulty in objectively evaluating discrepancies between model-generated and expert-written reports. To address these challenges, we propose \mu^2 LLM, a \underline\textbfmu ltiscale \underline\textbfmu ltimodal large language models for RRG tasks. The novel \mu^2 Tokenizer, as an intermediate layer, integrates multi-modal features from the multiscale visual tokenizer and the text tokenizer, then enhances report generation quality through direct preference optimization (DPO), guided by GREEN-RedLlama. Experimental results on four large CT image-report medical datasetdemonstrate that our method outperforms existing approaches, highlighting the potential of our fine-tuned \mu^2 LLMs on limited data for RRG tasks.
zh

[NLP-45] Open-ended Scientific Discovery via Bayesian Surprise

【速读】: 该论文试图解决自主科学发现(Autonomous Scientific Discovery, ASD)中如何让人工智能系统自主驱动探索的问题,而非依赖人类指定的研究问题。现有方法在开放式ASD中通常基于多样性启发式或主观的人类兴趣代理,但前者难以有效导航庞大的假设空间,后者则因定义不明确而效果受限。该论文提出的解决方案关键在于使用贝叶斯意外性(Bayesian surprise)作为驱动科学探索的机制,通过量化语言模型从先验信念到后验信念的认知变化,并结合蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)策略与渐进式扩展,以意外性作为奖励函数,从而高效探索嵌套假设空间。

链接: https://arxiv.org/abs/2507.00310
作者: Dhruv Agarwal,Bodhisattwa Prasad Majumder,Reece Adamson,Megha Chakravorty,Satvika Reddy Gavireddy,Aditya Parashar,Harshit Surana,Bhavana Dalvi Mishra,Andrew McCallum,Ashish Sabharwal,Peter Clark
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The promise of autonomous scientific discovery (ASD) hinges not only on answering questions, but also on knowing which questions to ask. Most recent works in ASD explore the use of large language models (LLMs) in goal-driven settings, relying on human-specified research questions to guide hypothesis generation. However, scientific discovery may be accelerated further by allowing the AI system to drive exploration by its own criteria. The few existing approaches in open-ended ASD select hypotheses based on diversity heuristics or subjective proxies for human interestingness, but the former struggles to meaningfully navigate the typically vast hypothesis space, and the latter suffers from imprecise definitions. This paper presents AutoDS – a method for open-ended ASD that instead drives scientific exploration using Bayesian surprise. Here, we quantify the epistemic shift from the LLM’s prior beliefs about a hypothesis to its posterior beliefs after gathering experimental results. To efficiently explore the space of nested hypotheses, our method employs a Monte Carlo tree search (MCTS) strategy with progressive widening using surprisal as the reward function. We evaluate AutoDS in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AutoDS substantially outperforms competitors by producing 5–29% more discoveries deemed surprising by the LLM. Our human evaluation further finds that two-thirds of AutoDS discoveries are surprising to the domain experts, suggesting this is an important step forward towards building open-ended ASD systems.
zh

[NLP-46] Natural language processing for African languages

【速读】: 该论文旨在解决非洲本土语言在自然语言处理(Natural Language Processing, NLP)研究中面临的资源匮乏问题,特别是低资源语言在标注数据和网络未标注数据方面的不足。其关键解决方案是通过分析现有语料库中的噪声并构建高质量语料库,证明语义表示的质量不仅依赖于数据量,还取决于预训练数据的质量;同时,通过微调多语言预训练语言模型(Multilingual Pre-trained Language Model, PLM)并在少量单语文本基础上进行适应与专业化,提升模型在未见非洲语言和低资源场景下的性能,并构建大规模人工标注的标签数据集以推动相关研究。

链接: https://arxiv.org/abs/2507.00297
作者: David Ifeoluwa Adelani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:Recent advances in word embeddings and language models use large-scale, unlabelled data and self-supervised learning to boost NLP performance. Multilingual models, often trained on web-sourced data like Wikipedia, face challenges: few low-resource languages are included, their data is often noisy, and lack of labeled datasets makes it hard to evaluate performance outside high-resource languages like English. In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts. To address the under-representation of the African languages in NLP research, we developed large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks: named entity recognition and machine translation. We conduct an extensive empirical evaluation using state-of-the-art methods across supervised, weakly-supervised, and transfer learning settings.
zh

[NLP-47] Impact of Fine-Tuning Methods on Memorization in Large Language Models

【速读】: 该论文试图解决预训练大语言模型(Large Language Models, LLMs)在微调过程中因记忆机制引发的隐私风险问题。其解决方案的关键在于对主流微调方法进行分类,并通过成员推断攻击(Membership Inference Attacks, MIAs)评估不同方法对记忆效应的影响,结果表明基于提示的微调方法在保持良好性能的同时,相较于参数微调方法展现出更低的隐私泄露风险,并且在不同模型规模下均能维持较低的记忆水平。

链接: https://arxiv.org/abs/2507.00258
作者: Jie Hou,Chuxiong Wu,Lannan Luo,Qiang Zeng
机构: George Mason University (乔治·梅森大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the capabilities of pre-trained large language models (LLMs) continue to advance, the “pre-train and fine-tune” paradigm has become increasingly mainstream, leading to the development of various fine-tuning methods. However, the privacy risks arising from memorization during fine-tuning have received relatively little attention. To address this gap, we categorize popular fine-tuning approaches and assess their impact on memorization through the lens of membership inference attacks (MIAs). Our results show that, compared to parameter-based fine-tuning, prompt-based fine-tuning achieves competitive performance while exhibiting lower vulnerability to MIAs. Furthermore, prompt-based methods maintain low memorization regardless of model scale. These findings suggest that parameter-based fine-tuning is more prone to leaking private information, whereas prompt-based fine-tuning serves as a more privacy-preserving option.
zh

[NLP-48] Developing Lightweight DNN Models With Limited Data For Real-Time Sign Language Recognition

【速读】: 该论文旨在解决手语识别中的数据稀缺性、高计算成本以及训练与推理环境之间帧率不一致等问题。其解决方案的关键在于通过编码手语特定参数(如手型、掌向、动作和位置)生成向量化输入,并结合MediaPipe进行关键点提取,从而获得高度可分的输入数据表示;同时,采用轻量级深度神经网络(DNN)架构,优化为小于10MB的部署规模,实现在边缘设备上对343个手势的准确分类,且延迟低于10ms。

链接: https://arxiv.org/abs/2507.00248
作者: Nikita Nikitin,Eugene Fomin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 2 figures, 2 tables, for associated mpeg file, see this https URL

点击查看摘要

Abstract:We present a novel framework for real-time sign language recognition using lightweight DNNs trained on limited data. Our system addresses key challenges in sign language recognition, including data scarcity, high computational costs, and discrepancies in frame rates between training and inference environments. By encoding sign language specific parameters, such as handshape, palm orientation, movement, and location into vectorized inputs, and leveraging MediaPipe for landmark extraction, we achieve highly separable input data representations. Our DNN architecture, optimized for sub 10MB deployment, enables accurate classification of 343 signs with less than 10ms latency on edge devices. The data annotation platform ‘slait data’ facilitates structured labeling and vector extraction. Our model achieved 92% accuracy in isolated sign recognition and has been integrated into the ‘slait ai’ web application, where it demonstrates stable inference.
zh

[NLP-49] EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning

【速读】: 该论文试图解决当前语言推理模型(Language Reasoning Models, LRMs)研究中普遍以英语为中心的问题,探讨英语是否是推理任务中最高效的语言。其解决方案的关键在于通过在四个数学数据集和七种语言类型多样的语言上评估三个开源RLM(DeepSeek R1、Qwen 2.5和Qwen 3),发现非英语语言在减少token使用的同时仍能保持准确性,并且这种优势在将推理过程翻译为英语后依然存在,表明模型在非英语语言中表现出真正的推理行为转变,而非仅受语言表层特征影响。

链接: https://arxiv.org/abs/2507.00246
作者: Sanchit Ahuja,Praneetha Vaddamanu,Barun Patra
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Despite recent advances in Language Reasoning Models (LRMs), most research focuses solely on English, even though many models are pretrained on multilingual data. In this work, we investigate: Is English the most token-efficient language for reasoning? We evaluate three open-source RLMs: DeepSeek R1, Qwen 2.5 and Qwen 3, across four math datasets and seven typologically diverse languages. We find that reasoning in non-English languages not only reduces token usage, but also preserves accuracy. These gains persist even after translating the reasoning traces into English, suggesting genuine shifts in reasoning behavior rather than surface-level linguistic effects. The extent of improvement, however, depends on the models multilingual strength. Our findings motivate a broader view of reasoning in language models, highlighting the potential of multilingual reasoning and the importance of strong multilingual foundations. The code for our work can be found: this https URL.
zh

[NLP-50] he Algebraic Structure of Morphosyntax

【速读】: 该论文试图解决形态学与句法接口的数学建模问题,特别是在合并(Merge)的数学表述和强极简主义纲领(Strong Minimalist Thesis)的框架下,如何描述形态学在词形生成中的组合性质。其解决方案的关键在于将形态学组织为一个形态树的magma(半群),并通过coproduct分解扩展形态树集,以包含可能的形态输入到句法树中,这些输入作为操作数代数(algebra over an operad)参与形态句法树的形成,并通过操作数代数之间的对应关系来描述结构生成过程。

链接: https://arxiv.org/abs/2507.00244
作者: Isabella Senturia,Matilde Marcolli
机构: 未知
类目: Computation and Language (cs.CL); Quantum Algebra (math.QA)
备注: 45 pages, LaTeX, 2 png figures

点击查看摘要

Abstract:Within the context of the mathematical formulation of Merge and the Strong Minimalist Thesis, we present a mathematical model of the morphology-syntax interface. In this setting, morphology has compositional properties responsible for word formation, organized into a magma of morphological trees. However, unlike syntax, we do not have movement within morphology. A coproduct decomposition exists, but it requires extending the set of morphological trees beyond those which are generated solely by the magma, to a larger set of possible morphological inputs to syntactic trees. These participate in the formation of morphosyntactic trees as an algebra over an operad, and a correspondence between algebras over an operad. The process of structure formation for morphosyntactic trees can then be described in terms of this operadic correspondence that pairs syntactic and morphological data and the morphology coproduct. We reinterpret in this setting certain operations of Distributed Morphology as transformation that allow for flexibility in moving the boundary between syntax and morphology within the morphosyntactic objects.
zh

[NLP-51] Linearly Decoding Refused Knowledge in Aligned Language Models

【速读】: 该论文试图解决指令调优语言模型(instruction-tuned language models)中被拒绝的有害信息是否在模型内部表示中仍然可被解码的问题。解决方案的关键在于使用线性探针(linear probes)从模型隐藏状态中解码通过“越狱提示”获取的信息,结果显示大量原本被拒绝的信息具有高度线性可解性,并且这些信息不仅存在于基础模型中,还能迁移至指令调优模型,表明有害信息并未被完全消除或重新定位,而是被抑制了直接表达,但仍在线性可访问的表示空间中存在并影响下游行为。

链接: https://arxiv.org/abs/2507.00239
作者: Aryan Shrivastava,Ari Holtzman
机构: University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most commonly used language models (LMs) are instruction-tuned and aligned using a combination of fine-tuning and reinforcement learning, causing them to refuse users requests deemed harmful by the model. However, jailbreak prompts can often bypass these refusal mechanisms and elicit harmful responses. In this work, we study the extent to which information accessed via jailbreak prompts is decodable using linear probes trained on LM hidden states. We show that a great deal of initially refused information is linearly decodable. For example, across models, the response of a jailbroken LM for the average IQ of a country can be predicted by a linear probe with Pearson correlations exceeding 0.8 . Surprisingly, we find that probes trained on base models (which do not refuse) sometimes transfer to their instruction-tuned versions and are capable of revealing information that jailbreaks decode generatively, suggesting that the internal representations of many refused properties persist from base LMs through instruction-tuning. Importantly, we show that this information is not merely “leftover” in instruction-tuned models, but is actively used by them: we find that probe-predicted values correlate with LM generated pairwise comparisons, indicating that the information decoded by our probes align with suppressed generative behavior that may be expressed more subtly in other downstream tasks. Overall, our results suggest that instruction-tuning does not wholly eliminate or even relocate harmful information in representation space-they merely suppress its direct expression, leaving it both linearly accessible and indirectly influential in downstream behavior.
zh

[NLP-52] Interpretable AI for Time-Series: Multi-Model Heatmap Fusion with Global Attention and NLP-Generated Explanations

【速读】: 该论文试图解决现有模型可解释性方法中存在的时空错位问题,即卷积网络无法捕捉全局上下文而Transformer缺乏局部精度,这一局限性在医疗和工业监测等安全关键领域阻碍了可操作性洞察的获取。解决方案的关键在于将ResNet生成的梯度加权激活图与重构的2D Transformer的注意力传播进行融合,形成统一的可视化结果,从而实现完整的时空对齐并保持实时性能。

链接: https://arxiv.org/abs/2507.00234
作者: Jiztom Kavalakkatt Francis,Matthew J Darr
机构: Iowa State University (爱荷华州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:In this paper, we present a novel framework for enhancing model interpretability by integrating heatmaps produced separately by ResNet and a restructured 2D Transformer with globally weighted input saliency. We address the critical problem of spatial-temporal misalignment in existing interpretability methods, where convolutional networks fail to capture global context and Transformers lack localized precision - a limitation that impedes actionable insights in safety-critical domains like healthcare and industrial monitoring. Our method merges gradient-weighted activation maps (ResNet) and Transformer attention rollout into a unified visualization, achieving full spatial-temporal alignment while preserving real-time performance. Empirical evaluations on clinical (ECG arrhythmia detection) and industrial (energy consumption prediction) datasets demonstrate significant improvements: the hybrid framework achieves 94.1% accuracy (F1 0.93) on the PhysioNet dataset and reduces regression error to RMSE = 0.28 kWh (R2 = 0.95) on the UCI Energy Appliance dataset-outperforming standalone ResNet, Transformer, and InceptionTime baselines by 3.8-12.4%. An NLP module translates fused heatmaps into domain-specific narratives (e.g., “Elevated ST-segment between 2-4 seconds suggests myocardial ischemia”), validated via BLEU-4 (0.586) and ROUGE-L (0.650) scores. By formalizing interpretability as causal fidelity and spatial-temporal alignment, our approach bridges the gap between technical outputs and stakeholder understanding, offering a scalable solution for transparent, time-aware decision-making.
zh

[NLP-53] owards Style Alignment in Cross-Cultural Translation ACL2025

【速读】: 该论文试图解决跨文化沟通中由于文化差异导致的风格不对齐问题,即说话者意图的风格与听者感知的风格之间存在偏差,例如礼貌性在翻译过程中常被丢失。解决方案的关键在于提出RASTA(Retrieval-Augmented STylistic Alignment),该方法通过利用学习到的风格概念,引导大型语言模型(LLM)的翻译更好地传达文化交际规范并实现风格对齐。

链接: https://arxiv.org/abs/2507.00216
作者: Shreya Havaldar,Adam Stein,Eric Wong,Lyle Ungar
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025

点击查看摘要

Abstract:Successful communication depends on the speaker’s intended style (i.e., what the speaker is trying to convey) aligning with the listener’s interpreted style (i.e., what the listener perceives). However, cultural differences often lead to misalignment between the two; for example, politeness is often lost in translation. We characterize the ways that LLMs fail to translate style - biasing translations towards neutrality and performing worse in non-Western languages. We mitigate these failures with RASTA (Retrieval-Augmented STylistic Alignment), a method that leverages learned stylistic concepts to encourage LLM translation to appropriately convey cultural communication norms and align style.
zh

[NLP-54] wo-Stage Reasoning -Infused Learning: Improving Classification with LLM -Generated Reasoning

【速读】: 该论文旨在解决传统分类模型在文本分类任务中因缺乏显式推理过程而导致的性能、鲁棒性和可解释性受限的问题。其解决方案的关键在于提出一种两阶段方法,首先利用大型语言模型(LLM)生成文本推理(R),然后将这些推理作为增强训练数据,用于训练下游生成模型,使其在输入文本的基础上同时输出推理和预测的情感标签,从而提升分类性能。

链接: https://arxiv.org/abs/2507.00214
作者: Mads Henrichsen,Rasmus Krebs
机构: syv.ai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standard classification models often map inputs directly to labels without explicit reasoning, potentially limiting their performance, robustness, and interpretability. This paper introduces a novel two-stage approach to enhance text classification by leveraging Large Language Model (LLM)-generated reasonings. In the first stage, we fine-tune a Llama-3.2-1B-Instruct model (henceforth Llama-R-Gen) on a general-purpose reasoning dataset (syvai/reasoning-gen) to generate textual reasoning ® given a question and its answer. In the second stage, this generally trained Llama-R-Gen is used offline to create an augmented training dataset for a downstream generative model. This downstream model, based on Llama-3.2-1B-Instruct, takes only the input text (Q) and is trained to output the generated reasoning ® immediately followed by the predicted emotion (A). We demonstrate this methodology on the dair-ai/emotion dataset for emotion classification. Our experiments show that the generative model trained to output reasoning and the emotion (Classifier Q-RA) achieves a significant improvement of 8.7 percentage points in accuracy (for emotion prediction) compared to a baseline generative model trained solely to output the emotion (Classifier Q-A), highlighting the strong generalization capabilities of the reasoning generation and the benefit of explicit reasoning training. This work underscores the potential of LLM-generated reasonings for creating richer training datasets, thereby improving the performance of diverse downstream NLP tasks and providing explicit explanations.
zh

[NLP-55] LineRetriever: Planning -Aware Observation Reduction for Web Agents

【速读】: 该论文试图解决在网页导航任务中,由于网页内容的广泛性(通常以DOM或Accessibility Tree结构表示)超出模型上下文限制,导致现有方法如自底向上的截断或基于嵌入的检索丢失关键页面状态和操作历史信息的问题。解决方案的关键在于提出\textitLineRetriever,该方法利用语言模型识别并检索对未来导航步骤最相关的观察行,相较于传统检索方法仅关注语义相似性,\textitLineRetriever显式考虑了规划范围,优先选择有助于动作预测的元素。

链接: https://arxiv.org/abs/2507.00210
作者: Imene Kerboua,Sahar Omidi Shayegan,Megh Thakkar,Xing Han Lù,Massimo Caccia,Véronique Eglin,Alexandre Aussem,Jérémy Espinas,Alexandre Lacoste
机构: INSA Lyon(INSA里昂); Esker(埃斯克); ServiceNow Research(服务现在研究); Mila AI Institute(Mila人工智能研究所); McGill University(麦吉尔大学); Université Claude Bernard Lyon 1(克莱蒙·勃朗大学里昂第一分校); LIRIS(LIRIS)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models have demonstrated impressive capabilities in web navigation tasks, the extensive context of web pages, often represented as DOM or Accessibility Tree (AxTree) structures, frequently exceeds model context limits. Current approaches like bottom-up truncation or embedding-based retrieval lose critical information about page state and action history. This is particularly problematic for adaptive planning in web agents, where understanding the current state is essential for determining future actions. We hypothesize that embedding models lack sufficient capacity to capture plan-relevant information, especially when retrieving content that supports future action prediction. This raises a fundamental question: how can retrieval methods be optimized for adaptive planning in web navigation tasks? In response, we introduce \textitLineRetriever, a novel approach that leverages a language model to identify and retrieve observation lines most relevant to future navigation steps. Unlike traditional retrieval methods that focus solely on semantic similarity, \textitLineRetriever explicitly considers the planning horizon, prioritizing elements that contribute to action prediction. Our experiments demonstrate that \textitLineRetriever can reduce the size of the observation at each step for the web agent while maintaining consistent performance within the context limitations.
zh

[NLP-56] Prompting as Scientific Inquiry

【速读】: 该论文试图解决如何科学地研究和控制大语言模型(Large Language Models, LLMs)的问题,其核心在于重新定义“提示”(prompting)在LLM研究中的地位。论文指出,尽管提示是解锁LLM多种关键能力(如少样本学习、思维链、宪法AI等)的主要方法,但长期以来被忽视或贬低为一种非科学的“炼金术”。论文认为,这是对LLM研究方法的一种范畴错误,主张将LLM视为一种复杂且不透明的“有机体”,并通过提示这一行为科学手段来探究其内在机制。解决方案的关键在于将提示视为一种科学方法,而非权宜之计,从而推动对LLM的系统性理解和控制。

链接: https://arxiv.org/abs/2507.00163
作者: Ari Holtzman,Chenhao Tan
机构: University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompting is the primary method by which we study and control large language models. It is also one of the most powerful: nearly every major capability attributed to LLMs-few-shot learning, chain-of-thought, constitutional AI-was first unlocked through prompting. Yet prompting is rarely treated as science and is frequently frowned upon as alchemy. We argue that this is a category error. If we treat LLMs as a new kind of complex and opaque organism that is trained rather than programmed, then prompting is not a workaround: it is behavioral science. Mechanistic interpretability peers into the neural substrate, prompting probes the model in its native interface: language. We contend that prompting is not inferior, but rather a key component in the science of LLMs.
zh

[NLP-57] able Understanding and (Multimodal) LLM s: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理表格数据时效率不足的问题,特别是其在不同领域和模态下的表格理解能力尚未得到充分研究。解决方案的关键在于通过跨领域和跨模态的评估,对比文本型和多模态LLMs在科学与非科学上下文中表格理解任务中的表现,并引入TableEval基准,该基准包含来自学术出版物、维基百科和财务报告的3017张表格,每张表格提供五种格式:图像、字典、HTML、XML和LaTeX,以全面评估模型的性能与可解释性。

链接: https://arxiv.org/abs/2507.00152
作者: Ekaterina Borisova,Fabio Barth,Nils Feldhus,Raia Abu Ahmad,Malte Ostendorff,Pedro Ortiz Suarez,Georg Rehm,Sebastian Möller
机构: Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI); Technische Universität Berlin (柏林工业大学); BIFOLD; Deutsche Telekom (德国电信); Common Crawl Foundation (Common Crawl 基金会); Humboldt-Universität zu Berlin (洪堡大学)
类目: Computation and Language (cs.CL)
备注: TRL@ACL 2025, camera-ready version

点击查看摘要

Abstract:Tables are among the most widely used tools for representing structured data in research, business, medicine, and education. Although LLMs demonstrate strong performance in downstream tasks, their efficiency in processing tabular data remains underexplored. In this paper, we investigate the effectiveness of both text-based and multimodal LLMs on table understanding tasks through a cross-domain and cross-modality evaluation. Specifically, we compare their performance on tables from scientific vs. non-scientific contexts and examine their robustness on tables represented as images vs. text. Additionally, we conduct an interpretability analysis to measure context usage and input relevance. We also introduce the TableEval benchmark, comprising 3017 tables from scholarly publications, Wikipedia, and financial reports, where each table is provided in five different formats: Image, Dictionary, HTML, XML, and LaTeX. Our findings indicate that while LLMs maintain robustness across table modalities, they face significant challenges when processing scientific tables.
zh

[NLP-58] hinking About Thinking: SAGE-nanos Inverse Reasoning for Self-Aware Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中决策过程缺乏透明性的问题,即其决策机制被视为“黑箱”。论文提出的解决方案关键在于引入文本逆向推理(textbfinverse reasoning),这是一种新颖的范式,使LLMs能够事后分解并解释自身的推理链。该方法通过元认知结构,利用注意力机制回溯以识别主要决策点并生成推理选择的解释,从而实现对推理过程的自我反思和透明化。

链接: https://arxiv.org/abs/2507.00092
作者: Basab Jha,Firoj Paudel,Ujjwal Puri,Zhang Yuting,Choi Donghyuk,Wang Junhao
机构: SAGEA; Tribhuwan University — Vedas College; Tribhuwan University — Madan Bhandari Memorial College; Fudan University; ETH Zurich
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 2 figures, 9 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities at solving complex reasoning tasks with Chain-of-Thought (CoT) prompting, but their decision-making processes remain somewhat blackbox. We introduce textbfinverse reasoning, a novel paradigm enabling LLMs to decompose and explain their own reasoning chains post-hoc. Our approach, used in SAGE-nano, a 4-billion-parameter reasoning model, employs a metacognitive structure that reflects back via attention processes to identify major decision points and generate explanations of reasoning choices. While typical CoT approaches are directed towards forward reasoning generation, inverse reasoning provides insight into why specific reasoning chains were selected over others. Through thorough testing of logical reasoning puzzles, math problems and ethical dilemmas from AQUA-RAT, CommonsenseQA, and customized benchmarks, we demonstrate that SAGE-nano is at the cutting edge both on reasoning accuracy (74.6% on AQUA-RAT) and explanation quality (92.1% human preference score) for its task, and offers performance almost on par with models like Claude-3.5 Sonnet or GPT-4o. Our contributions are: (i) the first rigorous framework for LLM self-reflection via inverse reasoning, (ii) a novel metalearning framework to reverse the attention flow, (iii) comprehensive evaluation frameworks for reasoning transparency, and (iv) evidence that increasing reasoning using inverse reasoning improves interpretability along with reasoning performance. Our work creates new avenues for transparent AI systems and closes significant gaps in AI safety, education, and scientific discovery.
zh

[NLP-59] Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission

【速读】: 该论文旨在解决混合语言模型(Hybrid Language Models, HLMs)在边缘设备与中心服务器协同推理过程中,因不确定预测导致的频繁离线请求所引发的通信开销问题。其解决方案的关键在于提出FedHLM框架,该框架通过联邦学习(Federated Learning, FL)协作学习令牌级别的不确定性阈值,以动态决定是否调用大语言模型(Large Language Models, LLMs),从而减少不必要的通信流量。此外,FedHLM还引入基于嵌入的令牌表示进行对等解析,并采用分层模型聚合策略,进一步优化边缘侧的路由策略与全局决策边界,有效降低冗余的LLM查询。

链接: https://arxiv.org/abs/2507.00082
作者: Faranaksadat Solat,Joohyung Lee,Mohamed Seif,Dusit Niyato,H. Vincent Poor
机构: Gachon Univ.(嘉泉大学); Nanyang Technological University(南洋理工大学); Princeton University(普林斯顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 16 figures, IEEE Internet of Things

点击查看摘要

Abstract:Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers. Unlike traditional end-to-end LLM inference, HLMs reduce latency and communication by invoking LLMs only when local SLM predictions are uncertain, i.e., when token-level confidence is low or entropy is high. However, ambiguous or low-confidence predictions still require frequent offloading to the LLM, leading to significant communication overhead in bandwidth-constrained settings. To address this, we propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL). FedHLM’s key innovation lies in collaboratively learning token-level uncertainty thresholds that govern when LLM assistance is needed. Rather than using static or manually tuned thresholds, FedHLM employs FL to optimize these thresholds in a privacy-preserving, distributed manner. Additionally, it leverages embedding-based token representations for Peer-to-Peer (P2P) resolution, enabling clients to reuse tokens inferred by semantically similar peers without engaging the LLM. We further introduce hierarchical model aggregation: edge servers refine local routing policies through client updates, while cross-cluster coordination aligns global decision boundaries. This layered design captures recurring uncertainty patterns, reducing redundant LLM queries. Experiments on large-scale news classification tasks show that FedHLM reduces LLM transmissions by over 95 percent with negligible accuracy loss, making it well-suited for scalable and efficient edge-AI applications.
zh

[NLP-60] State and Memory is All You Need for Robust and Reliable AI Agents

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂、现实科学工作流中的应用受限问题,具体表现为记忆、规划和工具集成方面的挑战。其解决方案的关键在于提出一种模块化的智能体框架——SciBORG(Scientific Bespoke Artificial Intelligence Agents Optimized for Research Goals),该框架通过动态构建源代码文档并结合有限状态自动机(Finite-State Automata, FSA)记忆,实现持久的状态跟踪和上下文感知的决策,从而无需人工提示工程即可在扩展的工作流中保持上下文并从工具或执行失败中恢复。

链接: https://arxiv.org/abs/2507.00081
作者: Matthew Muhoberac,Atharva Parikh,Nirvi Vakharia,Saniya Virani,Aco Radujevic,Savannah Wood,Meghav Verma,Dimitri Metaxotos,Jeyaraman Soundararajan,Thierry Masquelin,Alexander G. Godfrey,Sean Gardner,Dobrila Rudnicki,Sam Michael,Gaurav Chopra
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Chemical Physics (physics.chem-ph)
备注: 5 Main Figures, 10 Extended Data Figures (37 Pages) for Manuscript ; 9 Supplementary Tables, 40 Supplementary Figures (180 Pages) for Supporting Information

点击查看摘要

Abstract:Large language models (LLMs) have enabled powerful advances in natural language understanding and generation. Yet their application to complex, real-world scientific workflows remain limited by challenges in memory, planning, and tool integration. Here, we introduce SciBORG (Scientific Bespoke Artificial Intelligence Agents Optimized for Research Goals), a modular agentic framework that allows LLM-based agents to autonomously plan, reason, and achieve robust and reliable domain-specific task execution. Agents are constructed dynamically from source code documentation and augmented with finite-state automata (FSA) memory, enabling persistent state tracking and context-aware decision-making. This approach eliminates the need for manual prompt engineering and allows for robust, scalable deployment across diverse applications via maintaining context across extended workflows and to recover from tool or execution failures. We validate SciBORG through integration with both physical and virtual hardware, such as microwave synthesizers for executing user-specified reactions, with context-aware decision making and demonstrate its use in autonomous multi-step bioassay retrieval from the PubChem database utilizing multi-step planning, reasoning, agent-to-agent communication and coordination for execution of exploratory tasks. Systematic benchmarking shows that SciBORG agents achieve reliable execution, adaptive planning, and interpretable state transitions. Our results show that memory and state awareness are critical enablers of agentic planning and reliability, offering a generalizable foundation for deploying AI agents in complex environments.
zh

[NLP-61] he language of time: a language model perspective on time-series foundation models

【速读】: 该论文试图解决时间序列基础模型在跨领域迁移中存在的根本性悖论,即时间序列数据反映的是不同的动力系统,使得跨领域迁移看似不切实际,但实际模型却表现出优异的性能。解决方案的关键在于揭示基于块(patch)的时间序列基础模型通过将语言模型的表示范式进行扩展,从确定性的向量表示推广到潜在的概率分布形式,从而实现了对时间序列数据的有效表示学习和泛化能力。理论分析表明,连续的时间序列块可以被准确离散化为一个词汇表,其关键统计特性与自然语言高度一致,这使得时间序列模型能够继承大规模语言模型的鲁棒表示和迁移能力。

链接: https://arxiv.org/abs/2507.00078
作者: Yi Xie,Yun Xiong,Zejian Shi,Hao Niu,Zhengfu Liu
机构: Fudan University (复旦大学); Shanghai Key Laboratory of Data Science (上海市数据科学重点实验室); ZCTech (智城科技); Beijing Institute of Technology (北京理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rise of large language models, the paradigm of training foundation models with massive parameter counts on vast datasets has been adopted in multiple domains to achieve remarkable success. Time series foundation models represent a significant extension of this paradigm, demonstrating exceptional expressive power, generalization, and cross-domain transferability. However, this gives rise to a fundamental paradox: time series data reflect distinct dynamical systems, making cross-domain transfer intuitively implausible, yet this is contradicted by the models’ empirical success. To resolve this paradox, this paper investigates, from both theoretical and experimental perspectives, the representation learning mechanisms and generalization capabilities of patch-based time series foundation models. We argue that such models are not merely applying a new architecture but are fundamentally generalizing the representation paradigm of language models by extending deterministic vector-based representations to latent probabilistic distributional forms. Our theoretical analysis supports this framework by demonstrating that continuous time-series patches can be faithfully quantized into a discrete vocabulary whose key statistical properties are highly consistent with those of natural language. This generalization allows time series models to inherit the robust representation and transfer abilities of large language models, thereby explaining their superior performance in temporal tasks. Ultimately, our work provides a rigorous theoretical cornerstone for understanding, evaluating, and improving the safety and reliability of large-scale time series foundation models.
zh

[NLP-62] MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding

【速读】: 该论文旨在解决多模态学习中模态间表示与推理不一致的问题,通过将视觉和听觉输入统一到结构化文本空间中,以实现与大语言模型的无缝处理。其解决方案的关键在于MANTA(Multi-modal Abstraction and Normalization via Textual Alignment)框架,该框架通过信息论优化实现跨模态语义对齐、自适应时间同步、多尺度内容表征以及长序列中稀疏信息的上下文感知检索,从而提升多模态数据的融合效果和理解能力。

链接: https://arxiv.org/abs/2507.00068
作者: Ziqi Zhong,Daniel Tang
机构: London School of Economics (伦敦经济学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While multi-modal learning has advanced significantly, current approaches often treat modalities separately, creating inconsistencies in representation and reasoning. We introduce MANTA (Multi-modal Abstraction and Normalization via Textual Alignment), a theoretically-grounded framework that unifies visual and auditory inputs into a structured textual space for seamless processing with large language models. MANTA addresses four key challenges: (1) semantic alignment across modalities with information-theoretic optimization, (2) adaptive temporal synchronization for varying information densities, (3) hierarchical content representation for multi-scale understanding, and (4) context-aware retrieval of sparse information from long sequences. We formalize our approach within a rigorous mathematical framework, proving its optimality for context selection under token constraints. Extensive experiments on the challenging task of Long Video Question Answering show that MANTA improves state-of-the-art models by up to 22.6% in overall accuracy, with particularly significant gains (27.3%) on videos exceeding 30 minutes. Additionally, we demonstrate MANTA’s superiority on temporal reasoning tasks (23.8% improvement) and cross-modal understanding (25.1% improvement). Our framework introduces novel density estimation techniques for redundancy minimization while preserving rare signals, establishing new foundations for unifying multimodal representations through structured text.
zh

[NLP-63] Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation

【速读】: 该论文试图解决知识蒸馏(Knowledge Distillation, KD)过程中学生模型仅复制教师模型在分布内响应的问题,这限制了模型的泛化能力,尤其在推理任务中表现更为明显,并且可能带来计算上的高成本。解决方案的关键在于提出AdvDistill,一个基于奖励的课程数据蒸馏框架,通过为每个提示生成多个教师响应并根据规则验证器分配奖励,将这些奖励作为训练学生模型时的权重,从而提升学生模型在数学和复杂推理任务中的性能。

链接: https://arxiv.org/abs/2507.00054
作者: Shreyansh Padarha
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 Pages, 7 figures

点击查看摘要

Abstract:The push to compress and impart the proficiency of Large Language Models (LLMs) into more deployable and efficient Small Language Models (SLMs) has benefited from improvements in knowledge distillation (KD) techniques. These techniques allow a smaller student model to learn from a more capable and larger teacher model’s responses. However, distillation often revolves around the student model merely copying the teacher’s in-distribution responses, limiting its generalisability. This limitation is amplified on reasoning tasks and can be computationally expensive. In this study, we propose AdvDistill, a reward-guided dataset distillation framework. We utilise multiple generations (responses) from a teacher for each prompt and assign rewards based on rule-based verifiers. These varying and normally distributed rewards serve as weights when training student models. Our methods and their subsequent behavioural analysis demonstrate a significant improvement in student model performance for mathematical and complex reasoning tasks, showcasing the efficacy and benefits of incorporating a rewarding mechanism in dataset distillation processes.
zh

[NLP-64] CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning

【速读】: 该论文试图解决当前多模态大语言模型(Multi-Modal Large Language Models, MLLMs)在复杂视觉感知与推理任务中的能力局限性问题,特别是其在模拟人类优秀侦探能力方面的不足。研究通过引入名为“CaughtCheating”的挑战性任务,揭示了现有MLLMs在处理需要细致观察和情境推理解释的场景时表现显著下降的现象。解决方案的关键在于深入分析此类任务的难点,并探索提升模型在视觉线索识别与逻辑推理方面的能力,从而推动MLLMs向具备人类水平的侦探感知与推理能力发展。

链接: https://arxiv.org/abs/2507.00045
作者: Ming Li,Chenguang Wang,Yijun Liang,Xiyao Wang,Yuhang Zhou,Xiyang Wu,Yuqing Zhang,Ruiyi Zhang,Tianyi Zhou
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent agentic Multi-Modal Large Language Models (MLLMs) such as GPT-o3 have achieved near-ceiling scores on various existing benchmarks, motivating a demand for more challenging test tasks. These MLLMs have been reported to excel in a few expert-level tasks for humans, e.g., GeoGuesser, reflecting their potential as a detective who can notice minuscule cues in an image and weave them into coherent, situational explanations, leading to a reliable answer. But can they match the performance of excellent human detectives? To answer this question, we investigate some hard scenarios where GPT-o3 can still handle, and find a common scenario where o3’s performance drops to nearly zero, which we name CaughtCheating. It is inspired by the social media requests that ask others to detect suspicious clues from photos shared by the poster’s partner. We conduct extensive experiments and analysis to understand why existing MLLMs lack sufficient capability to solve this kind of task. CaughtCheating provides a class of challenging visual perception and reasoning tasks with great value and practical usage. Success in these tasks paves the way for MLLMs to acquire human-level detective perception and reasoning capabilities.
zh

[NLP-65] Moment Sampling in Video LLM s for Long-Form Video QA CVPR2025

【速读】: 该论文试图解决视频大语言模型(Video LLMs)在处理长视频时的长程推理能力不足的问题,特别是在帧采样过程中因常规帧下采样方法导致的关键帧丢失或冗余信息增加的问题。解决方案的关键在于提出“时刻采样”(moment sampling)方法,该方法利用通用文本到视频时刻检索模型指导帧采样过程,使模型能够根据问题上下文选择最相关的帧,从而提升长视频问答性能。

链接: https://arxiv.org/abs/2507.00033
作者: Mustafa Chasmai,Gauri Jagatap,Gouthaman KV,Grant Van Horn,Subhransu Maji,Andrea Fanelli
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Dolby Laboratories (杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Workshop on Video Large Language Models (VidLLMs) at CVPR 2025

点击查看摘要

Abstract:Recent advancements in video large language models (Video LLMs) have significantly advanced the field of video question answering (VideoQA). While existing methods perform well on short videos, they often struggle with long-range reasoning in longer videos. To scale Video LLMs for longer video content, frame sub-sampling (selecting frames at regular intervals) is commonly used. However, this approach is suboptimal, often leading to the loss of crucial frames or the inclusion of redundant information from multiple similar frames. Missing key frames impairs the model’s ability to answer questions accurately, while redundant frames lead the model to focus on irrelevant video segments and increase computational resource consumption. In this paper, we investigate the use of a general-purpose text-to-video moment retrieval model to guide the frame sampling process. We propose “moment sampling”, a novel, model-agnostic approach that enables the model to select the most relevant frames according to the context of the question. Specifically, we employ a lightweight moment retrieval model to prioritize frame selection. By focusing on the frames most pertinent to the given question, our method enhances long-form VideoQA performance in Video LLMs. Through extensive experiments on four long-form VideoQA datasets, using four state-of-the-art Video LLMs, we demonstrate the effectiveness of the proposed approach.
zh

[NLP-66] ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models

【速读】: 该论文旨在解决现有手动安全评估基准在适应性、话题覆盖广度和现实场景对齐方面的不足,这些问题限制了其在快速发展的大型语言模型(Large Language Models, LLMs)安全评估中的有效性。论文提出的解决方案关键在于构建一种基于多目标强化学习的现实导向安全评估框架(Reality-Oriented Safety Evaluation, ROSE),通过微调对抗性LLM来生成具有广泛话题多样性和丰富情境语境的对抗性提示,从而更全面地暴露潜在的安全漏洞。

链接: https://arxiv.org/abs/2507.00026
作者: Jiale Ding,Xiang Zheng,Cong Wang,Wei-Bin Lee,Xingjun Ma,Yu-Gang Jiang
机构: Fudan University (复旦大学); City University of Hong Kong (香港城市大学); Hon Hai Research Institute (鸿海研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications, evaluating their safety-especially under adversarial prompting-has become critical. Arguably, effective safety evaluations should be adaptive, evolving with LLM capabilities, and also cover a broad spectrum of harmful topics and real-world scenarios to fully expose potential vulnerabilities. Existing manual safety benchmarks, built on handcrafted adversarial prompts, are limited by their static nature and the intensive labor required to update them, making it difficult to keep pace with rapidly advancing LLMs. In contrast, automated adversarial prompt generation offers a promising path toward adaptive evaluation. However, current methods often suffer from insufficient adversarial topic coverage (topic-level diversity) and weak alignment with real-world contexts. These shortcomings stem from the exploration-exploitation dilemma in black-box optimization and a lack of real-world contextualization, resulting in adversarial prompts that are both topically narrow and scenario-repetitive. To address these issues, we propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM for generating topically diverse and contextually rich adversarial prompts. Experiments show that ROSE outperforms existing methods in uncovering safety vulnerabilities in state-of-the-art LLMs, with notable improvements in integrated evaluation metrics. We hope ROSE represents a step toward more practical and reality-oriented safety evaluation of LLMs. WARNING: This paper contains examples of potentially harmful text.
zh

[NLP-67] GLU Attention Improve Transformer

【速读】: 该论文试图解决传统注意力机制在非线性建模能力上的局限性,从而提升神经网络的性能与收敛速度。其解决方案的关键在于引入一种名为GLU Attention的新颖注意力机制,该机制通过在注意力值(Attention)中引入非线性,增强了模型的表达能力,同时保持了零额外参数和可忽略的计算成本,使其具有高度的轻量化和兼容性。

链接: https://arxiv.org/abs/2507.00022
作者: Zehao Wang
机构: Nanjing University (南京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 4 pages 4 figures

点击查看摘要

Abstract:Gated Linear Units (GLU) have shown great potential in enhancing neural network performance. In this paper, I introduce a novel attention mechanism called GLU Attention, which introduces nonlinearity into the values of Attention. My experiments demonstrate that GLU Attention improves both model performance and convergence speed across text and vision modalities with zero additional parameters and negligible computational costs. GLU Attention is lightweight and can seamlessly integrate with other technologies, such as Flash Attention, Rotary Position Embedding (RoPE), and various Multi-Head Attention (MHA) variants such as Grouped-Query Attention (GQA). This project is open-sourced at github.
zh

[NLP-68] Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

【速读】: 该论文旨在解决预训练语言模型(Large Language Model, LLM)后训练过程中,监督微调(Supervised Fine-Tuning, SFT)与偏好学习方法(如直接偏好优化,Direct Preference Optimization, DPO)在理论上的分离问题,提出一个统一的理论框架来连接二者。其关键在于通过严格的数学推导证明,SFT和偏好学习方法均存在于相同的最优策略-奖励子空间中,且SFT可视为隐式奖励学习的一种特殊情况。研究进一步揭示了传统SFT中KL散度项在优化过程中对策略的约束失效问题,并提出一种简单有效的学习率衰减方法,显著提升了模型在指令遵循任务中的性能。此外,还通过不同f-散度函数推导出改进的SFT目标函数,进一步增强了后DPO模型的表现。

链接: https://arxiv.org/abs/2507.00018
作者: Bo Wang,Qinyuan Cheng,Runyu Peng,Rong Bao,Peiji Li,Qipeng Guo,Linyang Li,Zhiyuan Zeng,Yunhua Zhou,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical framework bridging Supervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training. Through rigorous mathematical derivation, we demonstrate that both SFT and preference learning methods like Direct Preference Optimization (DPO) operate within the same optimal policy-reward subspace, with SFT representing a special case of implicit reward learning. Our analysis reveals a critical limitation in conventional SFT: the KL divergence term in distribution matching becomes constant with respect to the policy during optimization, failing to constrain model updates. To address this, we propose a simple yet effective learning rate reduction approach that yields significant performance improvements (up to \textbf25% relative gain and \textbf6% absolute win rate increase in instruction following tasks. Additionally, we derive alternative SFT objectives from various f-divergence functions that preserve the KL term during optimization, further enhancing post-DPO model performance. Finally, we extend the theoretical relationship between LLM logits and Q-functions from preference learning to the SFT context, providing mathematical derivations and experimental validation.
zh

[NLP-69] Hypertokens: Holographic Associative Memory in Tokenized LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中因信息扩散(information spreading)导致的精度下降问题,即将计算精度问题重新定义为信息理论中的通信问题。解决方案的关键在于引入HDRAM(Holographically Defined Random Access Memory),这是一种基于超令牌(hypertokens)的符号化记忆框架,通过整合经典纠错码(ECC)、全息计算和量子启发式搜索,将Transformer隐空间视为扩频信道,从而实现分布式信息的原理性解扩。该方法通过相位相干的记忆地址实现高效的键值操作和隐空间中的Grover式搜索,结合ECC语法与压缩感知及Krylov子空间对齐,显著提升了关联检索性能,而无需改变模型架构。

链接: https://arxiv.org/abs/2507.00002
作者: Christopher James Augeri
机构: Sloop FX, Inc.(Sloop FX公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint as accepted to this https URL - Quantum AI and NLP Conference 2025

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit remarkable capabilities but suffer from apparent precision loss, reframed here as information spreading. This reframing shifts the problem from computational precision to an information-theoretic communication issue. We address the K:V and V:K memory problem in LLMs by introducing HDRAM (Holographically Defined Random Access Memory), a symbolic memory framework treating transformer latent space as a spread-spectrum channel. Built upon hypertokens, structured symbolic codes integrating classical error-correcting codes (ECC), holographic computing, and quantum-inspired search, HDRAM recovers distributed information through principled despreading. These phase-coherent memory addresses enable efficient key-value operations and Grover-style search in latent space. By combining ECC grammar with compressed sensing and Krylov subspace alignment, HDRAM significantly improves associative retrieval without architectural changes, demonstrating how Classical-Holographic-Quantum-inspired (CHQ) principles can fortify transformer architectures.
zh

计算机视觉

[CV-0] VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers ICCV2025

【速读】:该论文旨在解决传统动作分词器在处理复杂时空动态动作轨迹时的局限性,以及在真实世界应用中对大量标注数据的依赖问题。其解决方案的关键在于构建一个基于向量量化(Vector Quantization)的动作分词器,并利用迄今为止最大规模的动作轨迹数据集进行训练,该数据集的数据量是以往方法的100倍以上,从而显著提升了模型的推理速度和动作输出的平滑性与连贯性。此外,研究还发现合成数据与真实数据之间的领域差异较小,使得模型能够有效利用大量合成数据进行训练而不影响实际性能。

链接: https://arxiv.org/abs/2507.01016
作者: Yating Wang,Haoyi Zhu,Mingyu Liu,Jiange Yang,Hao-Shu Fang,Tong He
机构: Shanghai AI Lab(上海人工智能实验室); Tongji(同济大学); USTC(中国科学技术大学); ZJU(浙江大学); NJU(南京大学); SJTU(上海交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly-most notably, achieving up to a 30% higher success rate on two real-world tasks in long-horizon scenarios. These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application this http URL website: this https URL
zh

[CV-1] DAM-VSR: Disentanglement of Appearance and Motion for Video Super-Resolution SIGGRAPH2025

【速读】:该论文试图解决真实世界视频超分辨率(VSR)中由于复杂且不可预测的退化导致的帧间时间不一致性问题。其解决方案的关键在于提出DAM-VSR框架,该框架通过将VSR分解为外观增强和运动控制两个独立问题,分别利用参考图像超分辨率和视频ControlNet实现,从而充分结合视频扩散模型的生成先验与图像超分辨率模型的细节生成能力,并引入运动对齐的双向采样策略以提升长视频处理能力。

链接: https://arxiv.org/abs/2507.01012
作者: Zhe Kong,Le Li,Yong Zhang,Feng Gao,Shaoshu Yang,Tao Wang,Kaihao Zhang,Zhuoliang Kang,Xiaoming Wei,Guanying Chen,Wenhan Luo
机构: Shenzhen Campus of Sun Yat-sen University (深圳校区中山大学); Meituan (美团); Division of AMC and Department of ECE, HKUST (香港科技大学机电工程系与AMC分校); Tianjin University (天津大学); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院人工智能学院); Nanjing University (南京大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Shenzhen Campus of Sun Yat-sen University (深圳校区中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM SIGGRAPH 2025, Homepage: this https URL Github: this https URL

点击查看摘要

Abstract:Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. However, due to the intrinsic image-animation characteristics of SVD, it is challenging to generate fine details using only low-quality videos. To tackle this problem, we propose DAM-VSR, an appearance and motion disentanglement framework for VSR. This framework disentangles VSR into appearance enhancement and motion control problems. Specifically, appearance enhancement is achieved through reference image super-resolution, while motion control is achieved through video ControlNet. This disentanglement fully leverages the generative prior of video diffusion models and the detail generation capabilities of image super-resolution models. Furthermore, equipped with the proposed motion-aligned bidirectional sampling strategy, DAM-VSR can conduct VSR on longer input videos. DAM-VSR achieves state-of-the-art performance on real-world data and AIGC data, demonstrating its powerful detail generation capabilities.
zh

[CV-2] ShapeEmbed: a self-supervised learning framework for 2D contour quantification

【速读】:该论文试图解决形状量化中的核心挑战,即确保提取的测量值在保持物体内在几何不变性的情况下对平移、缩放、旋转、反射和点索引变化保持不变。解决方案的关键在于提出ShapeEmbed,一种自监督表示学习框架,能够将2D图像中物体的轮廓(以欧几里得距离矩阵形式表示)编码为具有上述不变性的形状描述符。

链接: https://arxiv.org/abs/2507.01009
作者: Anna Foix Romero,Craig Russell,Alexander Krull,Virginie Uhlmann
机构: European Bioinformatics Institute, European Molecular Biology Laboratory, Cambridge, UK
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:The shape of objects is an important source of visual information in a wide range of applications. One of the core challenges of shape quantification is to ensure that the extracted measurements remain invariant to transformations that preserve an object’s intrinsic geometry, such as changing its size, orientation, and position in the image. In this work, we introduce ShapeEmbed, a self-supervised representation learning framework designed to encode the contour of objects in 2D images, represented as a Euclidean distance matrix, into a shape descriptor that is invariant to translation, scaling, rotation, reflection, and point indexing. Our approach overcomes the limitations of traditional shape descriptors while improving upon existing state-of-the-art autoencoder-based approaches. We demonstrate that the descriptors learned by our framework outperform their competitors in shape classification tasks on natural and biological images. We envision our approach to be of particular relevance to biological imaging applications.
zh

[CV-3] GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

【速读】:该论文旨在解决多模态推理能力不足的问题,特别是针对通用场景下的视觉与语言联合理解与推理任务。其解决方案的关键在于构建一个以推理为核心的训练框架,首先通过大规模预训练构建一个具备强大视觉基础能力的模型,随后采用课程采样强化学习(Reinforcement Learning with Curriculum Sampling, RLCS)进一步挖掘模型潜力,从而在包括STEM问题求解、视频理解、内容识别、编程、定位、基于GUI的代理以及长文档理解等多种任务中实现全面的能力提升。

链接: https://arxiv.org/abs/2507.01006
作者: Wenyi Hong,Wenmeng Yu,Xiaotao Gu,Guo Wang,Guobing Gan,Haomiao Tang,Jiale Cheng,Ji Qi,Junhui Ji,Lihang Pan,Shuaiqi Duan,Weihan Wang,Yan Wang,Yean Cheng,Zehai He,Zhe Su,Zhen Yang,Ziyang Pan,Aohan Zeng,Baoxu Wang,Boyan Shi,Changyu Pang,Chenhui Zhang,Da Yin,Fan Yang,Guoqing Chen,Jiazheng Xu,Jiali Chen,Jing Chen,Jinhao Chen,Jinghao Lin,Jinjiang Wang,Junjie Chen,Leqi Lei,Leyi Pan,Mingzhi Zhang,Qinkai Zheng,Sheng Yang,Shi Zhong,Shiyu Huang,Shuyuan Zhao,Siyan Xue,Shangqin Tu,Shengbiao Meng,Tianshu Zhang,Tianwei Luo,Tianxiang Hao,Tianle Gong,Wenkai Li,Wei Jia,Xin Lyu,Xuancheng Huang,Yanling Wang,Yadong Xue,Yanfeng Wang,Yifan An,Yifan Du,Yiming Shi,Yiheng Huang,Yilin Niu,Yuan Wang,Yuanchang Yue,Yuchen Li,Yutao Zhang,Yuxuan Zhang,Zhanxiao Du,Zhenyu Hou,Zhao Xue,Zhengxiao Du,Zihan Wang,Peng Zhang,Debing Liu,Bin Xu,Juanzi Li,Minlie Huang,Yuxiao Dong,Jie Tang
机构: Zhipu AI (智普AI); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. Reinforcement Learning with Curriculum Sampling (RLCS) then unlocks the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding, among others. To facilitate research in this field, we open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at this https URL.
zh

[CV-4] UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis ICCV2025

【速读】:该论文旨在解决文本到图像生成中视觉文本准确渲染的问题,具体包括模糊的字形(glyph)、语义漂移以及风格控制有限等挑战。现有方法通常依赖预渲染的字形图像作为条件输入,但难以保留原始字体风格和颜色线索,导致需要复杂的多分支设计,增加了模型开销并降低了灵活性。论文提出的解决方案关键在于引入一种基于分割引导的框架,利用像素级视觉文本掩码作为统一的条件输入,该掩码包含丰富的字形形状、颜色和空间细节。该方法包含两个核心组件:(1)用于精确提取文本掩码的微调双语分割模型;(2)增强自适应字形条件和区域特定损失的简化扩散模型,以保持文本内容和风格的真实性。

链接: https://arxiv.org/abs/2507.00992
作者: Yuanrui Wang,Cong Han,YafeiLi,Zhipeng Jin,Xiawei Li,SiNan Du,Wen Tao,Yi Yang,shuanglong li,Chun Yuan,Liu Lin
机构: Tsinghua University (清华大学); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Text-to-image generation has greatly advanced content creation, yet accurately rendering visual text remains a key challenge due to blurred glyphs, semantic drift, and limited style control. Existing methods often rely on pre-rendered glyph images as conditions, but these struggle to retain original font styles and color cues, necessitating complex multi-branch designs that increase model overhead and reduce flexibility. To address these issues, we propose a segmentation-guided framework that uses pixel-level visual text masks – rich in glyph shape, color, and spatial detail – as unified conditional inputs. Our method introduces two core components: (1) a fine-tuned bilingual segmentation model for precise text mask extraction, and (2) a streamlined diffusion model augmented with adaptive glyph conditioning and a region-specific loss to preserve textual fidelity in both content and style. Our approach achieves state-of-the-art performance on the AnyText benchmark, significantly surpassing prior methods in both Chinese and English settings. To enable more rigorous evaluation, we also introduce two new benchmarks: GlyphMM-benchmark for testing layout and glyph consistency in complex typesetting, and MiniText-benchmark for assessing generation quality in small-scale text regions. Experimental results show that our model outperforms existing methods by a large margin in both scenarios, particularly excelling at small text rendering and complex layout preservation, validating its strong generalization and deployment readiness.
zh

[CV-5] Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

【速读】:该论文试图解决机器人在无需物理演示或特定训练的情况下执行复杂操作任务(如倒水、擦拭和混合)的问题。其解决方案的关键在于利用生成式 AI (Generative AI) 生成潜在的操作视频,并通过视觉-语言模型(VLM)筛选符合指令的视频,随后使用6D位姿跟踪器提取物体轨迹,并以与具身无关的方式将轨迹映射到机器人上,从而实现有效的动作模仿。

链接: https://arxiv.org/abs/2507.00990
作者: Shivansh Patel,Shraddhaa Mohan,Hanlin Mai,Unnat Jain,Svetlana Lazebnik,Yunzhu Li
机构: UIUC (University of Illinois Urbana-Champaign); UC Irvine (University of California, Irvine); Skild AI (Skild AI); Columbia University (Columbia University)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks–such as pouring, wiping, and mixing–purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.
zh

[CV-6] Box Pose and Shape Estimation and Domain Adaptation for Large-Scale Warehouse Automation

【速读】:该论文旨在解决现代仓储自动化系统中感知模型训练数据不足的问题,特别是在缺乏人工标注数据的情况下提升对箱体姿态和形状的估计性能。其解决方案的关键在于提出了一种自监督域适应流程,该流程利用真实世界未标注数据来优化感知模型,从而减少对人工标注的依赖,并在模拟与实际工业场景中均表现出显著的性能提升。

链接: https://arxiv.org/abs/2507.00984
作者: Xihang Yu,Rajat Talak,Jingnan Shi,Ulrich Viereck,Igor Gilitschenski,Luca Carlone
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 6 figures. This work will be presented at the 19th International Symposium on Experimental Robotics (ISER2025)

点击查看摘要

Abstract:Modern warehouse automation systems rely on fleets of intelligent robots that generate vast amounts of data – most of which remains unannotated. This paper develops a self-supervised domain adaptation pipeline that leverages real-world, unlabeled data to improve perception models without requiring manual annotations. Our work focuses specifically on estimating the pose and shape of boxes and presents a correct-and-certify pipeline for self-supervised box pose and shape estimation. We extensively evaluate our approach across a range of simulated and real industrial settings, including adaptation to a large-scale real-world dataset of 50,000 images. The self-supervised model significantly outperforms models trained solely in simulation and shows substantial improvements over a zero-shot 3D bounding box estimation baseline.
zh

[CV-7] Evaluating Robustness of Monocular Depth Estimation with Procedural Scene Perturbations

【速读】:该论文试图解决单目深度估计模型在标准基准测试中仅评估准确性而忽略鲁棒性的问题,从而无法全面反映模型的实际性能。其解决方案的关键在于引入PDE(Procedural Depth Evaluation)基准,通过程序化生成3D场景来系统地评估模型对各种受控扰动(如物体、相机、材质和光照变化)的鲁棒性。

链接: https://arxiv.org/abs/2507.00981
作者: Jack Nugent,Siyang Wu,Zeyu Ma,Beining Han,Meenal Parakh,Abhishek Joshi,Lingjie Mei,Alexander Raistrick,Xinyuan Li,Jia Deng
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent years have witnessed substantial progress on monocular depth estimation, particularly as measured by the success of large models on standard benchmarks. However, performance on standard benchmarks does not offer a complete assessment, because most evaluate accuracy but not robustness. In this work, we introduce PDE (Procedural Depth Evaluation), a new benchmark which enables systematic robustness evaluation. PDE uses procedural generation to create 3D scenes that test robustness to various controlled perturbations, including object, camera, material and lighting changes. Our analysis yields interesting findings on what perturbations are challenging for state-of-the-art depth models, which we hope will inform further research. Code and data are available at this https URL.
zh

[CV-8] RTMap: Real-Time Recursive Mapping with Change Detection and Localization

【速读】:该论文旨在解决在线高精地图(HD map)构建中面临的感知不准确性、密集交通中的遮挡问题以及多智能体观测融合能力不足的问题。其解决方案的关键在于提出RTMap,通过持续众包多遍历的高精地图作为自我演化的记忆,以增强单次遍历方法。RTMap在车载智能体上以端到端的方式同时解决三个核心挑战:(1)针对高精地图元素的不确定性感知位置建模,(2)基于众包先验地图的概率感知定位,(3)实时检测可能的道路结构变化。

链接: https://arxiv.org/abs/2507.00980
作者: Yuheng Du,Sheng Yang,Lingxuan Wang,Zhenghua Hou,Chengying Cai,Zhitao Tan,Mingxia Chen,Shi-Sheng Huang,Qiang Li
机构: CaiNiao Inc., Alibaba Group (菜鸟网络,阿里巴巴集团); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an end-to-end fashion: (1) Uncertainty-aware positional modeling for HD map elements, (2) probabilistic-aware localization w.r.t. the crowdsourced prior-map, and (3) real-time detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior-aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior-map asynchronously. Our source-code will be made publicly available at this https URL (Camera ready version incorporating reviewer suggestions will be updated soon).
zh

[CV-9] Surgical Neural Radiance Fields from One Image

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在手术室内应用时面临的多视角数据不足问题,传统方法依赖大量多视角数据,而在手术过程中由于时间限制,难以获取此类数据。解决方案的关键在于利用术前的磁共振成像(MRI)数据定义相机视角和图像集,并通过神经风格迁移技术,结合WTC2和STROTSS方法将术中图像的外观转移到预构建的训练集中,从而生成适用于单张术中图像快速训练NeRF的数据集。

链接: https://arxiv.org/abs/2507.00969
作者: Alberto Neri,Maximilan Fehrentz,Veronica Penza,Leonardo S. Mattos,Nazim Haouchine
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purpose: Neural Radiance Fields (NeRF) offer exceptional capabilities for 3D reconstruction and view synthesis, yet their reliance on extensive multi-view data limits their application in surgical intraoperative settings where only limited data is available. In particular, collecting such extensive data intraoperatively is impractical due to time constraints. This work addresses this challenge by leveraging a single intraoperative image and preoperative data to train NeRF efficiently for surgical scenarios. Methods: We leverage preoperative MRI data to define the set of camera viewpoints and images needed for robust and unobstructed training. Intraoperatively, the appearance of the surgical image is transferred to the pre-constructed training set through neural style transfer, specifically combining WTC2 and STROTSS to prevent over-stylization. This process enables the creation of a dataset for instant and fast single-image NeRF training. Results: The method is evaluated with four clinical neurosurgical cases. Quantitative comparisons to NeRF models trained on real surgical microscope images demonstrate strong synthesis agreement, with similarity metrics indicating high reconstruction fidelity and stylistic alignment. When compared with ground truth, our method demonstrates high structural similarity, confirming good reconstruction quality and texture preservation. Conclusion: Our approach demonstrates the feasibility of single-image NeRF training in surgical settings, overcoming the limitations of traditional multi-view methods. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.00969 [cs.CV] (or arXiv:2507.00969v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.00969 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Int J CARS (2025) Related DOI: https://doi.org/10.1007/s11548-025-03447-5 Focus to learn more DOI(s) linking to related resources Submission history From: Alberto Neri [view email] [v1] Tue, 1 Jul 2025 17:19:25 UTC (12,553 KB)
zh

[CV-10] MVP: Winning Solution to SMP Challenge 2025 Video Track

【速读】:该论文试图解决社交媒体视频流行度预测的问题,这一问题在内容推荐、趋势检测和用户参与度提升等方面具有重要应用价值。解决方案的关键在于构建多模态的视频表示,通过整合预训练模型提取的深度视频特征与用户元数据及上下文信息,从而捕捉视频内容的丰富语义。此外,采用系统化的预处理技术(如对数变换和异常值去除)以增强模型的鲁棒性,并利用梯度提升回归模型来学习跨模态的复杂模式,最终在SMP Challenge 2025的视频赛道中取得了最佳性能。

链接: https://arxiv.org/abs/2507.00950
作者: Liliang Ye(1),Yunyao Zhang(1),Yafeng Wu(1),Yi-Ping Phoebe Chen(2),Junqing Yu(1),Wei Yang(1),Zikai Song(1) ((1) Huazhong University of Science and Technology, Wuhan, China, (2) La Trobe University, Melbourne, Australia)
机构: Huazhong University of Science and Technology (华中科技大学); La Trobe University (拉特罗布大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Social media platforms serve as central hubs for content dissemination, opinion expression, and public engagement across diverse modalities. Accurately predicting the popularity of social media videos enables valuable applications in content recommendation, trend detection, and audience engagement. In this paper, we present Multimodal Video Predictor (MVP), our winning solution to the Video Track of the SMP Challenge 2025. MVP constructs expressive post representations by integrating deep video features extracted from pretrained models with user metadata and contextual information. The framework applies systematic preprocessing techniques, including log-transformations and outlier removal, to improve model robustness. A gradient-boosted regression model is trained to capture complex patterns across modalities. Our approach ranked first in the official evaluation of the Video Track, demonstrating its effectiveness and reliability for multimodal video popularity prediction on social platforms. The source code is available at this https URL.
zh

[CV-11] RaG NNarok: A Light-Weight Graph Neural Network for Enhancing Radar Point Clouds on Unmanned Ground Vehicles IROS2025

【速读】:该论文旨在解决低成本室内移动机器人在复杂和动态环境中定位与导航的挑战,尤其是针对传统激光雷达和摄像头方案在视觉受阻环境中的性能不足、计算开销大以及成本高昂的问题。其解决方案的关键在于提出一种实时、轻量且可泛化的图神经网络(Graph Neural Network, GNN)框架RaGNNarok,用于增强毫米波雷达(mmWave radar)点云数据,从而提升雷达在稀疏点云生成、噪声和误检方面的表现。

链接: https://arxiv.org/abs/2507.00937
作者: David Hunt,Shaocheng Luo,Spencer Hallyburton,Shafii Nillongo,Yi Li,Tingjun Chen,Miroslav Pajic
机构: Duke University (杜克大学)
类目: Robotics (cs.RO); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, accepted by IROS 2025

点击查看摘要

Abstract:Low-cost indoor mobile robots have gained popularity with the increasing adoption of automation in homes and commercial spaces. However, existing lidar and camera-based solutions have limitations such as poor performance in visually obscured environments, high computational overhead for data processing, and high costs for lidars. In contrast, mmWave radar sensors offer a cost-effective and lightweight alternative, providing accurate ranging regardless of visibility. However, existing radar-based localization suffers from sparse point cloud generation, noise, and false detections. Thus, in this work, we introduce RaGNNarok, a real-time, lightweight, and generalizable graph neural network (GNN)-based framework to enhance radar point clouds, even in complex and dynamic environments. With an inference time of just 7.3 ms on the low-cost Raspberry Pi 5, RaGNNarok runs efficiently even on such resource-constrained devices, requiring no additional computational resources. We evaluate its performance across key tasks, including localization, SLAM, and autonomous navigation, in three different environments. Our results demonstrate strong reliability and generalizability, making RaGNNarok a robust solution for low-cost indoor mobile robots.
zh

[CV-12] Masks make discriminative models great again!

【速读】:该论文试图解决从单张图像中重建逼真三维场景的问题,特别是针对图像到三维提升(image-to-3D lifting)这一环节。解决方案的关键在于将提升问题(将图像转换为可见内容的三维模型)与补全问题(生成输入中未出现的内容)解耦,从而创建更适合判别模型的确定性任务。通过使用从优化的三维高斯泼溅中获得的可见性掩码,在训练过程中排除源视图不可见区域,显著提升了可见区域的重建质量。尽管仅在掩码区域上进行训练,Image2GS在完整场景评估中仍能与最先进的判别模型相媲美。

链接: https://arxiv.org/abs/2507.00916
作者: Tianshi Cao,Marie-Julie Rakotosaona,Ben Poole,Federico Tombari,Michael Niemeyer
机构: University of Toronto (多伦多大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Image2GS, a novel approach that addresses the challenging problem of reconstructing photorealistic 3D scenes from a single image by focusing specifically on the image-to-3D lifting component of the reconstruction process. By decoupling the lifting problem (converting an image to a 3D model representing what is visible) from the completion problem (hallucinating content not present in the input), we create a more deterministic task suitable for discriminative models. Our method employs visibility masks derived from optimized 3D Gaussian splats to exclude areas not visible from the source view during training. This masked training strategy significantly improves reconstruction quality in visible regions compared to strong baselines. Notably, despite being trained only on masked regions, Image2GS remains competitive with state-of-the-art discriminative models trained on full target images when evaluated on complete scenes. Our findings highlight the fundamental struggle discriminative models face when fitting unseen regions and demonstrate the advantages of addressing image-to-3D lifting as a distinct problem with specialized techniques.
zh

[CV-13] GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

【速读】:该论文旨在解决当前3D Vision-Language Models (VLMs)对目标检测器的高度依赖所导致的处理瓶颈和分类灵活性受限的问题。其解决方案的关键在于提出一种以场景为中心的3D VLM,通过将语言特征直接嵌入到每个高斯原始体(Gaussian primitive)中,实现早期模态对齐,并利用双稀疏化器通过任务引导和位置引导路径将密集表示提炼为紧凑的任务感知全局和局部场景标记,从而提升模型的泛化能力和效率。

链接: https://arxiv.org/abs/2507.00886
作者: Anna-Maria Halacheva,Jan-Nico Zaech,Xi Wang,Danda Pani Paudel,Luc Van Gool
机构: INSAIT, Sofia University “St. Kliment Ohridski”(INSAIT,索非亚大学“圣克莱门特·奥赫里德斯基”); ETH Zurich(苏黎世联邦理工学院); TU Munich(慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM five folds, in out-of-the-domain settings.
zh

[CV-14] Is Visual in-Context Learning for Compositional Medical Tasks within Reach? ICCV2025

【速读】:该论文试图解决如何使单一模型在不进行重新训练的情况下,处理多个任务并适应新任务的问题,特别是针对涉及多个中间步骤的复杂任务。其解决方案的关键在于训练视觉上下文学习者以适应任务序列,而非单个任务,并通过合成组合任务生成引擎来构建任务序列,从而提升模型在组合性任务中的泛化能力。此外,研究还探讨了基于掩码的训练目标,以优化模型在解决复杂组合任务时的表现。

链接: https://arxiv.org/abs/2507.00868
作者: Simon Reiß,Zdravko Marinov,Alexander Jaus,Constantin Seibold,M. Saquib Sarfraz,Erik Rodner,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Mercedes-Benz Tech Innovation (梅赛德斯-奔驰技术创新); University of Applied Sciences Berlin (柏林应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks and adapt to new tasks during test time without re-training. Unlike previous approaches, our focus is on training in-context learners to adapt to sequences of tasks, rather than individual tasks. Our goal is to solve complex tasks that involve multiple intermediate steps using a single model, allowing users to define entire vision pipelines flexibly at test time. To achieve this, we first examine the properties and limitations of visual in-context learning architectures, with a particular focus on the role of codebooks. We then introduce a novel method for training in-context learners using a synthetic compositional task generation engine. This engine bootstraps task sequences from arbitrary segmentation datasets, enabling the training of visual in-context learners for compositional tasks. Additionally, we investigate different masking-based training objectives to gather insights into how to train models better for solving complex, compositional tasks. Our exploration not only provides important insights especially for multi-modal medical task sequences but also highlights challenges that need to be addressed.
zh

[CV-15] SafeMap: Robust HD Map Construction from Incomplete Observations ICML2025

【速读】:该论文旨在解决自动驾驶中高精度(HD)地图构建面临的挑战,尤其是在多视角摄像头数据不完整的情况下,现有方法难以保证地图的准确性。其解决方案的关键在于提出SafeMap框架,该框架融合了基于高斯的透视视图重建(G-PVR)模块和基于知识蒸馏的鸟瞰图(BEV)校正(D-BEVC)模块。G-PVR通过利用视图重要性的先验知识动态优先处理最有信息量的区域,而D-BEVC则通过全景BEV特征对不完整观测得到的BEV表示进行校正,从而实现端到端的地图重建与鲁棒的HD地图生成。

链接: https://arxiv.org/abs/2507.00861
作者: Xiaoshuai Hao,Lingdong Kong,Rong Yin,Pengwei Wang,Jing Zhang,Yunfeng Diao,Shu Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Robust high-definition (HD) map construction is vital for autonomous driving, yet existing methods often struggle with incomplete multi-view camera data. This paper presents SafeMap, a novel framework specifically designed to secure accuracy even when certain camera views are missing. SafeMap integrates two key components: the Gaussian-based Perspective View Reconstruction (G-PVR) module and the Distillation-based Bird’s-Eye-View (BEV) Correction (D-BEVC) module. G-PVR leverages prior knowledge of view importance to dynamically prioritize the most informative regions based on the relationships among available camera views. Furthermore, D-BEVC utilizes panoramic BEV features to correct the BEV representations derived from incomplete observations. Together, these components facilitate the end-to-end map reconstruction and robust HD map generation. SafeMap is easy to implement and integrates seamlessly into existing systems, offering a plug-and-play solution for enhanced robustness. Experimental results demonstrate that SafeMap significantly outperforms previous methods in both complete and incomplete scenarios, highlighting its superior performance and reliability.
zh

[CV-16] Robust Component Detection for Flexible Manufacturing: A Deep Learning Approach to Tray-Free Object Recognition under Variable Lighting

【速读】:该论文旨在解决工业4.0中柔性制造系统对机器人在非结构化环境中处理物体的需求,特别是如何在无严格定位约束的情况下检测和抓取笔组件,并在不同光照条件下保持鲁棒性。解决方案的关键在于采用基于Mask R-CNN的计算机视觉系统,该系统能够在无需结构化托盘的情况下实现高精度的目标检测,并通过优化算法提升对极端光照变化的适应能力,从而显著提高制造灵活性并减少设置时间。

链接: https://arxiv.org/abs/2507.00852
作者: Fatemeh Sadat Daneshmand
机构: TU Clausthal (图宾根大学); Zurich University of Applied Sciences (苏黎世应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flexible manufacturing systems in Industry 4.0 require robots capable of handling objects in unstructured environments without rigid positioning constraints. This paper presents a computer vision system that enables industrial robots to detect and grasp pen components in arbitrary orientations without requiring structured trays, while maintaining robust performance under varying lighting conditions. We implement and evaluate a Mask R-CNN-based approach on a complete pen manufacturing line at ZHAW, addressing three critical challenges: object detection without positional constraints, robustness to extreme lighting variations, and reliable performance with cost-effective cameras. Our system achieves 95% detection accuracy across diverse lighting conditions while eliminating the need for structured component placement, demonstrating a 30% reduction in setup time and significant improvement in manufacturing flexibility. The approach is validated through extensive testing under four distinct lighting scenarios, showing practical applicability for real-world industrial deployment.
zh

[CV-17] UAVD-Mamba: Deformable Token Fusion Vision Mamba for Multimodal UAV Detection

【速读】:该论文旨在解决无人机(UAV)目标检测中面临的遮挡、小目标尺寸和不规则形状等挑战,提出一种鲁棒且高效的多模态UAV目标检测方法。其解决方案的关键在于引入基于Mamba架构的UAVD-Mamba框架,通过设计可变形令牌Mamba块(DTMB)增强几何适应性,并为RGB和红外(IR)模态分别设计独立的DTMB以优化多模态特征互补性,同时通过多尺度DTMB堆叠提升小目标检测能力,结合改进的检测颈部模块和跨增强空间注意力机制,显著提升了检测性能。

链接: https://arxiv.org/abs/2507.00849
作者: Wei Li,Jiaman Tang,Yang Li,Beihao Xia,Ligang Tan,Hongmao Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper was accepted by the 36th IEEE Intelligent Vehicles Symposium (IEEE IV 2025)

点击查看摘要

Abstract:Unmanned Aerial Vehicle (UAV) object detection has been widely used in traffic management, agriculture, emergency rescue, etc. However, it faces significant challenges, including occlusions, small object sizes, and irregular shapes. These challenges highlight the necessity for a robust and efficient multimodal UAV object detection method. Mamba has demonstrated considerable potential in multimodal image fusion. Leveraging this, we propose UAVD-Mamba, a multimodal UAV object detection framework based on Mamba architectures. To improve geometric adaptability, we propose the Deformable Token Mamba Block (DTMB) to generate deformable tokens by incorporating adaptive patches from deformable convolutions alongside normal patches from normal convolutions, which serve as the inputs to the Mamba Block. To optimize the multimodal feature complementarity, we design two separate DTMBs for the RGB and infrared (IR) modalities, with the outputs from both DTMBs integrated into the Mamba Block for feature extraction and into the Fusion Mamba Block for feature fusion. Additionally, to improve multiscale object detection, especially for small objects, we stack four DTMBs at different scales to produce multiscale feature representations, which are then sent to the Detection Neck for Mamba (DNM). The DNM module, inspired by the YOLO series, includes modifications to the SPPF and C3K2 of YOLOv11 to better handle the multiscale features. In particular, we employ cross-enhanced spatial attention before the DTMB and cross-channel attention after the Fusion Mamba Block to extract more discriminative features. Experimental results on the DroneVehicle dataset show that our method outperforms the baseline OAFA method by 3.6% in the mAP metric. Codes will be released at this https URL.
zh

[CV-18] Do Echo Top Heights Improve Deep Learning Nowcasts?

【速读】:该论文试图解决降水临近预报(precipitation nowcasting)中因仅依赖二维雷达反射率场而忽略三维雷达数据中垂直信息的问题,从而限制了模型对降雨强度预测的准确性。解决方案的关键在于引入回波顶高(Echo Top Height, ETH)作为辅助输入变量,通过将ETH与雷达反射率共同作为独立输入通道,构建一个单次通过的三维U-Net网络,以探索其对提升临近预报性能的潜力。

链接: https://arxiv.org/abs/2507.00845
作者: Peter Pavlík,Marc Schleiss,Anna Bou Ezzeddine,Viera Rozinajová
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Pre-review version of an article accepted at Transactions on Large-Scale Data and Knowledge-Centered Systems

点击查看摘要

Abstract:Precipitation nowcasting – the short-term prediction of rainfall using recent radar observations – is critical for weather-sensitive sectors such as transportation, agriculture, and disaster mitigation. While recent deep learning models have shown promise in improving nowcasting skill, most approaches rely solely on 2D radar reflectivity fields, discarding valuable vertical information available in the full 3D radar volume. In this work, we explore the use of Echo Top Height (ETH), a 2D projection indicating the maximum altitude of radar reflectivity above a given threshold, as an auxiliary input variable for deep learning-based nowcasting. We examine the relationship between ETH and radar reflectivity, confirming its relevance for predicting rainfall intensity. We implement a single-pass 3D U-Net that processes both the radar reflectivity and ETH as separate input channels. While our models are able to leverage ETH to improve skill at low rain-rate thresholds, results are inconsistent at higher intensities and the models with ETH systematically underestimate precipitation intensity. Three case studies are used to illustrate how ETH can help in some cases, but also confuse the models and increase the error variance. Nonetheless, the study serves as a foundation for critically assessing the potential contribution of additional variables to nowcasting performance.
zh

[CV-19] High-Frequency Semantics and Geometric Priors for End-to-End Detection Transformers in Challenging UAV Imagery

【速读】:该论文旨在解决无人机图像中目标检测(UAV-OD)面临的挑战,包括小目标尺寸、高密度分布和杂乱背景等问题。现有算法依赖于手工设计的组件如锚框(anchor boxes)和非极大值抑制(NMS),这些方法需要精细调优且泛化能力有限,并且在密集场景下容易出现误检。为了解决这些问题,本文提出了一种针对无人机的实时检测Transformer框架HEGS-DETR,其关键在于引入了高频增强语义网络(HFESNet)以保留关键高频空间细节,提升小目标和遮挡目标的判别能力;采用高效小目标金字塔(ESOP)策略,在计算开销最小的情况下融合高分辨率特征图,显著提升小目标检测性能;并通过选择性查询回收(SQR)和几何感知位置编码(GAPE)模块增强解码器稳定性和定位精度,优化边界框并提供显式空间先验。

链接: https://arxiv.org/abs/2507.00825
作者: Hongxing Peng,Lide Chen,Hui Zhu,Yan Chen
机构: South China Agricultural University (华南农业大学); Guangdong Mechanical &Electrical Polytechnic (广东机电职业技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures, to appear in KBS

点击查看摘要

Abstract:Unmanned Aerial Vehicle-based Object Detection (UAV-OD) faces substantial challenges, including small target sizes, high-density distributions, and cluttered backgrounds in UAV imagery. Current algorithms often depend on hand-crafted components like anchor boxes, which demand fine-tuning and exhibit limited generalization, and Non-Maximum Suppression (NMS), which is threshold-sensitive and prone to misclassifying dense objects. These generic architectures thus struggle to adapt to aerial imaging characteristics, resulting in performance limitations. Moreover, emerging end-to-end frameworks have yet to effectively mitigate these aerial-specific this http URL address these issues, we propose HEGS-DETR, a comprehensively enhanced, real-time Detection Transformer framework tailored for UAVs. First, we introduce the High-Frequency Enhanced Semantics Network (HFESNet) as a novel backbone. HFESNet preserves critical high-frequency spatial details to extract robust semantic features, thereby improving discriminative capability for small and occluded targets in complex backgrounds. Second, our Efficient Small Object Pyramid (ESOP) strategy strategically fuses high-resolution feature maps with minimal computational overhead, significantly boosting small object detection. Finally, the proposed Selective Query Recollection (SQR) and Geometry-Aware Positional Encoding (GAPE) modules enhance the detector’s decoder stability and localization accuracy, effectively optimizing bounding boxes and providing explicit spatial priors for dense scenes. Experiments on the VisDrone dataset demonstrate that HEGS-DETR achieves a 5.1% AP _50 and 3.8% AP increase over the baseline, while maintaining real-time speed and reducing parameter count by 4M.
zh

[CV-20] Instant Particle Size Distribution Measurement Using CNNs Trained on Synthetic Data CVPR2025

【速读】:该论文旨在解决工业领域中颗粒粒径分布(Particle Size Distribution, PSD)测量的准确性与效率问题,传统方法如筛分分析和激光衍射存在手动操作、耗时且受颗粒重叠限制的缺点。其解决方案的关键在于利用基于卷积神经网络(Convolutional Neural Network, CNN)的方法,通过使用Blender生成的高保真合成颗粒图像进行训练,实现从颗粒图像中自动、实时地估计PSD参数。该方法通过系统调整颗粒形状、纹理、光照和空间排列等参数,生成多样化的工业场景数据集,从而提升模型的泛化能力和适用性。

链接: https://arxiv.org/abs/2507.00822
作者: Yasser El Jarida,Youssef Iraqi,Loubna Mekouar
机构: College of Computing, University Mohammed VI Polytechnic (计算学院,穆罕默德六世理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the Synthetic Data for Computer Vision Workshop @ CVPR 2025. 10 pages, 5 figures. Code available at this https URL

点击查看摘要

Abstract:Accurate particle size distribution (PSD) measurement is important in industries such as mining, pharmaceuticals, and fertilizer manufacturing, significantly influencing product quality and operational efficiency. Traditional PSD methods like sieve analysis and laser diffraction are manual, time-consuming, and limited by particle overlap. Recent developments in convolutional neural networks (CNNs) enable automated, real-time PSD estimation directly from particle images. In this work, we present a CNN-based methodology trained on realistic synthetic particle imagery generated using Blender’s advanced rendering capabilities. Synthetic data sets using this method can replicate various industrial scenarios by systematically varying particle shapes, textures, lighting, and spatial arrangements that closely resemble the actual configurations. We evaluated three CNN-based architectures, ResNet-50, InceptionV3, and EfficientNet-B0, for predicting critical PSD parameters (d10, d50, d90). Results demonstrated comparable accuracy across models, with EfficientNet-B0 achieving the best computational efficiency suitable for real-time industrial deployment. This approach shows the effectiveness of realistic synthetic data for robust CNN training, which offers significant potential for automated industrial PSD monitoring. The code is released at : this https URL
zh

[CV-21] CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLM s

【速读】:该论文旨在解决视频多模态大语言模型(V-MLLMs)在对抗攻击方面的脆弱性问题,这一问题由于复杂的跨模态推理机制、时间依赖性和计算约束而未被充分研究。其解决方案的关键在于提出CAVALRY-V框架,该框架通过两个核心创新实现对视觉感知与语言生成关键接口的直接攻击:一是引入双目标语义-视觉损失函数,同时干扰模型的文本生成逻辑和视觉表征以破坏跨模态整合;二是设计一种计算高效的两阶段生成框架,结合大规模预训练以提升跨模型迁移能力,并通过专用微调增强时空一致性。

链接: https://arxiv.org/abs/2507.00817
作者: Jiaming Zhang,Rui Hu,Qing Guo,Wei Yang Bryan Lim
机构: Nanyang Technological University (南洋理工大学); Beijing Jiaotong University (北京交通大学); A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language generation in V-MLLMs. Our approach introduces two key innovations: (1) a dual-objective semantic-visual loss function that simultaneously disrupts the model’s text generation logits and visual representations to undermine cross-modal integration, and (2) a computationally efficient two-stage generator framework that combines large-scale pre-training for cross-model transferability with specialized fine-tuning for spatiotemporal coherence. Empirical evaluation on comprehensive video understanding benchmarks demonstrates that CAVALRY-V significantly outperforms existing attack methods, achieving 22.8% average improvement over the best baseline attacks on both commercial systems (GPT-4.1, Gemini 2.0) and open-source models (QwenVL-2.5, InternVL-2.5, Llava-Video, Aria, MiniCPM-o-2.6). Our framework achieves flexibility through implicit temporal coherence modeling rather than explicit regularization, enabling significant performance improvements even on image understanding (34.4% average gain). This capability demonstrates CAVALRY-V’s potential as a foundational approach for adversarial research across multimodal systems.
zh

[CV-22] RACE: Temporally Reliable Anatomically-Conditioned 3D CT Generation with Enhanced Efficiency MICCAI2025 MICCAI

【速读】:该论文旨在解决3D医学图像生成中存在的一系列问题,包括有限的解剖保真度、受限的轴向长度以及高昂的计算成本,这些问题限制了其在资源有限地区的临床应用。解决方案的关键在于提出TRACE框架,该框架采用2D多模态条件扩散方法生成具有时空对齐的3D医学图像,通过将序列2D切片建模为视频帧对,并结合分割先验和放射学报告实现解剖对齐,同时引入光流以保持时间一致性,从而在推理阶段通过重叠帧策略构建灵活长度的序列并重建为时空与解剖对齐的3D体积。

链接: https://arxiv.org/abs/2507.00802
作者: Minye Shao,Xingyu Miao,Haoran Duan,Zeyu Wang,Jingkun Chen,Yawen Huang,Xian Wu,Jingjing Deng,Yang Long,Yefeng Zheng
机构: Durham University (杜伦大学); Tsinghua University (清华大学); Dalian Minzu University (大连民族大学); University of Oxford (牛津大学); Tencent YouTu Lab (腾讯优图实验室); University of Bristol (布里斯托大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025 (this version is not peer-reviewed; it is the preprint version). MICCAI proceedings DOI will appear here

点击查看摘要

Abstract:3D medical image generation is essential for data augmentation and patient privacy, calling for reliable and efficient models suited for clinical practice. However, current methods suffer from limited anatomical fidelity, restricted axial length, and substantial computational cost, placing them beyond reach for regions with limited resources and infrastructure. We introduce TRACE, a framework that generates 3D medical images with spatiotemporal alignment using a 2D multimodal-conditioned diffusion approach. TRACE models sequential 2D slices as video frame pairs, combining segmentation priors and radiology reports for anatomical alignment, incorporating optical flow to sustain temporal coherence. During inference, an overlapping-frame strategy links frame pairs into a flexible length sequence, reconstructed into a spatiotemporally and anatomically aligned 3D volume. Experimental results demonstrate that TRACE effectively balances computational efficiency with preserving anatomical fidelity and spatiotemporal consistency. Code is available at: this https URL.
zh

[CV-23] Real-Time Inverse Kinematics for Generating Multi-Constrained Movements of Virtual Human Characters

【速读】:该论文旨在解决在计算机图形学、交互式虚拟环境、机器人学和生物力学等领域中,实时生成准确且逼真的虚拟人类运动的问题。其解决方案的关键在于提出一种基于TensorFlow自动微分和即时编译特性的新型实时逆运动学(Inverse Kinematics, IK)求解器,该方法通过将正向和逆向运动学视为可微操作,有效处理了多约束问题中的误差累积和复杂关节限制,从而提升了真实人体运动建模的精度与效率。

链接: https://arxiv.org/abs/2507.00792
作者: Hendric Voss,Stefan Kopp
机构: Bielefeld University (比勒费尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generating accurate and realistic virtual human movements in real-time is of high importance for a variety of applications in computer graphics, interactive virtual environments, robotics, and biomechanics. This paper introduces a novel real-time inverse kinematics (IK) solver specifically designed for realistic human-like movement generation. Leveraging the automatic differentiation and just-in-time compilation of TensorFlow, the proposed solver efficiently handles complex articulated human skeletons with high degrees of freedom. By treating forward and inverse kinematics as differentiable operations, our method effectively addresses common challenges such as error accumulation and complicated joint limits in multi-constrained problems, which are critical for realistic human motion modeling. We demonstrate the solver’s effectiveness on the SMPLX human skeleton model, evaluating its performance against widely used iterative-based IK algorithms, like Cyclic Coordinate Descent (CCD), FABRIK, and the nonlinear optimization algorithm IPOPT. Our experiments cover both simple end-effector tasks and sophisticated, multi-constrained problems with realistic joint limits. Results indicate that our IK solver achieves real-time performance, exhibiting rapid convergence, minimal computational overhead per iteration, and improved success rates compared to existing methods. The project code is available at this https URL
zh

[CV-24] LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

【速读】:该论文试图解决统一图像修复(unified image restoration)在低级视觉中的挑战性问题,现有方法要么针对特定任务进行定制设计,限制了其在多种退化类型中的泛化能力,要么依赖于成对数据集的训练,从而受到闭集约束。解决方案的关键在于提出一种无需数据集的统一方法,通过利用预训练潜在扩散模型进行递归后验采样,结合多模态理解模型在无任务条件下提供语义先验,并采用轻量级模块对齐退化输入与扩散模型生成的偏好,同时使用递归优化进行后验采样。

链接: https://arxiv.org/abs/2507.00790
作者: Huaqiu Li,Yong Wang,Tongwen Huang,Hailang Huang,Haoqian Wang,Xiangxiang Chu
机构: Tsinghua University (清华大学); AMAP, Alibaba Group (阿里集团AMAP)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data will be available at this https URL.
zh

[CV-25] OptiPrune: Boosting Prompt-Image Consistency with Attention-Guided Noise and Dynamic Token Selection

【速读】:该论文旨在解决文本到图像扩散模型在生成图像与文本提示之间实现准确语义对齐的同时,保持在资源受限硬件上部署的效率问题。现有方法要么通过噪声优化引入显著的计算开销,要么通过激进的标记剪枝牺牲语义保真度。论文提出的解决方案——OptiPrune,其关键在于结合了基于分布感知的初始噪声优化与基于相似性的标记剪枝,从而同时解决语义对齐与计算效率的问题。具体而言,OptiPrune通过注意力分数引导的分布感知噪声优化模块,将初始潜在噪声引导至语义有意义的区域,并采用基于块相似性的硬件高效标记剪枝策略,以提升泛化能力并恢复剪枝后的标记。

链接: https://arxiv.org/abs/2507.00789
作者: Ziji Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models often struggle to achieve accurate semantic alignment between generated images and text prompts while maintaining efficiency for deployment on resource-constrained hardware. Existing approaches either incur substantial computational overhead through noise optimization or compromise semantic fidelity by aggressively pruning tokens. In this work, we propose OptiPrune, a unified framework that combines distribution-aware initial noise optimization with similarity-based token pruning to address both challenges simultaneously. Specifically, (1) we introduce a distribution-aware noise optimization module guided by attention scores to steer the initial latent noise toward semantically meaningful regions, mitigating issues such as subject neglect and feature entanglement; (2) we design a hardware-efficient token pruning strategy that selects representative base tokens via patch-wise similarity, injects randomness to enhance generalization, and recovers pruned tokens using maximum similarity copying before attention operations. Our method preserves the Gaussian prior during noise optimization and enables efficient inference without sacrificing alignment quality. Experiments on benchmark datasets, including Animal-Animal, demonstrate that OptiPrune achieves state-of-the-art prompt-image consistency with significantly reduced computational cost.
zh

[CV-26] owards Open-World Human Action Segmentation Using Graph Convolutional Networks IROS25

【速读】:该论文旨在解决开放世界动作分割(open-world action segmentation)问题,即在面对未见过的新动作时,现有基于学习的方法难以有效泛化。为了解决这一问题,论文提出了一种结构化框架,其关键创新包括:1)一种增强型金字塔图卷积网络(EPGCN)结合新型解码器模块,实现鲁棒的时空特征上采样;2)基于Mixup的训练策略以合成分布外数据,减少对人工标注的依赖;3)一种新颖的时间聚类损失,用于将分布内动作分组并区分分布外样本。

链接: https://arxiv.org/abs/2507.00756
作者: Hao Xing,Kai Zhe Boey,Gordon Cheng
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 3 figures, accepted in IROS25, Hangzhou, China

点击查看摘要

Abstract:Human-object interaction segmentation is a fundamental task of daily activity understanding, which plays a crucial role in applications such as assistive robotics, healthcare, and autonomous systems. Most existing learning-based methods excel in closed-world action segmentation, they struggle to generalize to open-world scenarios where novel actions emerge. Collecting exhaustive action categories for training is impractical due to the dynamic diversity of human activities, necessitating models that detect and segment out-of-distribution actions without manual annotation. To address this issue, we formally define the open-world action segmentation problem and propose a structured framework for detecting and segmenting unseen actions. Our framework introduces three key innovations: 1) an Enhanced Pyramid Graph Convolutional Network (EPGCN) with a novel decoder module for robust spatiotemporal feature upsampling. 2) Mixup-based training to synthesize out-of-distribution data, eliminating reliance on manual annotations. 3) A novel Temporal Clustering loss that groups in-distribution actions while distancing out-of-distribution samples. We evaluate our framework on two challenging human-object interaction recognition datasets: Bimanual Actions and 2 Hands and Object (H2O) datasets. Experimental results demonstrate significant improvements over state-of-the-art action segmentation models across multiple open-set evaluation metrics, achieving 16.9% and 34.6% relative gains in open-set segmentation (F1@50) and out-of-distribution detection performances (AUROC), respectively. Additionally, we conduct an in-depth ablation study to assess the impact of each proposed component, identifying the optimal framework configuration for open-world action segmentation. Comments: 8 pages, 3 figures, accepted in IROS25, Hangzhou, China Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2507.00756 [cs.CV] (or arXiv:2507.00756v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.00756 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-27] Language-Unlocked ViT (LUViT): Empowering Self-Supervised Vision Transformers with LLM s

【速读】:该论文试图解决将大型语言模型(Large Language Model, LLM)模块与视觉变换器(Vision Transformer, ViT)融合时存在的模态不匹配问题,这一问题导致直接融合无法充分发挥LLM的潜力并出现微调不稳定的情况。解决方案的关键在于提出一种名为Language-Unlocked Vision Transformers (LUViT)的新方法,通过协同预训练策略来弥合这种模态差异,具体包括利用掩码自编码(Masked Auto-Encoding, MAE)预训练ViT以获得更丰富的视觉表征,并同时使用MAE目标在LLM模块中训练低秩适配(Low-Rank Adaptation, LoRA)层,从而实现ViT与LLM的联合优化。

链接: https://arxiv.org/abs/2507.00754
作者: Selim Kuzucu,Muhammad Ferjad Naeem,Anna Kukleva,Federico Tombari,Bernt Schiele
机构: Max Planck Institute for Informatics, SIC (马克斯·普朗克信息研究所,SIC); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 6 figures

点击查看摘要

Abstract:The integration of Large Language Model (LLMs) blocks with Vision Transformers (ViTs) holds immense promise for vision-only tasks by leveraging the rich semantic knowledge and reasoning capabilities of LLMs. However, a fundamental challenge lies in the inherent modality mismatch between text-centric pretraining of LLMs and vision-centric training of ViTs. Direct fusion often fails to fully exploit the LLM’s potential and suffers from unstable finetuning. As a result, LLM blocks are kept frozen while only the vision components are learned. As a remedy to these challenges, we introduce Language-Unlocked Vision Transformers (LUViT), a novel approach that bridges this modality mismatch through a synergistic pre-training strategy. LUViT co-adapts a ViT backbone and an LLM fusion block by (1) employing Masked Auto-Encoding (MAE) to pre-train the ViT for richer visual representations, and (2) concurrently training Low-Rank Adaptation (LoRA) layers within the LLM block using the MAE objective. This joint optimization guides the ViT to produce LLM-aligned features and the LLM to effectively interpret visual information. We demonstrate through extensive experiments that LUViT significantly improves performance on various downstream vision tasks, showcasing a more effective and efficient pathway to harness LLM knowledge for visual understanding.
zh

[CV-28] Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation IROS25

【速读】:该论文旨在解决人机协作场景中人类动作的精确时间分割问题,特别是在存在人体姿态估计和目标检测固有噪声的情况下,容易导致过度分割错误,从而破坏动作序列的一致性。其解决方案的关键在于提出一种多模态图卷积网络(MMGCN),通过融合低帧率(如1 fps)视觉数据与高帧率(如30 fps)运动数据(骨架和目标检测)来缓解碎片化问题。该方法引入了三种关键贡献:基于正弦编码策略的连续正弦-余弦空间映射以增强空间表示鲁棒性;通过分层特征聚合对齐多模态输入的时序图融合模块;以及受人类动作平滑过渡启发的SmoothLabelMix数据增强技术,用于生成具有渐进动作过渡的合成训练样本,从而提升预测的时序一致性并减少过度分割伪影。

链接: https://arxiv.org/abs/2507.00752
作者: Hao Xing,Kai Zhe Boey,Yuankai Wu,Darius Burschka,Gordon Cheng
机构: Institute for Cognitive Systems (认知系统研究所); Chair of Media Technology (媒体技术主席); Machine Vision and Perception Group (机器视觉与感知组); School of Computation, Information and Technology (计算、信息与技术学院); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 pages, 4 figures, accepted in IROS25, Hangzhou, China

点击查看摘要

Abstract:Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%. Comments: 7 pages, 4 figures, accepted in IROS25, Hangzhou, China Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2507.00752 [cs.CV] (or arXiv:2507.00752v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.00752 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-29] Improving the Reasoning of Multi-Image Grounding in MLLM s via Reinforcement Learning

【速读】:该论文旨在解决多图像接地任务中Multimodal Large Language Models (MLLMs)在跨图像推理和泛化能力方面的局限性。其关键解决方案是采用基于强化学习(Reinforcement Learning, RL)的后训练策略,首先通过合成高质量的思维链(chain-of-thought, CoT)数据进行冷启动初始化,随后利用低秩适配(LoRA)进行监督微调(SFT),并结合拒绝采样和规则引导的强化学习进一步优化模型的推理路径。

链接: https://arxiv.org/abs/2507.00748
作者: Bob Zhang,Haoran Li,Tao Zhang,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Yanbin Hao
机构: Xiaohongshu Inc. (小红书公司); University of Science and Technology of China (中国科学技术大学); Wuhan University (武汉大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions, which reveals limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning performance of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, achieving +9.04% improvements on MIG-Bench and +4.98% improvements on several out-of-domain reasoning grounding benchmarks over the SFT baseline. Furthermore, our approach exhibits strong generalization in multi-image perception, with gains of +3.1% and +2.4% over the base model on subsets of the BLINK and MMIU benchmarks, respectively.
zh

[CV-30] Biorthogonal Tunable Wavelet Unit with Lifting Scheme in Convolutional Neural Network

【速读】:该论文旨在解决传统卷积神经网络(CNN)在图像分类和异常检测任务中对细粒度特征捕捉能力不足的问题。其解决方案的关键在于提出了一种基于提升方案(lifting scheme)构建的新型双正交可调小波单元,该单元放松了正交性和滤波器长度相等的约束,从而在滤波器设计上提供了更大的灵活性,进而增强了卷积、池化和下采样操作的效果。

链接: https://arxiv.org/abs/2507.00739
作者: An Le,Hung Nguyen,Sungbal Seo,You-Suk Bae,Truong Nguyen
机构: University of California San Diego (加州大学圣地亚哥分校); Tech University of Korea (韩国技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:This work introduces a novel biorthogonal tunable wavelet unit constructed using a lifting scheme that relaxes both the orthogonality and equal filter length constraints, providing greater flexibility in filter design. The proposed unit enhances convolution, pooling, and downsampling operations, leading to improved image classification and anomaly detection in convolutional neural networks (CNN). When integrated into an 18-layer residual neural network (ResNet-18), the approach improved classification accuracy on CIFAR-10 by 2.12% and on the Describable Textures Dataset (DTD) by 9.73%, demonstrating its effectiveness in capturing fine-grained details. Similar improvements were observed in ResNet-34. For anomaly detection in the hazelnut category of the MVTec Anomaly Detection dataset, the proposed method achieved competitive and wellbalanced performance in both segmentation and detection tasks, outperforming existing approaches in terms of accuracy and robustness.
zh

[CV-31] Holmes: Towards Effective and Harmless Model Ownership Verification to Personalized Large Vision Models via Decoupling Common Features

【速读】:该论文试图解决个性化模型在面对模型窃取攻击时的安全性问题,尤其是针对通过微调预训练模型获得的个性化模型所面临的威胁。现有为传统深度神经网络(DNN)设计的防御方法在应用于微调模型时存在引入额外安全风险、易误判或无效等问题。该论文提出的解决方案的关键在于通过解耦相似的通用特征,构建一种无害的模型所有权验证方法。具体而言,该方法包括三个主要阶段:创建保留目标模型通用特征但破坏数据集特定特征的影子模型,利用元分类器判断可疑模型是否包含目标模型的数据集特定特征,并通过假设检验进行模型所有权验证以减少随机性并提高鲁棒性。

链接: https://arxiv.org/abs/2507.00724
作者: Linghui Zhu,Yiming Li,Haiqin Weng,Yan Liu,Tianwei Zhang,Shu-Tao Xia,Zhi Wang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Ant Group (蚂蚁集团); College of Computing and Data Science, Nanyang Technological University (南洋理工大学计算机与数据科学学院); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision models achieve remarkable performance in various downstream tasks, primarily by personalizing pre-trained models through fine-tuning with private and valuable local data, which makes the personalized model a valuable intellectual property for its owner. Similar to the era of traditional DNNs, model stealing attacks also pose significant risks to these personalized models. However, in this paper, we reveal that most existing defense methods (developed for traditional DNNs), typically designed for models trained from scratch, either introduce additional security risks, are prone to misjudgment, or are even ineffective for fine-tuned models. To alleviate these problems, this paper proposes a harmless model ownership verification method for personalized models by decoupling similar common features. In general, our method consists of three main stages. In the first stage, we create shadow models that retain common features of the victim model while disrupting dataset-specific features. We represent the dataset-specific features of the victim model by the output differences between the shadow and victim models. After that, a meta-classifier is trained to identify stolen models by determining whether suspicious models contain the dataset-specific features of the victim. In the third stage, we conduct model ownership verification by hypothesis test to mitigate randomness and enhance robustness. Extensive experiments on benchmark datasets verify the effectiveness of the proposed method in detecting different types of model stealing simultaneously.
zh

[CV-32] UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement ICCV2025

【速读】:该论文旨在解决零样本域适应(Zero-shot Domain Adaptation, ZSDA)中由于目标域缺乏图像而导致的检测任务挑战,特别是针对传统方法依赖人工设计提示词而忽略检测任务与视觉-语言模型(Vision-Language Models, VLMs)之间不匹配的问题。其解决方案的关键在于提出统一提示与表征增强(Unified Prompt and Representation Enhancement, UPRE)框架,该框架联合优化文本提示和视觉表征,通过多视角域提示和视觉表征增强模块,以及多层次增强策略,实现跨模态表征在图像级对齐和实例级多样化表征的捕捉。

链接: https://arxiv.org/abs/2507.00721
作者: Xiao Zhang,Fei Wei,Yong Wang,Wenda Zhao,Feiyi Li,Xiangxiang Chu
机构: Dalian University of Technology (大连理工大学); AMAP, Alibaba Group (阿里集团AMAP)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV2025

点击查看摘要

Abstract:Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. Specifically, our approach introduces a multi-view domain prompt that combines linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module that produces domain style variations. Furthermore, we introduce multi-level enhancement strategies, including relative domain distance and positive-negative separation, which align multi-modal representations at the image level and capture diverse visual representations at the instance level, respectively. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our framework in ZSDA detection scenarios. Code is available at this https URL.
zh

[CV-33] opoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving

【速读】:该论文旨在解决现有方法在车道段拓扑推理中存在的一致性位置嵌入和时序多属性学习的局限性,这些问题阻碍了道路网络的准确重建。其解决方案的关键在于提出TopoStreamer,一个端到端的时序感知模型,通过引入流式属性约束、动态车道边界位置编码和车道段去噪三个关键改进,以提升车道段拓扑关系的建模能力与感知精度。

链接: https://arxiv.org/abs/2507.00709
作者: Yiming Yang,Yueru Luo,Bingkun He,Hongbin Lin,Suzhong Fu,Chao Yan,Kun Tang,Xinrui Yan,Chao Zheng,Shuguang Cui,Zhen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.4% mAP in lane segment perception and +2.1% OLS in centerline perception tasks.
zh

[CV-34] BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving

【速读】:该论文旨在解决自动驾驶中多视角图像生成的问题,即在不同摄像头视角下保持一致的3D场景理解。现有方法通常将此问题视为2D图像集生成任务,缺乏明确的3D建模。论文提出的解决方案关键在于构建一个结构化的表示,具体而言是通过BEV-VAE模型,首先训练一个多视角图像变分自编码器以获得紧凑且统一的鸟瞰图(Bird’s Eye View, BEV)潜在空间,随后利用潜在扩散变换器生成场景,从而实现一致且可控的视角合成。

链接: https://arxiv.org/abs/2507.00707
作者: Zeming Chen,Hang Zhao
机构: Shanghai Qi Zhi Institute; IIIS, Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-view image generation in autonomous driving demands consistent 3D scene understanding across camera views. Most existing methods treat this problem as a 2D image set generation task, lacking explicit 3D modeling. However, we argue that a structured representation is crucial for scene generation, especially for autonomous driving applications. This paper proposes BEV-VAE for consistent and controllable view synthesis. BEV-VAE first trains a multi-view image variational autoencoder for a compact and unified BEV latent space and then generates the scene with a latent diffusion transformer. BEV-VAE supports arbitrary view generation given camera configurations, and optionally 3D layouts. Experiments on nuScenes and Argoverse 2 (AV2) show strong performance in both 3D consistent reconstruction and generation. The code is available at: this https URL.
zh

[CV-35] Rectifying Magnitude Neglect in Linear Attention ICCV2025

【速读】:该论文试图解决Linear Attention在性能上显著劣于标准Softmax Attention的问题,其核心在于Linear Attention完全忽略了Query的幅度信息,导致注意力分数分布无法随Query尺度动态调整。解决方案的关键是提出Magnitude-Aware Linear Attention (MALA),通过修改Linear Attention的计算方式,使其充分融合Query的幅度信息,从而生成更接近Softmax Attention的注意力分数分布,并保持更均衡的结构。

链接: https://arxiv.org/abs/2507.00698
作者: Qihang Fan,Huaibo Huang,Yuang Ai,ran He
机构: MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query. This prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query’s magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well-balanced structure. We evaluate the effectiveness of MALA on multiple tasks, including image classification, object detection, instance segmentation, semantic segmentation, natural language processing, speech recognition, and image generation. Our MALA achieves strong results on all of these tasks. Code will be available at this https URL
zh

[CV-36] Cage-Based Deformation for Transferable and Undefendable Point Cloud Attack

【速读】:该论文试图解决点云上的对抗攻击问题,特别是在保持攻击结果的合理性(plausibility)的同时,提升攻击的可迁移性(transferability)和不可防御性(undefendability)。现有方法在几何约束下难以兼顾这些性能指标,而基于变形的无结构方法可能导致不自然的扭曲,降低攻击的隐蔽性。论文提出的解决方案是CageAttack,其关键在于通过构建围绕目标物体的笼状结构(cage),为点云提供一个结构化的变形基础,从而实现平滑、自然的扰动传播,确保变形保持对象固有特性并维持合理性。

链接: https://arxiv.org/abs/2507.00690
作者: Keke Tang,Ziyong Du,Weilong Peng,Xiaofei Wang,Peican Zhu,Ligang Liu,Zhihong Tian
机构: Guangzhou University (广州大学); University of Science and Technology of China (中国科学技术大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Adversarial attacks on point clouds often impose strict geometric constraints to preserve plausibility; however, such constraints inherently limit transferability and undefendability. While deformation offers an alternative, existing unstructured approaches may introduce unnatural distortions, making adversarial point clouds conspicuous and undermining their plausibility. In this paper, we propose CageAttack, a cage-based deformation framework that produces natural adversarial point clouds. It first constructs a cage around the target object, providing a structured basis for smooth, natural-looking deformation. Perturbations are then applied to the cage vertices, which seamlessly propagate to the point cloud, ensuring that the resulting deformations remain intrinsic to the object and preserve plausibility. Extensive experiments on seven 3D deep neural network classifiers across three datasets show that CageAttack achieves a superior balance among transferability, undefendability, and plausibility, outperforming state-of-the-art methods. Codes will be made public upon acceptance.
zh

[CV-37] Diffusion Classifier Guidance for Non-robust Classifiers ECML2025

【速读】:该论文试图解决传统分类器引导(classifier guidance)方法在使用非鲁棒分类器时因扩散过程中的噪声导致的准确性下降和引导梯度不稳定的问题。其解决方案的关键在于提出一种利用单步去噪图像预测并结合受随机优化方法启发的稳定化技术(如指数移动平均)的方法,从而提升分类器引导的稳定性,同时保持样本多样性与视觉质量。

链接: https://arxiv.org/abs/2507.00687
作者: Philipp Vaeth,Dibyanshu Kumar,Benjamin Paassen,Magda Gregorová
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECML 2025

点击查看摘要

Abstract:Classifier guidance is intended to steer a diffusion process such that a given classifier reliably recognizes the generated data point as a certain class. However, most classifier guidance approaches are restricted to robust classifiers, which were specifically trained on the noise of the diffusion forward process. We extend classifier guidance to work with general, non-robust, classifiers that were trained without noise. We analyze the sensitivity of both non-robust and robust classifiers to noise of the diffusion process on the standard CelebA data set, the specialized SportBalls data set and the high-dimensional real-world CelebA-HQ data set. Our findings reveal that non-robust classifiers exhibit significant accuracy degradation under noisy conditions, leading to unstable guidance gradients. To mitigate these issues, we propose a method that utilizes one-step denoised image predictions and implements stabilization techniques inspired by stochastic optimization methods, such as exponential moving averages. Experimental results demonstrate that our approach improves the stability of classifier guidance while maintaining sample diversity and visual quality. This work contributes to advancing conditional sampling techniques in generative models, enabling a broader range of classifiers to be used as guidance classifiers.
zh

[CV-38] A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation

【速读】:该论文旨在解决全身抓取中姿态生成与运动补全的问题,以实现真实且稳定的物体交互。其解决方案的关键在于提出一种基于Transformer的框架,包含三个阶段:抓取姿态生成、时间补全以及LiftUp Transformer,用于将下采样的关节坐标恢复为高分辨率标记。此外,为克服手-物体交互数据的稀缺性,引入了在大规模多样化运动数据集上的高效泛化预训练阶段,从而获得可迁移至抓取任务的鲁棒时空表征。

链接: https://arxiv.org/abs/2507.00676
作者: Edward Effendy,Kuan-Wei Tseng,Rei Kawakami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accepted in the ICIP 2025 We present a novel transformer-based framework for whole-body grasping that addresses both pose generation and motion infilling, enabling realistic and stable object interactions. Our pipeline comprises three stages: Grasp Pose Generation for full-body grasp generation, Temporal Infilling for smooth motion continuity, and a LiftUp Transformer that refines downsampled joints back to high-resolution markers. To overcome the scarcity of hand-object interaction data, we introduce a data-efficient Generalized Pretraining stage on large, diverse motion datasets, yielding robust spatio-temporal representations transferable to grasping tasks. Experiments on the GRAB dataset show that our method outperforms state-of-the-art baselines in terms of coherence, stability, and visual realism. The modular design also supports easy adaptation to other human-motion applications. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.00676 [cs.CV] (or arXiv:2507.00676v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.00676 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-39] Audio-3DVG: Unified Audio - Point Cloud Fusion for 3D Visual Grounding

【速读】:该论文旨在解决基于语音描述的3D视觉定位(Audio-based 3D Visual Grounding)问题,即在三维点云中根据语音输入准确识别目标物体。现有研究主要依赖文本描述,而语音输入的利用仍处于探索阶段且面临较大挑战。论文提出的解决方案关键在于构建一个名为Audio-3DVG的框架,该框架通过将音频与空间信息相结合,提升定位效果。其核心创新包括:一是引入Object Mention Detection任务,通过多标签分类明确音频中提及的物体,实现更结构化的音频-场景推理;二是设计Audio-Guided Attention模块,捕捉候选物体与语音关系线索之间的交互,从而增强复杂场景中的目标区分能力。

链接: https://arxiv.org/abs/2507.00669
作者: Duc Cao-Dinh,Khai Le-Duc,Anh Dao,Bach Phan Tat,Chris Ngo,Duy M. H. Nguyen,Nguyen X. Khanh,Thanh Nguyen-Tang
机构: Knovel Engineering Lab, Singapore; University of Toronto, Canada; University Health Network, Canada; Michigan State University, USA; KU Leuven, Belgium; German Research Center for Artificial Intelligence (DFKI), Germany; Max Planck Research School for Intelligent Systems (IMPRS-IS), Germany; University of Stuttgart, Germany; UC Berkeley, USA; Johns Hopkins University, USA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Work in progress, 42 pages

点击查看摘要

Abstract:3D Visual Grounding (3DVG) involves localizing target objects in 3D point clouds based on natural language. While prior work has made strides using textual descriptions, leveraging spoken language-known as Audio-based 3D Visual Grounding-remains underexplored and challenging. Motivated by advances in automatic speech recognition (ASR) and speech representation learning, we propose Audio-3DVG, a simple yet effective framework that integrates audio and spatial information for enhanced grounding. Rather than treating speech as a monolithic input, we decompose the task into two complementary components. First, we introduce Object Mention Detection, a multi-label classification task that explicitly identifies which objects are referred to in the audio, enabling more structured audio-scene reasoning. Second, we propose an Audio-Guided Attention module that captures interactions between candidate objects and relational speech cues, improving target discrimination in cluttered scenes. To support benchmarking, we synthesize audio descriptions for standard 3DVG datasets, including ScanRefer, Sr3D, and Nr3D. Experimental results demonstrate that Audio-3DVG not only achieves new state-of-the-art performance in audio-based grounding, but also competes with text-based methods-highlighting the promise of integrating spoken language into 3D vision tasks.
zh

[CV-40] LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment ICCV2025

【速读】:该论文旨在解决在低层级细节(low Level-of-Detail, LoD)城市模型上实现空中视觉定位的问题。现有基于线框对齐的方法LoD-Loc虽然在高LoD(如LoD3或LoD2)城市模型上表现良好,但无法有效适用于广泛存在的低LoD(如LoD1)模型。为了解决这一问题,作者提出了LoD-Loc v2,其关键在于采用自粗到精的策略,通过显式轮廓对齐实现对低LoD城市模型的精确定位。具体而言,该方法首先利用建筑分割网络提取建筑轮廓,在粗略姿态选择阶段构建姿态代价体积以估计姿态分布,并在精细姿态估计阶段结合多光束跟踪的粒子滤波方法高效搜索假设空间,从而获得最终姿态估计。

链接: https://arxiv.org/abs/2507.00659
作者: Juelin Zhu,Shuaibang Peng,Long Wang,Hanlin Tan,Yu Liu,Maojun Zhang,Shen Yan
机构: National University of Defense Technology (国防科技大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:We propose a novel method for aerial visual localization over low Level-of-Detail (LoD) city models. Previous wireframe-alignment-based method LoD-Loc has shown promising localization results leveraging LoD models. However, LoD-Loc mainly relies on high-LoD (LoD3 or LoD2) city models, but the majority of available models and those many countries plan to construct nationwide are low-LoD (LoD1). Consequently, enabling localization on low-LoD city models could unlock drones’ potential for global urban localization. To address these issues, we introduce LoD-Loc v2, which employs a coarse-to-fine strategy using explicit silhouette alignment to achieve accurate localization over low-LoD city models in the air. Specifically, given a query image, LoD-Loc v2 first applies a building segmentation network to shape building silhouettes. Then, in the coarse pose selection stage, we construct a pose cost volume by uniformly sampling pose hypotheses around a prior pose to represent the pose probability distribution. Each cost of the volume measures the degree of alignment between the projected and predicted silhouettes. We select the pose with maximum value as the coarse pose. In the fine pose estimation stage, a particle filtering method incorporating a multi-beam tracking approach is used to efficiently explore the hypothesis space and obtain the final pose estimation. To further facilitate research in this field, we release two datasets with LoD1 city models covering 10.7 km , along with real RGB queries and ground-truth pose annotations. Experimental results show that LoD-Loc v2 improves estimation accuracy with high-LoD models and enables localization with low-LoD models for the first time. Moreover, it outperforms state-of-the-art baselines by large margins, even surpassing texture-model-based methods, and broadens the convergence basin to accommodate larger prior errors.
zh

[CV-41] GANs Secretly Perform Approximate Bayesian Model Selection

【速读】:该论文试图解决生成对抗网络(Generative Adversarial Networks, GANs)在优化过程中面临的挑战以及过拟合问题。其解决方案的关键在于将GANs解释为概率生成模型,并将其视为具有部分随机性的贝叶斯神经网络,从而建立通用近似条件。通过将不同变体GANs的对抗式优化转化为边缘似然的代理优化,结合边缘似然优化与奥卡姆剃刀原则的联系,定义了正则化和优化策略,以平滑损失景观并寻找具有最小描述长度的解,这些解与平坦极小值和良好的泛化能力相关。

链接: https://arxiv.org/abs/2507.00651
作者: Maurizio Filippone,Marius P. Linhard
机构: KAUST(沙特阿卜杜拉国王科技大学); RPTU Kaiserslautern-Landau(基尔施塔特-兰道大学应用技术大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) are popular and successful generative models. Despite their success, optimization is notoriously challenging and they require regularization against overfitting. In this work, we explain the success and limitations of GANs by interpreting them as probabilistic generative models. This interpretation enables us to view GANs as Bayesian neural networks with partial stochasticity, allowing us to establish conditions of universal approximation. We can then cast the adversarial-style optimization of several variants of GANs as the optimization of a proxy for the marginal likelihood. Taking advantage of the connection between marginal likelihood optimization and Occam’s razor, we can define regularization and optimization strategies to smooth the loss landscape and search for solutions with minimum description length, which are associated with flat minima and good generalization. The results on a wide range of experiments indicate that these strategies lead to performance improvements and pave the way to a deeper understanding of regularization strategies for GANs.
zh

[CV-42] UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions ICCV2025

【速读】:该论文旨在解决在不利天气条件下(如夜间或雾天)视觉目标跟踪性能显著下降的问题,这些问题主要是由于源域与目标域之间的巨大领域偏移导致的。解决方案的关键在于提出UMDATrack,其核心是通过一个统一的领域自适应框架,在不进行冗余模型更新的情况下,使目标表示能够快速适应多种天气条件。该方法包括一个可控场景生成器用于合成少量未标记视频,以及一个领域定制适配器(DCA)和一个基于最优传输定理的目标感知置信度对齐模块(TCA),以增强源域与目标域之间的定位一致性。

链接: https://arxiv.org/abs/2507.00648
作者: Siyuan Yao,Rui Zhu,Ziqi Wang,Wenqi Ren,Yanyang Yan,Xiaochun Cao
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Sun Yat-sen University (中山大学); University of Chinese Academy of Sciences (中国科学院大学); MoE Key Laboratory of Information Technology (教育部信息技术重点实验室); Guangdong Key Laboratory of Information Security Technology (广东省信息安全技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse weather conditions, e.g. nighttime or foggy environment, the tremendous domain shift leads to significant performance degradation. In this paper, we propose UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions within a unified domain adaptation framework. Specifically, we first use a controllable scenario generator to synthesize a small amount of unlabeled videos (less than 2% frames in source daytime datasets) in multiple weather conditions under the guidance of different text prompts. Afterwards, we design a simple yet effective domain-customized adapter (DCA), allowing the target objects’ representation to rapidly adapt to various weather conditions without redundant model updating. Furthermore, to enhance the localization consistency between source and target domains, we propose a target-aware confidence alignment module (TCA) following optimal transport theorem. Extensive experiments demonstrate that UMDATrack can surpass existing advanced visual trackers and lead new state-of-the-art performance by a significant margin. Our code is available at this https URL.
zh

[CV-43] Stable Tracking of Eye Gaze Direction During Ophthalmic Surgery ICRA2025

【速读】:该论文旨在解决眼科手术机器人在术前导航中依赖手动操作导致的一致性差和不确定性增加的问题。现有的眼动估计技术存在对额外传感器的依赖、手术环境中遮挡问题以及面部检测需求等挑战。该研究提出了一种结合机器学习与传统算法的眼部定位与跟踪方法,其关键在于无需依赖特征点即可实现稳定的眼虹膜检测和注视估计,从而在不同光照和阴影条件下提升眼动估计的准确性与鲁棒性。

链接: https://arxiv.org/abs/2507.00635
作者: Tinghe Hong,Shenlin Cai,Boyang Li,Kai Huang
机构: Sun Yat-sen University (中山大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted by ICRA 2025

点击查看摘要

Abstract:Ophthalmic surgical robots offer superior stability and precision by reducing the natural hand tremors of human surgeons, enabling delicate operations in confined surgical spaces. Despite the advancements in developing vision- and force-based control methods for surgical robots, preoperative navigation remains heavily reliant on manual operation, limiting the consistency and increasing the uncertainty. Existing eye gaze estimation techniques in the surgery, whether traditional or deep learning-based, face challenges including dependence on additional sensors, occlusion issues in surgical environments, and the requirement for facial detection. To address these limitations, this study proposes an innovative eye localization and tracking method that combines machine learning with traditional algorithms, eliminating the requirements of landmarks and maintaining stable iris detection and gaze estimation under varying lighting and shadow conditions. Extensive real-world experiment results show that our proposed method has an average estimation error of 0.58 degrees for eye orientation estimation and 2.08-degree average control error for the robotic arm’s movement based on the calculated orientation.
zh

[CV-44] De-Simplifying Pseudo Labels to Enhancing Domain Adaptive Object Detection

【速读】:该论文旨在解决自标签(self-labeling)目标检测方法在交通和运输场景中性能无法与领域对齐方法相媲美的问题。其关键在于识别训练过程中简单样本比例过高导致的“简单标签偏差”(simple-label bias),并提出一种名为De-Simplifying Pseudo Labels (DeSimPL) 的新方法来缓解这一问题。DeSimPL通过实例级记忆库实现创新的伪标签更新策略,并在训练中引入对抗样本以提升复杂样本的比例,同时采用自适应加权损失函数以减少后期训练中虚假正例伪标签的影响。

链接: https://arxiv.org/abs/2507.00608
作者: Zehua Fu,Chenguang Liu,Yuyu Chen,Jiaqi Zhou,Qingjie Liu,Yunhong Wang
机构: Hangzhou Innovation Institute, Beihang University(杭州创新研究院,北京航空航天大学); State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(虚拟现实技术与系统国家重点实验室,北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Intelligent Transportation Systems. 15 pages, 10 figures

点击查看摘要

Abstract:Despite its significant success, object detection in traffic and transportation scenarios requires time-consuming and laborious efforts in acquiring high-quality labeled data. Therefore, Unsupervised Domain Adaptation (UDA) for object detection has recently gained increasing research attention. UDA for object detection has been dominated by domain alignment methods, which achieve top performance. Recently, self-labeling methods have gained popularity due to their simplicity and efficiency. In this paper, we investigate the limitations that prevent self-labeling detectors from achieving commensurate performance with domain alignment methods. Specifically, we identify the high proportion of simple samples during training, i.e., the simple-label bias, as the central cause. We propose a novel approach called De-Simplifying Pseudo Labels (DeSimPL) to mitigate the issue. DeSimPL utilizes an instance-level memory bank to implement an innovative pseudo label updating strategy. Then, adversarial samples are introduced during training to enhance the proportion. Furthermore, we propose an adaptive weighted loss to avoid the model suffering from an abundance of false positive pseudo labels in the late training period. Experimental results demonstrate that DeSimPL effectively reduces the proportion of simple samples during training, leading to a significant performance improvement for self-labeling detectors. Extensive experiments conducted on four benchmarks validate our analysis and conclusions.
zh

[CV-45] World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model ICCV2025

【速读】:该论文旨在解决端到端自动驾驶中依赖昂贵的感知监督来提取场景信息的问题,其核心挑战是构建一个信息丰富的驾驶世界模型,以实现无需感知标注的端到端规划。解决方案的关键在于利用视觉基础模型构建潜在世界模型,从而生成和评估多模态规划轨迹,并通过自监督学习实现实际未来观测与从潜在空间重建的观测之间的对齐,最终达成无需人工感知标注的端到端规划。

链接: https://arxiv.org/abs/2507.00603
作者: Yupeng Zheng,Pengxuan Yang,Zebin Xing,Qichao Zhang,Yuhang Zheng,Yinfeng Gao,Pengfei Li,Teng Zhang,Zhongpu Xia,Peng Jia,Dongbin Zhao
机构: CASIA(中国科学院自动化研究所); Li Auto(小鹏汽车); PCL(模式识别国家重点实验室); NUS(新加坡国立大学); Tsinghua(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, first version

点击查看摘要

Abstract:End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, end-to-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.1% relative reduction in L2 error, 46.7% lower collision rate, and 3.75 faster training convergence. Codes will be accessed at this https URL.
zh

[CV-46] Overtake Detection in Trucks Using CAN Bus Signals: A Comparative Study of Machine Learning Methods

【速读】:该论文旨在解决卡车安全变道操作的准确预测问题,以提升高级驾驶辅助系统(ADAS)的决策能力。其关键解决方案是利用来自沃尔沃集团五辆在役卡车的控制器局域网(CAN)总线数据,评估三种常见分类器(人工神经网络ANN、随机森林RF和支持向量机SVM)的性能,并通过数据预处理配置优化分类效果。研究发现,交通条件的多样性对信号模式有显著影响,尤其在无变道类别中,因此训练数据的多样性至关重要。为提高模型泛化能力并减少特定条件下的偏差,采用多车辆数据进行训练,并引入评分级融合策略,从而在多数情况下实现了最佳的单车辆性能,最终达到93%的真负率(TNR)和86.5%的真正率(TPR)。

链接: https://arxiv.org/abs/2507.00593
作者: Fernando Alonso-Fernandez,Talha Hanif Butt,Prayag Tiwari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at ESWA

点击查看摘要

Abstract:Safe overtaking manoeuvres in trucks are vital for preventing accidents and ensuring efficient traffic flow. Accurate prediction of such manoeuvres is essential for Advanced Driver Assistance Systems (ADAS) to make timely and informed decisions. In this study, we focus on overtake detection using Controller Area Network (CAN) bus data collected from five in-service trucks provided by the Volvo Group. We evaluate three common classifiers for vehicle manoeuvre detection, Artificial Neural Networks (ANN), Random Forest (RF), and Support Vector Machines (SVM), and analyse how different preprocessing configurations affect performance. We find that variability in traffic conditions strongly influences the signal patterns, particularly in the no-overtake class, affecting classification performance if training data lacks adequate diversity. Since the data were collected under unconstrained, real-world conditions, class diversity cannot be guaranteed a priori. However, training with data from multiple vehicles improves generalisation and reduces condition-specific bias. Our pertruck analysis also reveals that classification accuracy, especially for overtakes, depends on the amount of training data per vehicle. To address this, we apply a score-level fusion strategy, which yields the best per-truck performance across most cases. Overall, we achieve an accuracy via fusion of TNR=93% (True Negative Rate) and TPR=86.5% (True Positive Rate). This research has been part of the BIG FUN project, which explores how Artificial Intelligence can be applied to logged vehicle data to understand and predict driver behaviour, particularly in relation to Camera Monitor Systems (CMS), being introduced as digital replacements for traditional exterior mirrors.
zh

[CV-47] Context-Aware Academic Emotion Dataset and Benchmark ICCV2025

【速读】:该论文旨在解决在真实学习环境中自动识别学术情绪(academic emotion)的挑战,这一领域由于公开可用数据集的缺乏而研究不足。其解决方案的关键在于提出一种基于CLIP模型的上下文感知学术情绪识别方法(CLIP-CAER),该方法通过在视觉-语言模型CLIP中引入可学习的文本提示,有效融合视频中的面部表情与上下文线索,从而提升学术情绪识别的准确性。

链接: https://arxiv.org/abs/2507.00586
作者: Luming Zhao,Jingwen Xuan,Jiamin Lou,Yonghui Yu,Wenwu Yang
机构: Zhejiang Gongshang University (浙江工商大学); Zhejiang Yuexiu University (浙江越秀大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Academic emotion analysis plays a crucial role in evaluating students’ engagement and cognitive states during the learning process. This paper addresses the challenge of automatically recognizing academic emotions through facial expressions in real-world learning environments. While significant progress has been made in facial expression recognition for basic emotions, academic emotion recognition remains underexplored, largely due to the scarcity of publicly available datasets. To bridge this gap, we introduce RAER, a novel dataset comprising approximately 2,700 video clips collected from around 140 students in diverse, natural learning contexts such as classrooms, libraries, laboratories, and dormitories, covering both classroom sessions and individual study. Each clip was annotated independently by approximately ten annotators using two distinct sets of academic emotion labels with varying granularity, enhancing annotation consistency and reliability. To our knowledge, RAER is the first dataset capturing diverse natural learning scenarios. Observing that annotators naturally consider context cues-such as whether a student is looking at a phone or reading a book-alongside facial expressions, we propose CLIP-CAER (CLIP-based Context-aware Academic Emotion Recognition). Our method utilizes learnable text prompts within the vision-language model CLIP to effectively integrate facial expression and context cues from videos. Experimental results demonstrate that CLIP-CAER substantially outperforms state-of-the-art video-based facial expression recognition methods, which are primarily designed for basic emotions, emphasizing the crucial role of context in accurately recognizing academic emotions. Project page: this https URL
zh

[CV-48] Similarity Memory Prior is All You Need for Medical Image Segmentation

【速读】:该论文试图解决医学图像分割中类别间细微纹理变化难以准确捕捉的问题,其解决方案的关键在于提出一种基于相似性记忆先验的动态记忆权重-损失注意力机制(Dynamic Memory Weights-Loss Attention, DMW-LA),通过原型记忆库中的相似性记忆先验匹配并记忆特定病灶或器官的类别特征,从而帮助网络学习类别间的细微纹理变化,并通过Weight-Loss Dynamic(W-LD)更新策略动态反向更新相似性记忆先验,有效辅助网络直接提取类别特征。

链接: https://arxiv.org/abs/2507.00585
作者: Tang Hao,Guo ZhiQing,Wang LieJun,Liu Chao
机构: Xinjiang University(新疆大学); Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center(新疆多模态智能处理与信息安全工程技术研究中心); Silk Road Multilingual Cognitive Computing International Cooperation Joint Laboratory, Xinjiang University(丝绸之路多语种认知计算国际联合实验室,新疆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, it has been found that “grandmother cells” in the primary visual cortex (V1) of macaques can directly recognize visual input with complex shapes. This inspires us to examine the value of these cells in promoting the research of medical image segmentation. In this paper, we design a Similarity Memory Prior Network (Sim-MPNet) for medical image segmentation. Specifically, we propose a Dynamic Memory Weights-Loss Attention (DMW-LA), which matches and remembers the category features of specific lesions or organs in medical images through the similarity memory prior in the prototype memory bank, thus helping the network to learn subtle texture changes between categories. DMW-LA also dynamically updates the similarity memory prior in reverse through Weight-Loss Dynamic (W-LD) update strategy, effectively assisting the network directly extract category features. In addition, we propose the Double-Similarity Global Internal Enhancement Module (DS-GIM) to deeply explore the internal differences in the feature distribution of input data through cosine similarity and euclidean distance. Extensive experiments on four public datasets show that Sim-MPNet has better segmentation performance than other state-of-the-art methods. Our code is available on this https URL.
zh

[CV-49] AI-Generated Video Detection via Perceptual Straightening

【速读】:该论文旨在解决AI生成视频的检测问题,特别是针对合成视频在内容认证中带来的挑战和潜在滥用风险。其解决方案的关键在于利用神经表示空间中的几何特性,通过分析视频在预训练自监督视觉变压器(DINOv2)表示域中的时间曲率和步进距离,区分真实视频与AI生成视频。该方法基于“感知直线化”假设,即真实世界视频在神经表示域中的轨迹更加直线,而AI生成视频则表现出显著不同的曲率和距离模式,从而实现了高效的检测性能。

链接: https://arxiv.org/abs/2507.00583
作者: Christian Internò,Robert Geirhos,Markus Olhofer,Sunny Liu,Barbara Hammer,David Klindt
机构: Bielefeld University (比勒费尔德大学); Google DeepMind (谷歌深度思维); Honda Research Institute EU (本田研究机构欧洲); Cold Spring Harbor Laboratory (冷泉港实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the “perceptual straightening” hypothesis – which suggests real-world video trajectories become more straight in neural representation domain – we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model’s representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.
zh

[CV-50] BadViM: Backdoor Attack against Vision Mamba

【速读】:该论文旨在解决Vision Mamba(ViM)模型在面对后门攻击时的安全性问题,特别是其对隐蔽触发器的易感性。论文提出的解决方案是BadViM,其关键在于利用了Resonant Frequency Trigger(RFT)来生成隐匿且分布式的触发器,并通过Hidden State Alignment损失函数策略性地调整模型内部表示,使后门图像的隐藏状态与目标类别的隐藏状态对齐,从而提高攻击成功率并保持干净数据的准确性。

链接: https://arxiv.org/abs/2507.00577
作者: Yinghao Wu,Liyan Zhang
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision State Space Models (SSMs), particularly architectures like Vision Mamba (ViM), have emerged as promising alternatives to Vision Transformers (ViTs). However, the security implications of this novel architecture, especially their vulnerability to backdoor attacks, remain critically underexplored. Backdoor attacks aim to embed hidden triggers into victim models, causing the model to misclassify inputs containing these triggers while maintaining normal behavior on clean inputs. This paper investigates the susceptibility of ViM to backdoor attacks by introducing BadViM, a novel backdoor attack framework specifically designed for Vision Mamba. The proposed BadViM leverages a Resonant Frequency Trigger (RFT) that exploits the frequency sensitivity patterns of the victim model to create stealthy, distributed triggers. To maximize attack efficacy, we propose a Hidden State Alignment loss that strategically manipulates the internal representations of model by aligning the hidden states of backdoor images with those of target classes. Extensive experimental results demonstrate that BadViM achieves superior attack success rates while maintaining clean data accuracy. Meanwhile, BadViM exhibits remarkable resilience against common defensive measures, including PatchDrop, PatchShuffle and JPEG compression, which typically neutralize normal backdoor attacks.
zh

[CV-51] Out-of-distribution detection in 3D applications: a review

【速读】:该论文旨在解决在3D应用中,如自动驾驶,检测训练数据中不常见的物体(即分布外数据,Out-of-Distribution, OOD)的问题。传统机器学习方法通常假设推理过程中遇到的所有物体类别都属于训练数据中的封闭集合,这一假设限制了模型在现实世界中的泛化能力。论文提出的解决方案关键在于通过OOD检测技术识别与训练分布显著偏离的输入,从而提升AI系统的可靠性与安全性。该方法涉及对不同模态的基准数据集、评估指标以及多种OOD检测方法的比较分析,包括模型结构、不确定性指标和分布距离分类等,并强调了对抗鲁棒性与故障识别等未来研究方向。

链接: https://arxiv.org/abs/2507.00570
作者: Zizhao Li,Xueyang Kang,Joseph West,Kourosh Khoshelham
机构: University of Melbourne(墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ability to detect objects that are not prevalent in the training set is a critical capability in many 3D applications, including autonomous driving. Machine learning methods for object recognition often assume that all object categories encountered during inference belong to a closed set of classes present in the training data. This assumption limits generalization to the real world, as objects not seen during training may be misclassified or entirely ignored. As part of reliable AI, OOD detection identifies inputs that deviate significantly from the training distribution. This paper provides a comprehensive overview of OOD detection within the broader scope of trustworthy and uncertain AI. We begin with key use cases across diverse domains, introduce benchmark datasets spanning multiple modalities, and discuss evaluation metrics. Next, we present a comparative analysis of OOD detection methodologies, exploring model structures, uncertainty indicators, and distributional distance taxonomies, alongside uncertainty calibration techniques. Finally, we highlight promising research directions, including adversarially robust OOD detection and failure identification, particularly relevant to 3D applications. The paper offers both theoretical and practical insights into OOD detection, showcasing emerging research opportunities such as 3D vision integration. These insights help new researchers navigate the field more effectively, contributing to the development of reliable, safe, and robust AI systems.
zh

[CV-52] Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment

【速读】:该论文旨在解决零样本骨架动作识别(zero-shot skeleton-based action recognition)中的泛化能力不足问题,即在未见过的动作类别上进行准确分类的挑战。现有方法受限于骨架特征区分度不足以及测试阶段骨架与未见文本特征间的对齐偏差。其解决方案的关键在于提出一种基于原型引导的特征对齐范式(prototype-guided feature alignment, PGFA),通过端到端的跨模态对比训练框架提升骨架与文本特征的对齐效果,并引入基于原型的文本特征对齐策略以缓解测试阶段分布差异带来的负面影响。

链接: https://arxiv.org/abs/2507.00566
作者: Kai Zhou,Shuhai Zhang,Zeng You,Jinwu Hu,Mingkui Tan,Fei Liu
机构: South China University of Technology (华南理工大学); Pazhou Lab (琶洲实验室); Peng Cheng Laboratory (鹏城实验室); Key Laboratory of Big Data and Intelligent Robot (南华理工大学大数据与智能机器人重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by IEEE TIP 2025. Code is publicly available at this https URL

点击查看摘要

Abstract:Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features, enabling knowledge transfer to unseen classes through skeleton-text alignment and language models’ generalization. However, their efficacy is hindered by 1) insufficient discrimination for skeleton features, as the fixed skeleton encoder fails to capture necessary alignment information for effective skeleton-text alignment; 2) the neglect of alignment bias between skeleton and unseen text features during testing. To this end, we propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA. Specifically, we develop an end-to-end cross-modal contrastive training framework to improve skeleton-text alignment, ensuring sufficient discrimination for skeleton features. Additionally, we introduce a prototype-guided text feature alignment strategy to mitigate the adverse impact of the distribution discrepancy during testing. We provide a theoretical analysis to support our prototype-guided text feature alignment strategy and empirically evaluate our overall PGFA on three well-known datasets. Compared with the top competitor SMIE method, our PGFA achieves absolute accuracy improvements of 22.96%, 12.53%, and 18.54% on the NTU-60, NTU-120, and PKU-MMD datasets, respectively.
zh

[CV-53] LOD-GS: Level-of-Detail-Sensitive 3D Gaussian Splatting for Detail Conserved Anti-Aliasing

【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)在3D场景渲染中仍然存在的混叠伪影(aliasing artifacts)问题。现有方法主要依赖低通滤波来缓解混叠,但这些方法对采样率不敏感,常导致滤波不足或过度平滑。论文提出的解决方案是LOD-GS,其关键在于引入了一组基函数,以采样率为输入来建模外观变化,从而实现对每个3D高斯原始对象动态预测最优滤波强度,并通过端到端优化联合优化基函数参数与3D高斯参数。

链接: https://arxiv.org/abs/2507.00554
作者: Zhenya Yang,Bingchen Gong,Kai Chen,Qi Dou
机构: The Chinese University of Hong Kong (香港中文大学); Ecole Polytechnique (法国综合理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the advancements in quality and efficiency achieved by 3D Gaussian Splatting (3DGS) in 3D scene rendering, aliasing artifacts remain a persistent challenge. Existing approaches primarily rely on low-pass filtering to mitigate aliasing. However, these methods are not sensitive to the sampling rate, often resulting in under-filtering and over-smoothing renderings. To address this limitation, we propose LOD-GS, a Level-of-Detail-sensitive filtering framework for Gaussian Splatting, which dynamically predicts the optimal filtering strength for each 3D Gaussian primitive. Specifically, we introduce a set of basis functions to each Gaussian, which take the sampling rate as input to model appearance variations, enabling sampling-rate-sensitive filtering. These basis function parameters are jointly optimized with the 3D Gaussian in an end-to-end manner. The sampling rate is influenced by both focal length and camera distance. However, existing methods and datasets rely solely on down-sampling to simulate focal length changes for anti-aliasing evaluation, overlooking the impact of camera distance. To enable a more comprehensive assessment, we introduce a new synthetic dataset featuring objects rendered at varying camera distances. Extensive experiments on both public datasets and our newly collected dataset demonstrate that our method achieves SOTA rendering quality while effectively eliminating aliasing. The code and dataset have been open-sourced.
zh

[CV-54] Not All Attention Heads Are What You Need: Refining CLIPs Image Representation with Attention Ablation

【速读】:该论文试图解决CLIP图像编码器中某些注意力头(attention heads)对最终表示产生负面影响的问题,进而影响下游任务的性能。解决方案的关键在于提出一种称为注意力消融技术(Attention Ablation Technique, AAT)的方法,通过调整注意力权重来抑制特定注意力头的贡献,从而系统地识别并消融有害的注意力头,以提升表示质量。

链接: https://arxiv.org/abs/2507.00537
作者: Feng Lin,Marco Chen,Haokui Zhang,Xiaotian Yu,Guangming Lu,Rong Xiao
机构: Intellifusion Inc. (智源科技); Northwest Polytechnical University (西北工业大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:This paper studies the role of attention heads in CLIP’s image encoder. While CLIP has exhibited robust performance across diverse applications, we hypothesize that certain attention heads negatively affect final representations and that ablating them can improve performance in downstream tasks. To capitalize on this insight, we propose a simple yet effective method, called Attention Ablation Technique (AAT), to suppress the contribution of specific heads by manipulating attention weights. By integrating two alternative strategies tailored for different application scenarios, AAT systematically identifies and ablates detrimental attention heads to enhance representation quality. Experiments demonstrate that AAT consistently improves downstream task performance across various domains, boosting recall rate by up to 11.1% on CLIP-family models for cross-modal retrieval. The results highlight the potential of AAT to effectively refine large-scale vision-language models with virtually no increase in inference cost.
zh

[CV-55] Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving

【速读】:该论文旨在解决当前视觉-语言模型(VLMs)在自动驾驶中难以准确理解用户意图及进行空间与时间推理的问题。现有数据集仅限于全场景描述或路径点预测,无法评估VLMs对用户指定对象的局部化查询响应能力。论文提出的解决方案关键在于引入Box-QAymo数据集和基准,通过用户绘制边界框来表达意图,从而实现对指定对象的属性预测、目标运动理解和跨帧的时空运动推理,并采用分层评估协议以全面评估模型性能。

链接: https://arxiv.org/abs/2507.00525
作者: Djamahl Etchegaray,Yuxia Fu,Zi Huang,Yadan Luo
机构: The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interpretable communication is essential for safe and trustworthy autonomous driving, yet current vision-language models (VLMs) often operate under idealized assumptions and struggle to capture user intent in real-world scenarios. Existing driving-oriented VQA datasets are limited to full-scene descriptions or waypoint prediction, preventing the assessment of whether VLMs can respond to localized user-driven queries. We introduce Box-QAymo, a box-referring dataset and benchmark designed to both evaluate and finetune VLMs on spatial and temporal reasoning over user-specified objects. Users express intent by drawing bounding boxes, offering a fast and intuitive interface for focused queries in complex scenes. Specifically, we propose a hierarchical evaluation protocol that begins with binary sanity-check questions to assess basic model capacities, and progresses to (1) attribute prediction for box-referred objects, (2) motion understanding of target instances, and (3) spatiotemporal motion reasoning over inter-object dynamics across frames. To support this, we crowd-sourced fine-grained object classes and visual attributes that reflect the complexity drivers encounter, and extract object trajectories to construct temporally grounded QA pairs. Rigorous quality control through negative sampling, temporal consistency checks, and difficulty-aware balancing guarantee dataset robustness and diversity. Our comprehensive evaluation reveals significant limitations in current VLMs when queried about perception questions, highlighting the gap in achieving real-world performance. This work provides a foundation for developing more robust and interpretable autonomous driving systems that can communicate effectively with users under real-world conditions. Project page and dataset are available at this https URL.
zh

[CV-56] opology-Constrained Learning for Efficient Laparoscopic Liver Landmark Detection MICCAI2025

【速读】:该论文旨在解决腹腔镜肝手术中自动肝部解剖标志物检测的问题,该问题由于标志物的管状结构特性和术中动态变形而具有挑战性。解决方案的关键在于提出TopoNet,这是一种基于拓扑约束的学习框架,其核心是采用snake-CNN双路径编码器以同时获取RGB纹理信息和深度感知的拓扑结构,并引入边界感知拓扑融合(BTF)模块,自适应融合RGB-D特征以增强边缘感知并保持全局拓扑结构;此外,还嵌入了包含中心线约束损失和拓扑持久性损失的拓扑约束损失函数,以确保预测结果与标签之间的同伦等价性。

链接: https://arxiv.org/abs/2507.00519
作者: Ruize Cui,Jiaan Zhang,Jialun Pei,Kai Wang,Pheng-Ann Heng,Jing Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by MICCAI 2025

点击查看摘要

Abstract:Liver landmarks provide crucial anatomical guidance to the surgeon during laparoscopic liver surgery to minimize surgical risk. However, the tubular structural properties of landmarks and dynamic intraoperative deformations pose significant challenges for automatic landmark detection. In this study, we introduce TopoNet, a novel topology-constrained learning framework for laparoscopic liver landmark detection. Our framework adopts a snake-CNN dual-path encoder to simultaneously capture detailed RGB texture information and depth-informed topological structures. Meanwhile, we propose a boundary-aware topology fusion (BTF) module, which adaptively merges RGB-D features to enhance edge perception while preserving global topology. Additionally, a topological constraint loss function is embedded, which contains a center-line constraint loss and a topological persistence loss to ensure homotopy equivalence between predictions and labels. Extensive experiments on L3D and P2ILF datasets demonstrate that TopoNet achieves outstanding accuracy and computational complexity, highlighting the potential for clinical applications in laparoscopic liver surgery. Our code will be available at this https URL.
zh

[CV-57] SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning

【速读】:该论文旨在解决视觉-语言预训练模型(如CLIP)在行人重识别(ReID)任务中因复杂适配器设计或模态特异性调优而忽视跨模态交互所导致的计算成本高和对齐效果不佳的问题。其解决方案的关键在于提出一种名为选择性跨模态提示调优(SCING)的框架,该框架包含两个核心创新:首先,引入了轻量级的可选视觉提示融合(SVIP)模块,通过跨模态门控机制动态注入判别性视觉特征到文本提示中;其次,提出了扰动驱动的一致性对齐(PDCA)双路径训练策略,通过正则化原始与增强跨模态嵌入之间的一致性,强制实现不变特征对齐。

链接: https://arxiv.org/abs/2507.00506
作者: Yunfei Xie,Yuxuan Cheng,Juncheng Wu,Haoyu Zhang,Yuyin Zhou,Shoudong Han
机构: Huazhong University of Science and Technology (华中科技大学); Huazhong Agricultural University (华中农业大学); University of California, Santa Cruz (加州大学圣克鲁兹分校); City University of Hong Kong (Dongguan) (香港城市大学(东莞))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in adapting vision-language pre-training models like CLIP for person re-identification (ReID) tasks often rely on complex adapter design or modality-specific tuning while neglecting cross-modal interaction, leading to high computational costs or suboptimal alignment. To address these limitations, we propose a simple yet effective framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-world perturbations. Our method introduces two key innovations: Firstly, we proposed Selective Visual Prompt Fusion (SVIP), a lightweight module that dynamically injects discriminative visual features into text prompts via a cross-modal gating mechanism. Moreover, the proposed Perturbation-Driven Consistency Alignment (PDCA) is a dual-path training strategy that enforces invariant feature alignment under random image perturbations by regularizing consistency between original and augmented cross-modal embeddings. Extensive experiments are conducted on several popular benchmarks covering Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-REID, and P-DukeMTMC, which demonstrate the impressive performance of the proposed method. Notably, our framework eliminates heavy adapters while maintaining efficient inference, achieving an optimal trade-off between performance and computational overhead. The code will be released upon acceptance.
zh

[CV-58] LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLM s ICCV

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉表示能力不足的问题,特别是由于基于CLIP-ViT的视觉编码器在建模相邻图像块之间的局部关系方面存在局限,导致视觉表征不够细致,从而影响模型对细节的理解能力。其解决方案的关键在于提出LLaVA-SP,通过仅添加六个空间视觉标记(spatial visual tokens)来增强视觉表征,核心创新包括:设计一种新型投影器(Projector),利用卷积核从ViT块特征中提取空间标记,模拟“从中心区域到全局”和“从抽象到具体”的两种视觉空间排序方式,并通过交叉注意力机制融合细粒度视觉信息,从而提升整体视觉表征能力。

链接: https://arxiv.org/abs/2507.00505
作者: Haoran Lou,Chunxiao Fan,Ziyan Liu,Yuexin Wu,Xinxiang Wang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV

点击查看摘要

Abstract:The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which \textbf only adds six spatial visual tokens to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1)We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: from central region to global" and from abstract to specific". Then, a cross-attention mechanism is applied to fuse fine-grained visual information, enriching the overall visual representation. 2) We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling, enabling the model to handle diverse visual understanding tasks. 3) Extensive experiments show that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming the state-of-the-art LLaVA-1.5 model in multiple tasks with nearly identical inference latency. The code and models are available at \hrefthis https URL\textttthis https URL.
zh

[CV-59] ExPaMoE: An Expandable Parallel Mixture of Experts for Continual Test-Time Adaptation

【速读】:该论文旨在解决持续测试时适应(Continual Test-Time Adaptation, CTTA)中模型在面对不断变化的数据分布时容易出现特征纠缠和灾难性遗忘的问题。现有方法依赖于跨所有领域的共享模型参数,难以应对大规模或非平稳的领域偏移。其解决方案的关键在于提出ExPaMoE框架,该框架基于可扩展的并行专家混合架构(Expandable Parallel Mixture-of-Experts),通过双分支专家设计实现领域通用与领域特定知识的解耦,并利用频域线索实时检测分布变化的Spectral-Aware Online Domain Discriminator (SODD) 动态扩展专家池,从而提升模型的适应能力、鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2507.00502
作者: JianChao Zhao,Songlin Dong
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual Test-Time Adaptation (CTTA) aims to enable models to adapt on-the-fly to a stream of unlabeled data under evolving distribution shifts. However, existing CTTA methods typically rely on shared model parameters across all domains, making them vulnerable to feature entanglement and catastrophic forgetting in the presence of large or non-stationary domain shifts. To address this limitation, we propose \textbfExPaMoE, a novel framework based on an \emphExpandable Parallel Mixture-of-Experts architecture. ExPaMoE decouples domain-general and domain-specific knowledge via a dual-branch expert design with token-guided feature separation, and dynamically expands its expert pool based on a \emphSpectral-Aware Online Domain Discriminator (SODD) that detects distribution changes in real-time using frequency-domain cues. Extensive experiments demonstrate the superiority of ExPaMoE across diverse CTTA scenarios. We evaluate our method on standard benchmarks including CIFAR-10C, CIFAR-100C, ImageNet-C, and Cityscapes-to-ACDC for semantic segmentation. Additionally, we introduce \textbfImageNet++, a large-scale and realistic CTTA benchmark built from multiple ImageNet-derived datasets, to better reflect long-term adaptation under complex domain evolution. ExPaMoE consistently outperforms prior arts, showing strong robustness, scalability, and resistance to forgetting.
zh

[CV-60] Laplace-Mamba: Laplace Frequency Prior-Guided Mamba-CNN Fusion Network for Image Dehazing

【速读】:该论文旨在解决基于空间状态模型(Spatial State Models, SSMs)的图像恢复方法在重建局部结构和处理高维数据时存在的局限性,特别是在去雾任务中难以有效恢复精细图像特征的问题。其解决方案的关键在于提出了一种融合拉普拉斯频域先验与混合Mamba-CNN架构的新型框架——Laplace-Mamba。该框架通过拉普拉斯分解将图像分离为低频成分和高频成分,分别由SSMs和CNN进行全局上下文建模与局部结构细化,从而有效提升图像去雾的质量与效率。

链接: https://arxiv.org/abs/2507.00501
作者: Yongzhen Wang,Liangliang Chen,Bingwen Hu,Heng Liu,Xiao-Ping Zhang,Mingqiang Wei
机构: Anhui University of Technology(安徽理工大学); Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院); Nanjing University of Aeronautics and Astronautics(南京航空航天大学); College of Artificial Intelligence, Taiyuan University of Technology(太原理工大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Recent progress in image restoration has underscored Spatial State Models (SSMs) as powerful tools for modeling long-range dependencies, owing to their appealing linear complexity and computational efficiency. However, SSM-based approaches exhibit limitations in reconstructing localized structures and tend to be less effective when handling high-dimensional data, frequently resulting in suboptimal recovery of fine image features. To tackle these challenges, we introduce Laplace-Mamba, a novel framework that integrates Laplace frequency prior with a hybrid Mamba-CNN architecture for efficient image dehazing. Leveraging the Laplace decomposition, the image is disentangled into low-frequency components capturing global texture and high-frequency components representing edges and fine details. This decomposition enables specialized processing via dual parallel pathways: the low-frequency branch employs SSMs for global context modeling, while the high-frequency branch utilizes CNNs to refine local structural details, effectively addressing diverse haze scenarios. Notably, the Laplace transformation facilitates information-preserving downsampling of low-frequency components in accordance with the Nyquist theory, thereby significantly improving computational efficiency. Extensive evaluations across multiple benchmarks demonstrate that our method outperforms state-of-the-art approaches in both restoration quality and efficiency. The source code and pretrained models are available at this https URL.
zh

[CV-61] MuteSwap: Silent Face-based Voice Conversion

【速读】:该论文试图解决在缺乏清洁音频输入的情况下进行语音转换的问题,即静默视频中的基于面部的语音转换(Silent Face-based Voice Conversion, SFVC)。传统语音转换方法依赖于源说话人和目标说话人的音频输入,但在静默视频或噪声环境中无法适用。为了解决这一问题,作者提出了MuteSwap框架,其关键在于利用对比学习对跨模态身份进行对齐,并通过最小化互信息来分离共享的视觉特征,从而仅依靠视觉线索生成可理解的语音并实现身份转换。

链接: https://arxiv.org/abs/2507.00498
作者: Yifan Liu,Yu Fang,Zhouhan Lin
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Conventional voice conversion modifies voice characteristics from a source speaker to a target speaker, relying on audio input from both sides. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. In this work, we focus on the task of Silent Face-based Voice Conversion (SFVC), which does voice conversion entirely from visual inputs. i.e., given images of a target speaker and a silent video of a source speaker containing lip motion, SFVC generates speech aligning the identity of the target speaker while preserving the speech content in the source silent video. As this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging. To address this, we introduce MuteSwap, a novel framework that employs contrastive learning to align cross-modality identities and minimize mutual information to separate shared visual features. Experimental results show that MuteSwap achieves impressive performance in both speech synthesis and identity conversion, especially under noisy conditions where methods dependent on audio input fail to produce intelligible results, demonstrating both the effectiveness of our training approach and the feasibility of SFVC.
zh

[CV-62] Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models WWW

【速读】:该论文试图解决当前视觉模型主要依赖局部纹理线索而忽视全局配置形状信息的问题,导致模型特征脆弱且缺乏组合性。其解决方案的关键在于将形状评估重新定义为绝对配置能力,并通过配置形状得分(Configural Shape Score, CSS)来衡量模型在保留局部纹理的同时识别因全局部件排列变化而属于不同类别对象的能力。研究发现,具有高CSS值的模型依赖于长程交互,表现出独特的U型整合特性,且CSS能够预测其他依赖形状的评估指标,表明融合局部纹理与全局配置形状的架构和学习框架是实现更鲁棒、泛化性强且类人视觉系统的关键。

链接: https://arxiv.org/abs/2507.00493
作者: Fenil R. Doshi,Thomas Fel,Talia Konkle,George Alvarez
机构: Harvard University (哈佛大学); Kempner Institute (肯普纳研究所); Dept. of Psychology (心理学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vs-texture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully self-supervised and language-aligned transformers – exemplified by DINOv2, SigLIP2 and EVA-CLIP – occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radius-controlled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. A BagNet control remains at chance (iv), ruling out “border-hacking” strategies. Finally, (v) we show that configural shape score also predicts other shape-dependent evals. Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape.
zh

[CV-63] will: Scheduling Compound AI Systems on Heterogeneous Mobile Edge Platforms

【速读】:该论文试图解决在异构移动边缘平台中调度复合人工智能(Compound AI,cAI)系统的问题,特别是如何高效处理动态到来的深度神经网络(DNN)与Transformer模型的并发推理任务。现有方法仅能处理单一类型的DNN或Transformer工作负载,无法满足cAI系统对DNN与Transformer混合推理的需求。解决方案的关键在于提出Twill框架,该框架通过任务亲和性感知的集群映射与迁移、优先级感知的任务冻结/解冻以及动态电压频率调整(DVFS)技术,在保证功耗约束的前提下,显著降低推理延迟。

链接: https://arxiv.org/abs/2507.00491
作者: Zain Taufique,Aman Vyas,Antonio Miele,Pasi Liljeberg,Anil Kanduri
机构: University of Turku(图尔库大学); Politecnico di Milano(米兰理工大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注: 9 Pages, 9 Figures, Accepted in International Conference on Computer-Aided Design (ICCAD) 2025

点击查看摘要

Abstract:Compound AI (cAI) systems chain multiple AI models to solve complex problems. cAI systems are typically composed of deep neural networks (DNNs), transformers, and large language models (LLMs), exhibiting a high degree of computational diversity and dynamic workload variation. Deploying cAI services on mobile edge platforms poses a significant challenge in scheduling concurrent DNN-transformer inference tasks, which arrive dynamically in an unknown sequence. Existing mobile edge AI inference strategies manage multi-DNN or transformer-only workloads, relying on design-time profiling, and cannot handle concurrent inference of DNNs and transformers required by cAI systems. In this work, we address the challenge of scheduling cAI systems on heterogeneous mobile edge platforms. We present Twill, a run-time framework to handle concurrent inference requests of cAI workloads through task affinity-aware cluster mapping and migration, priority-aware task freezing/unfreezing, and DVFS, while minimizing inference latency within power budgets. We implement and deploy our Twill framework on the Nvidia Jetson Orin NX platform. We evaluate Twill against state-of-the-art edge AI inference techniques over contemporary DNNs and LLMs, reducing inference latency by 54% on average, while honoring power budgets.
zh

[CV-64] Just Noticeable Difference for Large Multimodal Models

【速读】:该论文试图解决当前大型多模态模型(Large Multimodal Models, LMMs)在视觉感知能力上的局限性问题,特别是其在感知边界和视觉盲区方面的不足,这些问题可能导致安全风险和响应效率低下。解决方案的关键在于提出一个新的概念——LMM-JND(LMM Just Noticeable Difference),并构建了一个大规模数据集VPA-JND,用于系统量化LMM的视觉感知差异,同时分析视觉与语言骨干网络设计哲学之间的相关性,以指导未来LMM视觉敏锐度的优化。

链接: https://arxiv.org/abs/2507.00490
作者: Zijian Chen,Yuan Tian,Yuze Sun,Wei Sun,Zicheng Zhang,Weisi Lin,Guangtao Zhai,Wenjun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室; East China Normal University (华东师范大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 19 pages, 19 figures

点击查看摘要

Abstract:Just noticeable difference (JND), the minimum change that the human visual system (HVS) can perceive, has been studied for decades. Although recent work has extended this line of research into machine vision, there has been a scarcity of studies systematically exploring its perceptual boundaries across multiple tasks and stimulus types, particularly in the current era of rapidly advancing large multimodal models (LMMs), where studying the multifaceted capabilities of models has become a mainstream focus. Moreover, the perceptual defects of LMMs are not investigated thoroughly, resulting in potential security issues and suboptimal response efficiency. In this paper, we take an initial attempt and demonstrate that there exist significant visual blind spots in current LMMs. To systemically quantify this characteristic, we propose a new concept, \bf LMM-JND, together with its determination pipeline. Targeting uncovering the behavior commonalities in HVS-aligned visual perception tasks, we delve into several LMM families and construct a large-scale dataset, named VPA-JND, which contains 21.5k reference images with over 489k stimuli across 12 distortion types, to facilitate LMM-JND studies. VPA-JND exposes areas where state-of-the-art LMMs, including GPT-4o and the InternVL2.5 series, struggle with basic comparison queries and fall significantly short of human-level visual performance. We further explore the effects of vision and language backbones and find a notable correlation between their design philosophy that may instruct the future refinement of LMMs for their visual acuity. Together, our research underscores the significance of LMM-JND as a unique perspective for studying LMMs, and predictable LMM-JND is crucial for security concerns. This work will be available at this https URL.
zh

[CV-65] FreNBRDF: A Frequency-Rectified Neural Material Representation

【速读】:该论文旨在解决传统材料建模方法在频率域中行为理解不足的问题,以及现有隐式神经表示在材质外观重建与编辑中的准确性和鲁棒性不足的问题。其解决方案的关键在于提出一种名为FreNBRDF的频率校正神经材质表示,通过引入球面谐波将频率域考虑整合到神经BRDF建模中,并设计了一种基于神经材质频率分析的频率校正损失函数,从而提升了材质建模的保真度、适应性和效率。

链接: https://arxiv.org/abs/2507.00476
作者: Chenliang Zhou,Zheyuan Hu,Cengiz Oztireli
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate material modeling is crucial for achieving photorealistic rendering, bridging the gap between computer-generated imagery and real-world photographs. While traditional approaches rely on tabulated BRDF data, recent work has shifted towards implicit neural representations, which offer compact and flexible frameworks for a range of tasks. However, their behavior in the frequency domain remains poorly understood. To address this, we introduce FreNBRDF, a frequency-rectified neural material representation. By leveraging spherical harmonics, we integrate frequency-domain considerations into neural BRDF modeling. We propose a novel frequency-rectified loss, derived from a frequency analysis of neural materials, and incorporate it into a generalizable and adaptive reconstruction and editing pipeline. This framework enhances fidelity, adaptability, and efficiency. Extensive experiments demonstrate that \ours improves the accuracy and robustness of material appearance reconstruction and editing compared to state-of-the-art baselines, enabling more structured and interpretable downstream tasks and applications.
zh

[CV-66] ADAptation: Reconstruction-based Unsupervised Active Learning for Breast Ultrasound Diagnosis MICCAI2025

【速读】:该论文旨在解决深度学习诊断模型在训练域(源域)与测试域(目标域)之间分布差异导致的性能下降问题。其核心挑战在于目标域数据的收集和标注成本高且受限于时间和资源。该研究提出的解决方案是基于无监督主动学习(Active Learning, AL)的领域自适应框架ADAptation,其关键在于利用扩散模型的分布同质化能力,将目标域图像转换为源域风格以弥合跨数据集差异,并引入两种创新机制:(a)一种基于超球面约束的对比学习网络,用于紧凑特征聚类;(b)一种双评分机制,用于量化并平衡样本的不确定性和代表性。

链接: https://arxiv.org/abs/2507.00474
作者: Yaofei Duan,Yuhao Huang,Xin Yang,Luyi Han,Xinyu Xie,Zhiyuan Zhu,Ping He,Ka-Hou Chan,Ligang Cui,Sio-Kei Im,Dong Ni,Tao Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, 4 tables. Accepted by conference MICCAI2025

点击查看摘要

Abstract:Deep learning-based diagnostic models often suffer performance drops due to distribution shifts between training (source) and test (target) domains. Collecting and labeling sufficient target domain data for model retraining represents an optimal solution, yet is limited by time and scarce resources. Active learning (AL) offers an efficient approach to reduce annotation costs while maintaining performance, but struggles to handle the challenge posed by distribution variations across different datasets. In this study, we propose a novel unsupervised Active learning framework for Domain Adaptation, named ADAptation, which efficiently selects informative samples from multi-domain data pools under limited annotation budget. As a fundamental step, our method first utilizes the distribution homogenization capabilities of diffusion models to bridge cross-dataset gaps by translating target images into source-domain style. We then introduce two key innovations: (a) a hypersphere-constrained contrastive learning network for compact feature clustering, and (b) a dual-scoring mechanism that quantifies and balances sample uncertainty and representativeness. Extensive experiments on four breast ultrasound datasets (three public and one in-house/multi-center) across five common deep classifiers demonstrate that our method surpasses existing strong AL-based competitors, validating its effectiveness and generalization for clinical domain adaptation. The code is available at the anonymized link: this https URL.
zh

[CV-67] ARIG: Autoregressive Interactive Head Generation for Real-time Conversations ICCV2025

【速读】:该论文旨在解决交互式头部生成中的实时性和交互真实感问题,特别是在面对话过程中,传统基于片段的生成范式或显式的听者/说话者生成器切换方法在未来的信号获取、上下文行为理解以及切换平滑性方面存在局限。其解决方案的关键在于提出一种基于自回归(AR)的帧级框架ARIG,通过将运动预测建模为非向量量化自回归过程,并利用扩散过程表示运动分布,从而在连续空间中实现更精确的预测;同时通过强调交互行为理解(IBU)和详细的对话状态理解(CSU),提升交互的真实感。

链接: https://arxiv.org/abs/2507.00472
作者: Ying Guo,Xi Liu,Cheng Zhen,Pengfei Yan,Xiaoming Wei
机构: Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Homepage: this https URL

点击查看摘要

Abstract:Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the effectiveness of our model.
zh

[CV-68] Bisecle: Binding and Separation in Continual Learning for Video Language Understanding

【速读】:该论文旨在解决视频-语言模型(Video-Language Models, VLMs)在持续学习(Continual Learning)过程中面临的灾难性遗忘(Catastrophic Forgetting)和更新冲突(Update Conflict)问题。为了解决这些问题,论文提出了一种名为Bisecle的解决方案,其关键在于引入多方向监督模块以捕捉更多的跨模态关系,并设计对比提示学习(Contrastive Prompt Learning)方案以隔离任务特定知识,从而促进高效的记忆存储。此外,模仿海马体的绑定与分离机制进一步增强了VLMs保留复杂经验的能力,实现了视频理解任务中稳健且高效的持续学习。

链接: https://arxiv.org/abs/2507.00469
作者: Yue Tan,Xiaoqian Hu,Hao Xue,Celso De Melo,Flora D. Salim
机构: University of New South Wales (新南威尔士大学); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 12 figures, 10 tables

点击查看摘要

Abstract:Frontier vision-language models (VLMs) have made remarkable improvements in video understanding tasks. However, real-world videos typically exist as continuously evolving data streams (e.g., dynamic scenes captured by wearable glasses), necessitating models to continually adapt to shifting data distributions and novel scenarios. Considering the prohibitive computational costs of fine-tuning models on new tasks, usually, a small subset of parameters is updated while the bulk of the model remains frozen. This poses new challenges to existing continual learning frameworks in the context of large multimodal foundation models, i.e., catastrophic forgetting and update conflict. While the foundation models struggle with parameter-efficient continual learning, the hippocampus in the human brain has evolved highly efficient mechanisms for memory formation and consolidation. Inspired by the rapid Binding and pattern separation mechanisms in the hippocampus, in this work, we propose Bisecle for video-language continual learning, where a multi-directional supervision module is used to capture more cross-modal relationships and a contrastive prompt learning scheme is designed to isolate task-specific knowledge to facilitate efficient memory storage. Binding and separation processes further strengthen the ability of VLMs to retain complex experiences, enabling robust and efficient continual learning in video understanding tasks. We perform a thorough evaluation of the proposed Bisecle, demonstrating its ability to mitigate forgetting and enhance cross-task generalization on several VideoQA benchmarks.
zh

[CV-69] Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation

【速读】:该论文试图解决视觉-语言模型(Visual-Language Models, VLMs)在测试时面对分布偏移(distribution shifts)时适应能力不足的问题。现有无需训练的测试时自适应(training-free test-time adaptation, TTA)方法仅在CLIP原始特征空间内操作,依赖高置信度样本而忽略了低置信度样本的潜力。解决方案的关键在于提出MS-TTA,一种无需训练的方法,通过单步k近邻(kNN)均值漂移(Mean-Shift)将特征表示扩展至CLIP空间之外,从而提升特征紧凑性和类别可分性,实现更稳定的适应效果。

链接: https://arxiv.org/abs/2507.00462
作者: Jizhou Han,Chenhao Ding,SongLin Dong,Yuhang He,Xinyuan Gao,Yihong Gong
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual-language models (VLMs) like CLIP exhibit strong generalization but struggle with distribution shifts at test time. Existing training-free test-time adaptation (TTA) methods operate strictly within CLIP’s original feature space, relying on high-confidence samples while overlooking the potential of low-confidence ones. We propose MS-TTA, a training-free approach that enhances feature representations beyond CLIP’s space using a single-step k-nearest neighbors (kNN) Mean-Shift. By refining all test samples, MS-TTA improves feature compactness and class separability, leading to more stable adaptation. Additionally, a cache of refined embeddings further enhances inference by providing Mean Shift enhanced logits. Extensive evaluations on OOD and cross-dataset benchmarks demonstrate that MS-TTA consistently outperforms state-of-the-art training-free TTA methods, achieving robust adaptation without requiring additional training.
zh

[CV-70] ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales

【速读】:该论文试图解决视觉-语言跟踪(Visual-Language Tracking, VLT)中因目标运动导致的视觉输入与语言描述之间的错位问题。现有跟踪器虽已探索多种有效的特征修改方法以保留更对齐的特征,但其能力仍受到视觉与语言输入在时间与空间尺度信息上的固有差异的限制。解决方案的关键在于通过\textbf{对齐}视觉与语言输入的时间与空间尺度(\textbf{ATSTrack}),具体而言,是将语言描述按其与视觉输入的时间和空间对应关系分解为具有不同属性的短语,并对其进行细粒度的特征修改,同时引入包含前一帧修改后语言信息的视觉-语言标记,以指导模型提取更相关的视觉特征,从而减轻空间尺度差异带来的影响。

链接: https://arxiv.org/abs/2507.00454
作者: Yihao Zhen,Qiang Wang,Yu Qiao,Liangqiong Qu,Huijie Fan
机构: Shenyang Institute of Automation, CAS (中国科学院沈阳自动化研究所); School of Information Engineering, Shenyang University (沈阳大学信息工程学院); School of Software, Shandong University (山东大学软件学院); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A main challenge of Visual-Language Tracking (VLT) is the misalignment between visual inputs and language descriptions caused by target movement. Previous trackers have explored many effective feature modification methods to preserve more aligned features. However, an important yet unexplored factor ultimately hinders their capability, which is the inherent differences in the temporal and spatial scale of information between visual and language inputs. To address this issue, we propose a novel visual-language tracker that enhances the effect of feature modification by \textbfAligning \textbfTemporal and \textbfSpatial scale of different input components, named as \textbfATSTrack. Specifically, we decompose each language description into phrases with different attributes based on their temporal and spatial correspondence with visual inputs, and modify their features in a fine-grained manner. Moreover, we introduce a Visual-Language token that comprises modified linguistic information from the previous frame to guide the model to extract visual features that are more relevant to language description, thereby reducing the impact caused by the differences in spatial scale. Experimental results show that our proposed ATSTrack achieves performance comparable to existing methods. Our code will be released.
zh

[CV-71] Latent Posterior-Mean Rectified Flow for Higher-Fidelity Perceptual Face Restoration

【速读】:该论文旨在解决面部修复算法中感知质量与保真度之间的权衡问题(Perception-Distortion tradeoff)。现有方法如后验均值校正流(Posterior-Mean Rectified Flow, PMRF)虽然有效,但其在像素空间中的建模方式限制了其与人类感知的一致性。论文提出的解决方案关键在于将PMRF重构到变分自编码器(Variational Autoencoder, VAE)的潜在空间中,即Latent-PMRF,从而在优化过程中更好地对齐人类感知。通过在最小失真估计的潜在表示上定义源分布,该方法利用VAE的重建误差来约束最小失真,并展示了所设计VAE在重建和修复任务中的优越性能。

链接: https://arxiv.org/abs/2507.00447
作者: Xin Luo,Menglin Zhang,Yunwei Lan,Tianyu Zhang,Rui Li,Chang Liu,Dong Liu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Code and Models will be publicly available at this https URL

点击查看摘要

Abstract:The Perception-Distortion tradeoff (PD-tradeoff) theory suggests that face restoration algorithms must balance perceptual quality and fidelity. To achieve minimal distortion while maintaining perfect perceptual quality, Posterior-Mean Rectified Flow (PMRF) proposes a flow based approach where source distribution is minimum distortion estimations. Although PMRF is shown to be effective, its pixel-space modeling approach limits its ability to align with human perception, where human perception is defined as how humans distinguish between two image distributions. In this work, we propose Latent-PMRF, which reformulates PMRF in the latent space of a variational autoencoder (VAE), facilitating better alignment with human perception during optimization. By defining the source distribution on latent representations of minimum distortion estimation, we bound the minimum distortion by the VAE’s reconstruction error. Moreover, we reveal the design of VAE is crucial, and our proposed VAE significantly outperforms existing VAEs in both reconstruction and restoration. Extensive experiments on blind face restoration demonstrate the superiority of Latent-PMRF, offering an improved PD-tradeoff compared to existing methods, along with remarkable convergence efficiency, achieving a 5.79X speedup over PMRF in terms of FID. Our code will be available as open-source.
zh

[CV-72] RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation

【速读】:该论文试图解决当前双臂操作策略在评估过程中存在的局限性,即现有基准仅报告二元任务成功与否,难以揭示策略行为中的关键弱点,如协调不良、抓取过程中的滑动或手臂使用不对称等问题。解决方案的关键在于提出RoboEval,这是一个模拟基准和结构化评估框架,包含分层、语义基础的任务,这些任务被分解为特定技能阶段,并通过系统化的变体挑战空间、物理和协调能力。同时,任务配以细粒度的诊断指标和3000多个人类示范,以支持模仿学习,从而更全面地理解和评估双臂操作策略的表现。

链接: https://arxiv.org/abs/2507.00435
作者: Yi Ru Wang,Carter Ung,Grant Tannert,Jiafei Duan,Josephine Li,Amy Le,Rishabh Oswal,Markus Grotz,Wilbert Pumacay,Yuquan Deng,Ranjay Krishna,Dieter Fox,Siddhartha Srinivasa
机构: University of Washington (华盛顿大学); University of Houston (休斯顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present RoboEval, a simulation benchmark and structured evaluation framework designed to reveal the limitations of current bimanual manipulation policies. While prior benchmarks report only binary task success, we show that such metrics often conceal critical weaknesses in policy behavior – such as poor coordination, slipping during grasping, or asymmetric arm usage. RoboEval introduces a suite of tiered, semantically grounded tasks decomposed into skill-specific stages, with variations that systematically challenge spatial, physical, and coordination capabilities. Tasks are paired with fine-grained diagnostic metrics and 3000+ human demonstrations to support imitation learning. Our experiments reveal that policies with similar success rates diverge in how tasks are executed – some struggle with alignment, others with temporally consistent bimanual control. We find that behavioral metrics correlate with success in over half of task-metric pairs, and remain informative even when binary success saturates. By pinpointing when and how policies fail, RoboEval enables a deeper, more actionable understanding of robotic manipulation – and highlights the need for evaluation tools that go beyond success alone.
zh

[CV-73] MFH: Marrying Frequency Domain with Handwritten Mathematical Expression Recognition

【速读】:该论文旨在解决手写数学表达式识别(Handwritten Mathematical Expression Recognition, HMER)中由于公式结构复杂和字符布局在序列预测中的挑战。其解决方案的关键在于将频域分析引入HMER,提出一种结合频域与HMER的方法(Method that marries Frequency domain with HMER, MFH),通过离散余弦变换(Discrete Cosine Transform, DCT)利用频域信息,以辅助结构分析,从而提升数学公式的识别效果。

链接: https://arxiv.org/abs/2507.00430
作者: Huanxin Yang,Qiwen Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Handwritten mathematical expression recognition (HMER) suffers from complex formula structures and character layouts in sequence prediction. In this paper, we incorporate frequency domain analysis into HMER and propose a method that marries frequency domain with HMER (MFH), leveraging the discrete cosine transform (DCT). We emphasize the structural analysis assistance of frequency information for recognizing mathematical formulas. When implemented on various baseline models, our network exhibits a consistent performance enhancement, demonstrating the efficacy of frequency domain information. Experiments show that our MFH-CoMER achieves noteworthy accuracyrates of 61.66%/62.07%/63.72% on the CROHME 2014/2016/2019 test sets. The source code is available at this https URL.
zh

[CV-74] DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting ICCV2025

【速读】:该论文旨在解决文本引导的3D修复(text-guided 3D inpainting)中多个任务在统一框架下执行时面临的挑战,包括:1)单参考视图修复方法在处理远离参考视图的视角时缺乏鲁棒性;2)使用2D扩散先验独立修复多视角图像时出现外观不一致;3)修复区域存在显著几何变化时几何不一致限制了性能。解决方案的关键在于提出DiGA3D,这是一种新颖且多功能的3D修复管道,通过扩散模型以粗到细的方式传播一致的外观和几何信息。其核心创新包括多参考视图选择策略、基于注意力特征传播(Attention Feature Propagation, AFP)的机制以及纹理-几何得分蒸馏采样(Texture-Geometry Score Distillation Sampling, TG-SDS)损失,以提升外观和几何一致性。

链接: https://arxiv.org/abs/2507.00429
作者: Jingyi Pan,Dan Xu,Qiong Luo
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, Project page: this https URL

点击查看摘要

Abstract:Developing a unified pipeline that enables users to remove, re-texture, or replace objects in a versatile manner is crucial for text-guided 3D inpainting. However, there are still challenges in performing multiple 3D inpainting tasks within a unified framework: 1) Single reference inpainting methods lack robustness when dealing with views that are far from the reference view. 2) Appearance inconsistency arises when independently inpainting multi-view images with 2D diffusion priors; 3) Geometry inconsistency limits performance when there are significant geometric changes in the inpainting regions. To tackle these challenges, we introduce DiGA3D, a novel and versatile 3D inpainting pipeline that leverages diffusion models to propagate consistent appearance and geometry in a coarse-to-fine manner. First, DiGA3D develops a robust strategy for selecting multiple reference views to reduce errors during propagation. Next, DiGA3D designs an Attention Feature Propagation (AFP) mechanism that propagates attention features from the selected reference views to other views via diffusion models to maintain appearance consistency. Furthermore, DiGA3D introduces a Texture-Geometry Score Distillation Sampling (TG-SDS) loss to further improve the geometric consistency of inpainted 3D scenes. Extensive experiments on multiple 3D inpainting tasks demonstrate the effectiveness of our method. The project page is available at this https URL.
zh

[CV-75] Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在空间理解能力上的不足,这一问题源于其基于预训练的视觉-语言模型(Vision-Language Models, VLMs),而这些模型主要依赖2D图像-文本对进行训练,缺乏3D空间感知能力。为了解决这一问题,论文提出的关键解决方案是引入一个即插即用模块,通过利用现成的视觉几何基础模型,隐式地将3D几何特征注入VLA模型中,从而提升其空间理解能力。

链接: https://arxiv.org/abs/2507.00416
作者: Tao Lin,Gen Li,Yilei Zhong,Yanwen Zou,Bo Zhao
机构: School of AI, Shanghai Jiao Tong University (人工智能学院,上海交通大学); EvoMind Tech (EvoMind科技); IAAR-Shanghai (IAAR-上海)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models (VLMs), which excel at semantic understanding due to large-scale text pretraining. However, VLMs typically lack precise spatial understanding capabilities, as they are primarily tuned on 2D image-text pairs without 3D supervision. To address this limitation, recent approaches have incorporated explicit 3D inputs such as point clouds or depth maps, but this necessitates additional depth sensors or defective estimation. In contrast, our work introduces a plug-and-play module that implicitly injects 3D geometry features into VLA models by leveraging an off-the-shelf visual geometry foundation models. We design five spatially challenging tasks that require precise spatial understanding ability to validate effectiveness of our method. Extensive evaluations show that our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
zh

[CV-76] Few-shot Classification as Multi-instance Verification: Effective Backbone-agnostic Transfer across Domains

【速读】:该论文旨在解决在无法对主干网络(即特征提取器)进行微调的跨领域小样本学习问题,这一场景在实际应用中越来越常见。为应对由冻结的“黑盒”主干网络产生的低质量且静态的嵌入表示,作者将小样本分类问题建模为一系列多实例验证(Multiple Instance Verification, MIV)任务。解决方案的关键在于引入一种名为“MIV-head”的新方法,其类似于分类头,但对任何预训练主干网络具有无感知性且计算效率高。该方法在目标域的小样本数据上进行训练后,能够在不微调主干网络的情况下,在该域的测试数据上表现出色,显著降低了适应成本。

链接: https://arxiv.org/abs/2507.00401
作者: Xin Xu,Eibe Frank,Geoffrey Holmes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We investigate cross-domain few-shot learning under the constraint that fine-tuning of backbones (i.e., feature extractors) is impossible or infeasible – a scenario that is increasingly common in practical use cases. Handling the low-quality and static embeddings produced by frozen, “black-box” backbones leads to a problem representation of few-shot classification as a series of multiple instance verification (MIV) tasks. Inspired by this representation, we introduce a novel approach to few-shot domain adaptation, named the “MIV-head”, akin to a classification head that is agnostic to any pretrained backbone and computationally efficient. The core components designed for the MIV-head, when trained on few-shot data from a target domain, collectively yield strong performance on test data from that domain. Importantly, it does so without fine-tuning the backbone, and within the “meta-testing” phase. Experimenting under various settings and on an extension of the Meta-dataset benchmark for cross-domain few-shot image classification, using representative off-the-shelf convolutional neural network and vision transformer backbones pretrained on ImageNet1K, we show that the MIV-head achieves highly competitive accuracy when compared to state-of-the-art “adapter” (or partially fine-tuning) methods applied to the same backbones, while incurring substantially lower adaptation cost. We also find well-known “classification head” approaches lag far behind in terms of accuracy. Ablation study empirically justifies the core components of our approach. We share our code at this https URL.
zh

[CV-77] Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space

【速读】:该论文旨在解决传统特征匹配方法依赖于稀缺且干净的多视角图像集,从而限制了其在多样和复杂场景中的泛化能力的问题,同时传统特征编码器通常在单视角2D图像上训练,难以捕捉3D感知的对应关系。解决方案的关键在于提出一种两阶段框架——Lift to Match (L2M),通过结合多视角图像合成与3D特征高斯表示学习3D感知的特征编码器,并利用新颖视角渲染策略与大规模单视角图像合成数据生成来学习鲁棒的特征解码器,从而实现跨域的泛化能力。

链接: https://arxiv.org/abs/2507.00392
作者: Yingping Liang,Yutao Hu,Wenqi Shao,Ying Fu
机构: Beijing Institute of Technology (北京理工大学); School of Computer Science and Engineering, Southeast University (东南大学计算机科学与工程学院); Shanghai Al Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as \textbfLift to Match (L2M), taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching.
zh

[CV-78] MedDiff-FT: Data-Efficient Diffusion Model Fine-tuning with Structural Guidance for Controllable Medical Image Synthesis

【速读】:该论文旨在解决医学图像分割中因高质量训练数据稀缺而导致的深度学习模型性能受限问题。其解决方案的关键在于提出MedDiff-FT,一种可控的医学图像生成方法,通过微调扩散基础模型,在数据高效的前提下生成具有结构依赖性和领域特异性的医学图像。该方法在推理阶段引入动态自适应引导掩码以确保解剖学一致性,并结合轻量级随机掩码生成器提升多样性,同时通过自动化质量评估协议和掩码腐蚀进一步优化生成图像的保真度。

链接: https://arxiv.org/abs/2507.00377
作者: Jianhao Xie,Ziang Zhang,Zhenyu Weng,Yuesheng Zhu,Guibo Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages,3 figures

点击查看摘要

Abstract:Recent advancements in deep learning for medical image segmentation are often limited by the scarcity of high-quality training this http URL diffusion models provide a potential solution by generating synthetic images, their effectiveness in medical imaging remains constrained due to their reliance on large-scale medical datasets and the need for higher image quality. To address these challenges, we present MedDiff-FT, a controllable medical image generation method that fine-tunes a diffusion foundation model to produce medical images with structural dependency and domain specificity in a data-efficient manner. During inference, a dynamic adaptive guiding mask enforces spatial constraints to ensure anatomically coherent synthesis, while a lightweight stochastic mask generator enhances diversity through hierarchical randomness injection. Additionally, an automated quality assessment protocol filters suboptimal outputs using feature-space metrics, followed by mask corrosion to refine fidelity. Evaluated on five medical segmentation datasets,MedDiff-FT’s synthetic image-mask pairs improve SOTA method’s segmentation performance by an average of 1% in Dice score. The framework effectively balances generation quality, diversity, and computational efficiency, offering a practical solution for medical data augmentation. The code is available at this https URL.
zh

[CV-79] Customizable ROI-Based Deep Image Compression

【速读】:该论文旨在解决传统基于感兴趣区域(Region of Interest, ROI)的图像压缩方法在面对多样化用户需求时缺乏灵活性的问题,具体表现为ROI定义不可更改以及无法有效平衡ROI与非ROI之间的重建质量。其解决方案的关键在于提出一种可定制的基于ROI的深度图像压缩范式,包括三个核心组件:Text-controlled Mask Acquisition(TMA)模块,使用户可通过文本输入自定义ROI;Customizable Value Assign(CVA)机制,允许用户调整非ROI的掩码程度以实现质量权衡;以及Latent Mask Attention(LMA)模块,通过融合掩码的潜在空间先验和图像的潜在率失真优化(Rate-Distortion Optimization, RDO)先验来优化源图像的潜在表示。

链接: https://arxiv.org/abs/2507.00373
作者: Ian Jin,Fanxin Xia,Feng Ding,Xinfeng Zhang,Meiqin Liu,Yao Zhao,Weisi Lin,Lili Meng
机构: Nanyang Technological University (南洋理工大学); China-Singapore International Joint Research Institute (中新国际联合研究院); Shandong Normal University (山东师范大学); Simon Fraser University (西门菲沙大学); University of Chinese Academy of Sciences (中国科学院大学); Beijing Jiao Tong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Region of Interest (ROI)-based image compression optimizes bit allocation by prioritizing ROI for higher-quality reconstruction. However, as the users (including human clients and downstream machine tasks) become more diverse, ROI-based image compression needs to be customizable to support various preferences. For example, different users may define distinct ROI or require different quality trade-offs between ROI and non-ROI. Existing ROI-based image compression schemes predefine the ROI, making it unchangeable, and lack effective mechanisms to balance reconstruction quality between ROI and non-ROI. This work proposes a paradigm for customizable ROI-based deep image compression. First, we develop a Text-controlled Mask Acquisition (TMA) module, which allows users to easily customize their ROI for compression by just inputting the corresponding semantic \emphtext. It makes the encoder controlled by text. Second, we design a Customizable Value Assign (CVA) mechanism, which masks the non-ROI with a changeable extent decided by users instead of a constant one to manage the reconstruction quality trade-off between ROI and non-ROI. Finally, we present a Latent Mask Attention (LMA) module, where the latent spatial prior of the mask and the latent Rate-Distortion Optimization (RDO) prior of the image are extracted and fused in the latent space, and further used to optimize the latent representation of the source image. Experimental results demonstrate that our proposed customizable ROI-based deep image compression paradigm effectively addresses the needs of customization for ROI definition and mask acquisition as well as the reconstruction quality trade-off management between the ROI and non-ROI.
zh

[CV-80] Efficient Depth- and Spatially-Varying Image Simulation for Defocus Deblur

【速读】:该论文旨在解决大光圈现代相机因景深浅导致的焦外模糊问题,特别是在固定对焦相机(如智能眼镜中使用的相机)中,由于体积和功耗限制难以添加自动对焦机制。现有开源数据集训练的深度学习模型因与实际相机系统的光学像差和失焦特性不匹配而存在领域差距,导致在真实场景中表现不佳。论文提出的解决方案的关键在于一种高效且可扩展的数据集合成方法,该方法无需依赖真实世界数据进行微调,同时建模深度相关的失焦效应和空间变化的光学像差,从而缓解计算复杂性和高质量RGB-D数据集稀缺的问题。

链接: https://arxiv.org/abs/2507.00372
作者: Xinge Yang,Chuong Nguyen,Wenbin Wang,Kaizhang Kang,Wolfgang Heidrich,Xiaoxing Li
机构: KAUST(卡耐基梅隆大学阿卜杜勒阿齐兹国王科技大学); Meta Reality Labs(元宇宙实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Modern cameras with large apertures often suffer from a shallow depth of field, resulting in blurry images of objects outside the focal plane. This limitation is particularly problematic for fixed-focus cameras, such as those used in smart glasses, where adding autofocus mechanisms is challenging due to form factor and power constraints. Due to unmatched optical aberrations and defocus properties unique to each camera system, deep learning models trained on existing open-source datasets often face domain gaps and do not perform well in real-world settings. In this paper, we propose an efficient and scalable dataset synthesis approach that does not rely on fine-tuning with real-world data. Our method simultaneously models depth-dependent defocus and spatially varying optical aberrations, addressing both computational complexity and the scarcity of high-quality RGB-D datasets. Experimental results demonstrate that a network trained on our low resolution synthetic images generalizes effectively to high resolution (12MP) real-world images across diverse scenes.
zh

[CV-81] PlantSegNeRF: A few-shot cross-dataset method for plant 3D instance point cloud reconstruction via joint-channel NeRF with multi-view image instance matching

【速读】:该论文旨在解决植物点云中器官分割的难题,特别是针对现有技术在分辨率、分割精度和跨物种泛化能力方面的局限性。其解决方案的关键在于提出一种名为PlantSegNeRF的新方法,该方法通过多视角RGB图像序列直接生成高精度实例点云,结合2D实例分割、实例匹配模块以及实例神经辐射场(instance NeRF)技术,实现对不同植物物种的高效且精确的器官分割。

链接: https://arxiv.org/abs/2507.00371
作者: Xin Yang(1 and 2),Ruiming Du(3),Hanyang Huang(1 and 2),Jiayang Xie(1 and 2),Pengyao Xie(1 and 2),Leisen Fang(1 and 2),Ziyue Guo(1 and 2),Nanjun Jiang(4),Yu Jiang(5),Haiyan Cen(1 and 2) ((1) College of Biosystems Engineering and Food Science, Zhejiang University, (2) Key Laboratory of Spectroscopy Sensing, Ministry of Agriculture and Rural Affairs, (3) Department of Biological and Environmental Engineering, Cornell University, (4) Amway (China) Botanical R and D Center, (5) Horticulture Section, School of Integrative Plant Science, Cornell AgriTech)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Organ segmentation of plant point clouds is a prerequisite for the high-resolution and accurate extraction of organ-level phenotypic traits. Although the fast development of deep learning has boosted much research on segmentation of plant point clouds, the existing techniques for organ segmentation still face limitations in resolution, segmentation accuracy, and generalizability across various plant species. In this study, we proposed a novel approach called plant segmentation neural radiance fields (PlantSegNeRF), aiming to directly generate high-precision instance point clouds from multi-view RGB image sequences for a wide range of plant species. PlantSegNeRF performed 2D instance segmentation on the multi-view images to generate instance masks for each organ with a corresponding ID. The multi-view instance IDs corresponding to the same plant organ were then matched and refined using a specially designed instance matching module. The instance NeRF was developed to render an implicit scene, containing color, density, semantic and instance information. The implicit scene was ultimately converted into high-precision plant instance point clouds based on the volume density. The results proved that in semantic segmentation of point clouds, PlantSegNeRF outperformed the commonly used methods, demonstrating an average improvement of 16.1%, 18.3%, 17.8%, and 24.2% in precision, recall, F1-score, and IoU compared to the second-best results on structurally complex datasets. More importantly, PlantSegNeRF exhibited significant advantages in plant point cloud instance segmentation tasks. Across all plant datasets, it achieved average improvements of 11.7%, 38.2%, 32.2% and 25.3% in mPrec, mRec, mCov, mWCov, respectively. This study extends the organ-level plant phenotyping and provides a high-throughput way to supply high-quality 3D data for the development of large-scale models in plant science.
zh

[CV-82] Out-of-Distribution Detection with Adaptive Top-K Logits Integration

【速读】:该论文旨在解决机器学习模型在面对分布外(out-of-distribution, OOD)样本时产生过度自信预测的问题,从而提升模型的安全性。其解决方案的关键在于发现除了最大对数几率(MaxLogit)之外,其他一些对数几率也对OOD检测具有价值,并基于此提出了一种新的方法ATLI(Adaptive Top-k Logits Integration),该方法自适应地选择与模型相关的有效top-k对数几率,并将最大对数几率与其他top-k对数几率进行融合,从而提升了OOD检测的性能。实验结果表明,该方法在ImageNet-1K基准上相比MaxLogit方法将FPR95降低了6.73%,并优于其他先进方法。

链接: https://arxiv.org/abs/2507.00368
作者: Hikaru Shijo,Yutaka Yoshihama,Kenichi Yadani,Norifumi Murata
机构: Panasonic Automotive Systems Co Ltd(松下汽车系统公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural networks often make overconfident predictions from out-of-distribution (OOD) samples. Detection of OOD data is therefore crucial to improve the safety of machine learning. The simplest and most powerful method for OOD detection is MaxLogit, which uses the model’s maximum logit to provide an OOD score. We have discovered that, in addition to the maximum logit, some other logits are also useful for OOD detection. Based on this finding, we propose a new method called ATLI (Adaptive Top-k Logits Integration), which adaptively determines effective top-k logits that are specific to each model and combines the maximum logit with the other top-k logits. In this study we evaluate our proposed method using ImageNet-1K benchmark. Extensive experiments showed our proposed method to reduce the false positive rate (FPR95) by 6.73% compared to the MaxLogit approach, and decreased FPR95 by an additional 2.67% compared to other state-of-the-art methods.
zh

[CV-83] An Improved U-Net Model for Offline handwriting signature denoising

【速读】:该论文试图解决离线手写签名在法医鉴定中因样本混杂大量干扰信息而导致的识别困难问题,这一问题严重影响了签名识别系统的准确性和可靠性。解决方案的关键在于提出一种基于改进U-net结构的签名手写去噪模型,通过引入离散小波变换和主成分分析(PCA)来增强模型抑制噪声的能力,从而有效提升签名图像的清晰度和可读性。

链接: https://arxiv.org/abs/2507.00365
作者: Wanghui Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Handwriting signatures, as an important means of identity recognition, are widely used in multiple fields such as financial transactions, commercial contracts and personal affairs due to their legal effect and uniqueness. In forensic science appraisals, the analysis of offline handwriting signatures requires the appraiser to provide a certain number of signature samples, which are usually derived from various historical contracts or archival materials. However, the provided handwriting samples are often mixed with a large amount of interfering information, which brings severe challenges to handwriting identification work. This study proposes a signature handwriting denoising model based on the improved U-net structure, aiming to enhance the robustness of the signature recognition system. By introducing discrete wavelet transform and PCA transform, the model’s ability to suppress noise has been enhanced. The experimental results show that this modelis significantly superior to the traditional methods in denoising effect, can effectively improve the clarity and readability of the signed images, and provide more reliable technical support for signature analysis and recognition.
zh

[CV-84] GDGS: 3D Gaussian Splatting Via Geometry-Guided Initialization And Dynamic Density Control

【速读】:该论文旨在解决3D高斯点云渲染(3D Gaussian Splatting, 3DGS)在初始化、优化和密度控制方面的挑战。现有方法依赖于精确的初始化,且在将无结构的高斯分布优化为有序表面时存在困难,同时缺乏有效的自适应密度控制机制。论文的关键解决方案包括:几何引导的初始化策略,用于预测高斯参数以实现精准定位和快速收敛;面向表面的优化策略,用于提升几何精度并使高斯分布与场景表面法线对齐;以及动态自适应密度控制机制,根据区域复杂性调整高斯密度以保证视觉保真度。这些创新使得方法能够在复杂场景中实现高质量的实时渲染。

链接: https://arxiv.org/abs/2507.00363
作者: Xingjun Wang,Lianlei Shan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a method to enhance 3D Gaussian Splatting (3DGS)~\citeKerbl2023, addressing challenges in initialization, optimization, and density control. Gaussian Splatting is an alternative for rendering realistic images while supporting real-time performance, and it has gained popularity due to its explicit 3D Gaussian representation. However, 3DGS heavily depends on accurate initialization and faces difficulties in optimizing unstructured Gaussian distributions into ordered surfaces, with limited adaptive density control mechanism proposed so far. Our first key contribution is a geometry-guided initialization to predict Gaussian parameters, ensuring precise placement and faster convergence. We then introduce a surface-aligned optimization strategy to refine Gaussian placement, improving geometric accuracy and aligning with the surface normals of the scene. Finally, we present a dynamic adaptive density control mechanism that adjusts Gaussian density based on regional complexity, for visual fidelity. These innovations enable our method to achieve high-fidelity real-time rendering and significant improvements in visual quality, even in complex scenes. Our method demonstrates comparable or superior results to state-of-the-art methods, rendering high-fidelity images in real time.
zh

[CV-85] CGEarthEye:A High-Resolution Remote Sensing Vision Foundation Model Based on the Jilin-1 Satellite Constellation

【速读】:该论文旨在解决超高分辨率光学遥感(Ultra-High-Resolution Optical Remote Sensing, UHRORS)影像在构建高分辨率遥感视觉基础模型(Remote Sensing Vision Foundation Model, RSVFM)时面临的数据获取渠道有限问题。其关键解决方案是提出CGEarthEye框架,该框架针对吉林一号卫星特性设计,包含五个不同参数规模的主干网络,总参数量达21亿,并通过构建JLSSD数据集,采用多层级表示聚类与采样策略,实现多时相自监督学习(Multi-Temporal Self-Supervised Learning, SSL),结合季节对比、基于增强的对比和掩码块标记对比策略进行预训练,从而提升模型的表征能力。

链接: https://arxiv.org/abs/2507.00356
作者: Zhiwei Yi,Xin Cheng,Jingyu Ma,Ruifei Zhu,Junwei Tian,Yuanxiu Zhou,Xinge Zhao,Hongzhe Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: A Remote Sensing Fundation Model for Very High Resolution Images

点击查看摘要

Abstract:Deep learning methods have significantly advanced the development of intelligent rinterpretation in remote sensing (RS), with foundational model research based on large-scale pre-training paradigms rapidly reshaping various domains of Earth Observation (EO). However, compared to the open accessibility and high spatiotemporal coverage of medium-resolution data, the limited acquisition channels for ultra-high-resolution optical RS imagery have constrained the progress of high-resolution remote sensing vision foundation models (RSVFM). As the world’s largest sub-meter-level commercial RS satellite constellation, the Jilin-1 constellation possesses abundant sub-meter-level image resources. This study proposes CGEarthEye, a RSVFM framework specifically designed for Jilin-1 satellite characteristics, comprising five backbones with different parameter scales with totaling 2.1 billion parameters. To enhance the representational capacity of the foundation model, we developed JLSSD, the first 15-million-scale multi-temporal self-supervised learning (SSL) dataset featuring global coverage with quarterly temporal sampling within a single year, constructed through multi-level representation clustering and sampling strategies. The framework integrates seasonal contrast, augmentation-based contrast, and masked patch token contrastive strategies for pre-training. Comprehensive evaluations across 10 benchmark datasets covering four typical RS tasks demonstrate that the CGEarthEye consistently achieves state-of-the-art (SOTA) performance. Further analysis reveals CGEarthEye’s superior characteristics in feature visualization, model convergence, parameter efficiency, and practical mapping applications. This study anticipates that the exceptional representation capabilities of CGEarthEye will facilitate broader and more efficient applications of Jilin-1 data in traditional EO application.
zh

[CV-86] raining for X-Ray Vision: Amodal Segmentation Amodal Content Completion and View-Invariant Object Representation from Multi-Camera Video

【速读】:该论文试图解决复杂场景中遮挡物体的非模态分割(amodal segmentation)与非模态内容补全(amodal content completion)问题,其核心挑战在于利用物体先验知识估计被遮挡的物体掩码和特征。解决方案的关键在于引入MOVi-MC-AC数据集,该数据集是目前最大且首个提供非模态内容真实标签的数据集,通过多相机设置实现跨视角物体识别与跟踪,并为合成视频中的检测和分割提供一致的物体ID,从而提升了模型对自然遮挡场景的理解能力。

链接: https://arxiv.org/abs/2507.00339
作者: Alexander Moore,Amar Saini,Kylie Cancilla,Doug Poland,Carmen Carrano
机构: Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Amodal segmentation and amodal content completion require using object priors to estimate occluded masks and features of objects in complex scenes. Until now, no data has provided an additional dimension for object context: the possibility of multiple cameras sharing a view of a scene. We introduce MOVi-MC-AC: Multiple Object Video with Multi-Cameras and Amodal Content, the largest amodal segmentation and first amodal content dataset to date. Cluttered scenes of generic household objects are simulated in multi-camera video. MOVi-MC-AC contributes to the growing literature of object detection, tracking, and segmentation by including two new contributions to the deep learning for computer vision world. Multiple Camera (MC) settings where objects can be identified and tracked between various unique camera perspectives are rare in both synthetic and real-world video. We introduce a new complexity to synthetic video by providing consistent object ids for detections and segmentations between both frames and multiple cameras each with unique features and motion patterns on a single scene. Amodal Content (AC) is a reconstructive task in which models predict the appearance of target objects through occlusions. In the amodal segmentation literature, some datasets have been released with amodal detection, tracking, and segmentation labels. While other methods rely on slow cut-and-paste schemes to generate amodal content pseudo-labels, they do not account for natural occlusions present in the modal masks. MOVi-MC-AC provides labels for ~5.8 million object instances, setting a new maximum in the amodal dataset literature, along with being the first to provide ground-truth amodal content. The full dataset is available at this https URL ,
zh

[CV-87] Populate-A-Scene: Affordance-Aware Human Video Generation

【速读】:该论文试图解决如何将视频生成模型重新用于交互式世界模拟的问题,具体而言是探索文本到视频模型在感知环境可操作性(affordance)方面的潜力。其解决方案的关键在于通过微调模型,使其在给定场景图像和描述人类动作的提示下,能够插入一个人物并保持行为、外观、协调性和场景可操作性的连贯性,而无需依赖边界框或人体姿态等显式条件,而是从单张场景图像中推断出人类可操作性。

链接: https://arxiv.org/abs/2507.00334
作者: Mengyi Shan,Zecheng He,Haoyu Ma,Felix Juefei-Xu,Peizhao Zhang,Tingbo Hou,Ching-Yao Chuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.
zh

[CV-88] Scope Meets Screen: Lessons Learned in Designing Composite Visualizations for Marksmanship Training Across Skill Levels IEEE-VIS2025

【速读】:该论文试图解决传统射击训练中教练无法实时观察射手视角以及训练后分析仅限于姿势和精度的问题。解决方案的关键在于开发一种射击可视化系统,该系统利用第一人称射击视频记录,并叠加度量指标和图形摘要,形成复合可视化视图,从而提升对射击表现的理解和分析能力。

链接: https://arxiv.org/abs/2507.00333
作者: Emin Zerman,Jonas Carlsson,Mårten Sjöström
机构: Mid Sweden University (中瑞典大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
备注: 5 pages, accepted at IEEE VIS 2025

点击查看摘要

Abstract:Marksmanship practices are required in various professions, including police, military personnel, hunters, as well as sports shooters, such as Olympic shooting, biathlon, and modern pentathlon. The current form of training and coaching is mostly based on repetition, where the coach does not see through the eyes of the shooter, and analysis is limited to stance and accuracy post-session. In this study, we present a shooting visualization system and evaluate its perceived effectiveness for both novice and expert shooters. To achieve this, five composite visualizations were developed using first-person shooting video recordings enriched with overlaid metrics and graphical summaries. These views were evaluated with 10 participants (5 expert marksmen, 5 novices) through a mixed-methods study including shot-count and aiming interpretation tasks, pairwise preference comparisons, and semi-structured interviews. The results show that a dashboard-style composite view, combining raw video with a polar plot and selected graphs, was preferred in 9 of 10 cases and supported understanding across skill levels. The insights gained from this design study point to the broader value of integrating first-person video with visual analytics for coaching, and we suggest directions for applying this approach to other precision-based sports.
zh

[CV-89] MammoTracker: Mask-Guided Lesion Tracking in Temporal Mammograms

【速读】:该论文旨在解决在时间序列乳腺X线摄影中自动追踪病灶(lesion)的难题,这一问题限制了计算机辅助诊断(computer-aided diagnosis, CAD)系统的有效性。其解决方案的关键在于提出MammoTracker,一个基于掩码引导的病灶追踪框架,采用从粗到细的策略,结合全局搜索、局部搜索和分数优化三个核心模块,实现了连续检查之间的病灶定位自动化。

链接: https://arxiv.org/abs/2507.00328
作者: Xuan Liu,Yinhao Ren,Marc D. Ryser,Lars J. Grimm,Joseph Y. Lo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate lesion tracking in temporal mammograms is essential for monitoring breast cancer progression and facilitating early diagnosis. However, automated lesion correspondence across exams remains a challenges in computer-aided diagnosis (CAD) systems, limiting their effectiveness. We propose MammoTracker, a mask-guided lesion tracking framework that automates lesion localization across consecutively exams. Our approach follows a coarse-to-fine strategy incorporating three key modules: global search, local search, and score refinement. To support large-scale training and evaluation, we introduce a new dataset with curated prior-exam annotations for 730 mass and calcification cases from the public EMBED mammogram dataset, yielding over 20000 lesion pairs, making it the largest known resource for temporal lesion tracking in mammograms. Experimental results demonstrate that MammoTracker achieves 0.455 average overlap and 0.509 accuracy, surpassing baseline models by 8%, highlighting its potential to enhance CAD-based lesion progression analysis. Our dataset will be available at this https URL.
zh

[CV-90] Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes ICCV2025

【速读】:该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)在领域差异较大的场景下适应性不足的问题,其固定低秩结构难以捕捉领域特有的复杂性。解决方案的关键在于引入稳定秩引导的低秩适配(Stable Rank-Guided Low-Rank Adaptation, SR-LoRA),通过利用预训练权重矩阵的稳定秩作为层间秩分配的自然先验,实现基于内在维度的高效秩重新分配,从而提升模型适应性且无需额外搜索成本。

链接: https://arxiv.org/abs/2507.00327
作者: Chuyan Zhang,Kefan Wang,Yun Gu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has proven effective in reducing computational costs while maintaining performance comparable to fully fine-tuned foundation models across various tasks. However, its fixed low-rank structure restricts its adaptability in scenarios with substantial domain gaps, where higher ranks are often required to capture domain-specific complexities. Current adaptive LoRA methods attempt to overcome this limitation by dynamically expanding or selectively allocating ranks, but these approaches frequently depend on computationally intensive techniques such as iterative pruning, rank searches, or additional regularization. To address these challenges, we introduce Stable Rank-Guided Low-Rank Adaptation (SR-LoRA), a novel framework that utilizes the stable rank of pre-trained weight matrices as a natural prior for layer-wise rank allocation. By leveraging the stable rank, which reflects the intrinsic dimensionality of the weights, SR-LoRA enables a principled and efficient redistribution of ranks across layers, enhancing adaptability without incurring additional search costs. Empirical evaluations on few-shot tasks with significant domain gaps show that SR-LoRA consistently outperforms recent adaptive LoRA variants, achieving a superior trade-off between performance and efficiency. Our code is available at this https URL.
zh

[CV-91] Exploring Theory-Laden Observations in the Brain Basis of Emotional Experience

【速读】:该论文试图解决传统情感研究中基于分类学假设的科学结论可能受到先验假设影响的问题,即情感类别是否代表稳定的生物和心理类型。其解决方案的关键在于采用一种不同的情感类别观,即将情感类别视为可变的、情境化的实例群体,并在此基础上对数据进行再分析,从而减少对数据结构的先验假设,揭示出个体间情感脑模式的显著差异。

链接: https://arxiv.org/abs/2507.00320
作者: Christiana Westlin,Ashutosh Singh,Deniz Erdogmus,Georgios Stratis,Lisa Feldman Barrett
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:In the science of emotion, it is widely assumed that folk emotion categories form a biological and psychological typology, and studies are routinely designed and analyzed to identify emotion-specific patterns. This approach shapes the observations that studies report, ultimately reinforcing the assumption that guided the investigation. Here, we reanalyzed data from one such typologically-guided study that reported mappings between individual brain patterns and group-averaged ratings of 34 emotion categories. Our reanalysis was guided by an alternative view of emotion categories as populations of variable, situated instances, and which predicts a priori that there will be significant variation in brain patterns within a category across instances. Correspondingly, our analysis made minimal assumptions about the structure of the variance present in the data. As predicted, we did not observe the original mappings and instead observed significant variation across individuals. These findings demonstrate how starting assumptions can ultimately impact scientific conclusions and suggest that a hypothesis must be supported using multiple analytic methods before it is taken seriously.
zh

[CV-92] Reducing Variability of Multiple Instance Learning Methods for Digital Pathology MICCAI2025

【速读】:该论文试图解决多重实例学习(Multiple Instance Learning, MIL)方法在全切片图像(Whole Slide Image, WSI)分类任务中表现不稳定的问题,其性能在不同运行之间可能相差高达10-15 AUC点,导致难以可靠比较不同MIL方法。解决方案的关键在于引入一种多保真度、模型融合策略,通过训练多个模型并在验证分数基础上选择最稳定和有前景的模型进行平均,从而减少性能波动,同时简化超参数调优并提高可重复性。

链接: https://arxiv.org/abs/2507.00292
作者: Ali Mammadov,Loïc Le Folgoc,Guillaume Hocquet,Pietro Gori
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2025

点击查看摘要

Abstract:Digital pathology has revolutionized the field by enabling the digitization of tissue samples into whole slide images (WSIs). However, the high resolution and large size of WSIs present significant challenges when it comes to applying Deep Learning models. As a solution, WSIs are often divided into smaller patches with a global label (\textiti.e., diagnostic) per slide, instead of a (too) costly pixel-wise annotation. By treating each slide as a bag of patches, Multiple Instance Learning (MIL) methods have emerged as a suitable solution for WSI classification. A major drawback of MIL methods is their high variability in performance across different runs, which can reach up to 10-15 AUC points on the test set, making it difficult to compare different MIL methods reliably. This variability mainly comes from three factors: i) weight initialization, ii) batch (shuffling) ordering, iii) and learning rate. To address that, we introduce a Multi-Fidelity, Model Fusion strategy for MIL methods. We first train multiple models for a few epochs and average the most stable and promising ones based on validation scores. This approach can be applied to any existing MIL model to reduce performance variability. It also simplifies hyperparameter tuning and improves reproducibility while maintaining computational efficiency. We extensively validate our approach on WSI classification tasks using 2 different datasets, 3 initialization strategies and 5 MIL methods, for a total of more than 2000 experiments.
zh

[CV-93] Self-Supervised Multiview Xray Matching MICCAI2025

【速读】:该论文旨在解决多视角X射线图像中建立鲁棒对应关系的难题,这对于准确诊断骨折、肌肉损伤等异常具有重要意义。当前方法在不同X射线视图之间难以建立可靠的对应关系,限制了临床评估的准确性。解决方案的关键在于提出一种自监督流水线,通过从未标注的CT数据自动生成合成X射线视图之间的多对多对应矩阵,利用数字重建投影(DRR)和基于Transformer的训练阶段实现跨多个X射线视图的精确对应预测,并将其作为预训练策略提升真实数据上的多视角骨折检测性能。

链接: https://arxiv.org/abs/2507.00287
作者: Mohamad Dabboussi,Malo Huard,Yann Gousseau,Pietro Gori
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2025

点击查看摘要

Abstract:Accurate interpretation of multi-view radiographs is crucial for diagnosing fractures, muscular injuries, and other anomalies. While significant advances have been made in AI-based analysis of single images, current methods often struggle to establish robust correspondences between different X-ray views, an essential capability for precise clinical evaluations. In this work, we present a novel self-supervised pipeline that eliminates the need for manual annotation by automatically generating a many-to-many correspondence matrix between synthetic X-ray views. This is achieved using digitally reconstructed radiographs (DRR), which are automatically derived from unannotated CT volumes. Our approach incorporates a transformer-based training phase to accurately predict correspondences across two or more X-ray views. Furthermore, we demonstrate that learning correspondences among synthetic X-ray views can be leveraged as a pretraining strategy to enhance automatic multi-view fracture detection on real data. Extensive evaluations on both synthetic and real X-ray datasets show that incorporating correspondences improves performance in multi-view fracture classification.
zh

[CV-94] Room Scene Discovery and Grouping in Unstructured Vacation Rental Image Collections

【速读】:该论文旨在解决度假租赁(Vacation Rental, VR)平台上房屋图像缺乏结构化分类的问题,这给旅行者理解房产的空间布局带来了挑战,尤其是在存在多个相同类型房间的情况下。其解决方案的关键在于提出一种计算效率高的机器学习流水线,该流水线具备低延迟和样本高效学习能力,适用于实时和数据稀缺的环境。该流水线整合了监督式房间类型检测模型、监督式重叠检测模型以及聚类算法,以基于相似性分数将同一空间的图像分组,并利用多模态大语言模型(Multi-modal Large Language Model, MLLM)将每个卧室组映射到房产元数据中指定的床型。

链接: https://arxiv.org/abs/2507.00263
作者: Vignesh Ram Nithin Kappagantula,Shayan Hassantabar
机构: Expedia Group(途易集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The rapid growth of vacation rental (VR) platforms has led to an increasing volume of property images, often uploaded without structured categorization. This lack of organization poses significant challenges for travelers attempting to understand the spatial layout of a property, particularly when multiple rooms of the same type are present. To address this issue, we introduce an effective approach for solving the room scene discovery and grouping problem, as well as identifying bed types within each bedroom group. This grouping is valuable for travelers to comprehend the spatial organization, layout, and the sleeping configuration of the property. We propose a computationally efficient machine learning pipeline characterized by low latency and the ability to perform effectively with sample-efficient learning, making it well-suited for real-time and data-scarce environments. The pipeline integrates a supervised room-type detection model, a supervised overlap detection model to identify the overlap similarity between two images, and a clustering algorithm to group the images of the same space together using the similarity scores. Additionally, the pipeline maps each bedroom group to the corresponding bed types specified in the property’s metadata, based on the visual content present in the group’s images using a Multi-modal Large Language Model (MLLM) model. We evaluate the aforementioned models individually and also assess the pipeline in its entirety, observing strong performance that significantly outperforms established approaches such as contrastive learning and clustering with pretrained embeddings.
zh

[CV-95] VirtualFencer: Generating Fencing Bouts based on Strategies Extracted from In-the-Wild Videos

【速读】:该论文旨在解决如何从真实场景的视频中提取击剑运动的三维动作与策略,并利用这些知识生成逼真的击剑行为的问题。解决方案的关键在于提出VirtualFencer系统,该系统能够无需监督地从自然环境下的视频中提取击剑动作和策略,并基于所提取的知识生成符合实际规则和策略的击剑行为。

链接: https://arxiv.org/abs/2507.00261
作者: Zhiyin Lin,Purvi Goel,Joy Yun,C. Karen Liu,Joao Pedro Araujo
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Fencing is a sport where athletes engage in diverse yet strategically logical motions. While most motions fall into a few high-level actions (e.g. step, lunge, parry), the execution can vary widely-fast vs. slow, large vs. small, offensive vs. defensive. Moreover, a fencer’s actions are informed by a strategy that often comes in response to the opponent’s behavior. This combination of motion diversity with underlying two-player strategy motivates the application of data-driven modeling to fencing. We present VirtualFencer, a system capable of extracting 3D fencing motion and strategy from in-the-wild video without supervision, and then using that extracted knowledge to generate realistic fencing behavior. We demonstrate the versatile capabilities of our system by having it (i) fence against itself (self-play), (ii) fence against a real fencer’s motion from online video, and (iii) fence interactively against a professional fencer.
zh

[CV-96] GazeTarget360: Towards Gaze Target Estimation in 360-Degree for Robot Perception

【速读】:该论文试图解决在真实场景中从图像中准确估计360度视角下人类注视目标(gaze target)的问题。传统方法在处理离帧样本时存在局限性,且基于视觉的注视估计方法如OpenFace无法有效利用图像中的背景信息,难以在被试者视线偏离摄像头时进行准确预测。该研究提出的解决方案关键在于构建一个名为GazeTarget360的系统,该系统融合了眼接触检测器的条件推理引擎、预训练的视觉编码器以及多尺度融合解码器,从而实现了对未见过场景中注视目标的高精度和高可靠性预测。

链接: https://arxiv.org/abs/2507.00253
作者: Zhuangzhuang Dai,Vincent Gbouna Zakka,Luis J. Manso,Chen Li
机构: Aston University (阿斯顿大学); Aalborg University (奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Enabling robots to understand human gaze target is a crucial step to allow capabilities in downstream tasks, for example, attention estimation and movement anticipation in real-world human-robot interactions. Prior works have addressed the in-frame target localization problem with data-driven approaches by carefully removing out-of-frame samples. Vision-based gaze estimation methods, such as OpenFace, do not effectively absorb background information in images and cannot predict gaze target in situations where subjects look away from the camera. In this work, we propose a system to address the problem of 360-degree gaze target estimation from an image in generalized visual scenes. The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder. Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios. This makes a first-of-its-kind system to predict gaze targets from realistic camera footage which is highly efficient and deployable. Our source code is made publicly available at: this https URL.
zh

[CV-97] VOCAL: Visual Odometry via ContrAstive Learning

【速读】:该论文旨在解决传统基于学习的视觉里程计(Visual Odometry, VO)方法在可解释性和理论基础方面的不足,这些问题通常源于对刚性几何假设的依赖。其解决方案的关键在于提出VOCAL(Visual Odometry via ContrAstive Learning)框架,该框架将VO重新定义为一个标签排序问题,并通过结合贝叶斯推断与表征学习框架,使视觉特征能够反映相机状态。该方法利用排序机制促使相似相机状态在潜在空间中形成一致且空间连贯的表示,从而提升特征的可解释性并支持多模态数据源的兼容性。

链接: https://arxiv.org/abs/2507.00243
作者: Chi-Yao Huang,Zeel Bhatt,Yezhou Yang
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Breakthroughs in visual odometry (VO) have fundamentally reshaped the landscape of robotics, enabling ultra-precise camera state estimation that is crucial for modern autonomous systems. Despite these advances, many learning-based VO techniques rely on rigid geometric assumptions, which often fall short in interpretability and lack a solid theoretical basis within fully data-driven frameworks. To overcome these limitations, we introduce VOCAL (Visual Odometry via ContrAstive Learning), a novel framework that reimagines VO as a label ranking challenge. By integrating Bayesian inference with a representation learning framework, VOCAL organizes visual features to mirror camera states. The ranking mechanism compels similar camera states to converge into consistent and spatially coherent representations within the latent space. This strategic alignment not only bolsters the interpretability of the learned features but also ensures compatibility with multimodal data sources. Extensive evaluations on the KITTI dataset highlight VOCAL’s enhanced interpretability and flexibility, pushing VO toward more general and explainable spatial intelligence.
zh

[CV-98] Computer Vision for Objects used in Group Work: Challenges and Opportunities

【速读】:该论文旨在解决现有系统在协作任务中无法准确捕捉学生与物理对象之间真实交互的问题,特别是在K-12教育环境中,这一问题限制了AI系统对物体和实体关系的建模能力。解决方案的关键在于利用6D位姿估计(6D pose estimation),即从RGB图像或视频中估计物体在三维空间中的位置和方向,从而实现对物理对象的精准定位与跟踪。为实现这一目标,研究者提出了FiboSB数据集,并通过微调YOLO11-x模型提升了检测性能,最终实现了较高的mAP_50指标,为复杂协作场景下的6D位姿估计提供了基础支持。

链接: https://arxiv.org/abs/2507.00224
作者: Changsoo Jung,Sheikh Mannan,Jack Fitzgerald,Nathaniel Blanchard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted to AIED 2025 Late Breaking Results Track

点击查看摘要

Abstract:Interactive and spatially aware technologies are transforming educational frameworks, particularly in K-12 settings where hands-on exploration fosters deeper conceptual understanding. However, during collaborative tasks, existing systems often lack the ability to accurately capture real-world interactions between students and physical objects. This issue could be addressed with automatic 6D pose estimation, i.e., estimation of an object’s position and orientation in 3D space from RGB images or videos. For collaborative groups that interact with physical objects, 6D pose estimates allow AI systems to relate objects and entities. As part of this work, we introduce FiboSB, a novel and challenging 6D pose video dataset featuring groups of three participants solving an interactive task featuring small hand-held cubes and a weight scale. This setup poses unique challenges for 6D pose because groups are holistically recorded from a distance in order to capture all participants – this, coupled with the small size of the cubes, makes 6D pose estimation inherently non-trivial. We evaluated four state-of-the-art 6D pose estimation methods on FiboSB, exposing the limitations of current algorithms on collaborative group work. An error analysis of these methods reveals that the 6D pose methods’ object detection modules fail. We address this by fine-tuning YOLO11-x for FiboSB, achieving an overall mAP_50 of 0.898. The dataset, benchmark results, and analysis of YOLO11-x errors presented here lay the groundwork for leveraging the estimation of 6D poses in difficult collaborative contexts.
zh

[CV-99] Rethink 3D Object Detection from Physical World

【速读】:该论文旨在解决自动驾驶系统中3D目标检测的精度与延迟之间的权衡问题,以及不同硬件设备和加速器对实时性能的影响,同时关注检测精度对运动规划安全性的影响。其解决方案的关键在于引入了两种新指标——延迟感知平均精度(L-AP)和规划感知平均精度(P-AP),这些指标考虑了时间因素和物理约束,从而提供了更全面的实时3D目标检测评估方法,并通过延迟感知超参数优化(L-HPO)实现了针对硬件差异的模型性能优化。

链接: https://arxiv.org/abs/2507.00190
作者: Satoshi Tanaka,Koji Minoda,Fumiya Watanabe,Takamasa Horibe
机构: TIER IV, Inc
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:High-accuracy and low-latency 3D object detection is essential for autonomous driving systems. While previous studies on 3D object detection often evaluate performance based on mean average precision (mAP) and latency, they typically fail to address the trade-off between speed and accuracy, such as 60.0 mAP at 100 ms vs 61.0 mAP at 500 ms. A quantitative assessment of the trade-offs between different hardware devices and accelerators remains unexplored, despite being critical for real-time applications. Furthermore, they overlook the impact on collision avoidance in motion planning, for example, 60.0 mAP leading to safer motion planning or 61.0 mAP leading to high-risk motion planning. In this paper, we introduce latency-aware AP (L-AP) and planning-aware AP (P-AP) as new metrics, which consider the physical world such as the concept of time and physical constraints, offering a more comprehensive evaluation for real-time 3D object detection. We demonstrate the effectiveness of our metrics for the entire autonomous driving system using nuPlan dataset, and evaluate 3D object detection models accounting for hardware differences and accelerators. We also develop a state-of-the-art performance model for real-time 3D object detection through latency-aware hyperparameter optimization (L-HPO) using our metrics. Additionally, we quantitatively demonstrate that the assumption “the more point clouds, the better the recognition performance” is incorrect for real-time applications and optimize both hardware and model selection using our metrics.
zh

[CV-100] Graph-Based Deep Learning for Component Segmentation of Maize Plants

【速读】:该论文试图解决在精准农业中通过处理三维点云数据准确识别单株植物各组成部分的问题。传统方法在处理三维数据和识别个体植物组件方面存在诸多不足,因此本文提出了一种基于图神经网络(Graph Neural Networks, GNN)的深度学习架构,并结合主成分分析(Principal Component Analysis, PCA)进行特征增强。该解决方案的关键在于将每个点视为顶点,并通过K-近邻(K-Nearest Neighbors, KNN)层建立边,从而构建图结构表示三维点云数据,随后利用Edge-Conv层提取更丰富的特征,并通过图注意力网络(Graph Attention Networks, GAT)对可见表型组件进行分类。

链接: https://arxiv.org/abs/2507.00182
作者: J. I. Ruíz,A. Méndez,E. Rodríguez
机构: Cinvestav(国家研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In precision agriculture, one of the most important tasks when exploring crop production is identifying individual plant components. There are several attempts to accomplish this task by the use of traditional 2D imaging, 3D reconstructions, and Convolutional Neural Networks (CNN). However, they have several drawbacks when processing 3D data and identifying individual plant components. Therefore, in this work, we propose a novel Deep Learning architecture to detect components of individual plants on Light Detection and Ranging (LiDAR) 3D Point Cloud (PC) data sets. This architecture is based on the concept of Graph Neural Networks (GNN), and feature enhancing with Principal Component Analysis (PCA). For this, each point is taken as a vertex and by the use of a K-Nearest Neighbors (KNN) layer, the edges are established, thus representing the 3D PC data set. Subsequently, Edge-Conv layers are used to further increase the features of each point. Finally, Graph Attention Networks (GAT) are applied to classify visible phenotypic components of the plant, such as the leaf, stem, and soil. This study demonstrates that our graph-based deep learning approach enhances segmentation accuracy for identifying individual plant components, achieving percentages above 80% in the IoU average, thus outperforming other existing models based on point clouds.
zh

[CV-101] SelvaBox: A high-resolution dataset for tropical tree crown detection

【速读】:该论文旨在解决热带森林中个体树冠检测的难题,这一问题对于研究受人类活动和气候变化影响的复杂生态系统至关重要。由于热带树冠在大小、结构和分布上具有高度多样性且常出现重叠和交织,传统方法难以有效应对,因此需要先进的遥感技术与高分辨率影像相结合。论文提出的解决方案关键在于构建了SelvaBox数据集,这是目前最大的公开热带树冠检测数据集,包含超过83,000个手动标注的树冠,其规模是之前所有热带森林数据集的十倍以上。此外,通过在SelvaBox上进行模型训练,并结合其他高分辨率数据集进行联合训练,显著提升了检测性能,实现了在未见数据集上的零样本检测能力以及跨数据集的最优检测效果。

链接: https://arxiv.org/abs/2507.00170
作者: Hugo Baudchon,Arthur Ouaknine,Martin Weiss,Mélisande Teng,Thomas R. Walla,Antoine Caron-Guay,Christopher Pal,Etienne Laliberté
机构: Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); McGill University (麦吉尔大学); Rubisco AI (Rubisco AI); Colorado Mesa University (科罗拉多梅萨大学); Polytechnique Montreal (蒙特利尔理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting individual tree crowns in tropical forests is essential to study these complex and crucial ecosystems impacted by human interventions and climate change. However, tropical crowns vary widely in size, structure, and pattern and are largely overlapping and intertwined, requiring advanced remote sensing methods applied to high-resolution imagery. Despite growing interest in tropical tree crown detection, annotated datasets remain scarce, hindering robust model development. We introduce SelvaBox, the largest open-access dataset for tropical tree crown detection in high-resolution drone imagery. It spans three countries and contains more than 83,000 manually labeled crowns - an order of magnitude larger than all previous tropical forest datasets combined. Extensive benchmarks on SelvaBox reveal two key findings: (1) higher-resolution inputs consistently boost detection accuracy; and (2) models trained exclusively on SelvaBox achieve competitive zero-shot detection performance on unseen tropical tree crown datasets, matching or exceeding competing methods. Furthermore, jointly training on SelvaBox and three other datasets at resolutions from 3 to 10 cm per pixel within a unified multi-resolution pipeline yields a detector ranking first or second across all evaluated datasets. Our dataset, code, and pre-trained weights are made public.
zh

[CV-102] FreeLong: Training-Free Long Video Generation via Multi-band SpectralFusion

【速读】:该论文旨在解决长视频生成中存在的时间一致性退化和视觉保真度下降问题,特别是针对由短视频生成模型直接扩展至长视频时出现的高频失真(high-frequency distortion)现象。其解决方案的关键在于提出FreeLong框架,在去噪过程中通过融合全局低频特征与局部高频特征,以平衡长视频特征的频率分布,从而保留细节并提升整体质量;进一步的FreeLong++通过多分支架构实现多带频域融合,从低频到高频逐步增强语义连贯性和运动动态的精细控制,无需额外训练即可显著提升长视频生成的效果。

链接: https://arxiv.org/abs/2507.00162
作者: Yu Lu,Yi Yang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:Recent advances in video generation models have enabled high-quality short video generation from text prompts. However, extending these models to longer videos remains a significant challenge, primarily due to degraded temporal consistency and visual fidelity. Our preliminary observations show that naively applying short-video generation models to longer sequences leads to noticeable quality degradation. Further analysis identifies a systematic trend where high-frequency components become increasingly distorted as video length grows, an issue we term high-frequency distortion. To address this, we propose FreeLong, a training-free framework designed to balance the frequency distribution of long video features during the denoising process. FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows to preserve fine details. Building on this, FreeLong++ extends FreeLong dual-branch design into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale. By arranging multiple window sizes from global to local, FreeLong++ enables multi-band frequency fusion from low to high frequencies, ensuring both semantic continuity and fine-grained motion dynamics across longer video sequences. Without any additional training, FreeLong++ can be plugged into existing video generation models (e.g. Wan2.1 and LTX-Video) to produce longer videos with substantially improved temporal consistency and visual fidelity. We demonstrate that our approach outperforms previous methods on longer video generation tasks (e.g. 4x and 8x of native length). It also supports coherent multi-prompt video generation with smooth scene transitions and enables controllable video generation using long depth or pose sequences.
zh

[CV-103] Diffusion-Based Image Augmentation for Semantic Segmentation in Outdoor Robotics ICRA

【速读】:该论文试图解决基于学习的感知算法在分布外(out-of-distribution)和代表性不足环境中的性能下降问题,特别是在户外机器人因动态光照、季节性和天气变化导致视觉场景表征不足的情况下。解决方案的关键在于提出一种基于扩散模型的图像增强方法,以更准确地反映部署环境的特征到训练数据中,从而提升模型在目标环境中的泛化能力。该方法依赖于互联网规模数据集上预训练的视觉基础模型,并利用开放词汇语义分割模型过滤包含幻觉(hallucinations)的增强候选图像,以控制训练数据中地面表面的语义分布。

链接: https://arxiv.org/abs/2507.00153
作者: Peter Mortimer,Mirko Maehlisch
机构: University of the Bundeswehr Munich (慕尼黑联邦国防军大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the 2025 IEEE ICRA Workshop on Field Robotics

点击查看摘要

Abstract:The performance of leaning-based perception algorithms suffer when deployed in out-of-distribution and underrepresented environments. Outdoor robots are particularly susceptible to rapid changes in visual scene appearance due to dynamic lighting, seasonality and weather effects that lead to scenes underrepresented in the training data of the learning-based perception system. In this conceptual paper, we focus on preparing our autonomous vehicle for deployment in snow-filled environments. We propose a novel method for diffusion-based image augmentation to more closely represent the deployment environment in our training data. Diffusion-based image augmentations rely on the public availability of vision foundation models learned on internet-scale datasets. The diffusion-based image augmentations allow us to take control over the semantic distribution of the ground surfaces in the training data and to fine-tune our model for its deployment environment. We employ open vocabulary semantic segmentation models to filter out augmentation candidates that contain hallucinations. We believe that diffusion-based image augmentations can be extended to many other environments apart from snow surfaces, like sandy environments and volcanic terrains.
zh

[CV-104] An efficient plant disease detection using transfer learning approach

【速读】:该论文旨在解决植物疾病早期检测的问题,以减少病害对农作物产量和质量的负面影响。其解决方案的关键在于采用迁移学习方法,利用YOLOv7和YOLOv8这两种先进的目标检测模型,并在植物叶片图像数据集上进行微调,从而实现对细菌、真菌和病毒性疾病的准确识别与监测。

链接: https://arxiv.org/abs/2507.00070
作者: Bosubabu Sambana,Hillary Sunday Nnadi,Mohd Anas Wajid,Nwosu Ogochukwu Fidelia,Claudia Camacho-Zuñiga,Henry Dozie Ajuzie,Edeh Michael Onyema
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages , 4 figures. Scientific Reports 2025

点击查看摘要

Abstract:Plant diseases pose significant challenges to farmers and the agricultural sector at large. However, early detection of plant diseases is crucial to mitigating their effects and preventing widespread damage, as outbreaks can severely impact the productivity and quality of crops. With advancements in technology, there are increasing opportunities for automating the monitoring and detection of disease outbreaks in plants. This study proposed a system designed to identify and monitor plant diseases using a transfer learning approach. Specifically, the study utilizes YOLOv7 and YOLOv8, two state-ofthe-art models in the field of object detection. By fine-tuning these models on a dataset of plant leaf images, the system is able to accurately detect the presence of Bacteria, Fungi and Viral diseases such as Powdery Mildew, Angular Leaf Spot, Early blight and Tomato mosaic virus. The model’s performance was evaluated using several metrics, including mean Average Precision (mAP), F1-score, Precision, and Recall, yielding values of 91.05, 89.40, 91.22, and 87.66, respectively. The result demonstrates the superior effectiveness and efficiency of YOLOv8 compared to other object detection methods, highlighting its potential for use in modern agricultural practices. The approach provides a scalable, automated solution for early any plant disease detection, contributing to enhanced crop yield, reduced reliance on manual monitoring, and supporting sustainable agricultural practices.
zh

[CV-105] VSF-Med:A Vulnerability Scoring Framework for Medical Vision-Language Models

【速读】:该论文旨在解决医疗视觉语言模型(Medical Vision Language Models, VLMs)在临床环境中缺乏系统性安全评估的问题。其关键解决方案是提出VSF-Med框架,该框架整合了三个创新组件:一是包含针对新兴威胁向量的复杂文本提示攻击模板的丰富库;二是通过结构相似性(SSIM)阈值校准的不可察觉的视觉扰动,以保持临床现实性;三是由两个独立的判官大语言模型(LLM)评估的八维评分体系,通过z-score标准化合并得到0-32的综合风险指标。该框架基于公开数据集构建,并提供开源代码,能够从5000张放射图像生成超过30,000个对抗样本,支持任何医疗VLM的可重复基准测试。

链接: https://arxiv.org/abs/2507.00052
作者: Binesh Sadanandan,Vahid Behzadan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) hold great promise for streamlining labour-intensive medical imaging workflows, yet systematic security evaluations in clinical settings remain scarce. We introduce VSF–Med, an end-to-end vulnerability-scoring framework for medical VLMs that unites three novel components: (i) a rich library of sophisticated text-prompt attack templates targeting emerging threat vectors; (ii) imperceptible visual perturbations calibrated by structural similarity (SSIM) thresholds to preserve clinical realism; and (iii) an eight-dimensional rubric evaluated by two independent judge LLMs, whose raw scores are consolidated via z-score normalization to yield a 0–32 composite risk metric. Built entirely on publicly available datasets and accompanied by open-source code, VSF–Med synthesizes over 30,000 adversarial variants from 5,000 radiology images and enables reproducible benchmarking of any medical VLM with a single command. Our consolidated analysis reports mean z-score shifts of 0.90\sigma for persistence-of-attack-effects, 0.74\sigma for prompt-injection effectiveness, and 0.63\sigma for safety-bypass success across state-of-the-art VLMs. Notably, Llama-3.2-11B-Vision-Instruct exhibits a peak vulnerability increase of 1.29\sigma for persistence-of-attack-effects, while GPT-4o shows increases of 0.69\sigma for that same vector and 0.28\sigma for prompt-injection attacks.
zh

[CV-106] AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Large-Scale Object Detection Training

【速读】:该论文旨在解决大规模数据集在机器学习模型训练中的计算负担和固有冗余问题。现有数据剪枝方法存在局限性,密度基方法可能与任务无关,而模型基技术可能引入冗余或计算成本过高。其解决方案的关键在于提出一种新型混合框架——自适应去重(Adaptive De-Duplication, AdaDeDup),该框架通过在聚类自适应方式下协同集成密度基剪枝与模型反馈,实现更高效的数据选择。具体而言,AdaDeDup首先对数据进行分区并应用初始密度基剪枝,随后利用代理模型评估每个聚类中剪枝的影响,通过比较保留样本与剪枝样本的损失来调整聚类特定的剪枝阈值,从而在冗余聚类中实现更激进的剪枝,同时保留信息丰富聚类中的关键数据。

链接: https://arxiv.org/abs/2507.00049
作者: Feiyang Kang,Nadine Chang,Maying Shen,Marc T. Law,Rafid Mahmood,Ruoxi Jia,Jose M. Alvarez
机构: Virginia Tech(弗吉尼亚理工学院); NVIDIA(英伟达); University of Ottawa(渥太华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:The computational burden and inherent redundancy of large-scale datasets challenge the training of contemporary machine learning models. Data pruning offers a solution by selecting smaller, informative subsets, yet existing methods struggle: density-based approaches can be task-agnostic, while model-based techniques may introduce redundancy or prove computationally prohibitive. We introduce Adaptive De-Duplication (AdaDeDup), a novel hybrid framework that synergistically integrates density-based pruning with model-informed feedback in a cluster-adaptive manner. AdaDeDup first partitions data and applies an initial density-based pruning. It then employs a proxy model to evaluate the impact of this initial pruning within each cluster by comparing losses on kept versus pruned samples. This task-aware signal adaptively adjusts cluster-specific pruning thresholds, enabling more aggressive pruning in redundant clusters while preserving critical data in informative ones. Extensive experiments on large-scale object detection benchmarks (Waymo, COCO, nuScenes) using standard models (BEVFormer, Faster R-CNN) demonstrate AdaDeDup’s advantages. It significantly outperforms prominent baselines, substantially reduces performance degradation (e.g., over 54% versus random sampling on Waymo), and achieves near-original model performance while pruning 20% of data, highlighting its efficacy in enhancing data efficiency for large-scale model training. Code is open-sourced.
zh

[CV-107] Evolutionary computing-based image segmentation method to detect defects and features in Additive Friction Stir Deposition Process

【速读】:该论文旨在解决增材摩擦搅拌沉积(Additive Friction Stir Deposition, AFSD)过程中材料界面质量和缺陷检测的问题,特别是针对多层AFSD构件中不易通过传统成像技术观察到的细微材料过渡区域和潜在缺陷区域。解决方案的关键在于采用基于粒子群优化(Particle Swarm Optimization, PSO)的图像分割方法,结合梯度幅值分析与距离变换,生成具有注意力加权的可视化结果,从而实现对材料界面的精确分割与质量评估。

链接: https://arxiv.org/abs/2507.00046
作者: Akshansh Mishra,Eyob Mesele Sefene,Shivraman Thapliyal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:This work proposes an evolutionary computing-based image segmentation approach for analyzing soundness in Additive Friction Stir Deposition (AFSD) processes. Particle Swarm Optimization (PSO) was employed to determine optimal segmentation thresholds for detecting defects and features in multilayer AFSD builds. The methodology integrates gradient magnitude analysis with distance transforms to create novel attention-weighted visualizations that highlight critical interface regions. Five AFSD samples processed under different conditions were analyzed using multiple visualization techniques i.e. self-attention maps, and multi-channel visualization. These complementary approaches reveal subtle material transition zones and potential defect regions which were not readily observable through conventional imaging. The PSO algorithm automatically identified optimal threshold values (ranging from 156-173) for each sample, enabling precise segmentation of material interfaces. The multi-channel visualization technique effectively combines boundary information (red channel), spatial relationships (green channel), and material density data (blue channel) into cohesive representations that quantify interface quality. The results demonstrate that attention-based analysis successfully identifies regions of incomplete bonding and inhomogeneities in AFSD joints, providing quantitative metrics for process optimization and quality assessment of additively manufactured components.
zh

[CV-108] HistoART: Histopathology Artifact Detection and Reporting Tool

【速读】:该论文旨在解决全切片成像(Whole Slide Imaging, WSI)在数字化病理分析过程中因载玻片制备和扫描过程引入的伪影问题,这些问题可能影响后续图像分析的准确性。解决方案的关键在于提出并比较三种稳健的伪影检测方法:基于基础模型的方法(FMA),利用微调的统一神经图像(UNI)架构;基于深度学习的方法(DLA),采用ResNet50作为主干网络;以及基于知识的方法(KBA),依赖于手工设计的纹理、颜色和频域特征。其中,FMA在检测六种常见伪影类型(包括组织折叠、离焦区域、气泡、组织损伤、标记痕迹和血液污染)方面表现出最佳性能,其像素级AUROC达到0.995,显著优于其他两种方法。

链接: https://arxiv.org/abs/2507.00044
作者: Seyed Kahaki,Alexander R. Webber,Ghada Zamzmi,Adarsh Subbaswamy,Rucha Deshpande,Aldo Badano
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:In modern cancer diagnostics, Whole Slide Imaging (WSI) is widely used to digitize tissue specimens for detailed, high-resolution examination; however, other diagnostic approaches, such as liquid biopsy and molecular testing, are also utilized based on the cancer type and clinical context. While WSI has revolutionized digital histopathology by enabling automated, precise analysis, it remains vulnerable to artifacts introduced during slide preparation and scanning. These artifacts can compromise downstream image analysis. To address this challenge, we propose and compare three robust artifact detection approaches for WSIs: (1) a foundation model-based approach (FMA) using a fine-tuned Unified Neural Image (UNI) architecture, (2) a deep learning approach (DLA) built on a ResNet50 backbone, and (3) a knowledge-based approach (KBA) leveraging handcrafted features from texture, color, and frequency-based metrics. The methods target six common artifact types: tissue folds, out-of-focus regions, air bubbles, tissue damage, marker traces, and blood contamination. Evaluations were conducted on 50,000+ image patches from diverse scanners (Hamamatsu, Philips, Leica Aperio AT2) across multiple sites. The FMA achieved the highest patch-wise AUROC of 0.995 (95% CI [0.994, 0.995]), outperforming the ResNet50-based method (AUROC: 0.977, 95% CI [0.977, 0.978]) and the KBA (AUROC: 0.940, 95% CI [0.933, 0.946]). To translate detection into actionable insights, we developed a quality report scorecard that quantifies high-quality patches and visualizes artifact distributions.
zh

[CV-109] MR-CLIP: Efficient Metadata-Guided Learning of MRI Contrast Representations

【速读】:该论文旨在解决医学影像中磁共振成像(Magnetic Resonance Imaging, MRI)扫描的对比度解释问题,特别是在缺乏可靠和标准化DICOM元数据的情况下,如何准确识别图像对比度。传统方法依赖于粗粒度标签如T1加权或T2加权,但这些标签无法精确反映实际的采集参数。为了解决这一问题,作者提出了MR-CLIP,这是一种多模态对比学习框架,通过将MRI图像与其DICOM元数据对齐,学习具有对比度感知的表示,而无需依赖人工标注。该方案的关键在于利用多模态对齐技术,从原始采集参数中提取有效的对比度信息,从而实现跨模态检索和对比度分类等任务。

链接: https://arxiv.org/abs/2507.00043
作者: Mehmet Yigit Avci,Pedro Borges,Paul Wright,Mehmet Yigitsoy,Sebastien Ourselin,Jorge Cardoso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate interpretation of Magnetic Resonance Imaging scans in clinical systems is based on a precise understanding of image contrast. This contrast is primarily governed by acquisition parameters, such as echo time and repetition time, which are stored in the DICOM metadata. To simplify contrast identification, broad labels such as T1-weighted or T2-weighted are commonly used, but these offer only a coarse approximation of the underlying acquisition settings. In many real-world datasets, such labels are entirely missing, leaving raw acquisition parameters as the only indicators of contrast. Adding to this challenge, the available metadata is often incomplete, noisy, or inconsistent. The lack of reliable and standardized metadata complicates tasks such as image interpretation, retrieval, and integration into clinical workflows. Furthermore, robust contrast-aware representations are essential to enable more advanced clinical applications, such as achieving modality-invariant representations and data harmonization. To address these challenges, we propose MR-CLIP, a multimodal contrastive learning framework that aligns MR images with their DICOM metadata to learn contrast-aware representations, without relying on manual labels. Trained on a diverse clinical dataset that spans various scanners and protocols, MR-CLIP captures contrast variations across acquisitions and within scans, enabling anatomy-invariant representations. We demonstrate its effectiveness in cross-modal retrieval and contrast classification, highlighting its scalability and potential for further clinical applications. The code and weights are publicly available at this https URL.
zh

[CV-110] Catastrophic Forgetting Mitigation via Discrepancy-Weighted Experience Replay ICANN2025

【速读】:该论文旨在解决云边协同目标检测中边缘模型在持续适应过程中因灾难性遗忘而导致的先前知识丢失问题,特别是在具有周期性变化(如昼夜、高峰时段)的动态交通环境中,这一问题尤为突出。论文提出的解决方案关键在于ER-EMU算法,其核心是基于自适应经验回放机制,采用有限大小的经验缓冲区和一种基于领域距离度量的经验选择算法(DDM-ES),通过多核最大均值差异(MK-MMD)量化目标领域的差异性,优先选择与当前目标领域最不相似的历史数据,从而提升训练多样性并增强知识保留能力,同时利用简单随机采样策略维持历史领域的平衡表示。

链接: https://arxiv.org/abs/2507.00042
作者: Xinrun Xu,Jianwen Yang,Qiuhong Zhang,Zhanbiao Lian,Zhiming Ding,Shan Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICANN 2025

点击查看摘要

Abstract:Continually adapting edge models in cloud-edge collaborative object detection for traffic monitoring suffers from catastrophic forgetting, where models lose previously learned knowledge when adapting to new data distributions. This is especially problematic in dynamic traffic environments characterised by periodic variations (e.g., day/night, peak hours), where past knowledge remains valuable. Existing approaches like experience replay and visual prompts offer some mitigation, but struggle to effectively prioritize and leverage historical data for optimal knowledge retention and adaptation. Specifically, simply storing and replaying all historical data can be inefficient, while treating all historical experiences as equally important overlooks their varying relevance to the current domain. This paper proposes ER-EMU, an edge model update algorithm based on adaptive experience replay, to address these limitations. ER-EMU utilizes a limited-size experience buffer managed using a First-In-First-Out (FIFO) principle, and a novel Domain Distance Metric-based Experience Selection (DDM-ES) algorithm. DDM-ES employs the multi-kernel maximum mean discrepancy (MK-MMD) to quantify the dissimilarity between target domains, prioritizing the selection of historical data that is most dissimilar to the current target domain. This ensures training diversity and facilitates the retention of knowledge from a wider range of past experiences, while also preventing overfitting to the new domain. The experience buffer is also updated using a simple random sampling strategy to maintain a balanced representation of previous domains. Experiments on the Bellevue traffic video dataset, involving repeated day/night cycles, demonstrate that ER-EMU consistently improves the performance of several state-of-the-art cloud-edge collaborative object detection frameworks.
zh

[CV-111] alentMine: LLM -Based Extraction and Question-Answering from Multimodal Talent Tables KDD

【速读】:该论文旨在解决人才管理系统中因表格信息复杂而导致传统语言模型在信息检索中的显著挑战,特别是在处理需要精确理解表格关系的人才文档时,现有表格提取方法在语义理解上的不足导致了下游查询失败。解决方案的关键在于提出了一种基于大语言模型(LLM)的增强框架——TalentMine,该框架通过专门的多模态推理将提取的表格转换为语义丰富的表示,从而保留表格数据的结构和语义维度,而非依赖传统的CSV或文本线性化方法。

链接: https://arxiv.org/abs/2507.00041
作者: Varun Mannam,Fang Wang,Chaochun Liu,Xin Chen
机构: Amazon(亚马逊)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Submitted to KDD conference, workshop: Talent and Management Computing (TMC 2025), this https URL

点击查看摘要

Abstract:In talent management systems, critical information often resides in complex tabular formats, presenting significant retrieval challenges for conventional language models. These challenges are pronounced when processing Talent documentation that requires precise interpretation of tabular relationships for accurate information retrieval and downstream decision-making. Current table extraction methods struggle with semantic understanding, resulting in poor performance when integrated into retrieval-augmented chat applications. This paper identifies a key bottleneck - while structural table information can be extracted, the semantic relationships between tabular elements are lost, causing downstream query failures. To address this, we introduce TalentMine, a novel LLM-enhanced framework that transforms extracted tables into semantically enriched representations. Unlike conventional approaches relying on CSV or text linearization, our method employs specialized multimodal reasoning to preserve both structural and semantic dimensions of tabular data. Experimental evaluation across employee benefits document collections demonstrates TalentMine’s superior performance, achieving 100% accuracy in query answering tasks compared to 0% for standard AWS Textract extraction and 40% for AWS Textract Visual QA capabilities. Our comparative analysis also reveals that the Claude v3 Haiku model achieves optimal performance for talent management applications. The key contributions of this work include (1) a systematic analysis of semantic information loss in current table extraction pipelines, (2) a novel LLM-based method for semantically enriched table representation, (3) an efficient integration framework for retrieval-augmented systems as end-to-end systems, and (4) comprehensive benchmarks on talent analytics tasks showing substantial improvements across multiple categories.
zh

[CV-112] HiT-JEPA: A Hierarchical Self-supervised Trajectory Embedding Framework for Similarity Computation

【速读】:该论文试图解决城市轨迹数据表示中难以同时捕捉细粒度细节与高层语义抽象的问题,现有方法在单一模型中难以有效整合轨迹的局部动态与全局语义,从而限制了其对长期依赖关系的关注和局部细微特征的保留。解决方案的关键在于提出HiT-JEPA(Hierarchical Interactions of Trajectory Semantics via a Joint Embedding Predictive Architecture),该框架采用三层层次结构,逐步捕获点级细粒度细节、中间模式以及高层轨迹抽象,从而在统一结构中融合局部动态与全局语义。

链接: https://arxiv.org/abs/2507.00028
作者: Lihuan Li,Hao Xue,Shuang Ao,Yang Song,Flora Salim
机构: University of New South Wales (新南威尔士大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The representation of urban trajectory data plays a critical role in effectively analyzing spatial movement patterns. Despite considerable progress, the challenge of designing trajectory representations that can capture diverse and complementary information remains an open research problem. Existing methods struggle in incorporating trajectory fine-grained details and high-level summary in a single model, limiting their ability to attend to both long-term dependencies while preserving local nuances. To address this, we propose HiT-JEPA (Hierarchical Interactions of Trajectory Semantics via a Joint Embedding Predictive Architecture), a unified framework for learning multi-scale urban trajectory representations across semantic abstraction levels. HiT-JEPA adopts a three-layer hierarchy that progressively captures point-level fine-grained details, intermediate patterns, and high-level trajectory abstractions, enabling the model to integrate both local dynamics and global semantics in one coherent structure. Extensive experiments on multiple real-world datasets for trajectory similarity computation show that HiT-JEPA’s hierarchical design yields richer, multi-scale representations. Code is available at: this https URL.
zh

[CV-113] Gradient-based Fine-Tuning through Pre-trained Model Regularization

【速读】:该论文旨在解决大规模预训练模型在特定下游任务中微调时所需的大量计算资源和存储问题。其解决方案的关键在于提出一种高效的基于梯度和正则化的微调方法(GRFT),该方法通过更新权重矩阵的行或列来减少训练参数数量,并理论证明了具有最高平方梯度和的行或列是最优更新对象,从而有效降低存储开销并提升参数选择效率,同时引入正则化以增强知识迁移能力。

链接: https://arxiv.org/abs/2507.00016
作者: Xuanbo Liu,Liu Liu,Fuxiang Wu,Fusheng Hao,Xianglong Liu
机构: Beihang University (北京航空航天大学); Chinese Academy of Sciences (中国科学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large pre-trained models have demonstrated extensive applications across various fields. However, fine-tuning these models for specific downstream tasks demands significant computational resources and storage. One fine-tuning method, gradient-based parameter selection (GPS), focuses on fine-tuning only the parameters with high gradients in each neuron, thereby reducing the number of training parameters. Nevertheless, this approach increases computational resource requirements and storage demands. In this paper, we propose an efficient gradient-based and regularized fine-tuning method (GRFT) that updates the rows or columns of the weight matrix. We theoretically demonstrate that the rows or columns with the highest sum of squared gradients are optimal for updating. This strategy effectively reduces storage overhead and improves the efficiency of parameter selection. Additionally, we incorporate regularization to enhance knowledge transfer from the pre-trained model. GRFT achieves state-of-the-art performance, surpassing existing methods such as GPS, Adapter Tuning, and LoRA. Notably, GRFT requires updating only 1.22% and 0.30% of the total parameters on FGVC and VTAB datasets, respectively, demonstrating its high efficiency and effectiveness. The source code will be released soon.
zh

[CV-114] DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

【速读】:该论文试图解决在图形用户界面(GUI)中对自然语言查询进行精准定位的问题,该问题由于视觉元素的多样性、空间杂乱性以及语言的模糊性而具有挑战性。论文提出的解决方案关键在于DiMo-GUI框架,其核心策略包括动态视觉定位和模态感知优化。该方法通过将GUI输入拆分为文本元素和图标元素,使模型能够独立地对每种模态进行推理,从而提升定位准确性。当预测存在歧义或错误时,DiMo-GUI通过生成以初始预测为中心的候选焦点区域并逐步放大子区域来动态调整注意力,实现层次化精炼,无需额外训练或标注即可有效解决视觉拥挤布局中的歧义问题。

链接: https://arxiv.org/abs/2507.00008
作者: Hang Wu,Hongkai Chen,Yujun Cai,Chang Liu,Qingwen Ye,Ming-Hsuan Yang,Yiwei Wang
机构: vivo Mobile Communication Co., Ltd (维沃移动通信有限公司)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models. When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model’s initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually crowded layouts without the need for additional training or annotations. We evaluate our approach on standard GUI grounding benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation with region-focused reasoning.
zh

[CV-115] Advancing Lung Disease Diagnosis in 3D CT Scans

【速读】:该论文旨在解决胸部CT扫描中肺部疾病准确诊断的问题,其核心挑战在于如何有效提取病变区域特征并处理类别不平衡问题。解决方案的关键在于首先通过分析3D CT扫描的特性去除非肺部区域,使模型专注于病灶相关区域并降低计算成本;其次采用ResNeSt50作为强大的特征提取器,并引入加权交叉熵损失函数以缓解类别不平衡,特别是在处理代表性不足的鳞状细胞癌类别时表现出色。

链接: https://arxiv.org/abs/2507.00993
作者: Qingqiu Li,Runtian Yuan,Junlin Hou,Jilan Xu,Yuejie Zhang,Rui Feng,Hao Chen
机构: Fudan University (复旦大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To enable more accurate diagnosis of lung disease in chest CT scans, we propose a straightforward yet effective model. Firstly, we analyze the characteristics of 3D CT scans and remove non-lung regions, which helps the model focus on lesion-related areas and reduces computational cost. We adopt ResNeSt50 as a strong feature extractor, and use a weighted cross-entropy loss to mitigate class imbalance, especially for the underrepresented squamous cell carcinoma category. Our model achieves a Macro F1 Score of 0.80 on the validation set of the Fair Disease Diagnosis Challenge, demonstrating its strong performance in distinguishing between different lung conditions.
zh

[CV-116] DMCIE: Diffusion Model with Concatenation of Inputs and Errors to Improve the Accuracy of the Segmentation of Brain Tumors in MRI Images

【速读】:该论文旨在解决多模态磁共振成像(MRI)中脑肿瘤精准分割的问题,这对于可靠的临床诊断和有效的治疗计划至关重要。其解决方案的关键在于提出一种基于扩散模型的校正分割框架DMCIE(Diffusion Model with Concatenation of Inputs and Errors),该框架通过将初始分割掩码与真实标签之间的差异生成误差图,并将其与原始MRI图像拼接后输入扩散模型,从而引导模型专注于误分类区域,提升分割精度。

链接: https://arxiv.org/abs/2507.00983
作者: Sara Yavari,Rahul Nitin Pandya,Jacob Furst
机构: DePaul University (德保罗大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of brain tumors in MRI scans is essential for reliable clinical diagnosis and effective treatment planning. Recently, diffusion models have demonstrated remarkable effectiveness in image generation and segmentation tasks. This paper introduces a novel approach to corrective segmentation based on diffusion models. We propose DMCIE (Diffusion Model with Concatenation of Inputs and Errors), a novel framework for accurate brain tumor segmentation in multi-modal MRI scans. We employ a 3D U-Net to generate an initial segmentation mask, from which an error map is generated by identifying the differences between the prediction and the ground truth. The error map, concatenated with the original MRI images, are used to guide a diffusion model. Using multimodal MRI inputs (T1, T1ce, T2, FLAIR), DMCIE effectively enhances segmentation accuracy by focusing on misclassified regions, guided by the original inputs. Evaluated on the BraTS2020 dataset, DMCIE outperforms several state-of-the-art diffusion-based segmentation methods, achieving a Dice Score of 93.46 and an HD95 of 5.94 mm. These results highlight the effectiveness of error-guided diffusion in producing precise and reliable brain tumor segmentations.
zh

[CV-117] Deep learning-based segmentation of T1 and T2 cardiac MRI maps for automated disease detection DATE

【速读】:该论文旨在解决传统心脏组织参数映射中由于人工分割导致的观察者间变异性问题,以及基于平均弛豫值和单一阈值的方法可能过于简化心肌复杂性的局限性。其解决方案的关键在于利用深度学习(DL)实现对T1/T2图的精确分割,并结合多种统计特征与机器学习(ML)方法提升疾病检测的准确性。通过训练深度学习模型进行左心室血池和心肌的分割,同时引入平均值、四分位数等统计特征,显著提高了分类性能,证明了多特征融合在疾病检测中的有效性。

链接: https://arxiv.org/abs/2507.00903
作者: Andreea Bianca Popescu,Andreas Seitz,Heiko Mahrholdt,Jens Wetzl,Athira Jacob,Lucian Mihai Itu,Constantin Suciu,Teodora Chitiboi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted for consideration at European Radiology (Springer). Upon acceptance, this preprint will be updated with the journal reference

点击查看摘要

Abstract:Objectives Parametric tissue mapping enables quantitative cardiac tissue characterization but is limited by inter-observer variability during manual delineation. Traditional approaches relying on average relaxation values and single cutoffs may oversimplify myocardial complexity. This study evaluates whether deep learning (DL) can achieve segmentation accuracy comparable to inter-observer variability, explores the utility of statistical features beyond mean T1/T2 values, and assesses whether machine learning (ML) combining multiple features enhances disease detection. Materials Methods T1 and T2 maps were manually segmented. The test subset was independently annotated by two observers, and inter-observer variability was assessed. A DL model was trained to segment left ventricle blood pool and myocardium. Average (A), lower quartile (LQ), median (M), and upper quartile (UQ) were computed for the myocardial pixels and employed in classification by applying cutoffs or in ML. Dice similarity coefficient (DICE) and mean absolute percentage error evaluated segmentation performance. Bland-Altman plots assessed inter-user and model-observer agreement. Receiver operating characteristic analysis determined optimal cutoffs. Pearson correlation compared features from model and manual segmentations. F1-score, precision, and recall evaluated classification performance. Wilcoxon test assessed differences between classification methods, with p 0.05 considered statistically significant. Results 144 subjects were split into training (100), validation (15) and evaluation (29) subsets. Segmentation model achieved a DICE of 85.4%, surpassing inter-observer agreement. Random forest applied to all features increased F1-score (92.7%, p 0.001). Conclusion DL facilitates segmentation of T1/ T2 maps. Combining multiple features with ML improves disease detection.
zh

[CV-118] Automated anatomy-based post-processing reduces false positives and improved interpretability of deep learning intracranial aneurysm detection

【速读】:该论文旨在解决深度学习(Deep Learning, DL)模型在CT血管造影(CTA)中检测颅内动脉瘤时存在的高假阳性(False Positive, FP)率问题,这一问题阻碍了模型的临床转化。解决方案的关键在于采用一种自动化的、基于解剖结构的、混合启发式学习的动脉-静脉分割后处理方法,通过结合脑组织、动脉、静脉及海绵窦的分割掩膜,有效识别并移除DL输出中与这些结构重叠的假阳性区域,从而显著降低FP数量,同时保持真阳性(True Positive, TP)不变,提升模型的临床适用性。

链接: https://arxiv.org/abs/2507.00832
作者: Jisoo Kim,Chu-Hsuan Lin,Alberto Ceballos-Arroyo,Ping Liu,Huaizu Jiang,Shrikanth Yadav,Qi Wan,Lei Qin,Geoffrey S Young
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Introduction: Deep learning (DL) models can help detect intracranial aneurysms on CTA, but high false positive (FP) rates remain a barrier to clinical translation, despite improvement in model architectures and strategies like detection threshold tuning. We employed an automated, anatomy-based, heuristic-learning hybrid artery-vein segmentation post-processing method to further reduce FPs. Methods: Two DL models, CPM-Net and a deformable 3D convolutional neural network-transformer hybrid (3D-CNN-TR), were trained with 1,186 open-source CTAs (1,373 annotated aneurysms), and evaluated with 143 held-out private CTAs (218 annotated aneurysms). Brain, artery, vein, and cavernous venous sinus (CVS) segmentation masks were applied to remove possible FPs in the DL outputs that overlapped with: (1) brain mask; (2) vein mask; (3) vein more than artery masks; (4) brain plus vein mask; (5) brain plus vein more than artery masks. Results: CPM-Net yielded 139 true-positives (TP); 79 false-negative (FN); 126 FP. 3D-CNN-TR yielded 179 TP; 39 FN; 182 FP. FPs were commonly extracranial (CPM-Net 27.3%; 3D-CNN-TR 42.3%), venous (CPM-Net 56.3%; 3D-CNN-TR 29.1%), arterial (CPM-Net 11.9%; 3D-CNN-TR 53.3%), and non-vascular (CPM-Net 25.4%; 3D-CNN-TR 9.3%) structures. Method 5 performed best, reducing CPM-Net FP by 70.6% (89/126) and 3D-CNN-TR FP by 51.6% (94/182), without reducing TP, lowering the FP/case rate from 0.88 to 0.26 for CPM-NET, and from 1.27 to 0.62 for the 3D-CNN-TR. Conclusion: Anatomy-based, interpretable post-processing can improve DL-based aneurysm detection model performance. More broadly, automated, domain-informed, hybrid heuristic-learning processing holds promise for improving the performance and clinical acceptance of aneurysm detection models.
zh

[CV-119] Research on Improving the High Precision and Lightweight Diabetic Retinopathy Detection of YOLOv8n

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)早期检测与诊断中因微小病灶特征细微及易受背景干扰而导致的检测精度和鲁棒性不足的问题。其解决方案的关键在于提出一种基于改进YOLOv8n的轻量级高精度检测模型YOLO-KFG,通过设计新的动态卷积KWConv和C2f-KW模块增强主干网络对微小病灶的感知能力,构建特征聚焦的扩散金字塔网络FDPN以充分融合多尺度上下文信息,并引入轻量级共享检测头GSDHead以减少模型参数量,从而提升模型在资源受限设备上的部署能力。

链接: https://arxiv.org/abs/2507.00780
作者: Fei Yuhuan,Sun Xufei,Zang Ran,Wang Gengchen,Su Meng,Liu Fenghao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: in Chinese language

点击查看摘要

Abstract:Early detection and diagnosis of diabetic retinopathy is one of the current research focuses in ophthalmology. However, due to the subtle features of micro-lesions and their susceptibility to background interference, ex-isting detection methods still face many challenges in terms of accuracy and robustness. To address these issues, a lightweight and high-precision detection model based on the improved YOLOv8n, named YOLO-KFG, is proposed. Firstly, a new dynamic convolution KWConv and C2f-KW module are designed to improve the backbone network, enhancing the model’s ability to perceive micro-lesions. Secondly, a fea-ture-focused diffusion pyramid network FDPN is designed to fully integrate multi-scale context information, further improving the model’s ability to perceive micro-lesions. Finally, a lightweight shared detection head GSDHead is designed to reduce the model’s parameter count, making it more deployable on re-source-constrained devices. Experimental results show that compared with the base model YOLOv8n, the improved model reduces the parameter count by 20.7%, increases mAP@0.5 by 4.1%, and improves the recall rate by 7.9%. Compared with single-stage mainstream algorithms such as YOLOv5n and YOLOv10n, YOLO-KFG demonstrates significant advantages in both detection accuracy and efficiency.
zh

[CV-120] unable Wavelet Unit based Convolutional Neural Network in Optical Coherence Tomography Analysis Enhancement for Classifying Type of Epiretinal Membrane Surgery

【速读】:该论文旨在解决如何准确分类视网膜前膜(epiretinal membrane, ERM)切除手术类型的问题,即区分内界膜(internal limiting membrane, ILM)切除与单纯ERM切除。其解决方案的关键在于采用基于ResNet18架构的深度学习模型,并通过引入可调小波单元(tunable wavelet units)提升模型性能,具体包括正交格子小波单元(Orthogonal Lattice-based Wavelet Units, OrthLatt-UwU)和完美重构松弛小波单元(Perfect Reconstruction Relaxation-based Wavelet Units, PR-Relax-UwU),这些单元使模型能够在训练过程中自动调整滤波器系数,从而增强对不同手术类型的区分能力。

链接: https://arxiv.org/abs/2507.00743
作者: An Le,Nehal Mehta,William Freeman,Ines Nagel,Melanie Tran,Anna Heinke,Akshay Agnihotri,Lingyun Cheng,Dirk-Uwe Bartsch,Hung Nguyen,Truong Nguyen,Cheolhong An
机构: University of California San Diego(加州大学圣地亚哥分校); Jacobs Retina Center(雅各布视网膜中心); Shiley Eye Institute(谢利眼科研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:In this study, we developed deep learning-based method to classify the type of surgery performed for epiretinal membrane (ERM) removal, either internal limiting membrane (ILM) removal or ERM-alone removal. Our model, based on the ResNet18 convolutional neural network (CNN) architecture, utilizes postoperative optical coherence tomography (OCT) center scans as inputs. We evaluated the model using both original scans and scans preprocessed with energy crop and wavelet denoising, achieving 72% accuracy on preprocessed inputs, outperforming the 66% accuracy achieved on original scans. To further improve accuracy, we integrated tunable wavelet units with two key adaptations: Orthogonal Lattice-based Wavelet Units (OrthLatt-UwU) and Perfect Reconstruction Relaxation-based Wavelet Units (PR-Relax-UwU). These units allowed the model to automatically adjust filter coefficients during training and were incorporated into downsampling, stride-two convolution, and pooling layers, enhancing its ability to distinguish between ERM-ILM removal and ERM-alone removal, with OrthLattUwU boosting accuracy to 76% and PR-Relax-UwU increasing performance to 78%. Performance comparisons showed that our AI model outperformed a trained human grader, who achieved only 50% accuracy in classifying the removal surgery types from postoperative OCT scans. These findings highlight the potential of CNN based models to improve clinical decision-making by providing more accurate and reliable classifications. To the best of our knowledge, this is the first work to employ tunable wavelets for classifying different types of ERM removal surgery.
zh

[CV-121] Prompt2SegCXR:Prompt to Segment All Organs and Diseases in Chest X-rays

【速读】:该论文试图解决在胸部X光图像中基于用户提示进行多器官和多疾病分割的挑战,传统分割模型通常针对特定器官或疾病进行训练,限制了其在其他器官和疾病上的泛化能力。解决方案的关键在于生成由医学专家标注的草图提示数据集,并提出Prompt2SegCXR模型,该模型通过多阶段特征融合机制,结合不同网络层的特征以提升空间和语义理解能力,从而实现更精确的多器官和多疾病分割。

链接: https://arxiv.org/abs/2507.00673
作者: Abduz Zami,Shadman Sobhan,Rounaq Hossain,Md. Sawran Sorker,Mohiuddin Ahmed,Md. Redwan Hossain
机构: Rajshahi University of Engineering & Technology, Rajshahi-6204, Bangladesh; Bangladesh University of Engineering and Technology, Dhaka-1000, Bangladesh; Shaheed Suhrawardy Medical College , Sher-E-Bangla Nagar, Dhaka-1207; Bangladesh University of Engineering and Technology, Dhaka-1000, Bangladesh
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 Pages

点击查看摘要

Abstract:Image segmentation plays a vital role in the medical field by isolating organs or regions of interest from surrounding areas. Traditionally, segmentation models are trained on a specific organ or a disease, limiting their ability to handle other organs and diseases. At present, few advanced models can perform multi-organ or multi-disease segmentation, offering greater flexibility. Also, recently, prompt-based image segmentation has gained attention as a more flexible approach. It allows models to segment areas based on user-provided prompts. Despite these advances, there has been no dedicated work on prompt-based interactive multi-organ and multi-disease segmentation, especially for Chest X-rays. This work presents two main contributions: first, generating doodle prompts by medical experts of a collection of datasets from multiple sources with 23 classes, including 6 organs and 17 diseases, specifically designed for prompt-based Chest X-ray segmentation. Second, we introduce Prompt2SegCXR, a lightweight model for accurately segmenting multiple organs and diseases from Chest X-rays. The model incorporates multi-stage feature fusion, enabling it to combine features from various network layers for better spatial and semantic understanding, enhancing segmentation accuracy. Compared to existing pre-trained models for prompt-based image segmentation, our model scores well, providing a reliable solution for segmenting Chest X-rays based on user prompts.
zh

[CV-122] Mind the Detail: Uncovering Clinically Relevant Image Details in Accelerated MRI with Semantically Diverse Reconstructions MICCAI2025

【速读】:该论文试图解决在高加速磁共振成像(MRI)重建中,现有技术可能无法准确恢复小而罕见的病灶,从而导致错误诊断(假阴性)的问题。解决方案的关键在于提出“语义多样重建”(Semantically Diverse Reconstructions, SDR),该方法在保证与测量数据完全一致的前提下,生成具有增强语义多样性的新重建结果,以揭示可能丢失的临床信息。

链接: https://arxiv.org/abs/2507.00670
作者: Jan Nikolas Morshuis,Christian Schlarmann,Thomas Küstner,Christian F. Baumgartner,Matthias Hein
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025

点击查看摘要

Abstract:In recent years, accelerated MRI reconstruction based on deep learning has led to significant improvements in image quality with impressive results for high acceleration factors. However, from a clinical perspective image quality is only secondary; much more important is that all clinically relevant information is preserved in the reconstruction from heavily undersampled data. In this paper, we show that existing techniques, even when considering resampling for diffusion-based reconstruction, can fail to reconstruct small and rare pathologies, thus leading to potentially wrong diagnosis decisions (false negatives). To uncover the potentially missing clinical information we propose ``Semantically Diverse Reconstructions’’ (\SDR), a method which, given an original reconstruction, generates novel reconstructions with enhanced semantic variability while all of them are fully consistent with the measured data. To evaluate \SDR automatically we train an object detector on the fastMRI+ dataset. We show that \SDR significantly reduces the chance of false-negative diagnoses (higher recall) and improves mean average precision compared to the original reconstructions. The code is available on this https URL
zh

[CV-123] MTCNet: Motion and Topology Consistency Guided Learning for Mitral Valve Segmentationin 4D Ultrasound MICCAI2025

【速读】:该论文旨在解决四维(4D)二尖瓣(Mitral Valve, MV)超声图像分析中的跨相位一致性不足问题,这一问题主要由有限的相位标注、严重的运动伪影和成像质量差导致。其解决方案的关键在于提出一种基于运动-拓扑引导的一致性网络(Motion-Topology guided consistency network, MTCNet),该网络通过引入跨相位运动引导的一致性学习策略和拓扑引导的相关正则化方法,实现了在半监督学习框架下的高精度4D MV分割。MTCNet仅需稀疏的收缩末期和舒张末期标注,即可有效利用结构对应关系,提升跨相位的一致性表现。

链接: https://arxiv.org/abs/2507.00660
作者: Rusi Chen,Yuanting Yang,Jiezhi Yao,Hongning Song,Ji Zhang,Yongsong Zhou,Yuhao Huang,Ronghao Yang,Dan Jia,Yuhan Zhang,Xing Tao,Haoran Dou,Qing Zhou,Xin Yang,Dong Ni
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025

点击查看摘要

Abstract:Mitral regurgitation is one of the most prevalent cardiac disorders. Four-dimensional (4D) ultrasound has emerged as the primary imaging modality for assessing dynamic valvular morphology. However, 4D mitral valve (MV) analysis remains challenging due to limited phase annotations, severe motion artifacts, and poor imaging quality. Yet, the absence of inter-phase dependency in existing methods hinders 4D MV analysis. To bridge this gap, we propose a Motion-Topology guided consistency network (MTCNet) for accurate 4D MV ultrasound segmentation in semi-supervised learning (SSL). MTCNet requires only sparse end-diastolic and end-systolic annotations. First, we design a cross-phase motion-guided consistency learning strategy, utilizing a bi-directional attention memory bank to propagate spatio-temporal features. This enables MTCNet to achieve excellent performance both per- and inter-phase. Second, we devise a novel topology-guided correlation regularization that explores physical prior knowledge to maintain anatomically plausible. Therefore, MTCNet can effectively leverage structural correspondence between labeled and unlabeled phases. Extensive evaluations on the first largest 4D MV dataset, with 1408 phases from 160 patients, show that MTCNet performs superior cross-phase consistency compared to other advanced methods (Dice: 87.30%, HD: 1.75mm). Both the code and the dataset are available at this https URL.
zh

[CV-124] Bridging Classical and Learning-based Iterative Registration through Deep Equilibrium Models MICCAI2025

【速读】:该论文旨在解决可变形医学图像配准中学习型方法缺乏理论收敛性保证和训练时GPU内存消耗随迭代步数线性增长的问题。其解决方案的关键在于提出DEQReg框架,该框架基于深度平衡模型(Deep Equilibrium Models, DEQ),将配准问题建模为寻找平衡点的过程,从而在理论上支持无限迭代步骤,并保持恒定的内存使用,有效解决了传统优化方法与学习型解卷积方法之间的性能与稳定性差距。

链接: https://arxiv.org/abs/2507.00582
作者: Yi Zhang,Yidong Zhao,Qian Tao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted version. Accepted by MICCAI 2025

点击查看摘要

Abstract:Deformable medical image registration is traditionally formulated as an optimization problem. While classical methods solve this problem iteratively, recent learning-based approaches use recurrent neural networks (RNNs) to mimic this process by unrolling the prediction of deformation fields in a fixed number of steps. However, classical methods typically converge after sufficient iterations, but learning-based unrolling methods lack a theoretical convergence guarantee and show instability empirically. In addition, unrolling methods have a practical bottleneck at training time: GPU memory usage grows linearly with the unrolling steps due to backpropagation through time (BPTT). To address both theoretical and practical challenges, we propose DEQReg, a novel registration framework based on Deep Equilibrium Models (DEQ), which formulates registration as an equilibrium-seeking problem, establishing a natural connection between classical optimization and learning-based unrolling methods. DEQReg maintains constant memory usage, enabling theoretically unlimited iteration steps. Through extensive evaluation on the public brain MRI and lung CT datasets, we show that DEQReg can achieve competitive registration performance, while substantially reducing memory consumption compared to state-of-the-art unrolling methods. We also reveal an intriguing phenomenon: the performance of existing unrolling methods first increases slightly then degrades irreversibly when the inference steps go beyond the training configuration. In contrast, DEQReg achieves stable convergence with its inbuilt equilibrium-seeking mechanism, bridging the gap between classical optimization-based and modern learning-based registration methods.
zh

[CV-125] Medical Image Segmentation Using Advanced Unet: VMSE-Unet and VM-Unet CBAM

【速读】:该论文旨在解决医学图像分割中的精度不足、特征定位不准确以及计算效率低的问题。其解决方案的关键在于将Squeeze-and-Excitation (SE)和Convolutional Block Attention Module (CBAM)模块集成到传统的VM U-Net架构中,从而提升分割的准确性、特征定位能力及计算效率。通过这种改进,所提出的VMSE U-Net和VM-Unet CBAM+模型在多个数据集上均表现出优于基线VM-Unet的性能。

链接: https://arxiv.org/abs/2507.00511
作者: Sayandeep Kanrar,Raja Piyush,Qaiser Razi,Debanshi Chakraborty,Vikas Hassija,GSS Chalapathi
机构: Kalinga Institute of Industrial Technology (KIIT) Deemed to be University, Bhubaneswar-751024, Odisha, India; Department of Electrical and Electronics Engineering, BITS-Pilani, Pilani Campus, India
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: under review

点击查看摘要

Abstract:In this paper, we present the VMSE U-Net and VM-Unet CBAM+ model, two cutting-edge deep learning architectures designed to enhance medical image segmentation. Our approach integrates Squeeze-and-Excitation (SE) and Convolutional Block Attention Module (CBAM) techniques into the traditional VM U-Net framework, significantly improving segmentation accuracy, feature localization, and computational efficiency. Both models show superior performance compared to the baseline VM-Unet across multiple datasets. Notably, VMSEUnet achieves the highest accuracy, IoU, precision, and recall while maintaining low loss values. It also exhibits exceptional computational efficiency with faster inference times and lower memory usage on both GPU and CPU. Overall, the study suggests that the enhanced architecture VMSE-Unet is a valuable tool for medical image analysis. These findings highlight its potential for real-world clinical applications, emphasizing the importance of further research to optimize accuracy, robustness, and computational efficiency.
zh

[CV-126] Accurate and Efficient Fetal Birth Weight Estimation from 3D Ultrasound MICCAI2025

【速读】:该论文旨在解决胎儿出生体重(Fetal Birth Weight, FBW)估计的准确性问题,传统临床方法效率低、依赖操作者且在复杂胎儿解剖情况下难以应用,而现有深度学习方法基于二维标准超声(2D Standard Ultrasound, US)图像或视频,缺乏空间信息导致预测精度受限。论文提出的解决方案关键在于首次直接从三维胎儿超声(3D Fetal US)体积中估计FBW,其核心是集成多尺度特征融合网络(Multi-Scale Feature Fusion Network, MFFN)和基于合成样本的学习框架(Synthetic Sample-Based Learning Framework, SSLF),通过多尺度特征提取与融合以及半监督学习策略提升预测性能。

链接: https://arxiv.org/abs/2507.00398
作者: Jian Wang,Qiongying Ni,Hongkui Yu,Ruixuan Yao,Jinqiao Ying,Bin Zhang,Xingyi Yang,Jin Peng,Jiongquan Chen,Junxuan Yu,Wenlong Shi,Chaoyu Chen,Zhongnuo Yan,Mingyuan Luo,Gaocheng Cai,Dong Ni,Jing Lu,Xin Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025

点击查看摘要

Abstract:Accurate fetal birth weight (FBW) estimation is essential for optimizing delivery decisions and reducing perinatal mortality. However, clinical methods for FBW estimation are inefficient, operator-dependent, and challenging to apply in cases of complex fetal anatomy. Existing deep learning methods are based on 2D standard ultrasound (US) images or videos that lack spatial information, limiting their prediction accuracy. In this study, we propose the first method for directly estimating FBW from 3D fetal US volumes. Our approach integrates a multi-scale feature fusion network (MFFN) and a synthetic sample-based learning framework (SSLF). The MFFN effectively extracts and fuses multi-scale features under sparse supervision by incorporating channel attention, spatial attention, and a ranking-based loss function. SSLF generates synthetic samples by simply combining fetal head and abdomen data from different fetuses, utilizing semi-supervised learning to improve prediction performance. Experimental results demonstrate that our method achieves superior performance, with a mean absolute error of 166.4\pm155.9 g and a mean absolute percentage error of 5.1\pm4.6 %, outperforming existing methods and approaching the accuracy of a senior doctor. Code is available at: this https URL.
zh

[CV-127] SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures

【速读】:该论文旨在解决微创手术(Minimally Invasive Surgery, MIS)中高分辨率成像数据不足的问题,特别是针对机器人辅助手术缺乏公开的原生4K视频数据集。解决方案的关键在于提出SurgiSR4K,这是首个公开可用的原生4K分辨率外科影像与视频数据集,其涵盖了多种真实的手术场景,如镜面反射、器械遮挡、出血和软组织变形,能够有效支持多种计算机视觉任务,如超分辨率、烟雾去除、手术器械检测等,从而为高分辨率外科成像研究和智能成像技术的发展提供坚实基础。

链接: https://arxiv.org/abs/2507.00209
作者: Fengyi Jiang,Xiaorui Zhang,Lingbo Jin,Ruixing Liang,Yuxin Chen,Adi Chola Venkatesh,Jason Culman,Tiantian Wu,Lirong Shao,Wenqing Sun,Cong Gao,Hallie McNamara,Jingpei Lu,Omid Mohareri
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:High-resolution imaging is crucial for enhancing visual clarity and enabling precise computer-assisted guidance in minimally invasive surgery (MIS). Despite the increasing adoption of 4K endoscopic systems, there remains a significant gap in publicly available native 4K datasets tailored specifically for robotic-assisted MIS. We introduce SurgiSR4K, the first publicly accessible surgical imaging and video dataset captured at a native 4K resolution, representing realistic conditions of robotic-assisted procedures. SurgiSR4K comprises diverse visual scenarios including specular reflections, tool occlusions, bleeding, and soft tissue deformations, meticulously designed to reflect common challenges faced during laparoscopic and robotic surgeries. This dataset opens up possibilities for a broad range of computer vision tasks that might benefit from high resolution data, such as super resolution (SR), smoke removal, surgical instrument detection, 3D tissue reconstruction, monocular depth estimation, instance segmentation, novel view synthesis, and vision-language model (VLM) development. SurgiSR4K provides a robust foundation for advancing research in high-resolution surgical imaging and fosters the development of intelligent imaging technologies aimed at enhancing performance, safety, and usability in image-guided robotic surgeries.
zh

[CV-128] owards 3D Semantic Image Synthesis for Medical Imaging

【速读】:该论文旨在解决医学影像领域中大规模数据集获取困难的问题,这一问题主要源于数据可访问性受限和严格的隐私保护法规。为应对这一挑战,本文提出了一种名为Med-LSDM(Latent Semantic Diffusion Model)的解决方案,其关键在于直接在3D域内操作,并利用去标识化的语义图生成合成数据,从而实现隐私保护与数据增强的双重目标。Med-LSDM通过在预训练VQ-GAN的潜在空间中应用扩散模型,引入引导机制以控制3D图像生成过程,在降低计算复杂度的同时保留关键的3D空间细节,从而实现了高质量的3D语义医学图像合成。

链接: https://arxiv.org/abs/2507.00206
作者: Wenwu Tang,Khaled Seyam,Bin Yang
机构: University of Stuttgart (斯图加特大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the medical domain, acquiring large datasets is challenging due to both accessibility issues and stringent privacy regulations. Consequently, data availability and privacy protection are major obstacles to applying machine learning in medical imaging. To address this, our study proposes the Med-LSDM (Latent Semantic Diffusion Model), which operates directly in the 3D domain and leverages de-identified semantic maps to generate synthetic data as a method of privacy preservation and data augmentation. Unlike many existing methods that focus on generating 2D slices, Med-LSDM is designed specifically for 3D semantic image synthesis, making it well-suited for applications requiring full volumetric data. Med-LSDM incorporates a guiding mechanism that controls the 3D image generation process by applying a diffusion model within the latent space of a pre-trained VQ-GAN. By operating in the compressed latent space, the model significantly reduces computational complexity while still preserving critical 3D spatial details. Our approach demonstrates strong performance in 3D semantic medical image synthesis, achieving a 3D-FID score of 0.0054 on the conditional Duke Breast dataset and similar Dice scores (0.70964) to those of real images (0.71496). These results demonstrate that the synthetic data from our model have a small domain gap with real data and are useful for data augmentation.
zh

[CV-129] Multimodal Multi-Disease Medical Imaging Foundation Model (MerMED-FM)

【速读】:该论文试图解决当前医学影像人工智能模型多为单模态、单病种,且在构建多模态、多病种模型时临床准确性不一致的问题,同时应对训练这些模型所需的大规模、人工标注密集型数据集的挑战。解决方案的关键在于开发MerMED-FM,这是一种基于自监督学习和记忆模块训练的先进多模态、多专科基础模型,其在超过十种专科和七种模态(包括CT、CXR、US、病理切片、CFP、OCT和皮肤图像)的330万张医学图像上进行训练,从而实现了跨学科的高适应性和强泛化能力。

链接: https://arxiv.org/abs/2507.00185
作者: Yang Zhou,Chrystie Wan Ning Quek,Jun Zhou,Yan Wang,Yang Bai,Yuhe Ke,Jie Yao,Laura Gutierrez,Zhen Ling Teo,Darren Shu Jeng Ting,Brian T. Soetikno,Christopher S. Nielsen,Tobias Elze,Zengxiang Li,Linh Le Dinh,Lionel Tim-Ee Cheng,Tran Nguyen Tuan Anh,Chee Leong Cheng,Tien Yin Wong,Nan Liu,Iain Beehuat Tan,Tony Kiat Hon Lim,Rick Siow Mong Goh,Yong Liu,Daniel Shu Wei Ting
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 42 pages, 3 composite figures, 4 tables

点击查看摘要

Abstract:Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundation model trained using self-supervised learning and a memory module. MerMED-FM was trained on 3.3 million medical images from over ten specialties and seven modalities, including computed tomography (CT), chest X-rays (CXR), ultrasound (US), pathology patches, color fundus photography (CFP), optical coherence tomography (OCT) and dermatology images. MerMED-FM was evaluated across multiple diseases and compared against existing foundational models. Strong performance was achieved across all modalities, with AUROCs of 0.988 (OCT); 0.982 (pathology); 0.951 (US); 0.943 (CT); 0.931 (skin); 0.894 (CFP); 0.858 (CXR). MerMED-FM has the potential to be a highly adaptable, versatile, cross-specialty foundation model that enables robust medical imaging interpretation across diverse medical disciplines.
zh

[CV-130] Real-Time Guidewire Tip Tracking Using a Siamese Network for Image-Guided Endovascular Procedures

【速读】:该论文旨在解决心血管疾病图像引导治疗中导丝尖端跟踪任务的挑战,以提升诊断和治疗的质量。其解决方案的关键在于提出了一种基于孪生网络与双注意力机制的跟踪框架,该框架结合了自注意力和交叉注意力策略,通过增强的空间-时间特征学习来处理视觉模糊、组织变形和成像伪影等问题。

链接: https://arxiv.org/abs/2507.00051
作者: Tianliang Yao,Zhiqiang Pei,Yong Li,Yixuan Yuan,Peng Qi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by Advanced Intelligent Systems

点击查看摘要

Abstract:An ever-growing incorporation of AI solutions into clinical practices enhances the efficiency and effectiveness of healthcare services. This paper focuses on guidewire tip tracking tasks during image-guided therapy for cardiovascular diseases, aiding physicians in improving diagnostic and therapeutic quality. A novel tracking framework based on a Siamese network with dual attention mechanisms combines self- and cross-attention strategies for robust guidewire tip tracking. This design handles visual ambiguities, tissue deformations, and imaging artifacts through enhanced spatial-temporal feature learning. Validation occurred on 3 randomly selected clinical digital subtraction angiography (DSA) sequences from a dataset of 15 sequences, covering multiple interventional scenarios. The results indicate a mean localization error of 0.421 \pm 0.138 mm, with a maximum error of 1.736 mm, and a mean Intersection over Union (IoU) of 0.782. The framework maintains an average processing speed of 57.2 frames per second, meeting the temporal demands of endovascular imaging. Further validations with robotic platforms for automating diagnostics and therapies in clinical routines yielded tracking errors of 0.708 \pm 0.695 mm and 0.148 \pm 0.057 mm in two distinct experimental scenarios.
zh

人工智能

[AI-0] Description of the Training Process of Neural Networks via Ergodic Theorem : Ghost nodes

【速读】:该论文试图解决深度神经网络训练过程中早期阶段的可训练性不足以及优化过程容易陷入局部鞍点的问题。其解决方案的关键在于提出一种基于随机梯度下降的统一框架,通过分析目标函数的几何景观,引入了最大李雅普诺夫指数(Lyapunov exponent)的运行估计作为诊断工具,以区分真正的稳定最小值收敛与鞍点附近的统计稳定化。此外,还提出了标准分类器的“幽灵类别”扩展,通过添加辅助的幽灵输出节点,为模型提供额外的下降方向,从而在狭窄损失屏障周围开辟横向通道,使优化器能够在早期训练阶段绕过不良区域。

链接: https://arxiv.org/abs/2507.01003
作者: Eun-Ji Park,Sangwon Yun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Recent studies have proposed interpreting the training process from an ergodic perspective. Building on this foundation we present a unified framework for understanding and accelerating the training of deep neural networks via stochastic gradient descent. By analyzing the geometric landscape of the objective function we introduce a practical diagnostic, the running estimate of the largest Lyapunov exponent, which provably distinguishes genuine convergence toward stable minimizers from mere statistical stabilization near saddle points. We then propose a ghost category extension for standard classifiers that adds auxiliary ghost output nodes so the model gains extra descent directions that open a lateral corridor around narrow loss barriers and enable the optimizer to bypass poor basins during the early training phase. We show that this extension strictly reduces approximation error and that after sufficient convergence the ghost dimensions collapse and the extended model’s invariant law coincides with that of the original and there exists a path in the enlarged parameter space along which the total loss does not increase while the original loss decreases by an arbitrary margin. Taken together these results provide a principled architecture level intervention that accelerates early stage trainability while preserving asymptotic behavior.
zh

[AI-1] Reasoning as an Adaptive Defense for Safety

【速读】:该论文试图解决如何通过自适应分配测试时计算资源的方法,提升大型语言模型(Large Language Models, LLMs)对安全漏洞的鲁棒性问题。其解决方案的关键在于提出了一种名为TARS(Training Adaptive Reasoners for Safety)的强化学习(Reinforcement Learning, RL)方法,该方法通过链式推理轨迹和平衡安全与任务完成的奖励信号来训练模型进行安全相关的推理。TARS的核心设计包括:轻量级预训练阶段、混合有害、无害和模糊提示以防止捷径行为,以及一个防止推理能力退化的奖励函数,从而实现模型在面对模糊查询时更高效地分配计算资源,提升安全拒绝的权衡效果及对抗白盒和黑盒攻击的鲁棒性。

链接: https://arxiv.org/abs/2507.00971
作者: Taeyoun Kim,Fahim Tajwar,Aditi Raghunathan,Aviral Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 42 pages, 11 Figures, 7 Tables

点击查看摘要

Abstract:Reasoning methods that adaptively allocate test-time compute have advanced LLM performance on easy to verify domains such as math and code. In this work, we study how to utilize this approach to train models that exhibit a degree of robustness to safety vulnerabilities, and show that doing so can provide benefits. We build a recipe called \textitTARS (Training Adaptive Reasoners for Safety), a reinforcement learning (RL) approach that trains models to reason about safety using chain-of-thought traces and a reward signal that balances safety with task completion. To build TARS, we identify three critical design choices: (1) a “lightweight” warmstart SFT stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as too many refusals, and (3) a reward function to prevent degeneration of reasoning capabilities during training. Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, leading to better safety-refusal trade-offs. They also internally learn to better distinguish between safe and unsafe prompts and attain greater robustness to both white-box (e.g., GCG) and black-box attacks (e.g., PAIR). Overall, our work provides an effective, open recipe for training LLMs against jailbreaks and harmful requests by reasoning per prompt.
zh

[AI-2] MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

【速读】:该论文试图解决序列模型(如LSTM和Mamba)在单通道语音增强任务中容易过拟合训练集的问题,以及现有模型在域外数据上的泛化性能不足的问题。其解决方案的关键在于提出一种新型的混合架构——MambAttention,该架构结合了Mamba与共享的时间-频率多头注意力模块,以提升模型的泛化能力。此外,研究还引入了一个更具挑战性的数据集VB-DemandEx用于训练,进一步增强了模型在不同数据集上的表现。

链接: https://arxiv.org/abs/2507.00966
作者: Nikolai Lund Kühne,Jesper Jensen,Jan Østergaard,Zheng-Hua Tan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing for possible publication

点击查看摘要

Abstract:With the advent of new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform state-of-the-art models in single-channel speech enhancement, automatic speech recognition, and self-supervised audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this issue, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VoiceBank+Demand Extended (VB-DemandEx), a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, our proposed MambAttention model significantly outperforms existing state-of-the-art LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 and EARS-WHAM_v2, while matching their performance on the in-domain dataset VB-DemandEx. Ablation studies highlight the role of weight sharing between the time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. However, our MambAttention model remains superior on both out-of-domain datasets across all reported evaluation metrics.
zh

[AI-3] hinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact

【速读】:该论文试图解决如何实现人工通用智能(Artificial General Intelligence, AGI)的问题,即机器是否能够在类似人类的领域中真正思考、推理和行动。其解决方案的关键在于突破当前模型对令牌级预测的依赖,通过整合记忆与推理能力,构建具有模块化推理、持久记忆和多智能体协作机制的系统。论文强调了代理型检索增强生成(Agentic RAG)框架的重要性,结合检索、规划与动态工具使用,以实现更适应性的行为,并通过信息压缩、测试时适应及无训练方法等策略提升模型的泛化能力。此外,论文指出真正的智能不仅依赖于规模,更依赖于模块化、交互性和自我改进组件的协同作用。

链接: https://arxiv.org/abs/2507.00951
作者: Rizwan Qureshi,Ranjan Sapkota,Abbas Shah,Amgad Muneer,Anas Zafar,Ashmal Vayani,Maged Shoman,Abdelrahman B. M. Eldaly,Kai Zhang,Ferhat Sadak,Shaina Raza,Xinqi Fan,Ravid Shwartz-Ziv,Hong Yan,Vinjia Jain,Aman Chadha,Manoj Karkee,Jia Wu,Philip Torr,Seyedali Mirjalili
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Can machines truly think, reason and act in domains like humans? This enduring question continues to shape the pursuit of Artificial General Intelligence (AGI). Despite the growing capabilities of models such as GPT-4.5, DeepSeek, Claude 3.5 Sonnet, Phi-4, and Grok 3, which exhibit multimodal fluency and partial reasoning, these systems remain fundamentally limited by their reliance on token-level prediction and lack of grounded agency. This paper offers a cross-disciplinary synthesis of AGI development, spanning artificial intelligence, cognitive neuroscience, psychology, generative models, and agent-based systems. We analyze the architectural and cognitive foundations of general intelligence, highlighting the role of modular reasoning, persistent memory, and multi-agent coordination. In particular, we emphasize the rise of Agentic RAG frameworks that combine retrieval, planning, and dynamic tool use to enable more adaptive behavior. We discuss generalization strategies, including information compression, test-time adaptation, and training-free methods, as critical pathways toward flexible, domain-agnostic intelligence. Vision-Language Models (VLMs) are reexamined not just as perception modules but as evolving interfaces for embodied understanding and collaborative task completion. We also argue that true intelligence arises not from scale alone but from the integration of memory and reasoning: an orchestration of modular, interactive, and self-improving components where compression enables adaptive behavior. Drawing on advances in neurosymbolic systems, reinforcement learning, and cognitive scaffolding, we explore how recent architectures begin to bridge the gap between statistical learning and goal-directed cognition. Finally, we identify key scientific, technical, and ethical challenges on the path to AGI.
zh

[AI-4] WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

【速读】:该论文试图解决当前评估自主网络代理(autonomous web agents)的基准存在不稳定性和不一致性的问题,这些问题通常源于动态内容或过于简化的模拟。解决方案的关键在于提出WebArXiv,这是一个静态且时间不变的基准,包含275个基于arXiv平台的网页任务,通过固定网页快照和确定性真实答案确保评估的可重复性和可靠性。此外,针对代理过度依赖固定交互历史的常见失败模式——Rigid History Reflection,论文提出了一种轻量级的动态反思机制,以在决策过程中选择性地检索相关历史步骤。

链接: https://arxiv.org/abs/2507.00938
作者: Zihao Sun,Meng Fang,Ling Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 10 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Recent progress in large language models (LLMs) has enabled the development of autonomous web agents capable of navigating and interacting with real websites. However, evaluating such agents remains challenging due to the instability and inconsistency of existing benchmarks, which often rely on dynamic content or oversimplified simulations. In this work, we introduce WebArXiv, a static and time-invariant benchmark comprising 275 web-based tasks grounded in the arXiv platform. WebArXiv ensures reproducible and reliable evaluation by anchoring tasks in fixed web snapshots with deterministic ground truths and standardized action trajectories. Through behavioral analysis, we identify a common failure mode, Rigid History Reflection, where agents over-rely on fixed interaction histories. To address this, we propose a lightweight dynamic reflection mechanism that allows agents to selectively retrieve relevant past steps during decision-making. We evaluate ten state-of-the-art web agents on WebArXiv. Results demonstrate clear performance differences across agents and validate the effectiveness of our proposed reflection strategy.
zh

[AI-5] Large Language Model Powered Intelligent Urban Agents : Concepts Capabilities and Applications

【速读】:该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)提升城市智能化水平,实现高效、宜居和可持续的城市环境的问题。其解决方案的关键在于构建城市LLM代理(Urban LLM Agents),这些代理在城市混合的网络-物理-社会空间中具备半具身性,能够作为智能代理自主解决跨领域的复杂问题,通过城市感知、记忆管理、推理、执行与学习等流程进行系统级的城市决策。

链接: https://arxiv.org/abs/2507.00914
作者: Jindong Han,Yansong Ning,Zirui Yuan,Hang Ni,Fan Liu,Tengfei Lyu,Hao Liu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The long-standing vision of intelligent cities is to create efficient, livable, and sustainable urban environments using big data and artificial intelligence technologies. Recently, the advent of Large Language Models (LLMs) has opened new ways toward realizing this vision. With powerful semantic understanding and reasoning capabilities, LLMs can be deployed as intelligent agents capable of autonomously solving complex problems across domains. In this article, we focus on Urban LLM Agents, which are LLM-powered agents that are semi-embodied within the hybrid cyber-physical-social space of cities and used for system-level urban decision-making. First, we introduce the concept of urban LLM agents, discussing their unique capabilities and features. Second, we survey the current research landscape from the perspective of agent workflows, encompassing urban sensing, memory management, reasoning, execution, and learning. Third, we categorize the application domains of urban LLM agents into five groups: urban planning, transportation, environment, public safety, and urban society, presenting representative works in each group. Finally, we discuss trustworthiness and evaluation issues that are critical for real-world deployment, and identify several open problems for future research. This survey aims to establish a foundation for the emerging field of urban LLM agents and to provide a roadmap for advancing the intersection of LLMs and urban intelligence. A curated list of relevant papers and open-source resources is maintained and continuously updated at this https URL.
zh

[AI-6] urning AI Data Centers into Grid-Interactive Assets: Results from a Field Demonstration in Phoenix Arizona

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)导致的电力需求激增问题,这一问题威胁电网可靠性、提高社区能源基础设施成本,并阻碍AI创新,因为数据中心需等待电网扩容以实现接入。解决方案的关键在于提出一种纯软件方法——Emerald Conductor,该方法将AI数据中心转变为灵活的电网资源,能够在不进行大规模基础设施建设的情况下,高效且即时地利用现有电力系统。通过基于实时电网信号调度AI工作负载,而无需硬件修改或储能设备,该平台重新定义了数据中心作为电网互动资产的角色,从而提升电网可靠性、促进经济性并加速AI发展。

链接: https://arxiv.org/abs/2507.00909
作者: Philip Colangelo,Ayse K. Coskun,Jack Megrue,Ciaran Roberts,Shayan Sengupta,Varun Sivaram,Ethan Tiao,Aroon Vijaykar,Chris Williams,Daniel C. Wilson,Zack MacFarland,Daniel Dreiling,Nathan Morey,Anuja Ratnayake,Baskar Vairamohan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF); Systems and Control (eess.SY)
备注: 10 pages, 6 figures, 1 table

点击查看摘要

Abstract:Artificial intelligence (AI) is fueling exponential electricity demand growth, threatening grid reliability, raising prices for communities paying for new energy infrastructure, and stunting AI innovation as data centers wait for interconnection to constrained grids. This paper presents the first field demonstration, in collaboration with major corporate partners, of a software-only approach–Emerald Conductor–that transforms AI data centers into flexible grid resources that can efficiently and immediately harness existing power systems without massive infrastructure buildout. Conducted at a 256-GPU cluster running representative AI workloads within a commercial, hyperscale cloud data center in Phoenix, Arizona, the trial achieved a 25% reduction in cluster power usage for three hours during peak grid events while maintaining AI quality of service (QoS) guarantees. By orchestrating AI workloads based on real-time grid signals without hardware modifications or energy storage, this platform reimagines data centers as grid-interactive assets that enhance grid reliability, advance affordability, and accelerate AI’s development.
zh

[AI-7] he Age of Sensorial Zero Trust: Why We Can No Longer Trust Our Senses

【速读】:该论文试图解决生成式人工智能(Generative AI)引发的新型安全威胁,特别是深度伪造(deepfakes)和语音克隆等技术对组织安全构成的挑战。其解决方案的关键在于引入“感官零信任”(Sensorial Zero Trust)理念,通过系统性地质疑通过感官接收到的信息,并建立严格的验证协议,以降低基于生成式人工智能的欺诈风险。核心要素包括带外验证、视觉-语言模型(Vision-Language Models, VLMs)作为取证协作工具、密码学溯源以及人员培训,从而将零信任原则扩展至人类感官信息的验证范畴。

链接: https://arxiv.org/abs/2507.00907
作者: Fabio Correa Xavier
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:In a world where deepfakes and cloned voices are emerging as sophisticated attack vectors, organizations require a new security mindset: Sensorial Zero Trust [9]. This article presents a scientific analysis of the need to systematically doubt information perceived through the senses, establishing rigorous verification protocols to mitigate the risks of fraud based on generative artificial intelligence. Key concepts, such as Out-of-Band verification, Vision-Language Models (VLMs) as forensic collaborators, cryptographic provenance, and human training, are integrated into a framework that extends Zero Trust principles to human sensory information. The approach is grounded in empirical findings and academic research, emphasizing that in an era of AI-generated realities, even our eyes and ears can no longer be implicitly trusted without verification. Leaders are called to foster a culture of methodological skepticism to protect organizational integrity in this new threat landscape.
zh

[AI-8] Constellation as a Service: Tailored Connectivity Management in Direct-Satellite-to-Device Networks

【速读】:该论文旨在解决多星座环境下直接卫星到设备(DS2D)通信的连接管理问题,包括由多覆盖重叠和卫星快速移动引起的高干扰和频繁切换。现有方法主要局限于单一星座架构,限制了多星座连接潜力的发挥,导致DS2D服务性能不佳。其解决方案的关键在于提出一种星座即服务(CaaS)框架,将多星座基础设施视为共享资源池,并动态形成最优子星座(SC)以满足不同DS2D服务区域的需求,通过结合生成式人工智能(GenAI)预测卫星波束成形和预配置切换路径两种创新策略,实现高效的服务提供与移动性管理。

链接: https://arxiv.org/abs/2507.00902
作者: Feng Wang,Shengyu Zhang,Een-Kee Hong,Tony Q.S. Quek
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: To appear in IEEE Communications Magazine

点击查看摘要

Abstract:Direct-satellite-to-device (DS2D) communication is emerging as a promising solution for global mobile service extension, leveraging the deployment of satellite constellations. However, the challenge of managing DS2D connectivity for multi-constellations becomes outstanding, including high interference and frequent handovers caused by multi-coverage overlap and rapid satellite movement. Moreover, existing approaches primarily operate within single-constellation shell, which inherently limits the ability to exploit the vast potential of multi-constellation connectivity provision, resulting in suboptimal DS2D service performances. To address these challenges, this article proposes a Constellation as a Service (CaaS) framework, which treats the entire multi-constellation infrastructure as a shared resource pool and dynamically forms optimal sub-constellations (SCs) for each DS2D service region. The formation of each SC integrates satellites from various orbits to provide tailored connectivity based on user demands, guided by two innovative strategies: predictive satellite beamforming using generative artificial intelligence (GenAI) and pre-configured handover path for efficient satellite access and mobility management. Simulation results demonstrate that CaaS significantly improves satellite service rates while reducing handover overhead, making it an efficient and continuable solution for managing DS2D connectivity in multi-constellation environments.
zh

[AI-9] NN-Former: Rethinking Graph Structure in Neural Architecture Representation CVPR2025

【速读】:该论文旨在解决深度学习中神经网络设计与部署效率的问题,特别是针对神经架构属性(如准确率和延迟)估计的挑战。现有方法中,图神经网络(GNNs)在表示复杂特征方面存在局限,而变压器(transformers)在架构深度增加时表现出较差的泛化能力。论文的关键解决方案是重新思考神经架构拓扑结构,强调被忽视的兄弟节点(sibling nodes)的重要性,并提出一种结合GNN和transformers优势的新预测器。该方法引入了考虑兄弟节点的新型token mixer以及名为双向图同构前馈网络(bidirectional graph isomorphism feed-forward network)的通道混合器,从而在准确率和延迟预测上均取得了优异性能。

链接: https://arxiv.org/abs/2507.00880
作者: Ruihan Xu,Haokui Zhang,Yaowei Wang,Wei Zeng,Shiliang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2025. Code is avaiable at this https URL

点击查看摘要

Abstract:The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each of both methods has its disadvantages. GNNs lack the capabilities to represent complicated features, while transformers face poor generalization when the depth of architecture grows. To mitigate the above issues, we rethink neural architecture topology and show that sibling nodes are pivotal while overlooked in previous research. We thus propose a novel predictor leveraging the strengths of GNNs and transformers to learn the enhanced topology. We introduce a novel token mixer that considers siblings, and a new channel mixer named bidirectional graph isomorphism feed-forward network. Our approach consistently achieves promising performance in both accuracy and latency prediction, providing valuable insights for learning Directed Acyclic Graph (DAG) topology. The code is available at this https URL.
zh

[AI-10] SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents

【速读】:该论文试图解决多模态智能代理系统在面对复杂交互时存在的安全风险问题,特别是针对潜在的越狱攻击行为,这些攻击可能导致代理系统执行危险或敏感操作。解决方案的关键在于通过整合行为序列信息构建风险判别机制,并设计一种基于大语言模型的自动化辅助评估方案,以提升对风险行为的识别能力并降低代理系统被越狱的概率。

链接: https://arxiv.org/abs/2507.00841
作者: Siyuan Liang,Tianmeng Fang,Zhe Liu,Aishan Liu,Yan Xiao,Jinyuan He,Ee-Chien Chang,Xiaochun Cao
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 12 pages

点击查看摘要

Abstract:With the wide application of multimodal foundation models in intelligent agent systems, scenarios such as mobile device control, intelligent assistant interaction, and multimodal task execution are gradually relying on such large model-driven agents. However, the related systems are also increasingly exposed to potential jailbreak risks. Attackers may induce the agents to bypass the original behavioral constraints through specific inputs, and then trigger certain risky and sensitive operations, such as modifying settings, executing unauthorized commands, or impersonating user identities, which brings new challenges to system security. Existing security measures for intelligent agents still have limitations when facing complex interactions, especially in detecting potentially risky behaviors across multiple rounds of conversations or sequences of tasks. In addition, an efficient and consistent automated methodology to assist in assessing and determining the impact of such risks is currently lacking. This work explores the security issues surrounding mobile multimodal agents, attempts to construct a risk discrimination mechanism by incorporating behavioral sequence information, and designs an automated assisted assessment scheme based on a large language model. Through preliminary validation in several representative high-risk tasks, the results show that the method can improve the recognition of risky behaviors to some extent and assist in reducing the probability of agents being jailbroken. We hope that this study can provide some valuable references for the security risk modeling and protection of multimodal intelligent agent systems.
zh

[AI-11] HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning

【速读】:该论文旨在解决双臂灵巧操作任务中仿真环境和高质量示范数据不足的问题,尤其是在拟人化机器人(humanoid robot)上,由于其双臂和灵巧手的复杂性,自主数据收集尤为困难。解决方案的关键在于提出HumanoidGen框架,该框架利用原子灵巧操作和大语言模型(LLM)推理生成关系约束,通过空间标注和LLM规划器生成可执行的空间约束链,并采用改进的蒙特卡洛树搜索方法提升长时序任务和标注不足情况下的规划能力。

链接: https://arxiv.org/abs/2507.00833
作者: Zhi Jing,Siyuan Yang,Jicong Ao,Ting Xiao,Yugang Jiang,Chenjia Bai
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:For robotic manipulation, existing robotics datasets and simulation benchmarks predominantly cater to robot-arm platforms. However, for humanoid robots equipped with dual arms and dexterous hands, simulation tasks and high-quality demonstrations are notably lacking. Bimanual dexterous manipulation is inherently more complex, as it requires coordinated arm movements and hand operations, making autonomous data collection challenging. This paper presents HumanoidGen, an automated task creation and demonstration collection framework that leverages atomic dexterous operations and LLM reasoning to generate relational constraints. Specifically, we provide spatial annotations for both assets and dexterous hands based on the atomic operations, and perform an LLM planner to generate a chain of actionable spatial constraints for arm movements based on object affordances and scenes. To further improve planning ability, we employ a variant of Monte Carlo tree search to enhance LLM reasoning for long-horizon tasks and insufficient annotation. In experiments, we create a novel benchmark with augmented scenarios to evaluate the quality of the collected data. The results show that the performance of the 2D and 3D diffusion policies can scale with the generated dataset. Project page is this https URL.
zh

[AI-12] PI-WAN: A Physics-Informed Wind-Adaptive Network for Quadrotor Dynamics Prediction in Unknown Environments

【速读】:该论文旨在解决四旋翼飞行器在未知环境中精确轨迹跟踪的动态建模问题,传统物理驱动建模方法在变量载荷、风扰和外部干扰等复杂环境下存在显著局限性,而数据驱动方法在处理分布外(out-of-distribution, OoD)数据时泛化能力较差。解决方案的关键在于提出一种融合物理约束与数据驱动的物理感知风适应网络(Physics-Informed Wind-Adaptive Network, PI-WAN),通过在训练过程中直接嵌入物理规律,提升模型在未见条件下的泛化能力和鲁棒性,同时结合时间卷积网络(Temporal Convolutional Network, TCN)捕捉历史飞行数据中的时序依赖关系,并将实时预测结果融入模型预测控制(Model Predictive Control, MPC)框架以优化闭环跟踪性能。

链接: https://arxiv.org/abs/2507.00816
作者: Mengyun Wang,Bo Wang,Yifeng Niu,Chang Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate dynamics modeling is essential for quadrotors to achieve precise trajectory tracking in various applications. Traditional physical knowledge-driven modeling methods face substantial limitations in unknown environments characterized by variable payloads, wind disturbances, and external perturbations. On the other hand, data-driven modeling methods suffer from poor generalization when handling out-of-distribution (OoD) data, restricting their effectiveness in unknown scenarios. To address these challenges, we introduce the Physics-Informed Wind-Adaptive Network (PI-WAN), which combines knowledge-driven and data-driven modeling methods by embedding physical constraints directly into the training process for robust quadrotor dynamics learning. Specifically, PI-WAN employs a Temporal Convolutional Network (TCN) architecture that efficiently captures temporal dependencies from historical flight data, while a physics-informed loss function applies physical principles to improve model generalization and robustness across previously unseen conditions. By incorporating real-time prediction results into a model predictive control (MPC) framework, we achieve improvements in closed-loop tracking performance. Comprehensive simulations and real-world flight experiments demonstrate that our approach outperforms baseline methods in terms of prediction accuracy, tracking precision, and robustness to unknown environments.
zh

[AI-13] A Robust Algorithm for Non-IID Machine Learning Problems with Convergence Analysis

【速读】:该论文旨在解决最小最大(minimax)问题,这类问题在鲁棒优化、不平衡学习等领域具有广泛应用。其解决方案的关键在于结合非光滑优化、二次规划和迭代过程,提出一种改进的数值算法,并在梯度连续性和有界性等较弱假设下,提供了该算法的严格收敛性证明。

链接: https://arxiv.org/abs/2507.00810
作者: Qing Xu,Xiaohua Xuan
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:In this paper, we propose an improved numerical algorithm for solving minimax problems based on nonsmooth optimization, quadratic programming and iterative process. We also provide a rigorous proof of convergence for our algorithm under some mild assumptions, such as gradient continuity and boundedness. Such an algorithm can be widely applied in various fields such as robust optimization, imbalanced learning, etc.
zh

[AI-14] Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability

【速读】:该论文试图解决AI助手在协同开发过程中对软件可维护性的影响问题,特别是其他开发者在后续演化代码时的难易程度。其解决方案的关键在于通过一个两阶段的受控实验,评估AI辅助开发与传统开发方式在代码演化效率和CodeHealth(代码健康度)方面的差异,从而分析AI助手对软件可维护性的潜在影响。

链接: https://arxiv.org/abs/2507.00788
作者: Markus Borg,Dave Hewett,Nadim Hagatulah,Noric Couderc,Emma Söderberg,Donald Graham,Uttam Kini,Dave Farley
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Preprint of study preregistered at ICSME 2025 with In-Principal Acceptance. this https URL

点击查看摘要

Abstract:[Context] AI assistants, like GitHub Copilot and Cursor, are transforming software engineering. While several studies highlight productivity improvements, their impact on maintainability requires further investigation. [Objective] This study investigates whether co-development with AI assistants affects software maintainability, specifically how easily other developers can evolve the resulting source code. [Method] We conducted a two-phase controlled experiment involving 151 participants, 95% of whom were professional developers. In Phase 1, participants added a new feature to a Java web application, with or without AI assistance. In Phase 2, a randomized controlled trial, new participants evolved these solutions without AI assistance. [Results] AI-assisted development in Phase 1 led to a modest speedup in subsequent evolution and slightly higher average CodeHealth. Although neither difference was significant overall, the increase in CodeHealth was statistically significant when habitual AI users completed Phase 1. For Phase 1, we also observed a significant effect that corroborates previous productivity findings: using an AI assistant yielded a 30.7% median decrease in task completion time. Moreover, for habitual AI users, the mean speedup was 55.9%. [Conclusions] Our study adds to the growing evidence that AI assistants can effectively accelerate development. Moreover, we did not observe warning signs of degraded code-level maintainability. We recommend that future research focus on risks such as code bloat from excessive code generation and the build-up of cognitive debt as developers invest less mental effort during implementation.
zh

[AI-15] Can Large Language Models Develop Strategic Reasoning ? Post-training Insights from Learning Chess

【速读】:该论文试图解决如何通过强化学习(Reinforcement Learning, RL)提升大语言模型(Large Language Models, LLMs)的战略推理能力的问题,具体聚焦于国际象棋领域的策略生成。其解决方案的关键在于利用预训练的棋类动作价值网络提供密集奖励(dense rewards),以指导LLMs生成更高质量的棋步,这一过程可视为一种知识蒸馏(knowledge distillation)机制。实验结果表明,基于知识蒸馏的密集奖励在多数情况下优于稀疏二值奖励,但所有模型在性能上仍远未达到专家水平,这表明问题的核心可能在于预训练模型对棋类知识的内在理解不足,而仅靠RL难以完全弥补这一缺陷。

链接: https://arxiv.org/abs/2507.00726
作者: Dongyoon Hwang,Hojoon Lee,Jaegul Choo,Dongmin Park,Jongho Park
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages

点击查看摘要

Abstract:While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether LLMs can develop strategic reasoning capabilities through RL in chess. To this end, we leverage a chess-pretrained action-value network to provide dense reward on the LLM’s output move quality, which can be seen as a form of knowledge distillation. Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards. However, surprisingly, all models plateau far below expert levels. We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models’ internal understanding of chess–a deficit which RL alone may not be able to fully overcome.
zh

[AI-16] Generative Exaggeration in LLM Social Agents : Consistency Bias and Toxicity

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在模拟社交媒体上政治话语时的行为问题,特别是其生成内容的可靠性与偏差。解决方案的关键在于构建基于真实用户行为的LLM代理,并通过对比不同初始化方式(零样本与少样本)下的回复,评估模型在语言风格、意识形态一致性及毒性方面的表现,从而揭示模型生成内容中的结构性偏差。

链接: https://arxiv.org/abs/2507.00657
作者: Jacopo Nudo,Mario Edoardo Pandolfo,Edoardo Loru,Mattia Samory,Matteo Cinelli,Walter Quattrociocchi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:We investigate how Large Language Models (LLMs) behave when simulating political discourse on social media. Leveraging 21 million interactions on X during the 2024 U.S. presidential election, we construct LLM agents based on 1,186 real users, prompting them to reply to politically salient tweets under controlled conditions. Agents are initialized either with minimal ideological cues (Zero Shot) or recent tweet history (Few Shot), allowing one-to-one comparisons with human replies. We evaluate three model families (Gemini, Mistral, and DeepSeek) across linguistic style, ideological consistency, and toxicity. We find that richer contextualization improves internal consistency but also amplifies polarization, stylized signals, and harmful language. We observe an emergent distortion that we call “generation exaggeration”: a systematic amplification of salient traits beyond empirical baselines. Our analysis shows that LLMs do not emulate users, they reconstruct them. Their outputs, indeed, reflect internal optimization dynamics more than observed behavior, introducing structural biases that compromise their reliability as social proxies. This challenges their use in content moderation, deliberative simulations, and policy modeling.
zh

[AI-17] Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models

【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)推理过程中计算成本不断上升的问题,这一问题已成为其广泛且可持续部署的关键障碍。现有优化策略主要依赖统计启发式方法或架构修改,缺乏指导推理过程的认知理论。解决方案的关键在于引入认知负载感知推理(Cognitive Load-Aware Inference, CLAI)框架,该框架将认知负荷理论(Cognitive Load Theory, CLT)和神经科学原理转化为可量化的LLM指标,从而将推理过程重新定义为一种认知经济学优化问题。

链接: https://arxiv.org/abs/2507.00653
作者: Yilun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages

点击查看摘要

Abstract:The escalating computational costs of Large Language Model (LLM) inference have become a critical barrier to their widespread and sustainable deployment. While existing optimization strategies are effective, they are predominantly based on statistical heuristics or architectural modifications, lacking a guiding cognitive theory to manage the inference process itself. This paper aims to bridge this gap by introducing a novel paradigm: the Cognitive Load-Aware Inference (CLAI) framework, which operationalizes principles from Cognitive Load Theory (CLT) and neuroscience for LLM inference. We formalize the concepts of Intrinsic Cognitive Load, Extraneous Cognitive Load, and Germane Cognitive Load into quantifiable LLM metrics ( ICL_LLM , ECL_LLM , and GCL_LLM ), thereby reframing the inference process as a cognitive economics optimization problem: based on the intrinsic complexity of a problem ( ICL_LLM ), minimize wasteful computation ( ECL_LLM ), and strategically allocate the token budget to productive reasoning ( GCL_LLM ). We propose two implementation paths: CLAI-Prompt, a zero-shot method that guides a base LLM through cognitive control steps via a structured meta-prompt, and CLAI-Tune, a fine-tuned model that internalizes these principles for spontaneous cognitive economy. Across a range of benchmarks in complex reasoning, long-context question answering, and code generation, our methods achieve significant reductions in token consumption (up to 45%) without sacrificing accuracy. Furthermore, CLAI-Tune exhibits an emergent ability to autonomously decompose difficult problems, a key characteristic of human expert cognition. This work demonstrates that by emulating the brain’s resource management strategies, we can build more efficient, robust, and capable artificial intelligence systems.
zh

[AI-18] Horus: A Protocol for Trustless Delegation Under Uncertainty

【速读】:该论文试图解决在动态、低信任环境中,自主AI代理如何确保任务执行的正确性问题。传统的方法依赖于前期规范或集中监督,但在这种环境下难以保证正确性。解决方案的关键在于提出一种通过抵押声明(collateralized claims)在递归验证博弈中强制正确性的协议。该协议通过任务发布为意图(intent),求解者竞争完成任务,并在事后由验证者检查正确性,同时允许挑战者通过质押来触发验证过程,从而实现错误代理被惩罚、正确代理被奖励的机制,最终使正确性成为纳什均衡。

链接: https://arxiv.org/abs/2507.00631
作者: David Shi,Kevin Joo
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:Correctness is an emergent property of systems where exposing error is cheaper than committing it. In dynamic, low-trust environments, autonomous AI agents benefit from delegating work to sub-agents, yet correctness cannot be assured through upfront specification or centralized oversight. We propose a protocol that enforces correctness through collateralized claims in a recursive verification game. Tasks are published as intents, and solvers compete to fulfill them. Selected solvers carry out tasks under risk, with correctness checked post hoc by verifiers. Any challenger can challenge a result by staking against it to trigger the verification process. Incorrect agents are slashed and correct opposition is rewarded, with an escalation path that penalizes erroneous verifiers themselves. When incentives are aligned across solvers, challengers, and verifiers, falsification conditions make correctness the Nash equilibrium.
zh

[AI-19] Residual Reward Models for Preference-based Reinforcement Learning

【速读】:该论文旨在解决Preference-based Reinforcement Learning (PbRL)在训练过程中收敛速度慢的问题,尤其是在需要训练奖励模型的情况下。其解决方案的关键在于提出了一种残差奖励模型(Residual Reward Model, RRM),该模型假设环境的真实奖励可以分解为先验奖励和学习到的奖励两部分,其中先验奖励可以在训练前获得,例如用户提供的“最佳猜测”奖励函数或通过逆强化学习(Inverse Reinforcement Learning, IRL)获得的奖励函数,而学习到的奖励则通过偏好进行训练。这种方法有效利用了先验知识,从而提升了PbRL的性能。

链接: https://arxiv.org/abs/2507.00611
作者: Chenyang Cao,Miguel Rogel-García,Mohamed Nabail,Xueqian Wang,Nicholas Rhinehart
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 26 pages, 22 figures

点击查看摘要

Abstract:Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify, avoiding heuristic and time-consuming reward design. However, PbRL can suffer from slow convergence speed since it requires training in a reward model. Prior work has proposed learning a reward model from demonstrations and fine-tuning it using preferences. However, when the model is a neural network, using different loss functions for pre-training and fine-tuning can pose challenges to reliable optimization. In this paper, we propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM). An RRM assumes that the true reward of the environment can be split into a sum of two parts: a prior reward and a learned reward. The prior reward is a term available before training, for example, a user’s ``best guess’’ reward function, or a reward function learned from inverse reinforcement learning (IRL), and the learned reward is trained with preferences. We introduce state-based and image-based versions of RRM and evaluate them on several tasks in the Meta-World environment suite. Experimental results show that our method substantially improves the performance of a common PbRL method. Our method achieves performance improvements for a variety of different types of prior rewards, including proxy rewards, a reward obtained from IRL, and even a negated version of the proxy reward. We also conduct experiments with a Franka Panda to show that our method leads to superior performance on a real robot. It significantly accelerates policy learning for different tasks, achieving success in fewer steps than the baseline. The videos are presented at this https URL.
zh

[AI-20] High-resolution spatial memory requires grid-cell-like neural codes

【速读】:该论文试图解决连续吸引子网络(CANs)在生物系统中对噪声和异质性等小缺陷高度敏感的问题,这一问题导致其在保持稳定性和高分辨率之间存在权衡。解决方案的关键在于采用基于随机特征嵌入的稀疏二进制分布式编码,此类编码具有周期性空间感受野,类似于网格细胞的编码方式,从而使得CAN能够同时实现高稳定性和高分辨率。

链接: https://arxiv.org/abs/2507.00598
作者: Madison Cotteret,Christopher J. Kymn,Hugh Greatorex,Martin Ziegler,Elisabetta Chicca,Friedrich T. Sommer
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: 14 pages, 4 figures. Supplementary material: 11 pages, 5 figures

点击查看摘要

Abstract:Continuous attractor networks (CANs) are widely used to model how the brain temporarily retains continuous behavioural variables via persistent recurrent activity, such as an animal’s position in an environment. However, this memory mechanism is very sensitive to even small imperfections, such as noise or heterogeneity, which are both common in biological systems. Previous work has shown that discretising the continuum into a finite set of discrete attractor states provides robustness to these imperfections, but necessarily reduces the resolution of the represented variable, creating a dilemma between stability and resolution. We show that this stability-resolution dilemma is most severe for CANs using unimodal bump-like codes, as in traditional models. To overcome this, we investigate sparse binary distributed codes based on random feature embeddings, in which neurons have spatially-periodic receptive fields. We demonstrate theoretically and with simulations that such grid-cell-like codes enable CANs to achieve both high stability and high resolution simultaneously. The model extends to embedding arbitrary nonlinear manifolds into a CAN, such as spheres or tori, and generalises linear path integration to integration along freely-programmable on-manifold vector fields. Together, this work provides a theory of how the brain could robustly represent continuous variables with high resolution and perform flexible computations over task-relevant manifolds.
zh

[AI-21] Quantum Circuit Structure Optimization for Quantum Reinforcement Learning

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)在高维空间中因维度灾难导致的学习效率下降问题。其解决方案的关键在于引入量子强化学习(Quantum Reinforcement Learning, QRL),通过量子计算中的叠加性和纠缠性,提升对高维问题的处理效率。QRL结合了量子神经网络(Quantum Neural Networks, QNNs)与RL,其中参数化量子电路(Parameterized Quantum Circuit, PQC)作为核心计算模块,通过门操作实现线性与非线性变换。为优化PQC结构,本文提出QRL-NAS算法,融合量子神经架构搜索(Quantum Neural Architecture Search, QNAS),从而提升QRL性能,实验结果表明该方法在奖励值上优于传统固定结构的QRL。

链接: https://arxiv.org/abs/2507.00589
作者: Seok Bin Son,Joongheon Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) enables agents to learn optimal policies through environmental interaction. However, RL suffers from reduced learning efficiency due to the curse of dimensionality in high-dimensional spaces. Quantum reinforcement learning (QRL) addresses this issue by leveraging superposition and entanglement in quantum computing, allowing efficient handling of high-dimensional problems with fewer resources. QRL combines quantum neural networks (QNNs) with RL, where the parameterized quantum circuit (PQC) acts as the core computational module. The PQC performs linear and nonlinear transformations through gate operations, similar to hidden layers in classical neural networks. Previous QRL studies, however, have used fixed PQC structures based on empirical intuition without verifying their optimality. This paper proposes a QRL-NAS algorithm that integrates quantum neural architecture search (QNAS) to optimize PQC structures within QRL. Experiments demonstrate that QRL-NAS achieves higher rewards than QRL with fixed circuits, validating its effectiveness and practical utility.
zh

[AI-22] Advancing Local Search in SMT-NRA with MCSAT Integration

【速读】:该论文旨在解决Satisfiability Modulo the Theory of Nonlinear Real Arithmetic(SMT-NRA)中的局部搜索效率问题。其解决方案的关键在于引入一种名为\emph{2d}-cell-jump的二维单元跳跃操作,以扩展传统的局部搜索框架\emph{2d}-LS,并结合模型构造满足性演算(MCSAT)框架以提升搜索效率。此外,还引入了样本单元投影算子以优化MCSAT的性能,并设计了一个融合MCSAT、\emph{2d}-LS和OpenCAD的混合框架,通过信息交换进一步提高求解效率。

链接: https://arxiv.org/abs/2507.00557
作者: Tianyi Ding,Haokun Li,Xinpeng Ni,Bican Xia,Tianqi Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:In this paper, we advance local search for Satisfiability Modulo the Theory of Nonlinear Real Arithmetic (SMT-NRA for short). First, we introduce a two-dimensional cell-jump move, called \emph 2d -cell-jump, generalizing the key operation, cell-jump, of the local search method for SMT-NRA. Then, we propose an extended local search framework, named \emph 2d -LS (following the local search framework, LS, for SMT-NRA), integrating the model constructing satisfiability calculus (MCSAT) framework to improve search efficiency. To further improve the efficiency of MCSAT, we implement a recently proposed technique called \emphsample-cell projection operator for MCSAT, which is well suited for CDCL-style search in the real domain and helps guide the search away from conflicting states. Finally, we design a hybrid framework for SMT-NRA combining MCSAT, 2d -LS and OpenCAD, to improve search efficiency through information exchange. The experimental results demonstrate improvements in local search performance, highlighting the effectiveness of the proposed methods.
zh

[AI-23] Rethinking Group Recommender Systems in the Era of Generative AI: From One-Shot Recommendations to Agent ic Group Decision Support

【速读】:该论文试图解决传统群体推荐系统在实际应用中缺乏实例的问题,其核心在于质疑学术研究中关于群体沟通过程和推荐支持决策机制的常见假设,并指出这些假设与用户需求或期望之间可能存在不匹配。解决方案的关键在于重新定向该研究领域,利用现代生成式 AI (Generative AI) 助手的能力,特别是通过构建一种人机协作的群体推荐系统,其中人类群体成员通过聊天互动,AI 基础的群体推荐代理以自主代理的方式辅助决策过程,从而实现更自然的群体决策环境并推动群体推荐系统的实际应用。

链接: https://arxiv.org/abs/2507.00535
作者: Dietmar Jannach,Amra Delić,Francesco Ricci,Markus Zanker
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Submitted for publication

点击查看摘要

Abstract:More than twenty-five years ago, first ideas were developed on how to design a system that can provide recommendations to groups of users instead of individual users. Since then, a rich variety of algorithmic proposals were published, e.g., on how to acquire individual preferences, how to aggregate them, and how to generate recommendations for groups of users. However, despite the rich literature on the topic, barely any examples of real-world group recommender systems can be found. This lets us question common assumptions in academic research, in particular regarding communication processes in a group and how recommendation-supported decisions are made. In this essay, we argue that these common assumptions and corresponding system designs often may not match the needs or expectations of users. We thus call for a reorientation in this research area, leveraging the capabilities of modern Generative AI assistants like ChatGPT. Specifically, as one promising future direction, we envision group recommender systems to be systems where human group members interact in a chat and an AI-based group recommendation agent assists the decision-making process in an agentic way. Ultimately, this shall lead to a more natural group decision-making environment and finally to wider adoption of group recommendation systems in practice.
zh

[AI-24] Customer Service Representatives Perception of the AI Assistant in an Organizations Call Center

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)在组织环境中集成过程中,员工在与客户互动时所面临的新型工作负担与适应问题。研究通过实地考察和对13名电力公司客服代表(Customer Service Representatives, CSRs)的半结构化访谈,揭示了AI工具在减轻传统任务负担(如打字和记忆)的同时,也带来了新的挑战(如合规性、心理压力等)。解决方案的关键在于理解CSRs在适应更新后的系统过程中所付出的努力与承担的负担,从而为更全面地认识AI在组织中的应用提供依据。

链接: https://arxiv.org/abs/2507.00513
作者: Kai Qin,Kexin Du,Yimeng Chen,Yueyan Liu,Jie Cai,Zhiqiang Nie,Nan Gao,Guohui Wei,Shengzhu Wang,Chun Yu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: ACM CSCW Poster 2025

点击查看摘要

Abstract:The integration of various AI tools creates a complex socio-technical environment where employee-customer interactions form the core of work practices. This study investigates how customer service representatives (CSRs) at the power grid service customer service call center perceive AI assistance in their interactions with customers. Through a field visit and semi-structured interviews with 13 CSRs, we found that AI can alleviate some traditional burdens during the call (e.g., typing and memorizing) but also introduces new burdens (e.g., earning, compliance, psychological burdens). This research contributes to a more nuanced understanding of AI integration in organizational settings and highlights the efforts and burdens undertaken by CSRs to adapt to the updated system.
zh

[AI-25] PNAct: Crafting Backdoor Attacks in Safe Reinforcement Learning

【速读】:该论文试图解决安全强化学习(Safe Reinforcement Learning, Safe RL)中存在潜在的安全威胁问题,特别是针对其可能遭受的后门攻击(backdoor attacks)问题。解决方案的关键在于提出一种名为PNAct的攻击框架,该框架通过同时利用正向动作样本(Positive Action samples)和负向动作样本(Negative Action samples)来植入后门,其中正向动作样本提供参考动作,而负向动作样本指示应避免的动作。该方法理论上分析了PNAct的特性,并设计了相应的攻击算法,实验验证了其有效性,从而揭示了Safe RL系统在安全性方面的潜在风险。

链接: https://arxiv.org/abs/2507.00485
作者: Weiran Guo,Guanjun Liu,Ziyuan Zhou,Ling Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) is widely used in tasks where agents interact with an environment to maximize rewards. Building on this foundation, Safe Reinforcement Learning (Safe RL) incorporates a cost metric alongside the reward metric, ensuring that agents adhere to safety constraints during decision-making. In this paper, we identify that Safe RL is vulnerable to backdoor attacks, which can manipulate agents into performing unsafe actions. First, we introduce the relevant concepts and evaluation metrics for backdoor attacks in Safe RL. It is the first attack framework in the Safe RL field that involves both Positive and Negative Action sample (PNAct) is to implant backdoors, where positive action samples provide reference actions and negative action samples indicate actions to be avoided. We theoretically point out the properties of PNAct and design an attack algorithm. Finally, we conduct experiments to evaluate the effectiveness of our proposed backdoor attack framework, evaluating it with the established metrics. This paper highlights the potential risks associated with Safe RL and underscores the feasibility of such attacks. Our code and supplementary material are available at this https URL.
zh

[AI-26] Diversity Conscious Refined Random Forest

【速读】:该论文试图解决传统随机森林(Random Forest, RF)模型中存在的高推理成本和模型冗余问题,这些问题主要源于其依赖大量树结构和所有输入特征进行分类。解决方案的关键在于动态地仅在信息丰富的特征上生长树,并通过聚类和保留不相关树来增强树之间的多样性。具体而言,该方法通过迭代移除最不具信息量的特征,分析确定应新增的树的数量,并利用基于相关性的聚类方法去除冗余树,从而在相同树数量下提升分类精度。

链接: https://arxiv.org/abs/2507.00467
作者: Sijan Bhattarai,Saurav Bhandari,Girija Bhusal,Saroj Shakya,Tapendra Pandey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Random Forest (RF) is a widely used ensemble learning technique known for its robust classification performance across diverse domains. However, it often relies on hundreds of trees and all input features, leading to high inference cost and model redundancy. In this work, our goal is to grow trees dynamically only on informative features and then enforce maximal diversity by clustering and retaining uncorrelated trees. Therefore, we propose a Refined Random Forest Classifier that iteratively refines itself by first removing the least informative features and then analytically determines how many new trees should be grown, followed by correlation-based clustering to remove redundant trees. The classification accuracy of our model was compared against the standard RF on the same number of trees. Experiments on 8 multiple benchmark datasets, including binary and multiclass datasets, demonstrate that the proposed model achieves improved accuracy compared to standard RF.
zh

[AI-27] Novel Complex-Valued Hopfield Neural Networks with Phase and Magnitude Quantization

【速读】:该论文旨在解决传统复值霍普菲尔德神经网络(CvHNN)状态数量有限的问题,从而限制其应用范围。解决方案的关键在于引入相位和幅度量化机制,并通过两种不同的激活函数设计:一种基于矩形坐标表示的向上取整型激活函数,另一种则基于极坐标表示的向上取整型激活函数,从而显著提升了CvHNN的状态数量,扩展了其潜在应用领域。

链接: https://arxiv.org/abs/2507.00461
作者: Garimella Ramamurthy,Marcos Eduardo Valle,Tata Jagannadha Swamy
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Paper submitted to the Fifth International Conference on Emerging Techniques in Computational Intelligence (ICETCI 2025)

点击查看摘要

Abstract:This research paper introduces two novel complex-valued Hopfield neural networks (CvHNNs) that incorporate phase and magnitude quantization. The first CvHNN employs a ceiling-type activation function that operates on the rectangular coordinate representation of the complex net contribution. The second CvHNN similarly incorporates phase and magnitude quantization but utilizes a ceiling-type activation function based on the polar coordinate representation of the complex net contribution. The proposed CvHNNs, with their phase and magnitude quantization, significantly increase the number of states compared to existing models in the literature, thereby expanding the range of potential applications for CvHNNs.
zh

[AI-28] Best Agent Identification for General Game Playing

【速读】:该论文试图解决在多问题领域中准确识别每个子任务最优算法的问题,其核心挑战在于如何在有限的试验次数内高效地找到每个任务中表现最佳的智能体(agent)。解决方案的关键在于将该问题建模为多臂老虎机(multi-armed bandits)中的最佳臂识别问题,并提出一种基于威尔逊置信区间(Wilson score interval)的乐观选择过程(Optimistic-WS),通过评估各臂在减少潜在遗憾方面的潜力来进行排序,从而实现更高效的算法选择。

链接: https://arxiv.org/abs/2507.00451
作者: Matthew Stephenson,Alex Newcombe,Eric Piette,Dennis Soemers
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present an efficient and generalised procedure to accurately identify the best performing algorithm for each sub-task in a multi-problem domain. Our approach treats this as a set of best arm identification problems for multi-armed bandits, where each bandit corresponds to a specific task and each arm corresponds to a specific algorithm or agent. We propose an optimistic selection process based on the Wilson score interval (Optimistic-WS) that ranks each arm across all bandits in terms of their potential regret reduction. We evaluate the performance of Optimistic-WS on two of the most popular general game domains, the General Video Game AI (GVGAI) framework and the Ludii general game playing system, with the goal of identifying the highest performing agent for each game within a limited number of trials. Compared to previous best arm identification algorithms for multi-armed bandits, our results demonstrate a substantial performance improvement in terms of average simple regret. This novel approach can be used to significantly improve the quality and accuracy of agent evaluation procedures for general game frameworks, as well as other multi-task domains with high algorithm runtimes.
zh

[AI-29] Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design

【速读】:该论文试图解决在生物分子设计中,通过奖励引导生成来微调扩散模型的问题(diffusion models for reward-guided generation in biomolecular design)。现有方法虽然能够生成高质量的数据,但在优化非可微奖励函数(如基于物理的模拟或科学知识的奖励)时存在不稳定性、样本效率低和模式崩溃等问题。该论文提出的解决方案关键在于一种基于迭代蒸馏的微调框架,其核心思想是将问题转化为策略蒸馏(policy distillation),通过收集离策略数据、模拟基于奖励的软最优策略,并通过最小化模拟策略与当前模型策略之间的KL散度进行模型更新,从而提升训练稳定性和样本效率。

链接: https://arxiv.org/abs/2507.00445
作者: Xingyu Su,Xiner Li,Masatoshi Uehara,Sunwoo Kim,Yulai Zhao,Gabriele Scalia,Ehsan Hajiramezanali,Tommaso Biancalani,Degui Zhi,Shuiwang Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:We address the problem of fine-tuning diffusion models for reward-guided generation in biomolecular design. While diffusion models have proven highly effective in modeling complex, high-dimensional data distributions, real-world applications often demand more than high-fidelity generation, requiring optimization with respect to potentially non-differentiable reward functions such as physics-based simulation or rewards based on scientific knowledge. Although RL methods have been explored to fine-tune diffusion models for such objectives, they often suffer from instability, low sample efficiency, and mode collapse due to their on-policy nature. In this work, we propose an iterative distillation-based fine-tuning framework that enables diffusion models to optimize for arbitrary reward functions. Our method casts the problem as policy distillation: it collects off-policy data during the roll-in phase, simulates reward-based soft-optimal policies during roll-out, and updates the model by minimizing the KL divergence between the simulated soft-optimal policy and the current model policy. Our off-policy formulation, combined with KL divergence minimization, enhances training stability and sample efficiency compared to existing RL-based methods. Empirical results demonstrate the effectiveness and superior reward optimization of our approach across diverse tasks in protein, small molecule, and regulatory DNA design.
zh

[AI-30] Novel Pigeon-inspired 3D Obstacle Detection and Avoidance Maneuver for Multi-UAV Systems

【速读】:该论文试图解决多无人飞行器(multi-UAV)系统在城市环境中运行时面临的静态和动态障碍物避让问题。解决方案的关键在于引入一种受自然群体行为启发的无碰撞编队控制方法,该方法结合了基于概率Lloyd算法的集中式引导算法与分布式控制策略,实现了无人机的最佳定位以及车辆间和障碍物的避让。此外,该框架还扩展到了三维空间,并定义了新的三维机动方式,以适应复杂环境中的多UAV系统应用。

链接: https://arxiv.org/abs/2507.00443
作者: Reza Ahmadvand,Sarah Safura Sharif,Yaser Mike Banad
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 11 Pages, 11 Pictures, 1 Table, 3 Algorithms

点击查看摘要

Abstract:Recent advances in multi-agent systems manipulation have demonstrated a rising demand for the implementation of multi-UAV systems in urban areas, which are always subjected to the presence of static and dynamic obstacles. Inspired by the collective behavior of tilapia fish and pigeons, the focus of the presented research is on the introduction of a nature-inspired collision-free formation control for a multi-UAV system, considering the obstacle avoidance maneuvers. The developed framework in this study utilizes a semi-distributed control approach, in which, based on a probabilistic Lloyd’s algorithm, a centralized guidance algorithm works for optimal positioning of the UAVs, while a distributed control approach has been used for the intervehicle collision and obstacle avoidance. Further, the presented framework has been extended to the 3D space with a novel definition of 3D maneuvers. Finally, the presented framework has been applied to multi-UAV systems in 2D and 3D scenarios, and the obtained results demonstrated the validity of the presented method in dynamic environments with stationary and moving obstacles.
zh

[AI-31] A Recipe for Causal Graph Regression: Confounding Effects Revisited ICML2025

【速读】:该论文试图解决在图神经网络(Graph Neural Networks, GNNs)的分布外(Out-of-Distribution, OOD)场景下,因果图学习(Causal Graph Learning, CGL)方法在回归任务中应用不足的问题。现有CGL方法主要针对分类任务进行设计,而回归任务由于其复杂性被忽视。论文的关键解决方案在于重新审视混杂因素(Confounders)在图级回归中的预测能力,并通过对比学习(Contrastive Learning)将适用于分类的因果干预技术推广至回归任务,从而提升模型在OOD场景下的泛化能力。

链接: https://arxiv.org/abs/2507.00440
作者: Yujia Yin,Tianyi Qu,Zihao Wang,Yifan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注: ICML 2025 accepted

点击查看摘要

Abstract:Through recognizing causal subgraphs, causal graph learning (CGL) has risen to be a promising approach for improving the generalizability of graph neural networks under out-of-distribution (OOD) scenarios. However, the empirical successes of CGL techniques are mostly exemplified in classification settings, while regression tasks, a more challenging setting in graph learning, are overlooked. We thus devote this work to tackling causal graph regression (CGR); to this end we reshape the processing of confounding effects in existing CGL studies, which mainly deal with classification. Specifically, we reflect on the predictive power of confounders in graph-level regression, and generalize classification-specific causal intervention techniques to regression through a lens of contrastive learning. Extensive experiments on graph OOD benchmarks validate the efficacy of our proposals for CGR. The model implementation and the code are provided on this https URL.
zh

[AI-32] Serving LLM s in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs

【速读】:该论文旨在评估Qualcomm Cloud AI 100 Ultra (QAic)加速器在大型语言模型(Large Language Model, LLM)推理中的能效和性能,以确定其在国家研究平台(National Research Platform, NRP)生态系统中对高性能计算(High-Performance Computing, HPC)应用的潜力。解决方案的关键在于通过vLLM框架部署15个开源LLM,并将其与领先的NVIDIA(A100、H200)和AMD(MI300A)GPU进行对比分析,从而全面评估QAic在能效和性能方面的表现。

链接: https://arxiv.org/abs/2507.00418
作者: Mohammad Firas Sada,John J. Graham,Elham E Khoda,Mahidhar Tatineni,Dmitry Mishin,Rajesh K. Gupta,Rick Wagner,Larry Smarr,Thomas A. DeFanti,Frank Würthwein
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: To appear in Proceedings of the Practice and Experience in Advanced Research Computing (PEARC '25)

点击查看摘要

Abstract:This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt) and performance against leading NVIDIA (A100, H200) and AMD (MI300A) GPUs within the National Research Platform (NRP) ecosystem. A total of 15 open-source LLMs, ranging from 117 million to 90 billion parameters, are served using the vLLM framework. The QAic inference cards appears to be energy efficient and performs well in the energy efficiency metric in most cases. The findings offer insights into the potential of the Qualcomm Cloud AI 100 Ultra for high-performance computing (HPC) applications within the National Research Platform (NRP).
zh

[AI-33] Panda: An Intelligent Protocol Testing and Debugging Agent for Conformance Testing

【速读】:该论文旨在解决协议实现与规范之间一致性验证的问题,传统方法依赖人工编写大量测试用例和脚本,导致过程繁琐且效率低下。其解决方案的关键在于提出iPanda框架,该框架首次实现了基于大型语言模型(Large Language Models, LLMs)的端到端协议符合性测试自动化,通过关键词方法生成全面测试用例、代码检索增强生成技术解析实现并生成可执行测试代码,并结合迭代自修正机制提升代码质量,最终通过执行与分析验证实现与规范的合规性。

链接: https://arxiv.org/abs/2507.00378
作者: Xikai Sun,Fan Dang,Kebin Liu,Xin Miao,Zihao Yang,Haimo Lu,Yawen Zheng,Yunhao Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Conformance testing is essential for ensuring that protocol implementations comply with their specifications. However, traditional testing approaches involve manually creating numerous test cases and scripts, making the process labor-intensive and inefficient. Recently, Large Language Models (LLMs) have demonstrated impressive text comprehension and code generation abilities, providing promising opportunities for automation. In this paper, we propose iPanda, the first end-to-end framework that leverages LLMs to automate protocol conformance testing. Given a protocol specification document and its implementation, iPanda first employs a keyword-based method to automatically generate comprehensive test cases. Then, it utilizes a code-based retrieval-augmented generation approach to effectively interpret the implementation and produce executable test code. To further enhance code quality, iPanda incorporates an iterative self-correction mechanism to refine generated test scripts interactively. Finally, by executing and analyzing the generated tests, iPanda systematically verifies compliance between implementations and protocol specifications. Comprehensive experiments on various protocols show that iPanda significantly outperforms pure LLM-based approaches, improving the success rate (Pass@1) of test-code generation by factors ranging from 4.675 times to 10.751 times.
zh

[AI-34] Data-Driven Exploration for a Class of Continuous-Time Linear–Quadratic Reinforcement Learning Problems

【速读】:该论文旨在解决连续时间随机线性二次(LQ)控制问题中的强化学习(Reinforcement Learning, RL)方法优化问题,其中波动率依赖于状态和控制,且状态为标量值、运行控制奖励缺失。其解决方案的关键在于提出一种无需模型、数据驱动的探索机制,该机制通过批评者(critic)调整熵正则化,并通过策略网络(actor)调整策略方差,实现探索策略的自适应调整。相较于已有工作中采用的固定或确定性探索调度,该方法在减少调参需求的同时提升了学习效率,并在保持最优模型无关结果的次线性遗憾界下,验证了其在数值实验中的收敛速度与遗憾性能优势。

链接: https://arxiv.org/abs/2507.00358
作者: Yilie Huang,Xun Yu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: 36 pages, 10 figures

点击查看摘要

Abstract:We study reinforcement learning (RL) for the same class of continuous-time stochastic linear–quadratic (LQ) control problems as in \citehuang2024sublinear, where volatilities depend on both states and controls while states are scalar-valued and running control rewards are absent. We propose a model-free, data-driven exploration mechanism that adaptively adjusts entropy regularization by the critic and policy variance by the actor. Unlike the constant or deterministic exploration schedules employed in \citehuang2024sublinear, which require extensive tuning for implementations and ignore learning progresses during iterations, our adaptive exploratory approach boosts learning efficiency with minimal tuning. Despite its flexibility, our method achieves a sublinear regret bound that matches the best-known model-free results for this class of LQ problems, which were previously derived only with fixed exploration schedules. Numerical experiments demonstrate that adaptive explorations accelerate convergence and improve regret performance compared to the non-adaptive model-free and model-based counterparts.
zh

[AI-35] An AST-guided LLM Approach for SVRF Code Synthesis

【速读】:该论文旨在解决随着工艺节点进步导致的复杂设计规则使传统标准验证规则格式(SVRF)开发失效的问题,以及由此产生的专业知识缺口。其解决方案的关键在于融合抽象语法树(AST)嵌入与检索增强生成(RAG)技术,通过结构化验证和领域知识注入,提升SVRF代码合成的语义准确性和错误最小化能力。

链接: https://arxiv.org/abs/2507.00352
作者: Abanoub E. Abdelmalak,Mohamed A. Elsayed,David Abercrombie,Ilhami Torunoglu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 9 Pages, 5 Figures, 2 Tables

点击查看摘要

Abstract:Standard Verification Rule Format (SVRF) is essential for semiconductor applications like Design Rule Check (DRC), Layout Versus Schematic (LVS), and Optical Proximity Correction (OPC) and it faces challenges as advancing nodes create complex design rules that renders traditional SVRF development ineffective and highlight an expertise gap. This paper introduces a novel methodology integrating Abstract Syntax Tree (AST) embedding and Retrieval-Augmented Generation (RAG) for enhanced SVRF code synthesis, ensuring semantic accuracy and error minimization through structural validation with domain-specific insights for precise code generation. We evaluate different T5-based models and propose an innovative SVRF-specific scoring framework that complements standard metrics like BLEU and ROUGE-L. In our approach, AST provides rigorous structural validation, while RAG infuses relevant domain knowledge, effectively enhancing the code generation workflow. Testing on a comprehensive benchmark of 740 DRC rule implementations, our methodology demonstrates up to a 40% improvement in code generation accuracy compared to basic text-based fine-tuning process. This fusion of industry expertise with advanced coding strategies not only optimizes SVRF development under limited dataset constraints but also creates a more intuitive and efficient coding environment. Consequently, users can rapidly iterate through design cycles, reduce manual error correction, and significantly improve overall productivity. Comments: 9 Pages, 5 Figures, 2 Tables Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET) Cite as: arXiv:2507.00352 [cs.SE] (or arXiv:2507.00352v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.00352 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-36] VTS-Guided AI Interaction Workflow for Business Insights

【速读】:该论文试图解决现代企业面对大量密集且非结构化报告时,难以快速提取可用洞察的问题。解决方案的关键在于将视觉思维策略(Visual Thinking Strategies)整合到人工智能代理中,使这些代理能够大规模地从非结构化文本、表格和图像中提取商业洞察。该系统通过三个层级(微观、中观、宏观)运作,实现问题标记、源页链接及清晰行动杠杆的生成,并将结果存储在可搜索的YAML文件中,从而提升分析效率与准确性。

链接: https://arxiv.org/abs/2507.00347
作者: Sun Ding,Ude Enebeli,Atilhan(Ati)Manay,Ryan Pua,Kamal Kotak
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern firms face a flood of dense, unstructured reports. Turning these documents into usable insights takes heavy effort and is far from agile when quick answers are needed. VTS-AI tackles this gap. It integrates Visual Thinking Strategies, which emphasize evidence-based observation, linking, and thinking, into AI agents, so the agents can extract business insights from unstructured text, tables, and images at scale. The system works in three tiers (micro, meso, macro). It tags issues, links them to source pages, and rolls them into clear action levers stored in a searchable YAML file. In tests on an 18-page business report, VTS-AI matched the speed of a one-shot ChatGPT prompt yet produced richer findings: page locations, verbatim excerpts, severity scores, and causal links. Analysts can accept or adjust these outputs in the same IDE, keeping human judgment in the loop. Early results show VTS-AI spots the direction of key metrics and flags where deeper number-crunching is needed. Next steps include mapping narrative tags to financial ratios, adding finance-tuned language models through a Model-Context Protocol, and building a Risk Safety Layer to stress-test models and secure data. These upgrades aim to make VTS-AI a production-ready, audit-friendly tool for rapid business analysis.
zh

[AI-37] Visual Privacy Management with Generative AI for Blind and Low-Vision People

【速读】:该论文旨在解决盲人和低视力(Blind and Low Vision, BLV)用户在使用生成式 AI (Generative AI) 工具处理视觉内容时面临的视觉隐私问题。研究通过21名参与者的访谈揭示了用户在隐私、效率和情感自主性之间的平衡实践,并提出了包括本地设备处理、零留存保证、敏感内容删除、隐私感知外观指示器以及多模态触觉镜像交互方法等设计偏好。解决方案的关键在于通过用户中心的设计原则,增强对视觉隐私的保护,同时确保工具的有效性和用户的情感需求得到满足。

链接: https://arxiv.org/abs/2507.00286
作者: Tanusree Sharma,Yu-Yun Tseng,Lotus Zhang,Ayae Ide,Kelly Avery Mack,Leah Findlater,Danna Gurari,Yang Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Blind and low vision (BLV) individuals use Generative AI (GenAI) tools to interpret and manage visual content in their daily lives. While such tools can enhance the accessibility of visual content and so enable greater user independence, they also introduce complex challenges around visual privacy. In this paper, we investigate the current practices and future design preferences of blind and low vision individuals through an interview study with 21 participants. Our findings reveal a range of current practices with GenAI that balance privacy, efficiency, and emotional agency, with users accounting for privacy risks across six key scenarios, such as self-presentation, indoor/outdoor spatial privacy, social sharing, and handling professional content. Our findings reveal design preferences, including on-device processing, zero-retention guarantees, sensitive content redaction, privacy-aware appearance indicators, and multimodal tactile mirrored interaction methods. We conclude with actionable design recommendations to support user-centered visual privacy through GenAI, expanding the notion of privacy and responsible handling of others data.
zh

[AI-38] Double Q-learning for Value-based Deep Reinforcement Learning Revisited

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中广泛存在的Q值高估问题,尤其是基于价值的深度强化学习算法中的这一问题。其解决方案的关键在于对Double Q-learning方法的改进,提出了一种称为Deep Double Q-learning (DDQL) 的算法,该算法通过训练两个独立的Q函数来解耦动作选择与动作评估过程,从而有效降低Q值的高估现象。

链接: https://arxiv.org/abs/2507.00275
作者: Prabhat Nagarajan,Martha White,Marlos C. Machado
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 44 pages

点击查看摘要

Abstract:Overestimation is pervasive in reinforcement learning (RL), including in Q-learning, which forms the algorithmic basis for many value-based deep RL algorithms. Double Q-learning is an algorithm introduced to address Q-learning’s overestimation by training two Q-functions and using both to de-correlate action-selection and action-evaluation in bootstrap targets. Shortly after Q-learning was adapted to deep RL in the form of deep Q-networks (DQN), Double Q-learning was adapted to deep RL in the form of Double DQN. However, Double DQN only loosely adapts Double Q-learning, forgoing the training of two different Q-functions that bootstrap off one another. In this paper, we study algorithms that adapt this core idea of Double Q-learning for value-based deep RL. We term such algorithms Deep Double Q-learning (DDQL). Our aim is to understand whether DDQL exhibits less overestimation than Double DQN and whether performant instantiations of DDQL exist. We answer both questions affirmatively, demonstrating that DDQL reduces overestimation and outperforms Double DQN in aggregate across 57 Atari 2600 games, without requiring additional hyperparameters. We also study several aspects of DDQL, including its network architecture, replay ratio, and minibatch sampling strategy.
zh

[AI-39] Control-Optimized Deep Reinforcement Learning for Artificially Intelligent Autonomous Systems

【速读】:该论文试图解决深度强化学习(DRL)在实际应用中因动作执行不匹配而导致性能下降的问题,这种不匹配通常由系统动力学、硬件限制和延迟等因素引起。解决方案的关键在于提出一种控制优化的DRL框架,该框架显式建模并补偿动作执行不匹配,通过结构化的两阶段过程——确定期望动作和选择合适的控制信号以确保正确执行,并在训练过程中考虑动作不匹配和控制器修正,从而提升智能体在真实世界不确定性下的鲁棒性。

链接: https://arxiv.org/abs/2507.00268
作者: Oren Fivel,Matan Rudman,Kobi Cohen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 27 pages, 10 figures

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has become a powerful tool for complex decision-making in machine learning and AI. However, traditional methods often assume perfect action execution, overlooking the uncertainties and deviations between an agent’s selected actions and the actual system response. In real-world applications, such as robotics, mechatronics, and communication networks, execution mismatches arising from system dynamics, hardware constraints, and latency can significantly degrade performance. This work advances AI by developing a novel control-optimized DRL framework that explicitly models and compensates for action execution mismatches, a challenge largely overlooked in existing methods. Our approach establishes a structured two-stage process: determining the desired action and selecting the appropriate control signal to ensure proper execution. It trains the agent while accounting for action mismatches and controller corrections. By incorporating these factors into the training process, the AI agent optimizes the desired action with respect to both the actual control signal and the intended outcome, explicitly considering execution errors. This approach enhances robustness, ensuring that decision-making remains effective under real-world uncertainties. Our approach offers a substantial advancement for engineering practice by bridging the gap between idealized learning and real-world implementation. It equips intelligent agents operating in engineering environments with the ability to anticipate and adjust for actuation errors and system disturbances during training. We evaluate the framework in five widely used open-source mechanical simulation environments we restructured and developed to reflect real-world operating conditions, showcasing its robustness against uncertainties and offering a highly practical and efficient solution for control-oriented applications.
zh

[AI-40] Gym4ReaL: A Suite for Benchmarking Real-World Reinforcement Learning

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)在真实世界应用中面临的关键挑战,如大规模状态-动作空间、非平稳性和部分可观测性等问题。现有基准测试通常聚焦于理想化、完全可观测和平稳的环境,未能充分考虑现实世界的复杂性。论文提出的解决方案是引入一个名为Gym4ReaL的综合性现实环境套件,旨在支持能够在真实场景中运行的RL算法的开发与评估,其关键在于通过多样化任务暴露算法于多种实际挑战,从而推动RL方法在处理现实世界复杂性方面的进一步发展。

链接: https://arxiv.org/abs/2507.00257
作者: Davide Salaorni,Vincenzo De Paola,Samuele Delpero,Giovanni Dispoto,Paolo Bonetti,Alessio Russo,Giuseppe Calcagno,Francesco Trovò,Matteo Papini,Alberto Maria Metelli,Marco Mussi,Marcello Restelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:In recent years, \emphReinforcement Learning (RL) has made remarkable progress, achieving superhuman performance in a wide range of simulated environments. As research moves toward deploying RL in real-world applications, the field faces a new set of challenges inherent to real-world settings, such as large state-action spaces, non-stationarity, and partial observability. Despite their importance, these challenges are often underexplored in current benchmarks, which tend to focus on idealized, fully observable, and stationary environments, often neglecting to incorporate real-world complexities explicitly. In this paper, we introduce \textttGym4ReaL, a comprehensive suite of realistic environments designed to support the development and evaluation of RL algorithms that can operate in real-world scenarios. The suite includes a diverse set of tasks that expose algorithms to a variety of practical challenges. Our experimental results show that, in these settings, standard RL algorithms confirm their competitiveness against rule-based benchmarks, motivating the development of new methods to fully exploit the potential of RL to tackle the complexities of real-world tasks.
zh

[AI-41] A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss

【速读】:该论文试图解决语音超分辨率(Speech Super-Resolution, SSR)中低分辨率语音增强的问题,特别是在提高采样率的同时保持良好的感知质量。传统方法多关注幅度重建,而本文强调相位重建的重要性。解决方案的关键在于提出CTFT-Net,一种复数时频变换网络,能够在复数域中同时重建幅度和相位。该网络引入了复数全局注意力块以建模音素间和频段间的依赖关系,并采用复数Conformer结构以捕捉长程和局部特征,从而提升频率重建能力和噪声鲁棒性。此外,CTFT-Net结合时域和多分辨率频域损失函数,增强了模型的泛化能力。实验表明,CTFT-Net在VCTK数据集上优于现有先进模型,在极端上采样(2 kHz至48 kHz)任务中能有效重建高频内容且无噪声伪影。

链接: https://arxiv.org/abs/2507.00229
作者: Tarikul Islam Tamiti,Biraj Joshi,Rida Hasan,Rashedul Hasan,Taieba Athay,Nursad Mamun,Anomadarshi Barua
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech super-resolution (SSR) enhances low-resolution speech by increasing the sampling rate. While most SSR methods focus on magnitude reconstruction, recent research highlights the importance of phase reconstruction for improved perceptual quality. Therefore, we introduce CTFT-Net, a Complex Time-Frequency Transformation Network that reconstructs both magnitude and phase in complex domains for improved SSR tasks. It incorporates a complex global attention block to model inter-phoneme and inter-frequency dependencies and a complex conformer to capture long-range and local features, improving frequency reconstruction and noise robustness. CTFT-Net employs time-domain and multi-resolution frequency-domain loss functions for better generalization. Experiments show CTFT-Net outperforms state-of-the-art models (NU-Wave, WSRGlow, NVSR, AERO) on the VCTK dataset, particularly for extreme upsampling (2 kHz to 48 kHz), reconstructing high frequencies effectively without noisy artifacts.
zh

[AI-42] Learning for routing: A guided review of recent developments and future directions

【速读】:该论文试图解决NP难的组合优化问题,特别是路由问题,如旅行商问题(Traveling Salesman Problem, TSP)和车辆路径问题(Vehicle Routing Problem, VRP)。传统精确算法在求解此类问题时计算时间过长,而启发式方法虽能提供近似解但无法保证最优性。该论文的关键解决方案是引入机器学习(Machine Learning, ML)技术,通过构建基于构造和改进的ML路由方法分类体系,将传统运筹学方法与先进的ML技术相结合,以提升求解效率和质量,并为未来研究提供结构化框架。

链接: https://arxiv.org/abs/2507.00218
作者: Fangting Zhou,Attila Lischka,Balazs Kulcsar,Jiaming Wu,Morteza Haghir Chehreghani,Gilbert Laporte
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Accepted for publication in Transportation Research Part E: Logistics and Transportation Review

点击查看摘要

Abstract:This paper reviews the current progress in applying machine learning (ML) tools to solve NP-hard combinatorial optimization problems, with a focus on routing problems such as the traveling salesman problem (TSP) and the vehicle routing problem (VRP). Due to the inherent complexity of these problems, exact algorithms often require excessive computational time to find optimal solutions, while heuristics can only provide approximate solutions without guaranteeing optimality. With the recent success of machine learning models, there is a growing trend in proposing and implementing diverse ML techniques to enhance the resolution of these challenging routing problems. We propose a taxonomy categorizing ML-based routing methods into construction-based and improvement-based approaches, highlighting their applicability to various problem characteristics. This review aims to integrate traditional OR methods with state-of-the-art ML techniques, providing a structured framework to guide future research and address emerging VRP variants.
zh

[AI-43] Holistic Artificial Intelligence in Medicine; improved performance and explainability

【速读】:该论文旨在解决现有医学人工智能框架HAIM在任务无关数据使用和缺乏可解释性方面的问题。其解决方案的关键在于引入xHAIM(Explainable HAIM),该框架利用生成式AI (Generative AI) 通过四个结构化步骤提升预测性能与可解释性:自动识别跨模态的任务相关患者数据、生成全面的患者摘要、利用摘要进行改进的预测建模,以及通过将预测与患者特定的医学知识关联提供临床解释。

链接: https://arxiv.org/abs/2507.00205
作者: Periklis Petridis,Georgios Margaritis,Vasiliki Stoumpou,Dimitris Bertsimas
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to npj Digital Medicine

点击查看摘要

Abstract:With the increasing interest in deploying Artificial Intelligence in medicine, we previously introduced HAIM (Holistic AI in Medicine), a framework that fuses multimodal data to solve downstream clinical tasks. However, HAIM uses data in a task-agnostic manner and lacks explainability. To address these limitations, we introduce xHAIM (Explainable HAIM), a novel framework leveraging Generative AI to enhance both prediction and explainability through four structured steps: (1) automatically identifying task-relevant patient data across modalities, (2) generating comprehensive patient summaries, (3) using these summaries for improved predictive modeling, and (4) providing clinical explanations by linking predictions to patient-specific medical knowledge. Evaluated on the HAIM-MIMIC-MM dataset, xHAIM improves average AUC from 79.9% to 90.3% across chest pathology and operative tasks. Importantly, xHAIM transforms AI from a black-box predictor into an explainable decision support system, enabling clinicians to interactively trace predictions back to relevant patient data, bridging AI advancements with clinical utility.
zh

[AI-44] What Makes Local Updates Effective: The Role of Data Heterogeneity and Smoothness

【速读】:该论文旨在解决分布式和联邦优化中局部更新算法(如Local SGD)在数据异质性现实模型下的理论理解问题,特别是明确局部更新相较于集中式或小批量方法在凸和非凸设置中的优势条件。其解决方案的关键在于提出了一种基于共识误差的细粒度分析框架,该框架在三阶光滑性和宽松异质性假设下,能够得出更紧致的有限时间收敛界,并证明了有界二阶异质性假设是局部更新优于传统方法的必要且充分条件。

链接: https://arxiv.org/abs/2507.00195
作者: Kumar Kshitij Patel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This thesis contributes to the theoretical understanding of local update algorithms, especially Local SGD, in distributed and federated optimization under realistic models of data heterogeneity. A central focus is on the bounded second-order heterogeneity assumption, which is shown to be both necessary and sufficient for local updates to outperform centralized or mini-batch methods in convex and non-convex settings. The thesis establishes tight upper and lower bounds in several regimes for various local update algorithms and characterizes the min-max complexity of multiple problem classes. At its core is a fine-grained consensus-error-based analysis framework that yields sharper finite-time convergence bounds under third-order smoothness and relaxed heterogeneity assumptions. The thesis also extends to online federated learning, providing fundamental regret bounds under both first-order and bandit feedback. Together, these results clarify when and why local updates offer provable advantages, and the thesis serves as a self-contained guide for analyzing Local SGD in heterogeneous environments.
zh

[AI-45] Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions ICML2025

【速读】:该论文试图解决可穿戴设备中行为信号在健康预测中的应用问题,尤其是在传统基础模型主要应用于低级传感器数据的情况下,如何充分利用行为数据的高信息量特性。解决方案的关键在于基于来自162K个体的超过2.5B小时可穿戴数据,系统地优化模型架构和分词策略,以适配行为信号的独特特性,并在57项与健康相关的任务中验证了模型的高性能表现。

链接: https://arxiv.org/abs/2507.00191
作者: Eray Erturk,Fahad Kamran,Salar Abbaspourazad,Sean Jewell,Harsh Sharma,Yujie Li,Sinead Williamson,Nicholas J Foti,Joseph Futoma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:Wearable devices record physiological and behavioral signals that can improve health predictions. While foundation models are increasingly used for such predictions, they have been primarily applied to low-level sensor data, despite behavioral data often being more informative due to their alignment with physiologically relevant timescales and quantities. We develop foundation models of such behavioral signals using over 2.5B hours of wearable data from 162K individuals, systematically optimizing architectures and tokenization strategies for this unique dataset. Evaluated on 57 health-related tasks, our model shows strong performance across diverse real-world applications including individual-level classification and time-varying health state prediction. The model excels in behavior-driven tasks like sleep prediction, and improves further when combined with representations of raw sensor data. These results underscore the importance of tailoring foundation model design to wearables and demonstrate the potential to enable new health applications.
zh

[AI-46] xt-to-Level Diffusion Models With Various Text Encoders for Super Mario Bros

【速读】:该论文试图解决如何利用扩散模型进行文本到关卡的生成问题,特别是在现有研究中,扩散模型在无条件生成基于瓷砖的游戏关卡方面已有进展,但其在文本到关卡生成方面的应用仍处于探索阶段。解决方案的关键在于自动为现有关卡数据集分配描述性标题,并使用预训练文本编码器和从头训练的简单Transformer模型来训练扩散模型,以生成完整可玩的关卡,而非单个场景。此外,研究还比较了不同方法在生成关卡的多样性与可玩性方面的表现,发现采用简单Transformer模型进行文本嵌入的扩散模型在训练时间上更优,表明并不一定需要依赖大型语言模型。

链接: https://arxiv.org/abs/2507.00184
作者: Jacob Schrum,Olivia Kilday,Emilio Salas,Bess Hagan,Reid Williams
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research shows how diffusion models can unconditionally generate tile-based game levels, but use of diffusion models for text-to-level generation is underexplored. There are practical considerations for creating a usable model: caption/level pairs are needed, as is a text embedding model, and a way of generating entire playable levels, rather than individual scenes. We present strategies to automatically assign descriptive captions to an existing level dataset, and train diffusion models using both pretrained text encoders and simple transformer models trained from scratch. Captions are automatically assigned to generated levels so that the degree of overlap between input and output captions can be compared. We also assess the diversity and playability of the resulting levels. Results are compared with an unconditional diffusion model and a generative adversarial network, as well as the text-to-level approaches Five-Dollar Model and MarioGPT. Notably, the best diffusion model uses a simple transformer model for text embedding, and takes less time to train than diffusion models employing more complex text encoders, indicating that reliance on larger language models is not necessary. We also present a GUI allowing designers to construct long levels from model-generated scenes.
zh

[AI-47] ChatGPT produces more “lazy” thinkers: Evidence of cognitive engagement decline

【速读】:该论文试图解决生成式AI(Generative AI)工具,特别是ChatGPT,在学术写作任务中对学生认知参与度可能产生的负面影响问题。研究通过实验设计,将参与者随机分配至AI辅助组或非辅助对照组,结果显示使用ChatGPT的组别在认知参与度量表(CES-AI)上的得分显著低于对照组,表明AI辅助可能导致认知卸载。该研究的关键在于开发了专门评估认知参与度的CES-AI量表,并通过实证方法揭示了AI工具对学习者深度思维和主动学习的潜在影响。

链接: https://arxiv.org/abs/2507.00181
作者: Georgios P. Georgiou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the increasing use of large language models (LLMs) in education, concerns have emerged about their potential to reduce deep thinking and active learning. This study investigates the impact of generative artificial intelligence (AI) tools, specifically ChatGPT, on the cognitive engagement of students during academic writing tasks. The study employed an experimental design with participants randomly assigned to either an AI-assisted (ChatGPT) or a non-assisted (control) condition. Participants completed a structured argumentative writing task followed by a cognitive engagement scale (CES), the CES-AI, developed to assess mental effort, attention, deep processing, and strategic thinking. The results revealed significantly lower cognitive engagement scores in the ChatGPT group compared to the control group. These findings suggest that AI assistance may lead to cognitive offloading. The study contributes to the growing body of literature on the psychological implications of AI in education and raises important questions about the integration of such tools into academic practice. It calls for pedagogical strategies that promote active, reflective engagement with AI-generated content to avoid compromising self-regulated learning and deep cognitive involvement of students.
zh

[AI-48] BlackBoxToBlueprint: Extracting Interpretable Logic from Legacy Systems using Reinforcement Learning and Counterfactual Analysis

【速读】:该论文试图解决遗留软件系统现代化过程中因缺乏文档和对原始系统复杂决策逻辑理解不足而带来的挑战。其解决方案的关键在于提出一种新的管道,通过强化学习(Reinforcement Learning, RL)代理探索输入空间,识别关键决策边界,并利用这些边界生成可解释的决策规则。该方法通过奖励导致系统输出显著变化的操作来聚焦于相关区域,随后使用K-Means聚类对产生的反事实状态转移进行聚类,并训练决策树以提取近似系统核心逻辑的人类可读规则。

链接: https://arxiv.org/abs/2507.00180
作者: Vidhi Rathore
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modernizing legacy software systems is a critical but challenging task, often hampered by a lack of documentation and understanding of the original system’s intricate decision logic. Traditional approaches like behavioral cloning merely replicate input-output behavior without capturing the underlying intent. This paper proposes a novel pipeline to automatically extract interpretable decision logic from legacy systems treated as black boxes. The approach uses a Reinforcement Learning (RL) agent to explore the input space and identify critical decision boundaries by rewarding actions that cause meaningful changes in the system’s output. These counterfactual state transitions, where the output changes, are collected and clustered using K-Means. Decision trees are then trained on these clusters to extract human-readable rules that approximate the system’s decision logic near the identified boundaries. I demonstrated the pipeline’s effectiveness on three dummy legacy systems with varying complexity, including threshold-based, combined-conditional, and non-linear range logic. Results show that the RL agent successfully focuses exploration on relevant boundary regions, and the extracted rules accurately reflect the core logic of the underlying dummy systems, providing a promising foundation for generating specifications and test cases during legacy migration.
zh

[AI-49] Designing an Adaptive Storytelling Platform to Promote Civic Education in Politically Polarized Learning Environments

【速读】:该论文试图解决政治极化对民主公民教育的负面影响,特别是身份认同导致的对对立观点的抵触情绪。其解决方案的关键在于利用生成式 AI (Generative AI) 技术设计具有适应性和情感响应能力的公民叙事,以维持学生的情感参与,并促进其对政治对立群体的观点理解。核心策略包括通过情感计算技术支持三种叙事机制:沉浸于故事世界、与角色产生认同以及与叙述者互动,同时结合面部情绪识别和注意力追踪实现实时情感状态评估,并通过 GPT-4 进行语言适应以个性化叙事内容。

链接: https://arxiv.org/abs/2507.00161
作者: Christopher M. Wegemer,Edward Halim,Jeff Burke
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Political polarization undermines democratic civic education by exacerbating identity-based resistance to opposing viewpoints. Emerging AI technologies offer new opportunities to advance interventions that reduce polarization and promote political open-mindedness. We examined novel design strategies that leverage adaptive and emotionally-responsive civic narratives that may sustain students’ emotional engagement in stories, and in turn, promote perspective-taking toward members of political out-groups. Drawing on theories from political psychology and narratology, we investigate how affective computing techniques can support three storytelling mechanisms: transportation into a story world, identification with characters, and interaction with the storyteller. Using a design-based research (DBR) approach, we iteratively developed and refined an AI-mediated Digital Civic Storytelling (AI-DCS) platform. Our prototype integrates facial emotion recognition and attention tracking to assess users’ affective and attentional states in real time. Narrative content is organized around pre-structured story outlines, with beat-by-beat language adaptation implemented via GPT-4, personalizing linguistic tone to sustain students’ emotional engagement in stories that center political perspectives different from their own. Our work offers a foundation for AI-supported, emotionally-sensitive strategies that address affective polarization while preserving learner autonomy. We conclude with implications for civic education interventions, algorithmic literacy, and HCI challenges associated with AI dialogue management and affect-adaptive learning environments.
zh

[AI-50] AI-Hybrid TRNG: Kernel-Based Deep Learning for Near-Uniform Entropy Harvesting from Physical Noise

【速读】:该论文旨在解决传统真随机数生成器(TRNG)依赖于昂贵的量子设备或实验室级射频接收器的问题,从而限制了其在资源受限平台上的应用。其解决方案的关键在于提出一种基于深度学习的AI-Hybrid TRNG框架,该框架通过从物理噪声中直接提取近似均匀的熵,并结合低成本的射频前端和CPU时序抖动进行训练,生成无需量化步骤的32位高熵流。该方法通过动态内外网络结构实现自适应自然源与重种子的耦合,确保生成序列的不可预测性和自主性,同时满足密码学标准并具备良好的可部署性。

链接: https://arxiv.org/abs/2507.00145
作者: Hasan Yiğit
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Theory (cs.IT); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:AI-Hybrid TRNG is a deep-learning framework that extracts near-uniform entropy directly from physical noise, eliminating the need for bulky quantum devices or expensive laboratory-grade RF receivers. Instead, it relies on a low-cost, thumb-sized RF front end, plus CPU-timing jitter, for training, and then emits 32-bit high-entropy streams without any quantization step. Unlike deterministic or trained artificial intelligence random number generators (RNGs), our dynamic inner-outer network couples adaptive natural sources and reseeding, yielding truly unpredictable and autonomous sequences. Generated numbers pass the NIST SP 800-22 battery better than a CPU-based method. It also passes nineteen bespoke statistical tests for both bit- and integer-level analysis. All results satisfy cryptographic standards, while forward and backward prediction experiments reveal no exploitable biases. The model’s footprint is below 0.5 MB, making it deployable on MCUs and FPGA soft cores, as well as suitable for other resource-constrained platforms. By detaching randomness quality from dedicated hardware, AI-Hybrid TRNG broadens the reach of high-integrity random number generators across secure systems, cryptographic protocols, embedded and edge devices, stochastic simulations, and server applications that need randomness. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Theory (cs.IT); Signal Processing (eess.SP) Cite as: arXiv:2507.00145 [cs.CR] (or arXiv:2507.00145v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.00145 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-51] aching Programming in the Age of Generative AI: Insights from Literature Pedagogical Proposals and Student Perspectives

【速读】:该论文试图解决在生成式人工智能(Generative AI)背景下,如何重新设计编程教学内容、学习方式及评估方法的问题。其解决方案的关键在于通过强调代码理解与执行过程,而非仅仅关注编码本身或程序功能,从而提升学生的深层次理解能力。具体而言,论文主张利用代码的可视化表示和执行模拟作为教学、学习与评估的有效工具。

链接: https://arxiv.org/abs/2507.00108
作者: Clemente Rubio-Manzano,Jazna Meza,Rodolfo Fernandez-Santibanez,Christian Vidal-Castro
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Computer programming is undergoing a true transformation driven by powerful new tools for automatic source code generation based on large language models. This transformation is also manifesting in introductory programming courses at universities around the world, generating an in-depth debate about how programming content should be taught, learned, and assessed in the context of generative artificial intelligence. This article aims, on the one hand, to review the most relevant studies on this issue, highlighting the advantages and disadvantages identified in the specialized literature. On the other hand, it proposes enriching teaching and learning methodologies by focusing on code comprehension and execution rather than on mere coding or program functionality. In particular, it advocates for the use of visual representations of code and visual simulations of its execution as effective tools for teaching, learning, and assessing programming, thus fostering a deeper understanding among students. Finally, the opinions of students who took the object-oriented programming course are presented to provide preliminary context supporting the incorporation of visual simulations in Java (or other languages) as part of the training process. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Programming Languages (cs.PL) Cite as: arXiv:2507.00108 [cs.CY] (or arXiv:2507.00108v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2507.00108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-52] owards transparent and data-driven fault detection in manufacturing: A case study on univariate discrete time series

【速读】:该论文旨在解决现代制造中确保产品一致性质量的问题,特别是在安全关键应用中,传统质量控制方法依赖人工定义的阈值和特征,难以适应生产数据的复杂性和变异性,而数据驱动的方法虽然检测性能高,但通常作为黑箱模型,限制了其在需要可解释性的工业环境中的应用。解决方案的关键在于提出一种既数据驱动又透明的工业故障检测方法,该方法结合了监督机器学习模型进行多类故障分类、Shapley Additive Explanations(SHAP)实现事后可解释性,以及领域特定的可视化技术,将模型解释映射到操作员可理解的特征,并通过定量扰动分析和定性专家评估对模型解释和可视化进行验证。

链接: https://arxiv.org/abs/2507.00102
作者: Bernd Hofmann,Patrick Bruendl,Huong Giang Nguyen,Joerg Franke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Ensuring consistent product quality in modern manufacturing is crucial, particularly in safety-critical applications. Conventional quality control approaches, reliant on manually defined thresholds and features, lack adaptability to the complexity and variability inherent in production data and necessitate extensive domain expertise. Conversely, data-driven methods, such as machine learning, demonstrate high detection performance but typically function as black-box models, thereby limiting their acceptance in industrial environments where interpretability is paramount. This paper introduces a methodology for industrial fault detection, which is both data-driven and transparent. The approach integrates a supervised machine learning model for multi-class fault classification, Shapley Additive Explanations for post-hoc interpretability, and a do-main-specific visualisation technique that maps model explanations to operator-interpretable features. Furthermore, the study proposes an evaluation methodology that assesses model explanations through quantitative perturbation analysis and evaluates visualisations by qualitative expert assessment. The approach was applied to the crimping process, a safety-critical joining technique, using a dataset of univariate, discrete time series. The system achieves a fault detection accuracy of 95.9 %, and both quantitative selectivity analysis and qualitative expert evaluations confirmed the relevance and inter-pretability of the generated explanations. This human-centric approach is designed to enhance trust and interpretability in data-driven fault detection, thereby contributing to applied system design in industrial quality control.
zh

[AI-53] AI-Governed Agent Architecture for Web-Trustworthy Tokenization of Alternative Assets

【速读】:该论文旨在解决在基于网络的代币化生态系统中确保可信度的问题,特别是在验证链下资产数据和执行监管合规性方面面临的挑战。其解决方案的关键在于提出一种由人工智能治理的智能体架构,该架构将智能体与区块链技术相结合,实现对另类资产的网络可信代币化。该架构通过自主智能体协调代币化流程(如资产验证、估值、合规检查和生命周期管理),并借助AI驱动的治理层监控智能体行为,通过自适应策略和密码经济激励措施来保障信任。

链接: https://arxiv.org/abs/2507.00096
作者: Ailiya Borjigin,Wei Zhou,Cong He
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 Pages, 1 figure

点击查看摘要

Abstract:Alternative Assets tokenization is transforming non-traditional financial instruments are represented and traded on the web. However, ensuring trustworthiness in web-based tokenized ecosystems poses significant challenges, from verifying off-chain asset data to enforcing regulatory compliance. This paper proposes an AI-governed agent architecture that integrates intelligent agents with blockchain to achieve web-trustworthy tokenization of alternative assets. In the proposed architecture, autonomous agents orchestrate the tokenization process (asset verification, valuation, compliance checking, and lifecycle management), while an AI-driven governance layer monitors agent behavior and enforces trust through adaptive policies and cryptoeconomic incentives. We demonstrate that this approach enhances transparency, security, and compliance in asset tokenization, addressing key concerns around data authenticity and fraud. A case study on tokenizing real estate assets illustrates how the architecture mitigates risks (e.g., fraudulent listings and money laundering) through real-time AI anomaly detection and on-chain enforcement. Our evaluation and analysis suggest that combining AI governance with multi-agent systems and blockchain can significantly bolster trust in tokenized asset ecosystems. This work offers a novel framework for trustworthy asset tokenization on the web and provides insights for practitioners aiming to deploy secure, compliant tokenization platforms.
zh

[AI-54] Efficient Conformance Checking of Rich Data-Aware Declare Specifications (Extended)

【速读】:该论文试图解决在数据感知的声明式流程模型中进行基于对齐的合规性检查的问题,现有方法主要局限于纯控制流规范或仅支持数值数据和变量到常量比较的有限扩展。其解决方案的关键在于提出一种新颖的算法技术,通过结合A*搜索和SMT求解两种方法,高效地探索搜索空间,并通过应用修复操作生成后代状态,逐步解决约束违反问题,从而在保证效率的同时支持更丰富的数据依赖性。

链接: https://arxiv.org/abs/2507.00094
作者: Jacobo Casas-Ramos,Sarah Winkler,Alessandro Gianola,Marco Montali,Manuel Mucientes,Manuel Lama
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Extended version of the paper of the same title accepted at the 23rd International Conference on Business Process Management (BPM 2025)

点击查看摘要

Abstract:Despite growing interest in process analysis and mining for data-aware specifications, alignment-based conformance checking for declarative process models has focused on pure control-flow specifications, or mild data-aware extensions limited to numerical data and variable-to-constant comparisons. This is not surprising: finding alignments is computationally hard, even more so in the presence of data dependencies. In this paper, we challenge this problem in the case where the reference model is captured using data-aware Declare with general data types and data conditions. We show that, unexpectedly, it is possible to compute data-aware optimal alignments in this rich setting, enjoying at once efficiency and expressiveness. This is achieved by carefully combining the two best-known approaches to deal with control flow and data dependencies when computing alignments, namely A* search and SMT solving. Specifically, we introduce a novel algorithmic technique that efficiently explores the search space, generating descendant states through the application of repair actions aiming at incrementally resolving constraint violations. We prove the correctness of our algorithm and experimentally show its efficiency. The evaluation witnesses that our approach matches or surpasses the performance of the state of the art while also supporting significantly more expressive data dependencies, showcasing its potential to support real-world applications.
zh

[AI-55] σ-Maximal Ancestral Graphs UAI

【速读】:该论文试图解决Maximal Ancestral Graphs (MAGs)无法表示循环因果关系的问题,这一固有局限性限制了其在复杂因果建模中的应用。解决方案的关键是引入并研究一类称为“σ-Maximal Ancestral Graphs”(σ-MAGs)的图模型,这些图能够抽象表示包含潜在(选择)变量的可能具有循环结构的Directed Graphs (DGs),类似于MAGs对DAGs的表示方式,并对其马尔可夫等价类进行了表征。

链接: https://arxiv.org/abs/2507.00093
作者: Binghua Yao,Joris M. Mooij
机构: 未知
类目: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST)
备注: It has beee accepted by the 41st Conference on Uncertainty in Artificial Intelligence (UAI)

点击查看摘要

Abstract:Maximal Ancestral Graphs (MAGs) provide an abstract representation of Directed Acyclic Graphs (DAGs) with latent (selection) variables. These graphical objects encode information about ancestral relations and d-separations of the DAGs they represent. This abstract representation has been used amongst others to prove the soundness and completeness of the FCI algorithm for causal discovery, and to derive a do-calculus for its output. One significant inherent limitation of MAGs is that they rule out the possibility of cyclic causal relationships. In this work, we address that limitation. We introduce and study a class of graphical objects that we coin ‘’ \sigma -Maximal Ancestral Graphs’’ (‘’ \sigma -MAGs’'). We show how these graphs provide an abstract representation of (possibly cyclic) Directed Graphs (DGs) with latent (selection) variables, analogously to how MAGs represent DAGs. We study the properties of these objects and provide a characterization of their Markov equivalence classes.
zh

[AI-56] Generating Heterogeneous Multi-dimensional Data : A Comparative Study

【速读】:该论文试图解决消防员干预过程中人员和物资资源分配的优化问题,其核心在于通过数据生成方法来模拟不同场景,以支持全局优化的消防响应。解决方案的关键在于比较多种数据生成方法(如随机采样、表格变分自编码器、标准生成对抗网络、条件表格生成对抗网络和扩散概率模型)的有效性,并采用结合领域特定指标与标准度量(如Wasserstein距离)的综合评估体系,以更准确地衡量合成数据的质量,特别是在处理高度不平衡且非高斯分布的数据时。

链接: https://arxiv.org/abs/2507.00090
作者: Corbeau Michael,Claeys Emmanuelle,Serrurier Mathieu,Zaraté Pascale
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted at IEEE SMC 2025 Vienna

点击查看摘要

Abstract:Allocation of personnel and material resources is highly sensible in the case of firefighter interventions. This allocation relies on simulations to experiment with various scenarios. The main objective of this allocation is the global optimization of the firefighters response. Data generation is then mandatory to study various scenarios In this study, we propose to compare different data generation methods. Methods such as Random Sampling, Tabular Variational Autoencoders, standard Generative Adversarial Networks, Conditional Tabular Generative Adversarial Networks and Diffusion Probabilistic Models are examined to ascertain their efficacy in capturing the intricacies of firefighter interventions. Traditional evaluation metrics often fall short in capturing the nuanced requirements of synthetic datasets for real-world scenarios. To address this gap, an evaluation of synthetic data quality is conducted using a combination of domain-specific metrics tailored to the firefighting domain and standard measures such as the Wasserstein distance. Domain-specific metrics include response time distribution, spatial-temporal distribution of interventions, and accidents representation. These metrics are designed to assess data variability, the preservation of fine and complex correlations and anomalies such as event with a very low occurrence, the conformity with the initial statistical distribution and the operational relevance of the synthetic data. The distribution has the particularity of being highly unbalanced, none of the variables following a Gaussian distribution, adding complexity to the data generation process.
zh

[AI-57] pUniFind: a unified large pre-trained deep learning model pushing the limit of mass spectra interpretation

【速读】:该论文旨在解决质谱数据解析中传统模型作为特征提取器而非统一评分框架的局限性。其关键解决方案是提出pUniFind,这是首个在蛋白质组学中大规模多模态预训练模型,通过端到端肽段-谱图评分与开放、零样本从头测序的整合,实现了光谱与肽段模态的跨模态预测,从而在多种数据集上超越传统引擎,特别是在免疫肽组学中提升了42.6%的鉴定肽段数量。

链接: https://arxiv.org/abs/2507.00087
作者: Jiale Zhao,Pengzhi Mao,Kaifei Wang,Yiming Li,Yaping Peng,Ranfei Chen,Shuqi Lu,Xiaohong Ji,Jiaxiang Ding,Xin Zhang,Yucheng Liao,Weinan E,Weijie Zhang,Han Wen,Hao Chi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has advanced mass spectrometry data interpretation, yet most models remain feature extractors rather than unified scoring frameworks. We present pUniFind, the first large-scale multimodal pre-trained model in proteomics that integrates end-to-end peptide-spectrum scoring with open, zero-shot de novo sequencing. Trained on over 100 million open search-derived spectra, pUniFind aligns spectral and peptide modalities via cross modality prediction and outperforms traditional engines across diverse datasets, particularly achieving a 42.6 percent increase in the number of identified peptides in immunopeptidomics. Supporting over 1,300 modifications, pUniFind identifies 60 percent more PSMs than existing de novo methods despite a 300-fold larger search space. A deep learning based quality control module further recovers 38.5 percent additional peptides including 1,891 mapped to the genome but absent from reference proteomes while preserving full fragment ion coverage. These results establish a unified, scalable deep learning framework for proteomic analysis, offering improved sensitivity, modification coverage, and interpretability.
zh

[AI-58] A Joint Topology-Data Fusion Graph Network for Robust Traffic Speed Prediction with Data Anomalism

【速读】:该论文旨在解决交通流预测中因交通动态的固有复杂性和非线性而导致的空间与时间特征难以融合的问题,以及现有方法在处理非平稳和异常历史数据时采用静态技术所导致的适应性差和数据平滑效果不佳的问题。解决方案的关键在于提出一种名为图融合增强网络(Graph Fusion Enhanced Network, GFEN)的创新框架,其核心是引入了一种新型的拓扑时空图融合技术,通过可训练方法从数据分布和网络拓扑中精确提取并融合空间与时间相关性,从而实现多尺度时空特征的建模,并结合基于k阶差分的数学框架与注意力机制的深度学习结构,以自适应地平滑历史观测数据并动态缓解数据异常和非平稳性。

链接: https://arxiv.org/abs/2507.00085
作者: Ruiyuan Jiang,Dongyao Jia,Eng Gee Lim,Pengfei Fan,Yuli Zhang,Shangbo Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate traffic prediction is essential for Intelligent Transportation Systems (ITS), yet current methods struggle with the inherent complexity and non-linearity of traffic dynamics, making it difficult to integrate spatial and temporal characteristics. Furthermore, existing approaches use static techniques to address non-stationary and anomalous historical data, which limits adaptability and undermines data smoothing. To overcome these challenges, we propose the Graph Fusion Enhanced Network (GFEN), an innovative framework for network-level traffic speed prediction. GFEN introduces a novel topological spatiotemporal graph fusion technique that meticulously extracts and merges spatial and temporal correlations from both data distribution and network topology using trainable methods, enabling the modeling of multi-scale spatiotemporal features. Additionally, GFEN employs a hybrid methodology combining a k-th order difference-based mathematical framework with an attention-based deep learning structure to adaptively smooth historical observations and dynamically mitigate data anomalies and non-stationarity. Extensive experiments demonstrate that GFEN surpasses state-of-the-art methods by approximately 6.3% in prediction accuracy and exhibits convergence rates nearly twice as fast as recent hybrid models, confirming its superior performance and potential to significantly enhance traffic prediction system efficiency.
zh

[AI-59] Strategic Counterfactual Modeling of Deep-Target Airstrike Systems via Intervention-Aware Spatio-Causal Graph Networks

【速读】:该论文试图解决当前战略级仿真中战术打击行为与战略延迟之间缺乏结构化因果建模的问题,特别是“弹性 - 节点抑制 - 谈判窗口”链中中间变量捕捉的结构性瓶颈。其解决方案的关键在于提出一种干预感知时空图神经网络(Intervention-Aware Spatio-Temporal Graph Neural Network, IA-STGNN),该框架通过整合图注意力机制、反事实仿真单元和空间干预节点重构,实现从战术输入到战略延迟输出的因果闭环,从而支持动态的打击配置与同步策略模拟。

链接: https://arxiv.org/abs/2507.00083
作者: Wei Meng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper proposes the first closed-loop causal modeling framework (IA-STGNN) that links tactical strike variables to strategic delay outcomes via graph neural networks with counterfactual reasoning

点击查看摘要

Abstract:This study addresses the lack of structured causal modeling between tactical strike behavior and strategic delay in current strategic-level simulations, particularly the structural bottlenecks in capturing intermediate variables within the “resilience - nodal suppression - negotiation window” chain. We propose the Intervention-Aware Spatio-Temporal Graph Neural Network (IA-STGNN), a novel framework that closes the causal loop from tactical input to strategic delay output. The model integrates graph attention mechanisms, counterfactual simulation units, and spatial intervention node reconstruction to enable dynamic simulations of strike configurations and synchronization strategies. Training data are generated from a multi-physics simulation platform (GEANT4 + COMSOL) under NIST SP 800-160 standards, ensuring structural traceability and policy-level validation. Experimental results demonstrate that IA-STGNN significantly outperforms baseline models (ST-GNN, GCN-LSTM, XGBoost), achieving a 12.8 percent reduction in MAE and 18.4 percent increase in Top-5 percent accuracy, while improving causal path consistency and intervention stability. IA-STGNN enables interpretable prediction of strategic delay and supports applications such as nuclear deterrence simulation, diplomatic window assessment, and multi-strategy optimization, providing a structured and transparent AI decision-support mechanism for high-level policy modeling.
zh

[AI-60] VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

【速读】:该论文试图解决如何通过引入视觉输入来增强模型对空间环境的理解能力,从而扩展其完成任务的开放性潜力。解决方案的关键在于提出VoyagerVision——一个能够利用截图作为视觉反馈在Minecraft中构建结构的多模态模型,相较于仅依赖文本输入的Voyager,VoyagerVision通过整合视觉信息显著提升了模型在复杂任务中的表现。

链接: https://arxiv.org/abs/2507.00079
作者: Ethan Smyth,Alessandro Suglia
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: website: this https URL

点击查看摘要

Abstract:Open-endedness is an active field of research in the pursuit of capable Artificial General Intelligence (AGI), allowing models to pursue tasks of their own choosing. Simultaneously, recent advancements in Large Language Models (LLMs) such as GPT-4o [9] have allowed such models to be capable of interpreting image inputs. Implementations such as OMNI-EPIC [4] have made use of such features, providing an LLM with pixel data of an agent’s POV to parse the environment and allow it to solve tasks. This paper proposes that providing these visual inputs to a model gives it greater ability to interpret spatial environments, and as such, can increase the number of tasks it can successfully perform, extending its open-ended potential. To this aim, this paper proposes VoyagerVision – a multi-modal model capable of creating structures within Minecraft using screenshots as a form of visual feedback, building on the foundation of Voyager. VoyagerVision was capable of creating an average of 2.75 unique structures within fifty iterations of the system, as Voyager was incapable of this, it is an extension in an entirely new direction. Additionally, in a set of building unit tests VoyagerVision was successful in half of all attempts in flat worlds, with most failures arising in more complex structures. Project website is available at this https URL
zh

[AI-61] heoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)在自我提升过程中的性能演化机制不明确的问题。其解决方案的关键在于通过“求解器-验证器差距”(solver-verifier gap)的概念对自我提升的训练动态进行理论建模,并基于此框架预测自我提升的最终效能,仅需利用前几个训练周期的信息即可实现。这一方法为理解LLM在无外部数据情况下的自我优化过程提供了理论依据和实用工具。

链接: https://arxiv.org/abs/2507.00075
作者: Yifan Sun,Yushan Liang,Zhen Zhang,Jiaye Teng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages

点击查看摘要

Abstract:Self-improvement is among the most prominent techniques within the realm of large language models (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM’s solver capability and verifier capability. Based on the theoretical framework, we further introduce how to predict the ultimate power of self-improvement using only information from the first few training epochs. We empirically validate the effectiveness of the theoretical model on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.
zh

[AI-62] InSight-R: A Framework for Risk-informed Human Failure Event Identification and Interface-Induced Risk Assessment Driven by AutoGraph

【速读】:该论文旨在解决传统人类可靠性分析(Human Reliability Analysis, HRA)方法在安全关键领域中对专家判断的高度依赖所带来的可重复性差、主观性强以及界面级数据整合不足的问题。其解决方案的关键在于提出一种基于AutoGraph(InSight-R)的框架,通过将实证行为数据与由自动化图执行框架构建的界面嵌入知识图(Interface-Embedded Knowledge Graph, IE-KG)相连接,实现基于易出错和时间偏差操作路径的自动化人类故障事件(Human Failure Event, HFE)识别,并评估界面设计对操作员性能变异性和错误易感性的影响。

链接: https://arxiv.org/abs/2507.00066
作者: Xingyu Xiao,Jiejuan Tong,Peng Chen,Jun Sun,Zhe Sui,Jingang Liang,Hongru Zhao,Jun Zhao,Haitao Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human reliability remains a critical concern in safety-critical domains such as nuclear power, where operational failures are often linked to human error. While conventional human reliability analysis (HRA) methods have been widely adopted, they rely heavily on expert judgment for identifying human failure events (HFEs) and assigning performance influencing factors (PIFs). This reliance introduces challenges related to reproducibility, subjectivity, and limited integration of interface-level data. In particular, current approaches lack the capacity to rigorously assess how human-machine interface design contributes to operator performance variability and error susceptibility. To address these limitations, this study proposes a framework for risk-informed human failure event identification and interface-induced risk assessment driven by AutoGraph (InSight-R). By linking empirical behavioral data to the interface-embedded knowledge graph (IE-KG) constructed by the automated graph-based execution framework (AutoGraph), the InSight-R framework enables automated HFE identification based on both error-prone and time-deviated operational paths. Furthermore, we discuss the relationship between designer-user conflicts and human error. The results demonstrate that InSight-R not only enhances the objectivity and interpretability of HFE identification but also provides a scalable pathway toward dynamic, real-time human reliability assessment in digitalized control environments. This framework offers actionable insights for interface design optimization and contributes to the advancement of mechanism-driven HRA methodologies.
zh

[AI-63] Smooth-Distill: A Self-distillation Framework for Multitask Learning with Wearable Sensor Data

【速读】:该论文旨在解决人体活动识别(Human Activity Recognition, HAR)与传感器放置检测的联合任务问题,同时降低模型训练的计算成本。其解决方案的关键在于提出了一种名为Smooth-Distill的自蒸馏框架,该框架采用统一的卷积神经网络(CNN)架构MTL-net,通过使用自身平滑的历史版本作为教师模型,而非传统的独立教师和学生模型结构,从而在保持性能优势的同时显著减少了训练计算开销。

链接: https://arxiv.org/abs/2507.00061
作者: Hoang-Dieu Vu,Duc-Nghia Tran,Quang-Tu Pham,Hieu H. Pham,Nicolas Vuillerme,Duc-Tan Tran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:This paper introduces Smooth-Distill, a novel self-distillation framework designed to simultaneously perform human activity recognition (HAR) and sensor placement detection using wearable sensor data. The proposed approach utilizes a unified CNN-based architecture, MTL-net, which processes accelerometer data and branches into two outputs for each respective task. Unlike conventional distillation methods that require separate teacher and student models, the proposed framework utilizes a smoothed, historical version of the model itself as the teacher, significantly reducing training computational overhead while maintaining performance benefits. To support this research, we developed a comprehensive accelerometer-based dataset capturing 12 distinct sleep postures across three different wearing positions, complementing two existing public datasets (MHealth and WISDM). Experimental results show that Smooth-Distill consistently outperforms alternative approaches across different evaluation scenarios, achieving notable improvements in both human activity recognition and device placement detection tasks. This method demonstrates enhanced stability in convergence patterns during training and exhibits reduced overfitting compared to traditional multitask learning baselines. This framework contributes to the practical implementation of knowledge distillation in human activity recognition systems, offering an effective solution for multitask learning with accelerometer data that balances accuracy and training efficiency. More broadly, it reduces the computational cost of model training, which is critical for scenarios requiring frequent model updates or training on resource-constrained platforms. The code and model are available at this https URL_distill.
zh

[AI-64] Estimating Correctness Without Oracles in LLM -Based Code Generation

【速读】:该论文试图解决在缺乏正确实现(即“oracle”)的情况下,如何量化大型语言模型(Large Language Models, LLMs)生成的程序的正确性问题。其解决方案的关键是提出了一种称为“incoherence”(不一致度)的错误度量方法,该方法能够在没有oracle的情况下高效估计程序的错误概率,并提供一个错误的下限,即LLM生成的程序不符合规范的概率。实验表明,基于incoherence的方法能够有效识别大部分错误程序,且与基于oracle的评估具有高度一致性。

链接: https://arxiv.org/abs/2507.00057
作者: Thomas Valentin,Ardi Madadi,Gaetano Sapia,Marcel Böhme
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 8 pages + refs and appendix

点击查看摘要

Abstract:Generating code from natural language specifications is one of the most successful applications of Large Language Models (LLMs). Yet, they hallucinate: LLMs produce outputs that may be grammatically correct but are factually incorrect. Without an existing, correct implementation (i.e., an oracle), can we quantify how likely the generated program is correct? In this paper, we propose a measure of incorrectness, called incoherence, that can be estimated efficiently in the absence of an oracle and provides a lower bound on the error, i.e., the probability that the LLM-generated program for that specification is incorrect. Our experiments demonstrate an extraordinary effectiveness. For the average code generation task, our incoherence-based methodology can automatically identify about two-thirds of incorrect programs without reports of false positives. In fact, an oracle-based evaluation of LLMs can be reliably replaced by an incoherence-based evaluation. In particular, we find a very strong agreement between the ranking of LLMs by the number of programs deemed correct via an oracle (pass@1) and the ranking of LLMs by the number of programs deemed correct via our incoherence. Comments: 8 pages + refs and appendix Subjects: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2507.00057 [cs.PL] (or arXiv:2507.00057v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2507.00057 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-65] SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network

【速读】:该论文试图解决传统基于惯性测量单元(Inertial Measurement Unit, IMU)的人类活动识别(Human Activity Recognition, HAR)在实际应用场景中因缺乏全面的数据集和模型透明度不足而受到的限制。其解决方案的关键在于提出一种新型的零样本HAR模型——自解释零样本人类活动识别网络(Self-Explainable Zero-shot Human Activity Recognition Network, SEZ-HARN),该模型不仅能够识别训练过程中未见过的活动,还能通过生成骨架视频来解释其决策过程,从而提升模型的可解释性和实用性。

链接: https://arxiv.org/abs/2507.00050
作者: Devin Y. De Silva,Sandareka Wickramanayake,Dulani Meedeniya,Sanka Rasnayaka
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR), which uses data from Inertial Measurement Unit (IMU) sensors, has many practical applications in healthcare and assisted living environments. However, its use in real-world scenarios has been limited by the lack of comprehensive IMU-based HAR datasets that cover a wide range of activities and the lack of transparency in existing HAR models. Zero-shot HAR (ZS-HAR) overcomes the data limitations, but current models struggle to explain their decisions, making them less transparent. This paper introduces a novel IMU-based ZS-HAR model called the Self-Explainable Zero-shot Human Activity Recognition Network (SEZ-HARN). It can recognize activities not encountered during training and provide skeleton videos to explain its decision-making process. We evaluate the effectiveness of the proposed SEZ-HARN on four benchmark datasets PAMAP2, DaLiAc, HTD-MHAD and MHealth and compare its performance against three state-of-the-art black-box ZS-HAR models. The experiment results demonstrate that SEZ-HARN produces realistic and understandable explanations while achieving competitive Zero-shot recognition accuracy. SEZ-HARN achieves a Zero-shot prediction accuracy within 3% of the best-performing black-box model on PAMAP2 while maintaining comparable performance on the other three datasets.
zh

[AI-66] A collaborative digital twin built on FAIR data and compute infrastructure

【速读】:该论文试图解决科学与工程应用中发现和优化任务的效率问题,通过将机器学习与自动化实验结合,构建自驱动实验室(Self-Driving Laboratory, SDL),以加速研究进程。其解决方案的关键在于基于nanoHUB服务的分布式SDL实现,结合了可发现、可访问、可互操作和可重用(FAIR)的数据基础设施,使得地理位置分散的研究者能够共享实验数据,并利用自动更新的分析工具和机器学习模型进行协同优化。此外,通过引入Sim2L工具和主动学习策略,实现了数据的自动处理与序列化优化,从而有效提升实验效率和结果准确性。

链接: https://arxiv.org/abs/2507.00048
作者: Thomas M. Deucher,Juan C. Verduzco,Michael Titus,Alejandro Strachan
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:The integration of machine learning with automated experimentation in self-driving laboratories (SDL) offers a powerful approach to accelerate discovery and optimization tasks in science and engineering applications. When supported by findable, accessible, interoperable, and reusable (FAIR) data infrastructure, SDLs with overlapping interests can collaborate more effectively. This work presents a distributed SDL implementation built on nanoHUB services for online simulation and FAIR data management. In this framework, geographically dispersed collaborators conducting independent optimization tasks contribute raw experimental data to a shared central database. These researchers can then benefit from analysis tools and machine learning models that automatically update as additional data become available. New data points are submitted through a simple web interface and automatically processed using a nanoHUB Sim2L, which extracts derived quantities and indexes all inputs and outputs in a FAIR data repository called ResultsDB. A separate nanoHUB workflow enables sequential optimization using active learning, where researchers define the optimization objective, and machine learning models are trained on-the-fly with all existing data, guiding the selection of future experiments. Inspired by the concept of ``frugal twin", the optimization task seeks to find the optimal recipe to combine food dyes to achieve the desired target color. With easily accessible and inexpensive materials, researchers and students can set up their own experiments, share data with collaborators, and explore the combination of FAIR data, predictive ML models, and sequential optimization. The tools introduced are generally applicable and can easily be extended to other optimization problems.
zh

[AI-67] Pattern-Based Graph Classification: Comparison of Quality Measures and Importance of Preprocessing

【速读】:该论文旨在解决图分类任务中质量度量选择困难的问题,即现有文献中存在大量用于评估模式判别能力的质量度量,但缺乏针对图数据的系统性比较与分析,导致研究者通常仅依赖广泛使用的度量而未进行充分评估。论文的关键解决方案是通过理论分析和实证比较对38种质量度量进行系统评估,并提出基于聚类的预处理步骤,以提升分类性能并减少需处理的模式数量。

链接: https://arxiv.org/abs/2507.00039
作者: Lucas Potin,Rosa Figueiredo,Vincent Labatut,Christine Largeron
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph classification aims to categorize graphs based on their structural and attribute features, with applications in diverse fields such as social network analysis and bioinformatics. Among the methods proposed to solve this task, those relying on patterns (i.e. subgraphs) provide good explainability, as the patterns used for classification can be directly interpreted. To identify meaningful patterns, a standard approach is to use a quality measure, i.e. a function that evaluates the discriminative power of each pattern. However, the literature provides tens of such measures, making it difficult to select the most appropriate for a given application. Only a handful of surveys try to provide some insight by comparing these measures, and none of them specifically focuses on graphs. This typically results in the systematic use of the most widespread measures, without thorough evaluation. To address this issue, we present a comparative analysis of 38 quality measures from the literature. We characterize them theoretically, based on four mathematical properties. We leverage publicly available datasets to constitute a benchmark, and propose a method to elaborate a gold standard ranking of the patterns. We exploit these resources to perform an empirical comparison of the measures, both in terms of pattern ranking and classification performance. Moreover, we propose a clustering-based preprocessing step, which groups patterns appearing in the same graphs to enhance classification performance. Our experimental results demonstrate the effectiveness of this step, reducing the number of patterns to be processed while achieving comparable performance. Additionally, we show that some popular measures widely used in the literature are not associated with the best results.
zh

[AI-68] Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

【速读】:该论文旨在解决数据驱动型人工智能中数据约简的问题,即如何从大规模数据集中选择最具信息量的实例以提升模型训练效率和数据质量。其解决方案的关键在于提出一种基于点对点V信息(Pointwise V-information, PVI)的数据约简策略,通过量化实例难度并过滤低难度实例,实现静态数据约简;同时采用渐进式学习方法,按PVI升序排序的实例进行训练,从而加速收敛并提升模型性能。

链接: https://arxiv.org/abs/2507.00038
作者: Fei Chen,Wenchi Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data reduction plays a vital role in data-centric AI by identifying the most informative instance within large-scale datasets to enhance model training efficiency. The core challenge lies in how to select the optimal instances-rather than the entire datasets-to improve data quality and training efficiency. In this paper, we propose an effective data reduction strategy based on Pointwise V-information(PVI). First, we quantify instance difficulty using PVI and filter out low-difficulty instances enabling a static approach. Experiments demonstrate that removing 10%-30% of the data preserves the classifier performance with only a 0.0001% to 0.76% loss in this http URL, we use a progressive learning approach to training the classifiers on instances sorted by ascending PVI, accelerating convergence and achieving a 0.8% accuracy gain over conventional training. Our results suggest that with the effective data reduction strategy, training a classifier on the selected optimal subset could enhance the model performance and boost training efficiency. Moreover, we have transferred the PVI framework, which previously applied only to English datasets, to diverse Chinese NLP tasks and base models, leading to valuable insights for cross-lingual data reduction and faster training. The codes are released at this https URL.
zh

[AI-69] Model Fusion via Neuron Interpolation

【速读】:该论文试图解决多模型融合过程中由于内部表示差异导致的非平凡问题,这些差异可能源于排列不变性、随机初始化或不同分布的训练数据。解决方案的关键在于提出一种以神经元为中心的模型融合算法家族,通过将父模型的中间神经元分组以创建目标表示,并利用对应的子网络近似这些表示,同时在融合过程中引入神经元归因分数,从而有效整合多个训练好的神经网络,且不依赖于训练数据分布。

链接: https://arxiv.org/abs/2507.00037
作者: Phoomraphee Luenam,Andreas Spanopoulos,Amit Sant,Thomas Hofmann,Sotiris Anagnostidis,Sidak Pal Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 figures, 15 tables, 23 pages

点击查看摘要

Abstract:Model fusion aims to combine the knowledge of multiple models by creating one representative model that captures the strengths of all of its parents. However, this process is non-trivial due to differences in internal representations, which can stem from permutation invariance, random initialization, or differently distributed training data. We present a novel, neuron-centric family of model fusion algorithms designed to integrate multiple trained neural networks into a single network effectively regardless of training data distribution. Our algorithms group intermediate neurons of parent models to create target representations that the fused model approximates with its corresponding sub-network. Unlike prior approaches, our approach incorporates neuron attribution scores into the fusion process. Furthermore, our algorithms can generalize to arbitrary layer types. Experimental results on various benchmark datasets demonstrate that our algorithms consistently outperform previous fusion techniques, particularly in zero-shot and non-IID fusion scenarios. The code is available at this https URL.
zh

[AI-70] Ken Utilization Layer: Hebbian Replay Within a Students Ken for Adaptive Knowledge Tracing

【速读】:该论文旨在解决知识追踪(Knowledge Tracing, KT)中的个性化学习问题,特别是在大规模场景下实现高效、灵活且具备持续适应能力的学生建模。其解决方案的关键在于引入KUL-KT架构,该架构结合了赫布记忆编码(Hebbian memory encoding)与基于梯度的巩固机制,在一个可扩展、输入无关的框架中实现了自然遗忘与无反向传播的时间连续学习。核心创新包括时间衰减的赫布记忆更新机制和损失对齐内部目标(Loss-aligned Internal Target, LIT)方法,从而支持少样本个性化和无需存储原始数据的学习过程。

链接: https://arxiv.org/abs/2507.00032
作者: Grey Kuling,Marinka Zitnik
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We introduce KUL-KT, a biologically inspired architecture for knowledge tracing (KT), combining Hebbian memory encoding with gradient-based consolidation in a scalable, input-agnostic framework. KUL-KT adapts the principle of memory consolidation in neural systems, to student modeling by introducing two key innovations: (i) a time-decaying Hebbian memory update that enables graceful forgetting, and (ii) a novel Loss-aligned Internal Target (LIT) method to compute an ideal internal state, allowing continual learning without backpropagation through time. The architecture consists of a fast Hebbian memory that captures each learner interaction via a single associative update, and a slower linear network that consolidates recalled samples through gradient descent. This design enables few-shot personalization and natural forgetting without storing raw data or relying on large cohort training. Operating entirely in embedding space, KUL-KT supports both structured (tabular) and unstructured (short-answer) inputs. Empirically, KUL-KT outperforms strong baselines on ten public KT benchmarks in rank-sensitive metrics such as nDCG and Recall@10. In a classroom deployment, KUL-KT personalized quizzes from short-answer data, leading to improved learner-perceived helpfulness and reduced difficulty (p 0.05). Ablation studies confirm that Hebbian decay and LIT are critical for continual adaptation. Compared to a strong graph-based KT model, KUL-KT trains 1.75x faster and uses 99.01% less memory. These results position KUL-KT as a biologically grounded, memory-efficient, and input-flexible framework for personalized learning at scale.
zh

[AI-71] Adaptive Action Duration with Contextual Bandits for Deep Reinforcement Learning in Dynamic Environments

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中动作执行时间尺度(temporal scale of action execution)这一关键但尚未得到充分研究的问题。其解决方案的关键在于将上下文关联的多臂老虎机(contextual bandits)与DRL相结合,以自适应地选择动作持续时间,从而提升策略的灵活性和计算效率。具体而言,该方法通过在深度Q网络(Deep Q-Network, DQN)中引入一个上下文关联模块,学习根据状态上下文选择最优的动作重复率,实验结果表明该方法在Atari 2600游戏上显著优于静态持续时间基线,验证了自适应时间抽象在DRL中的有效性。

链接: https://arxiv.org/abs/2507.00030
作者: Abhishek Verma,Nallarasan V,Balaraman Ravindran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) has achieved remarkable success in complex sequential decision-making tasks, such as playing Atari 2600 games and mastering board games. A critical yet underexplored aspect of DRL is the temporal scale of action execution. We propose a novel paradigm that integrates contextual bandits with DRL to adaptively select action durations, enhancing policy flexibility and computational efficiency. Our approach augments a Deep Q-Network (DQN) with a contextual bandit module that learns to choose optimal action repetition rates based on state contexts. Experiments on Atari 2600 games demonstrate significant performance improvements over static duration baselines, highlighting the efficacy of adaptive temporal abstractions in DRL. This paradigm offers a scalable solution for real-time applications like gaming and robotics, where dynamic action durations are critical.
zh

[AI-72] LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing

【速读】:该论文试图解决将低秩适配(LoRA)与专家混合(MoE)结合以适应大型语言模型(LLMs)到多任务时存在的参数效率低下和任务保真度不足的问题。其解决方案的关键在于提出LoRA-Mixer框架,该框架通过动态路由的、任务特定的LoRA专家替换注意力模块输入/输出线性层的投影矩阵,从而实现与多种基础模型的无缝兼容,并通过引入自适应专业化平衡损失(SBL)提升路由训练的稳定性和专家复用率。

链接: https://arxiv.org/abs/2507.00029
作者: Wenbing Li,Zikai Song,Hang Zhou,Yunyao Zhang,Junqing Yu,Wei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent efforts to combine low-rank adaptation (LoRA) with mixture-of-experts (MoE) for adapting large language models (LLMs) to multiple tasks still exhibit prevailing limitations: they either swap entire attention/feed-forward layers for switch experts or bolt on parallel expert branches, diluting parameter efficiency and task fidelity. We propose the LoRA-Mixer, a modular and lightweight MoE framework that integrates LoRA experts. Our core innovation lies in replacing the projection matrices of the attention module’s input/output linear layers with dynamically routed, task-specific LoRA experts. This design ensures seamless compatibility with diverse foundation models, including transformers and state space models (SSMs), by leveraging their inherent linear projection structures. The framework supports two operational paradigms: (1) joint optimization of LoRA experts and routing mechanisms via a novel hard-soft routing strategy, or (2) direct deployment of pre-trained, frozen LoRA modules sourced from external repositories. To enable robust router training with limited data while ensuring stable routing decisions and maximizing expert reuse, we introduce an adaptive Specialization Balance Loss (SBL) that jointly optimizes expert balance and task-specific alignment. Extensive experiments on seven benchmark datasets, including MedQA, CoLA, SST-2, GSM8K, ARC-E, ARC-C, and HumanEval, demonstrate the effectiveness of LoRA-Mixer. On datasets such as GSM8K, HumanEval, and MedQA, LoRA-Mixer achieves significant improvements of 7.61%, 4.88%, and 3.08% over the base models, respectively. Compared with state-of-the-art methods, LoRA-Mixer achieves additional improvements of 1.09%, 1.45%, and 1.68%, respectively, using only 48% of the parameters, demonstrating its efficiency and strong performance.
zh

[AI-73] Generalizing to New Dynamical Systems via Frequency Domain Adaptation

【速读】:该论文试图解决深度神经网络在特定领域内进行可靠预测以及在由相同一般动力学但环境特征不同的未见过系统中泛化能力不足的问题。解决方案的关键在于提出一种参数高效的傅里叶神经模拟器用于动态适应(Fourier Neural Simulator for Dynamical Adaptation, FNSDA),该方法通过在傅里叶空间中进行适应,识别可共享的动力学,并基于已知环境的傅里叶模式自动划分,利用低维潜在系统参数对每个新环境特有的模式进行调整,从而实现高效泛化。

链接: https://arxiv.org/abs/2507.00025
作者: Tiexin Qin,Hong Yan,Haoliang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted by TPAMI 2025

点击查看摘要

Abstract:Learning the underlying dynamics from data with deep neural networks has shown remarkable potential in modeling various complex physical dynamics. However, current approaches are constrained in their ability to make reliable predictions in a specific domain and struggle with generalizing to unseen systems that are governed by the same general dynamics but differ in environmental characteristics. In this work, we formulate a parameter-efficient method, Fourier Neural Simulator for Dynamical Adaptation (FNSDA), that can readily generalize to new dynamics via adaptation in the Fourier space. Specifically, FNSDA identifies the shareable dynamics based on the known environments using an automatic partition in Fourier modes and learns to adjust the modes specific for each new environment by conditioning on low-dimensional latent systematic parameters for efficient generalization. We evaluate our approach on four representative families of dynamic systems, and the results show that FNSDA can achieve superior or competitive generalization performance compared to existing methods with a significantly reduced parameter cost. Our code is available at this https URL.
zh

[AI-74] AIMatDesign: Knowledge-Augmented Reinforcement Learning for Inverse Materials Design under Data Scarcity

【速读】:该论文旨在解决机器学习驱动的逆向设计方法在高维材料组成空间与有限实验数据之间难以协调的问题,具体表现为模型在高维空间中的不可靠性导致预测偏差,以及无法有效融合领域专家知识,限制了其在知识引导的逆向设计中的应用。解决方案的关键在于提出AIMatDesign框架,该框架通过基于差异的算法增强实验数据以构建可信经验池,加速模型收敛;同时引入由大语言模型(Large Language Models, LLMs)指导的自动化精炼策略,动态修正预测不一致,强化奖励信号与状态价值函数的一致性,并结合基于知识的奖励函数,利用专家领域规则提升训练过程的稳定性和效率。

链接: https://arxiv.org/abs/2507.00024
作者: Yeyong Yu,Xilei Bian,Jie Xiong,Xing Wu,Quan Qian
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the growing demand for novel materials, machine learning-driven inverse design methods face significant challenges in reconciling the high-dimensional materials composition space with limited experimental data. Existing approaches suffer from two major limitations: (I) machine learning models often lack reliability in high-dimensional spaces, leading to prediction biases during the design process; (II) these models fail to effectively incorporate domain expert knowledge, limiting their capacity to support knowledge-guided inverse design. To address these challenges, we introduce AIMatDesign, a reinforcement learning framework that addresses these limitations by augmenting experimental data using difference-based algorithms to build a trusted experience pool, accelerating model convergence. To enhance model reliability, an automated refinement strategy guided by large language models (LLMs) dynamically corrects prediction inconsistencies, reinforcing alignment between reward signals and state value functions. Additionally, a knowledge-based reward function leverages expert domain rules to improve stability and efficiency during training. Our experiments demonstrate that AIMatDesign significantly surpasses traditional machine learning and reinforcement learning methods in discovery efficiency, convergence speed, and success rates. Among the numerous candidates proposed by AIMatDesign, experimental synthesis of representative Zr-based alloys yielded a top-performing BMG with 1.7GPa yield strength and 10.2% elongation, closely matching predictions. Moreover, the framework accurately captured the trend of yield strength variation with composition, demonstrating its reliability and potential for closed-loop materials discovery.
zh

[AI-75] Quantum Inspired Encoding Strategies for Machine Learning Models: Proposing and Evaluating Instance Level Global Discrete and Class Conditional Representations

【速读】:该论文旨在解决将经典数据转换为量子数据以用于纯经典机器学习模型时的高编码时间问题,同时确保编码值的正确性并分析其对分类性能的影响。论文提出的解决方案关键在于设计三种受量子启发的数据编码策略:实例级策略(Instance Level Strategy, ILS)通过独立处理数据集中的每一行来模拟局部量子态;全局离散值策略(Global Discrete Strategy, GDS)将整个数据集中的所有唯一特征值统一映射到量子态;类别条件值策略(Class Conditional Value Strategy, CCVS)则针对每个类别分别编码唯一值,以保留类别相关的信息。这些策略在编码效率、正确性、模型准确性和计算成本方面进行了评估,从而为优化量子启发的数据转换提供了见解。

链接: https://arxiv.org/abs/2507.00019
作者: Minati Rath,Hema Date
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:In this study, we propose, evaluate and compare three quantum inspired data encoding strategies, Instance Level Strategy (ILS), Global Discrete Strategy (GDS) and Class Conditional Value Strategy (CCVS), for transforming classical data into quantum data for use in pure classical machine learning models. The primary objective is to reduce high encoding time while ensuring correct encoding values and analyzing their impact on classification performance. The Instance Level Strategy treats each row of dataset independently; mimics local quantum states. Global Discrete Value Based encoding strategy maps all unique feature values across the full dataset to quantum states uniformly. In contrast, the Class conditional Value based encoding strategy encodes unique values separately for each class, preserving class dependent information. We apply these encoding strategies to a classification task and assess their impact on en-coding efficiency, correctness, model accuracy, and computational cost. By analyzing the trade offs between encoding time, precision, and predictive performance, this study provides insights into optimizing quantum inspired data transformations for classical machine learning workflows. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph) Cite as: arXiv:2507.00019 [cs.LG] (or arXiv:2507.00019v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.00019 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-76] Vision Transformer with Adversarial Indicator Token against Adversarial Attacks in Radio Signal Classifications

【速读】:该论文试图解决基于Transformer的调制分类系统在面对复杂对抗攻击时的脆弱性问题,特别是针对无线信号分类的场景。其解决方案的关键在于提出了一种新的视觉Transformer(ViT)架构,通过引入对抗指示符(AdvI)标记来检测对抗攻击,从而在训练阶段和运行阶段实现统一的防御机制,相较于使用独立模型检测对抗扰动,有效降低了系统的架构复杂度。

链接: https://arxiv.org/abs/2507.00015
作者: Lu Zhang,Sangarapillai Lambotharan,Gan Zheng,Guisheng Liao,Xuekang Liu,Fabio Roli,Carsten Maple
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The remarkable success of transformers across various fields such as natural language processing and computer vision has paved the way for their applications in automatic modulation classification, a critical component in the communication systems of Internet of Things (IoT) devices. However, it has been observed that transformer-based classification of radio signals is susceptible to subtle yet sophisticated adversarial attacks. To address this issue, we have developed a defensive strategy for transformer-based modulation classification systems to counter such adversarial attacks. In this paper, we propose a novel vision transformer (ViT) architecture by introducing a new concept known as adversarial indicator (AdvI) token to detect adversarial attacks. To the best of our knowledge, this is the first work to propose an AdvI token in ViT to defend against adversarial attacks. Integrating an adversarial training method with a detection mechanism using AdvI token, we combine a training time defense and running time defense in a unified neural network model, which reduces architectural complexity of the system compared to detecting adversarial perturbations using separate models. We investigate into the operational principles of our method by examining the attention mechanism. We show the proposed AdvI token acts as a crucial element within the ViT, influencing attention weights and thereby highlighting regions or features in the input data that are potentially suspicious or anomalous. Through experimental results, we demonstrate that our approach surpasses several competitive methods in handling white-box attack scenarios, including those utilizing the fast gradient method, projected gradient descent attacks and basic iterative method.
zh

[AI-77] SWE-Bench-CL: Continual Learning for Coding Agents

【速读】:该论文旨在解决软件工程中人工智能代理在持续学习场景下的适应性与鲁棒性问题,即如何使AI代理在面对不断变化的代码生成任务时,能够积累经验、迁移知识并抵抗灾难性遗忘。其解决方案的关键在于构建了一个基于时间序列的持续学习基准SWE-Bench-CL,该基准通过组织GitHub问题以反映仓库的自然演化过程,支持对AI代理的持续学习能力进行直接评估,并结合了交互式LangGraph评估框架、语义记忆模块以及多种专门设计的持续学习指标,以全面衡量稳定性与可塑性的权衡。

链接: https://arxiv.org/abs/2507.00014
作者: Thomas Joshi,Shayan Chowdhury,Fatih Uysal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive results on static code-generation benchmarks, but real-world software development unfolds as a continuous stream of evolving issues, fixes, and feature requests. We introduce SWE-Bench-CL, a novel continual learning benchmark built on the human-verified SWE-Bench Verified dataset introduced by OpenAI and Princeton-NLP in 2024. By organizing GitHub issues into chronologically ordered sequences that reflect natural repository evolution, SWE-Bench-CL enables direct evaluation of an agent’s ability to accumulate experience, transfer knowledge across tasks, and resist catastrophic forgetting. We complement the dataset with (i) a preliminary analysis of inter-task structural similarity and contextual sensitivity, (ii) an interactive LangGraph-based evaluation framework augmented with a FAISS-backed semantic memory module, and (iii) a suite of specialized continual learning metrics – including average accuracy, forgetting, forward/backward transfer, tool-use efficiency, and a generalized Composite Continual Learning Score and CL-F-beta score – to capture the stability-plasticity trade-off. We outline a rigorous experimental protocol comparing memory-enabled and memory-disabled agents across diverse Python repositories. All code and data are publicly available at this https URL, providing the community with a reproducible platform for developing more adaptive and robust AI agents in software engineering.
zh

[AI-78] ST-MTM: Masked Time Series Modeling with Seasonal-Trend Decomposition for Time Series Forecasting KDD2025

【速读】:该论文旨在解决复杂时间序列预测中由于原始时间序列被简单掩码而忽略其内在语义结构,导致模型学习到虚假时间模式的问题。其解决方案的关键在于引入ST-MTM框架,该框架通过季节-趋势分解(seasonal-trend decomposition)对时间序列进行分量处理,并设计了针对季节成分的周期掩码策略和针对趋势成分的子序列掩码策略,以有效捕捉不同时间变化模式,从而提升时间序列的建模能力和预测性能。

链接: https://arxiv.org/abs/2507.00013
作者: Hyunwoo Seo,Chiehyeon Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted by KDD 2025 research track

点击查看摘要

Abstract:Forecasting complex time series is an important yet challenging problem that involves various industrial applications. Recently, masked time-series modeling has been proposed to effectively model temporal dependencies for forecasting by reconstructing masked segments from unmasked ones. However, since the semantic information in time series is involved in intricate temporal variations generated by multiple time series components, simply masking a raw time series ignores the inherent semantic structure, which may cause MTM to learn spurious temporal patterns present in the raw data. To capture distinct temporal semantics, we show that masked modeling techniques should address entangled patterns through a decomposition approach. Specifically, we propose ST-MTM, a masked time-series modeling framework with seasonal-trend decomposition, which includes a novel masking method for the seasonal-trend components that incorporates different temporal variations from each component. ST-MTM uses a period masking strategy for seasonal components to produce multiple masked seasonal series based on inherent multi-periodicity and a sub-series masking strategy for trend components to mask temporal regions that share similar variations. The proposed masking method presents an effective pre-training task for learning intricate temporal variations and dependencies. Additionally, ST-MTM introduces a contrastive learning task to support masked modeling by enhancing contextual consistency among multiple masked seasonal representations. Experimental results show that our proposed ST-MTM achieves consistently superior forecasting performance compared to existing masked modeling, contrastive learning, and supervised forecasting methods.
zh

[AI-79] owards Undistillable Models by Minimizing Conditional Mutual Information

【速读】:该论文试图解决如何构建不可蒸馏的深度神经网络(undistillable DNN),以保护其知识产权。解决方案的关键在于通过最小化所有温度缩放聚类的条件互信息(CMI)来训练模型,即提出了一种称为CMI最小化(CMIM)的方法,该方法在传统交叉熵(CE)损失的基础上,联合优化CMI值,从而使得模型输出的概率分布具有高度集中性,使其无法通过知识蒸馏(KD)方法有效复制。

链接: https://arxiv.org/abs/2507.00012
作者: Linfeng Ye,Shayan Mohajer Hamidi,En-hui Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 6 figures, Transactions on Machine Learning Research

点击查看摘要

Abstract:A deep neural network (DNN) is said to be undistillable if, when used as a black-box input-output teacher, it cannot be distilled through knowledge distillation (KD). In this case, the distilled student (referred to as the knockoff student) does not outperform a student trained independently with label smoothing (LS student) in terms of prediction accuracy. To protect intellectual property of DNNs, it is desirable to build undistillable DNNs. To this end, it is first observed that an undistillable DNN may have the trait that each cluster of its output probability distributions in response to all sample instances with the same label should be highly concentrated to the extent that each cluster corresponding to each label should ideally collapse into one probability distribution. Based on this observation and by measuring the concentration of each cluster in terms of conditional mutual information (CMI), a new training method called CMI minimized (CMIM) method is proposed, which trains a DNN by jointly minimizing the conventional cross entropy (CE) loss and the CMI values of all temperature scaled clusters across the entire temperature spectrum. The resulting CMIM model is shown, by extensive experiments, to be undistillable by all tested KD methods existing in the literature. That is, the knockoff students distilled by these KD methods from the CMIM model underperform the respective LS students. In addition, the CMIM model is also shown to performs better than the model trained with the CE loss alone in terms of their own prediction accuracy.
zh

[AI-80] Novel RL approach for efficient Elevator Group Control Systems

【速读】:该论文旨在解决大型建筑中电梯调度的高效管理问题,以减少乘客出行时间和能耗。传统基于启发式或模式检测的控制器难以应对调度中的随机性和组合复杂性,因此本文将阿姆斯特丹自由大学的六部电梯、十五层系统建模为马尔可夫决策过程,并训练了一个端到端的强化学习(Reinforcement Learning, RL)电梯群控系统(Elevator Group Control System, EGCS)。解决方案的关键在于提出一种新颖的动作空间编码方法以处理调度的组合复杂性,引入基础设施步骤(infra-steps)以模拟连续的乘客到达,并设计了定制的奖励信号以提高学习效率。此外,还探索了适应基础设施步骤形式的折扣因子调整方法,最终验证了所提出的RL-based EGCS在动态交通模式下的适应能力与学习性能优于传统规则基算法。

链接: https://arxiv.org/abs/2507.00011
作者: Nathan Vaartjes,Vincent Francois-Lavet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:Efficient elevator traffic management in large buildings is critical for minimizing passenger travel times and energy consumption. Because heuristic- or pattern-detection-based controllers struggle with the stochastic and combinatorial nature of dispatching, we model the six-elevator, fifteen-floor system at Vrije Universiteit Amsterdam as a Markov Decision Process and train an end-to-end Reinforcement Learning (RL) Elevator Group Control System (EGCS). Key innovations include a novel action space encoding to handle the combinatorial complexity of elevator dispatching, the introduction of infra-steps to model continuous passenger arrivals, and a tailored reward signal to improve learning efficiency. In addition, we explore various ways to adapt the discounting factor to the infra-step formulation. We investigate RL architectures based on Dueling Double Deep Q-learning, showing that the proposed RL-based EGCS adapts to fluctuating traffic patterns, learns from a highly stochastic environment, and thereby outperforms a traditional rule-based algorithm.
zh

[AI-81] Integrating Universal Generative AI Platforms in Educational Labs to Foster Critical Thinking and Digital Literacy

【速读】:该论文试图解决传统教育中对生成式AI(Generative AI)平台的非批判性依赖问题,以及如何有效将这些工具融入实验教学以培养本科生的批判性思维和数字素养。解决方案的关键在于将生成式AI重新定位为研究对象和认知工具,通过让学生制定学科特定的提示并评估AI生成的文本、图像和视频响应,从而促进结构化的AI互动和反思性评估。这种教学模式在非科学专业学生的通识天文学课程中进行了试点,显示出高参与度和深度反思,证明了生成式AI与反思性评估方法结合可提升学习效果。

链接: https://arxiv.org/abs/2507.00007
作者: Vasiliy Znamenskiy,Rafael Niyazov,Joel Hernandez
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this http URL

点击查看摘要

Abstract:This paper presents a new educational framework for integrating generative artificial intelligence (GenAI) platforms such as ChatGPT, Claude, and Gemini into laboratory activities aimed at developing critical thinking and digital literacy among undergraduate students. Recognizing the limitations and risks of uncritical reliance on large language models (LLMs), the proposed pedagogical model reframes GenAI as a research subject and cognitive tool. Students formulate discipline-specific prompts and evaluate GenAI-generated responses in text, image, and video modalities. A pilot implementation in a general astronomy course for non-science majors demonstrated high levels of engagement and critical reflection, with many students continuing the activity after class and presenting results at a research symposium. The results highlight the importance of structured AI interactions in education and suggest that GenAI can improve learning outcomes when combined with reflective assessment methods. The study proposes a replicable model for interdisciplinary AI-integrated lab work, adaptable to scientific disciplines. See the guide to learning activities based on Generative-Ai platforms: this https URL
zh

[AI-82] A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理(inference)阶段的计算成本过高问题,尤其是在以推理为核心任务的模型中,推理成本已成为资源消耗的主要部分。传统方法在分析计算最优性时,往往孤立或以固定组合方式考虑模型规模、数据集规模和推理标记数,容易忽略更优的操作点。论文提出的解决方案是引入定向随机技能搜索(Directed Stochastic Skill Search, DS3),该框架将推理建模为在学习到的技能图上的随机遍历过程,并通过简化但表达性强的实例推导出任务成功率和计算成本的闭式表达式,从而实现对不同推理策略(如链式思维CoT和树状思维ToT)的比较分析。DS3的关键在于通过理论建模揭示训练与推理之间的依赖关系,为算法设计和资源分配提供理论依据。

链接: https://arxiv.org/abs/2507.00004
作者: Austin R. Ellis-Mohr,Anuj K. Nayak,Lav R. Varshney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demand considerable computational, energy, and financial resources during both training and deployment. While scaling laws for training have guided much of the field’s recent progress, inference costs now represent a significant and growing component of the overall resource burden, particularly for reasoning-focused models. Existing characterizations of compute-optimality that consider model size, dataset size, and inference tokens in isolation or in fixed combinations risk overlooking more efficient operating points. We introduce directed stochastic skill search (DS3), a general framework that represents inference as stochastic traversal over a learned skill graph. From a simplified yet expressive instantiation, we derive closed-form expressions for task success and compute cost across a wide range of inference strategies – including chain-of-thought (CoT) and tree-of-thought (ToT) – enabling comparative analysis as a function of task difficulty and model capability. To that end, we extend a prior first-principles tripartite graph framework of LLM training to incorporate inference, and separately bridge DS3 with empirical methods that characterize LLM scaling behavior. We theoretically recover empirically observed patterns, including: linear accuracy scaling with logarithmic compute; variation in preferred inference strategies as a function of task difficulty and model capability; emergent behavior elicited by reasoning even when performance plateaus under parameter scaling; and both best-of-N (BoN) and majority voting behavior captured within a unified analytical framework. By explicitly characterizing training-inference interdependencies, our framework deepens theoretical understanding and supports principled algorithmic design and resource allocation.
zh

[AI-83] Deciding When Not to Decide: Indeterminacy-Aware Intrusion Detection with NeutroSENSE

【速读】:该论文旨在解决物联网(IoT)环境中可解释的入侵检测问题,特别是在面对不确定性和需要可靠决策的情况下。解决方案的关键在于提出NeutroSENSE框架,该框架结合了随机森林、XGBoost和逻辑回归,并引入了中性逻辑(Neutrosophic Logic),将预测置信度分解为真值(T)、假值(F)和不确定性(I)三个组成部分,从而实现对不确定性的量化和主动回避。通过设置全局和自适应的类别特定阈值,系统能够识别高不确定性预测并进行人工审查,提升了边缘计算环境下的可信度与可解释性。

链接: https://arxiv.org/abs/2507.00003
作者: Eyhab Al-Masri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:This paper presents NeutroSENSE, a neutrosophic-enhanced ensemble framework for interpretable intrusion detection in IoT environments. By integrating Random Forest, XGBoost, and Logistic Regression with neutrosophic logic, the system decomposes prediction confidence into truth (T), falsity (F), and indeterminacy (I) components, enabling uncertainty quantification and abstention. Predictions with high indeterminacy are flagged for review using both global and adaptive, class-specific thresholds. Evaluated on the IoT-CAD dataset, NeutroSENSE achieved 97% accuracy, while demonstrating that misclassified samples exhibit significantly higher indeterminacy (I = 0.62) than correct ones (I = 0.24). The use of indeterminacy as a proxy for uncertainty enables informed abstention and targeted review-particularly valuable in edge deployments. Figures and tables validate the correlation between I-scores and error likelihood, supporting more trustworthy, human-in-the-loop AI decisions. This work shows that neutrosophic logic enhances both accuracy and explainability, providing a practical foundation for trust-aware AI in edge and fog-based IoT security systems.
zh

[AI-84] From Sentences to Sequences: Rethinking Languages in Biological System

【速读】:该论文试图解决如何将自然语言处理(NLP)中的成功方法有效迁移至生物语言建模的问题,特别是针对蛋白质、RNA和DNA等生物序列的建模。其关键在于重新审视生物系统中的“语言”概念,强调生物语言与自然语言在结构关联上的根本差异,并通过将生物分子的三维结构视为句子的语义内容,结合残基或碱基间的强相关性,提出结构评估的重要性,从而验证自回归生成范式在生物语言建模中的适用性。

链接: https://arxiv.org/abs/2507.00953
作者: Ke Liu,Shuanke Shen,Hao Chen
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The paradigm of large language models in natural language processing (NLP) has also shown promise in modeling biological languages, including proteins, RNA, and DNA. Both the auto-regressive generation paradigm and evaluation metrics have been transferred from NLP to biological sequence modeling. However, the intrinsic structural correlations in natural and biological languages differ fundamentally. Therefore, we revisit the notion of language in biological systems to better understand how NLP successes can be effectively translated to biological domains. By treating the 3D structure of biomolecules as the semantic content of a sentence and accounting for the strong correlations between residues or bases, we highlight the importance of structural evaluation and demonstrate the applicability of the auto-regressive paradigm in biological language modeling. Code can be found at \hrefthis https URLthis http URL
zh

[AI-85] LearnAFE: Circuit-Algorithm Co-design Framework for Learnable Audio Analog Front-End

【速读】:该论文试图解决传统音频信号分类系统中模拟前端(AFE)与后端分类器独立设计导致的性能非最优问题。解决方案的关键在于提出一种电路-算法协同设计框架,通过联合优化后端分类器与AFE的传递函数,实现系统级最优。具体而言,利用信噪比(SNR)感知的训练循环对模拟带通滤波器(BPF)组的传递函数参数进行调优,并采用协同设计损失函数LBPF,实现了滤波器组与分类器的共同优化。

链接: https://arxiv.org/abs/2507.00755
作者: Jinhai Hu,Zhongyi Zhang,Cong Sheng Leow,Wang Ling Goh,Yuan Gao
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 11 pages, 15 figures, accepted for publication on IEEE Transactions on Circuits and Systems I: Regular Papers

点击查看摘要

Abstract:This paper presents a circuit-algorithm co-design framework for learnable analog front-end (AFE) in audio signal classification. Designing AFE and backend classifiers separately is a common practice but non-ideal, as shown in this paper. Instead, this paper proposes a joint optimization of the backend classifier with the AFE’s transfer function to achieve system-level optimum. More specifically, the transfer function parameters of an analog bandpass filter (BPF) bank are tuned in a signal-to-noise ratio (SNR)-aware training loop for the classifier. Using a co-design loss function LBPF, this work shows superior optimization of both the filter bank and the classifier. Implemented in open-source SKY130 130nm CMOS process, the optimized design achieved 90.5%-94.2% accuracy for 10-keyword classification task across a wide range of input signal SNR from 5 dB to 20 dB, with only 22k classifier parameters. Compared to conventional approach, the proposed audio AFE achieves 8.7% and 12.9% reduction in power and capacitor area respectively.
zh

[AI-86] Physics-Informed Neural ODEs for Temporal Dynamics Modeling in Cardiac T1 Mapping MICCAI2025

【速读】:该论文旨在解决传统Modified Look-Locker Inversion Recovery (MOLLI)在心脏参数映射中因长时间扫描导致患者呼吸困难、运动伪影以及需要逐体素非线性拟合带来的计算复杂性问题。其解决方案的关键在于提出一种基于物理信息的神经微分方程(Physics-Informed Neural Ordinary Differential Equations, PINO)的端到端T₁映射框架,通过建模时间动态过程实现从稀疏基线图像中高精度估计T₁值,并在测试阶段确保高效的零点索引估计。该方法引入连续时间LSTM-ODE模型,支持任意时间间隔的Look-Locker(LL)数据采集,从而提升成像效率与准确性。

链接: https://arxiv.org/abs/2507.00613
作者: Nuno Capitão,Yi Zhang,Yidong Zhao,Qian Tao
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: Submitted version. Accepted at MICCAI 2025

点击查看摘要

Abstract:Spin-lattice relaxation time ( T_1 ) is an important biomarker in cardiac parametric mapping for characterizing myocardial tissue and diagnosing cardiomyopathies. Conventional Modified Look-Locker Inversion Recovery (MOLLI) acquires 11 breath-hold baseline images with interleaved rest periods to ensure mapping accuracy. However, prolonged scanning can be challenging for patients with poor breathholds, often leading to motion artifacts that degrade image quality. In addition, T_1 mapping requires voxel-wise nonlinear fitting to a signal recovery model involving an iterative estimation process. Recent studies have proposed deep-learning approaches for rapid T_1 mapping using shortened sequences to reduce acquisition time for patient comfort. Nevertheless, existing methods overlook important physics constraints, limiting interpretability and generalization. In this work, we present an accelerated, end-to-end T_1 mapping framework leveraging Physics-Informed Neural Ordinary Differential Equations (ODEs) to model temporal dynamics and address these challenges. Our method achieves high-accuracy T_1 estimation from a sparse subset of baseline images and ensures efficient null index estimation at test time. Specifically, we develop a continuous-time LSTM-ODE model to enable selective Look-Locker (LL) data acquisition with arbitrary time lags. Experimental results show superior performance in T_1 estimation for both native and post-contrast sequences and demonstrate the strong benefit of our physics-based formulation over direct data-driven T_1 priors.
zh

[AI-87] Inverse Design in Nanophotonics via Representation Learning

【速读】:该论文旨在解决纳米光子学中逆向设计的挑战,即在高维、非凸的设计空间中高效计算出实现特定电磁响应的结构,传统方法因计算成本高和优化困难而受限。其解决方案的关键在于利用机器学习(ML)增强逆向设计方法,通过表征学习将问题分为输出端和输入端两类策略:输出端方法通过学习解空间的表征构建可微分求解器以加速优化,输入端方法则通过学习可行器件几何的紧凑潜在表征,借助生成模型实现高效的全局探索。这些方法在数据需求、泛化能力和新设计发现潜力方面各有权衡,而结合物理驱动优化与数据驱动表征的混合框架有助于克服局部最优、提升可扩展性并促进知识迁移。

链接: https://arxiv.org/abs/2507.00546
作者: Reza Marzban,Ali Adibi,Raphael Pestourie
机构: 未知
类目: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Inverse design in nanophotonics, the computational discovery of structures achieving targeted electromagnetic (EM) responses, has become a key tool for recent optical advances. Traditional intuition-driven or iterative optimization methods struggle with the inherently high-dimensional, non-convex design spaces and the substantial computational demands of EM simulations. Recently, machine learning (ML) has emerged to address these bottlenecks effectively. This review frames ML-enhanced inverse design methodologies through the lens of representation learning, classifying them into two categories: output-side and input-side approaches. Output-side methods use ML to learn a representation in the solution space to create a differentiable solver that accelerates optimization. Conversely, input-side techniques employ ML to learn compact, latent-space representations of feasible device geometries, enabling efficient global exploration through generative models. Each strategy presents unique trade-offs in data requirements, generalization capacity, and novel design discovery potentials. Hybrid frameworks that combine physics-based optimization with data-driven representations help escape poor local optima, improve scalability, and facilitate knowledge transfer. We conclude by highlighting open challenges and opportunities, emphasizing complexity management, geometry-independent representations, integration of fabrication constraints, and advancements in multiphysics co-designs.
zh

[AI-88] Physics-Aware Style Transfer for Adaptive Holographic Reconstruction

【速读】:该论文试图解决在线全息成像中从记录的衍射图样重建物体复振幅的不适定逆问题。传统相位恢复算法和现有深度学习方法通常依赖高质量的复振幅图集作为真实值数据集,而该研究提出了一种物理感知的风格迁移方法,其关键在于将物-传感器距离视为衍射图样中的隐式风格,并利用风格域作为中间域构建循环图像翻译,从而仅使用强度测量数据集即可自适应地学习逆映射操作。

链接: https://arxiv.org/abs/2507.00482
作者: Chanseok Lee,Fakhriyya Mammadova,Jiseong Barg,Mooseok Jang
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Keywords: holographic imaging, style transfer, phase retrieval, deep learning

点击查看摘要

Abstract:Inline holographic imaging presents an ill-posed inverse problem of reconstructing objects’ complex amplitude from recorded diffraction patterns. Although recent deep learning approaches have shown promise over classical phase retrieval algorithms, they often require high-quality ground truth datasets of complex amplitude maps to achieve a statistical inverse mapping operation between the two domains. Here, we present a physics-aware style transfer approach that interprets the object-to-sensor distance as an implicit style within diffraction patterns. Using the style domain as the intermediate domain to construct cyclic image translation, we show that the inverse mapping operation can be learned in an adaptive manner only with datasets composed of intensity measurements. We further demonstrate its biomedical applicability by reconstructing the morphology of dynamically flowing red blood cells, highlighting its potential for real-time, label-free imaging. As a framework that leverages physical cues inherently embedded in measurements, the presented method offers a practical learning strategy for imaging applications where ground truth is difficult or impossible to obtain.
zh

[AI-89] Process-aware and high-fidelity microstructure generation using stable diffusion

【速读】:该论文旨在解决基于加工参数生成逼真微观结构图像的问题,以理解材料设计中的工艺-结构关系。由于训练用的显微图像有限且加工变量具有连续性,该任务面临较大挑战。其解决方案的关键在于提出一种基于Stable Diffusion 3.5 Large(SD3.5-Large)的工艺感知生成建模方法,通过数值感知嵌入将连续变量(如退火温度、时间及放大倍数)直接编码至模型条件中,从而实现受控的图像生成并捕捉工艺驱动的微观结构变化。此外,通过DreamBooth和低秩适应(LoRA)对模型进行微调,仅调整少量权重以应对数据稀缺和计算限制,有效实现了预训练模型向材料领域的迁移。

链接: https://arxiv.org/abs/2507.00459
作者: Hoang Cuong Phan,Minh Tien Tran,Chihun Lee,Hoheok Kim,Sehyok Oh,Dong-Kyu Kim,Ho Won Lee
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 46 pages, 13 figures, 5 tables, 3rd Word Congress on Artificial Intelligence in Materials Manufacturing 2025

点击查看摘要

Abstract:Synthesizing realistic microstructure images conditioned on processing parameters is crucial for understanding process-structure relationships in materials design. However, this task remains challenging due to limited training micrographs and the continuous nature of processing variables. To overcome these challenges, we present a novel process-aware generative modeling approach based on Stable Diffusion 3.5 Large (SD3.5-Large), a state-of-the-art text-to-image diffusion model adapted for microstructure generation. Our method introduces numeric-aware embeddings that encode continuous variables (annealing temperature, time, and magnification) directly into the model’s conditioning, enabling controlled image generation under specified process conditions and capturing process-driven microstructural variations. To address data scarcity and computational constraints, we fine-tune only a small fraction of the model’s weights via DreamBooth and Low-Rank Adaptation (LoRA), efficiently transferring the pre-trained model to the materials domain. We validate realism using a semantic segmentation model based on a fine-tuned U-Net with a VGG16 encoder on 24 labeled micrographs. It achieves 97.1% accuracy and 85.7% mean IoU, outperforming previous methods. Quantitative analyses using physical descriptors and spatial statistics show strong agreement between synthetic and real microstructures. Specifically, two-point correlation and lineal-path errors remain below 2.1% and 0.6%, respectively. Our method represents the first adaptation of SD3.5-Large for process-aware microstructure generation, offering a scalable approach for data-driven materials design.
zh

[AI-90] Geological Everything Model 3D: A Promptable Foundation Model for Unified and Zero-Shot Subsurface Understanding

【速读】:该论文旨在解决地球物理领域中地下结构分析碎片化的问题,传统方法需要为结构解释、地层分析、地质体分割和属性建模等任务分别构建模型,且每个模型都紧密依赖于特定的数据分布和任务设定。其解决方案的关键在于提出一种统一的生成式架构——地质全模型3D(GEM),该模型将所有任务重新表述为基于潜在结构框架的提示条件推理,通过人类提供的提示(如测井数据、掩码或结构草图)沿推断出的结构框架进行传播,从而生成地质上一致的输出,实现跨任务和异构提示类型的零样本泛化能力。

链接: https://arxiv.org/abs/2507.00419
作者: Yimin Dou,Xinming Wu,Nathan L Bangs,Harpreet Singh Sethi,Jintao Li,Hang Gao,Zhixiang Guo
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding Earth’s subsurface is critical for energy transition, natural hazard mitigation, and planetary science. Yet subsurface analysis remains fragmented, with separate models required for structural interpretation, stratigraphic analysis, geobody segmentation, and property modeling-each tightly coupled to specific data distributions and task formulations. We introduce the Geological Everything Model 3D (GEM), a unified generative architecture that reformulates all these tasks as prompt-conditioned inference along latent structural frameworks derived from subsurface imaging. This formulation moves beyond task-specific models by enabling a shared inference mechanism, where GEM propagates human-provided prompts-such as well logs, masks, or structural sketches-along inferred structural frameworks to produce geologically coherent outputs. Through this mechanism, GEM achieves zero-shot generalization across tasks with heterogeneous prompt types, without retraining for new tasks or data sources. This capability emerges from a two-stage training process that combines self-supervised representation learning on large-scale field seismic data with adversarial fine-tuning using mixed prompts and labels across diverse subsurface tasks. GEM demonstrates broad applicability across surveys and tasks, including Martian radar stratigraphy analysis, structural interpretation in subduction zones, full seismic stratigraphic interpretation, geobody delineation, and property modeling. By bridging expert knowledge with generative reasoning in a structurally aware manner, GEM lays the foundation for scalable, human-in-the-loop geophysical AI-transitioning from fragmented pipelines to a vertically integrated, promptable reasoning system. Project page: this https URL
zh

[AI-91] Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials

【速读】:该论文试图解决传统方法在获取分子三维几何结构时成本高昂的问题,传统方法通常依赖于密度泛函理论(Density Functional Theory, DFT)等计算密集型技术。解决方案的关键在于利用机器学习原子间势能模型(Machine Learning Interatomic Potential, MLIP)来预测分子几何结构。研究通过构建大规模分子弛豫数据集,并使用监督学习训练MLIP基础模型以预测能量和力,从而实现无需DFT即可获得高质量分子几何结构的目标。

链接: https://arxiv.org/abs/2507.00407
作者: Cong Fu,Yuchao Lin,Zachary Krueger,Haiyang Yu,Maho Nakata,Jianwen Xie,Emine Kucukbenli,Xiaofeng Qian,Shuiwang Ji
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Accurate molecular property predictions require 3D geometries, which are typically obtained using expensive methods such as density functional theory (DFT). Here, we attempt to obtain molecular geometries by relying solely on machine learning interatomic potential (MLIP) models. To this end, we first curate a large-scale molecular relaxation dataset comprising 3.5 million molecules and 300 million snapshots. Then MLIP foundation models are trained with supervised learning to predict energy and forces given 3D molecular structures. Once trained, we show that the foundation models can be used in different ways to obtain geometries either explicitly or implicitly. First, it can be used to obtain low-energy 3D geometries via geometry optimization, providing relaxed 3D geometries for downstream molecular property predictions. To mitigate potential biases and enhance downstream predictions, we introduce geometry fine-tuning based on the relaxed 3D geometries. Second, the foundation models can be directly fine-tuned for property prediction when ground truth 3D geometries are available. Our results demonstrate that MLIP foundation models trained on relaxation data can provide valuable molecular geometries that benefit property predictions.
zh

[AI-92] Reconfiguring Digital Accountability: AI-Powered Innovations and Transnational Governance in a Postnational Accounting Context

【速读】:该论文试图解决AI驱动的数字化创新如何在跨国治理背景下重塑组织问责机制的问题。随着AI系统在审计和财务报告等领域的决策中介作用增强,基于控制、透明度和可审计性的传统问责机制正面临 destabilisation(失衡)。解决方案的关键在于整合技术接受模型(Technology Acceptance Model, TAM)、行动者-网络理论(Actor-Network Theory, ANT)和制度理论,以分析组织如何在全球监管、伦理和文化压力下采纳AI技术。研究提出,问责是在全球社会技术网络中共同构建的,不仅受用户感知影响,还受治理逻辑和规范期望塑造;同时,通过扩展TAM,将合规性和合法性作为感知有用性和易用性的重要因素,并借助ANT重新定义问责为网络化集合的关联性和涌现性属性,最终提出内部治理重构与外部行动者网络参与两种策略,以促进会计领域负责任、合法且被广泛接受的AI应用。

链接: https://arxiv.org/abs/2507.00288
作者: Claire Li,David Freeborn
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 22 pages

点击查看摘要

Abstract:This study explores how AI-powered digital innovations are reshaping organisational accountability in a transnational governance context. As AI systems increasingly mediate decision-making in domains such as auditing and financial reporting, traditional mechanisms of accountability, based on control, transparency, and auditability, are being destabilised. We integrate the Technology Acceptance Model (TAM), Actor-Network Theory (ANT), and institutional theory to examine how organisations adopt AI technologies in response to regulatory, ethical, and cultural pressures that transcend national boundaries. We argue that accountability is co-constructed within global socio-technical networks, shaped not only by user perceptions but also by governance logics and normative expectations. Extending TAM, we incorporate compliance and legitimacy as key factors in perceived usefulness and usability. Drawing on ANT, we reconceptualise accountability as a relational and emergent property of networked assemblages. We propose two organisational strategies including internal governance reconfiguration and external actor-network engagement to foster responsible, legitimate, and globally accepted AI adoption in the accounting domain.
zh

[AI-93] Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations

【速读】:该论文试图解决当前稀疏自编码器(Sparse Autoencoder, SAE)在神经网络可解释性研究中无法消除多义性(polysemanticity)和表现出病理行为错误的问题。其解决方案的关键在于提出神经网络在相同基质中以两种互补空间——特征身份(feature identity)和特征整合(feature integration)进行信息编码的双编码假设,并通过顺序和联合训练架构同时捕捉这两种模式。联合训练实现了41.3%的重建提升和51.6%的KL散度误差减少,表明该方法能够更有效地捕获对行为至关重要的计算关系。

链接: https://arxiv.org/abs/2507.00269
作者: Omar Claflin
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current sparse autoencoder (SAE) approaches to neural network interpretability assume that activations can be decomposed through linear superposition into sparse, interpretable features. Despite high reconstruction fidelity, SAEs consistently fail to eliminate polysemanticity and exhibit pathological behavioral errors. We propose that neural networks encode information in two complementary spaces compressed into the same substrate: feature identity and feature integration. To test this dual encoding hypothesis, we develop sequential and joint-training architectures to capture identity and integration patterns simultaneously. Joint training achieves 41.3% reconstruction improvement and 51.6% reduction in KL divergence errors. This architecture spontaneously develops bimodal feature organization: low squared norm features contributing to integration pathways and the rest contributing directly to the residual. Small nonlinear components (3% of parameters) achieve 16.5% standalone improvements, demonstrating parameter-efficient capture of computational relationships crucial for behavior. Additionally, intervention experiments using 2x2 factorial stimulus designs demonstrated that integration features exhibit selective sensitivity to experimental manipulations and produce systematic behavioral effects on model outputs, including significant interaction effects across semantic dimensions. This work provides systematic evidence for (1) dual encoding in neural representations, (2) meaningful nonlinearly encoded feature interactions, and (3) introduces an architectural paradigm shift from post-hoc feature analysis to integrated computational design, establishing foundations for next-generation SAEs.
zh

[AI-94] Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis INTERSPEECH2025

【速读】:该论文试图解决在文本到语音合成中生成富有表现力的韵律(prosody)这一挑战性问题,尤其是在那些通过显式参数(如音高、能量和持续时间)建模韵律的系统中,这类方法通常为了可解释性和可控性而采用。论文提出的关键解决方案是评估随机方法(包括归一化流、条件流匹配和校正流)在该任务中的有效性,并证明这些方法能够通过捕捉人类语音中的固有变异性,生成与人类发音相当自然的韵律,同时通过调整采样温度提供额外的可控性选项。

链接: https://arxiv.org/abs/2507.00227
作者: Paul Mayer,Florian Lux,Alejandro Pérez-González-de-Martos,Angelina Elizarova,Lindsey Vanderlyn,Dirk Väth,Ngoc Thang Vu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:While generative methods have progressed rapidly in recent years, generating expressive prosody for an utterance remains a challenging task in text-to-speech synthesis. This is particularly true for systems that model prosody explicitly through parameters such as pitch, energy, and duration, which is commonly done for the sake of interpretability and controllability. In this work, we investigate the effectiveness of stochastic methods for this task, including Normalizing Flows, Conditional Flow Matching, and Rectified Flows. We compare these methods to a traditional deterministic baseline, as well as to real human realizations. Our extensive subjective and objective evaluations demonstrate that stochastic methods produce natural prosody on par with human speakers by capturing the variability inherent in human speech. Further, they open up additional controllability options by allowing the sampling temperature to be tuned.
zh

[AI-95] Discovering the underlying analytic structure within Standard Model constants using artificial intelligence

【速读】:该论文试图解决标准模型(Standard Model, SM)基本参数之间潜在解析结构的识别问题,旨在发现这些常数之间的简洁解析关系。解决方案的关键在于采用符号回归(symbolic regression)和遗传编程(genetic programming)方法,通过搜索约一千个相对精度优于1%的表达式,寻找可能揭示SM常数隐藏模式的数学关系。

链接: https://arxiv.org/abs/2507.00225
作者: S. V. Chekanov,H. Kjellerstrand
机构: 未知
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
备注: 42 pages, 10 tables

点击查看摘要

Abstract:This paper presents a search for underlying analytic structures among the fundamental parameters of the Standard Model (SM) using symbolic regression and genetic programming. We identify the simplest analytic relationships connecting pairs of these constants and report several notable observations based on about a thousand expressions with relative precision better than 1%. These results may serve as valuable inputs for model builders and artificial intelligence methods aimed at uncovering hidden patterns among the SM constants, or potentially used as building blocks for a deeper underlying law that connects all parameters of the SM through a small set of fundamental constants.
zh

[AI-96] How large language models judge and influence human cooperation

【速读】:该论文试图解决语言模型(Large Language Models, LLMs)在社会决策中的长期影响问题,特别是其对人类合作行为的潜在影响。研究的关键在于通过评估最先进的LLMs对合作行为的判断,并结合进化博弈论模型分析这些判断在群体中主导时对亲社会行为的长期影响,从而揭示LLMs在社会动态中的作用及其对人类合作的可能改变。

链接: https://arxiv.org/abs/2507.00088
作者: Alexandre S. Pires,Laurens Samson,Sennay Ghebreab,Fernando P. Santos
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Humans increasingly rely on large language models (LLMs) to support decisions in social settings. Previous work suggests that such tools shape people’s moral and political judgements. However, the long-term implications of LLM-based social decision-making remain unknown. How will human cooperation be affected when the assessment of social interactions relies on language models? This is a pressing question, as human cooperation is often driven by indirect reciprocity, reputations, and the capacity to judge interactions of others. Here, we assess how state-of-the-art LLMs judge cooperative actions. We provide 21 different LLMs with an extensive set of examples where individuals cooperate – or refuse cooperating – in a range of social contexts, and ask how these interactions should be judged. Furthermore, through an evolutionary game-theoretical model, we evaluate cooperation dynamics in populations where the extracted LLM-driven judgements prevail, assessing the long-term impact of LLMs on human prosociality. We observe a remarkable agreement in evaluating cooperation against good opponents. On the other hand, we notice within- and between-model variance when judging cooperation with ill-reputed individuals. We show that the differences revealed between models can significantly impact the prevalence of cooperation. Finally, we test prompts to steer LLM norms, showing that such interventions can shape LLM judgements, particularly through goal-oriented prompts. Our research connects LLM-based advices and long-term social dynamics, and highlights the need to carefully align LLM norms in order to preserve human cooperation.
zh

机器学习

[LG-0] ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

链接: https://arxiv.org/abs/2507.01004
作者: Yuhong Chou,Zehao Liu,Ruijie Zhu,Xinyi Wan,Tianjian Li,Congying Chu,Qian Liu,Jibin Wu,Zejun Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear attention mechanisms deliver significant advantages for Large Language Models (LLMs) by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence Parallelism (SP) methods, essential for distributing these workloads across devices, become the primary bottleneck due to substantial communication overhead. In this paper, we introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models, a new SP method designed to overcome these limitations and achieve end-to-end near-linear scalability for long sequence training. For example, training a model with a 1M sequence length across 64 devices using ZeCO takes roughly the same time as training with an 16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new collective communication primitive. All-Scan provides each SP rank with precisely the initial operator state it requires while maintaining a minimal communication footprint, effectively eliminating communication overhead. Theoretically, we prove the optimaity of ZeCO, showing that it introduces only negligible time and space overhead. Empirically, we compare the communication costs of different sequence parallelism strategies and demonstrate that All-Scan achieves the fastest communication in SP scenarios. Specifically, on 256 GPUs with an 8M sequence length, ZeCO achieves a 60% speedup compared to the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a clear path toward efficiently training next-generation LLMs on previously intractable sequence lengths.

[LG-1] Scalable Feature Learning on Huge Knowledge Graphs for Downstream Machine Learning

链接: https://arxiv.org/abs/2507.00965
作者: Félix Lefebvre,Gaël Varoquaux
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many machine learning tasks can benefit from external knowledge. Large knowledge graphs store such knowledge, and embedding methods can be used to distill it into ready-to-use vector representations for downstream applications. For this purpose, current models have however two limitations: they are primarily optimized for link prediction, via local contrastive learning, and they struggle to scale to the largest graphs due to GPU memory limits. To address these, we introduce SEPAL: a Scalable Embedding Propagation ALgorithm for large knowledge graphs designed to produce high-quality embeddings for downstream tasks at scale. The key idea of SEPAL is to enforce global embedding alignment by optimizing embeddings only on a small core of entities, and then propagating them to the rest of the graph via message passing. We evaluate SEPAL on 7 large-scale knowledge graphs and 46 downstream machine learning tasks. Our results show that SEPAL significantly outperforms previous methods on downstream tasks. In addition, SEPAL scales up its base embedding model, enabling fitting huge knowledge graphs on commodity hardware.

[LG-2] Benchmarking the Discovery Engine

链接: https://arxiv.org/abs/2507.00964
作者: Jack Foxabbott,Arush Tagade,Andrew Cusick,Robbie McCorkell,Leo McKee-Reid,Jugal Patel,Jamie Rumbelow,Jessica Rumbelow,Zohreh Shams
类目: Machine Learning (cs.LG)
*备注: 16 pages, 8 figures, benchmarks Discovery Engine on five scientific datasets (medicine, materials science, climate, air quality, social science)

点击查看摘要

Abstract:The Discovery Engine is a general purpose automated system for scientific discovery, which combines machine learning with state-of-the-art ML interpretability to enable rapid and robust scientific insight across diverse datasets. In this paper, we benchmark the Discovery Engine against five recent peer-reviewed scientific publications applying machine learning across medicine, materials science, social science, and environmental science. In each case, the Discovery Engine matches or exceeds prior predictive performance while also generating deeper, more actionable insights through rich interpretability artefacts. These results demonstrate its potential as a new standard for automated, interpretable scientific modelling that enables complex knowledge discovery from data.

[LG-3] me Series Foundation Models are Flow Predictors

链接: https://arxiv.org/abs/2507.00945
作者: Massimiliano Luca,Ciro Beneduce,Bruno Lepri
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: arXiv admin note: text overlap with arXiv:2203.07372

点击查看摘要

Abstract:We investigate the effectiveness of time series foundation models (TSFMs) for crowd flow prediction, focusing on Moirai and TimesFM. Evaluated on three real-world mobility datasets-Bike NYC, Taxi Beijing, and Spanish national OD flows-these models are deployed in a strict zero-shot setting, using only the temporal evolution of each OD flow and no explicit spatial information. Moirai and TimesFM outperform both statistical and deep learning baselines, achieving up to 33% lower RMSE, 39% lower MAE and up to 49% higher CPC compared to state-of-the-art competitors. Our results highlight the practical value of TSFMs for accurate, scalable flow prediction, even in scenarios with limited annotated data or missing spatial context.

[LG-4] Understanding Generalization in Node and Link Prediction

链接: https://arxiv.org/abs/2507.00927
作者: Antonis Vasileiou,Timo Stoll,Christopher Morris
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2412.07106

点击查看摘要

Abstract:Using message-passing graph neural networks (MPNNs) for node and link prediction is crucial in various scientific and industrial domains, which has led to the development of diverse MPNN architectures. Besides working well in practical settings, their ability to generalize beyond the training set remains poorly understood. While some studies have explored MPNNs’ generalization in graph-level prediction tasks, much less attention has been given to node- and link-level predictions. Existing works often rely on unrealistic i.i.d.@ assumptions, overlooking possible correlations between nodes or links, and assuming fixed aggregation and impractical loss functions while neglecting the influence of graph structure. In this work, we introduce a unified framework to analyze the generalization properties of MPNNs in inductive and transductive node and link prediction settings, incorporating diverse architectural parameters and loss functions and quantifying the influence of graph structure. Additionally, our proposed generalization framework can be applied beyond graphs to any classification task under the inductive or transductive setting. Our empirical study supports our theoretical insights, deepening our understanding of MPNNs’ generalization capabilities in these tasks.

[LG-5] HyperFusion: Hierarchical Multimodal Ensemble Learning for Social Media Popularity Prediction

链接: https://arxiv.org/abs/2507.00926
作者: Liliang Ye(1),Yunyao Zhang(1),Yafeng Wu(1),Yi-Ping Phoebe Chen(2),Junqing Yu(1),Wei Yang(1),Zikai Song(1) ((1) Huazhong University of Science and Technology, Wuhan, China, (2) La Trobe University, Melbourne, Australia)
类目: Multimedia (cs.MM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Social media popularity prediction plays a crucial role in content optimization, marketing strategies, and user engagement enhancement across digital platforms. However, predicting post popularity remains challenging due to the complex interplay between visual, textual, temporal, and user behavioral factors. This paper presents HyperFusion, a hierarchical multimodal ensemble learning framework for social media popularity prediction. Our approach employs a three-tier fusion architecture that progressively integrates features across abstraction levels: visual representations from CLIP encoders, textual embeddings from transformer models, and temporal-spatial metadata with user characteristics. The framework implements a hierarchical ensemble strategy combining CatBoost, TabNet, and custom multi-layer perceptrons. To address limited labeled data, we propose a two-stage training methodology with pseudo-labeling and iterative refinement. We introduce novel cross-modal similarity measures and hierarchical clustering features that capture inter-modal dependencies. Experimental results demonstrate that HyperFusion achieves competitive performance on the SMP challenge dataset. Our team achieved third place in the SMP Challenge 2025 (Image Track). The source code is available at this https URL.

[LG-6] Privacy-Preserving Quantized Federated Learning with Diverse Precision

链接: https://arxiv.org/abs/2507.00920
作者: Dang Qua Nguyen,Morteza Hashemi,Erik Perrins,Sergiy A. Vorobyov,David J. Love,Taejoon Kim
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising paradigm for distributed machine learning, enabling collaborative training of a global model across multiple local devices without requiring them to share raw data. Despite its advancements, FL is limited by factors such as: (i) privacy risks arising from the unprotected transmission of local model updates to the fusion center (FC) and (ii) decreased learning utility caused by heterogeneity in model quantization resolution across participating devices. Prior work typically addresses only one of these challenges because maintaining learning utility under both privacy risks and quantization heterogeneity is a non-trivial task. In this paper, our aim is therefore to improve the learning utility of a privacy-preserving FL that allows clusters of devices with different quantization resolutions to participate in each FL round. Specifically, we introduce a novel stochastic quantizer (SQ) that is designed to simultaneously achieve differential privacy (DP) and minimum quantization error. Notably, the proposed SQ guarantees bounded distortion, unlike other DP approaches. To address quantization heterogeneity, we introduce a cluster size optimization technique combined with a linear fusion approach to enhance model aggregation accuracy. Numerical simulations validate the benefits of our approach in terms of privacy protection and learning utility compared to the conventional LaplaceSQ-FL algorithm.

[LG-7] ABASCO: A Fast Simplified Model for Molecular Generation with Improved Physical Quality

链接: https://arxiv.org/abs/2507.00899
作者: Carlos Vonessen,Charles Harris,Miruna Cretu,Pietro Liò
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State-of-the-art models for 3D molecular generation are based on significant inductive biases, SE(3), permutation equivariance to respect symmetry and graph message-passing networks to capture local chemistry, yet the generated molecules still struggle with physical plausibility. We introduce TABASCO which relaxes these assumptions: The model has a standard non-equivariant transformer architecture, treats atoms in a molecule as sequences and reconstructs bonds deterministically after generation. The absence of equivariant layers and message passing allows us to significantly simplify the model architecture and scale data throughput. On the GEOM-Drugs benchmark TABASCO achieves state-of-the-art PoseBusters validity and delivers inference roughly 10x faster than the strongest baseline, while exhibiting emergent rotational equivariance despite symmetry not being hard-coded. Our work offers a blueprint for training minimalist, high-throughput generative models suited to specialised tasks such as structure- and pharmacophore-based drug design. We provide a link to our implementation at this http URL.

[LG-8] Machine Learning-based Early Detection of Potato Sprouting Using Electrophysiological Signals

链接: https://arxiv.org/abs/2507.00862
作者: Davide Andreoletti,Aris Marcolongo,Natasa Sarafijanovic Djukic,Julien Roulet,Stefano Billeter,Andrzej Kurenda,Margot Visse-Mansiaux,Brice Dupuis,Carrol Annette Plummer,Beatrice Paoli,Omran Ayoub
类目: Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Accurately predicting potato sprouting before the emergence of any visual signs is critical for effective storage management, as sprouting degrades both the commercial and nutritional value of tubers. Effective forecasting allows for the precise application of anti-sprouting chemicals (ASCs), minimizing waste and reducing costs. This need has become even more pressing following the ban on Isopropyl N-(3-chlorophenyl) carbamate (CIPC) or Chlorpropham due to health and environmental concerns, which has led to the adoption of significantly more expensive alternative ASCs. Existing approaches primarily rely on visual identification, which only detects sprouting after morphological changes have occurred, limiting their effectiveness for proactive management. A reliable early prediction method is therefore essential to enable timely intervention and improve the efficiency of post-harvest storage strategies, where early refers to detecting sprouting before any visible signs appear. In this work, we address the problem of early prediction of potato sprouting. To this end, we propose a novel machine learning (ML)-based approach that enables early prediction of potato sprouting using electrophysiological signals recorded from tubers using proprietary sensors. Our approach preprocesses the recorded signals, extracts relevant features from the wavelet domain, and trains supervised ML models for early sprouting detection. Additionally, we incorporate uncertainty quantification techniques to enhance predictions. Experimental results demonstrate promising performance in the early detection of potato sprouting by accurately predicting the exact day of sprouting for a subset of potatoes and while showing acceptable average error across all potatoes. Despite promising results, further refinements are necessary to minimize prediction errors, particularly in reducing the maximum observed deviations.

[LG-9] Aligning Learning and Endogenous Decision-Making

链接: https://arxiv.org/abs/2507.00851
作者: Rares Cristian,Pavithra Harsha,Georgia Perakis,Brian Quanz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many of the observations we make are biased by our decisions. For instance, the demand of items is impacted by the prices set, and online checkout choices are influenced by the assortments presented. The challenge in decision-making under this setting is the lack of counterfactual information, and the need to learn it instead. We introduce an end-to-end method under endogenous uncertainty to train ML models to be aware of their downstream, enabling their effective use in the decision-making stage. We further introduce a robust optimization variant that accounts for uncertainty in ML models – specifically by constructing uncertainty sets over the space of ML models and optimizing actions to protect against worst-case predictions. We prove guarantees that this robust approach can capture near-optimal decisions with high probability as a function of data. Besides this, we also introduce a new class of two-stage stochastic optimization problems to the end-to-end learning framework that can now be addressed through our framework. Here, the first stage is an information-gathering problem to decide which random variable to poll and gain information about before making a second-stage decision based off of it. We present several computational experiments for pricing and inventory assortment/recommendation problems. We compare against existing methods in online learning/bandits/offline reinforcement learning and show our approach has consistent improved performance over these. Just as in the endogenous setting, the model’s prediction also depends on the first-stage decision made. While this decision does not affect the random variable in this setting, it does affect the correct point forecast that should be made.

[LG-10] Quantum Approximate Optimization Algorithm for Spatiotemporal Forecasting of HIV Clusters WWW

链接: https://arxiv.org/abs/2507.00848
作者: Don Roosan,Saif Nirzhor,Rubayat Khan,Fahmida Hai,Mohammad Rifat Haidar
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注: Conference details can be found here: this https URL

点击查看摘要

Abstract:HIV epidemiological data is increasingly complex, requiring advanced computation for accurate cluster detection and forecasting. We employed quantum-accelerated machine learning to analyze HIV prevalence at the ZIP-code level using AIDSVu and synthetic SDoH data for 2022. Our approach compared classical clustering (DBSCAN, HDBSCAN) with a quantum approximate optimization algorithm (QAOA), developed a hybrid quantum-classical neural network for HIV prevalence forecasting, and used quantum Bayesian networks to explore causal links between SDoH factors and HIV incidence. The QAOA-based method achieved 92% accuracy in cluster detection within 1.6 seconds, outperforming classical algorithms. Meanwhile, the hybrid quantum-classical neural network predicted HIV prevalence with 94% accuracy, surpassing a purely classical counterpart. Quantum Bayesian analysis identified housing instability as a key driver of HIV cluster emergence and expansion, with stigma exerting a geographically variable influence. These quantum-enhanced methods deliver greater precision and efficiency in HIV surveillance while illuminating critical causal pathways. This work can guide targeted interventions, optimize resource allocation for PrEP, and address structural inequities fueling HIV transmission.

[LG-11] BoltzNCE: Learning Likelihoods for Boltzmann Generation with Stochastic Interpolants and Noise Contrastive Estimation NEURIPS2025

链接: https://arxiv.org/abs/2507.00846
作者: Rishal Aggrwal,Jacky Chen,Nicholas M. Boffi,David Ryan Koes
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注: 19 pages, 25 figures, submitted to NeurIPS 2025

点击查看摘要

Abstract:Efficient sampling from the Boltzmann distribution defined by an energy function is a key challenge in modeling physical systems such as molecules. Boltzmann Generators tackle this by leveraging Continuous Normalizing Flows that transform a simple prior into a distribution that can be reweighted to match the Boltzmann distribution using sample likelihoods. However, obtaining likelihoods requires computing costly Jacobians during integration, making it impractical for large molecular systems. To overcome this, we propose learning the likelihood of the generated distribution via an energy-based model trained with noise contrastive estimation and score matching. By using stochastic interpolants to anneal between the prior and generated distributions, we combine both the objective functions to efficiently learn the density function. On the alanine dipeptide system, we demonstrate that our method yields free energy profiles and energy distributions comparable to those obtained with exact likelihoods. Additionally, we show that free energy differences between metastable states can be estimated accurately with orders-of-magnitude speedup.

[LG-12] Leverag ing Genetic Algorithms for Efficient Demonstration Generation in Real-World Reinforcement Learning Environments

链接: https://arxiv.org/abs/2507.00762
作者: Tom Maus,Asma Atamna,Tobias Glasmachers
类目: Machine Learning (cs.LG)
*备注: This article has been submitted to and accepted for presentation at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD 2025). After publication, it will appear in the official LOD 2025 proceedings

点击查看摘要

Abstract:Reinforcement Learning (RL) has demonstrated significant potential in certain real-world industrial applications, yet its broader deployment remains limited by inherent challenges such as sample inefficiency and unstable learning dynamics. This study investigates the utilization of Genetic Algorithms (GAs) as a mechanism for improving RL performance in an industrially inspired sorting environment. We propose a novel approach in which GA-generated expert demonstrations are used to enhance policy learning. These demonstrations are incorporated into a Deep Q-Network (DQN) replay buffer for experience-based learning and utilized as warm-start trajectories for Proximal Policy Optimization (PPO) agents to accelerate training convergence. Our experiments compare standard RL training with rule-based heuristics, brute-force optimization, and demonstration data, revealing that GA-derived demonstrations significantly improve RL performance. Notably, PPO agents initialized with GA-generated data achieved superior cumulative rewards, highlighting the potential of hybrid learning paradigms, where heuristic search methods complement data-driven RL. The utilized framework is publicly available and enables further research into adaptive RL strategies for real-world applications.

[LG-13] A Probabilistic Approach to Wildfire Spread Prediction Using a Denoising Diffusion Surrogate Model

链接: https://arxiv.org/abs/2507.00761
作者: Wenbo Yu,Anirbit Ghosh,Tobias Sebastian Finn,Rossella Arcucci,Marc Bocquet,Sibo Cheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Thanks to recent advances in generative AI, computers can now simulate realistic and complex natural processes. We apply this capability to predict how wildfires spread, a task made difficult by the unpredictable nature of fire and the variety of environmental conditions it depends on. In this study, We present the first denoising diffusion model for predicting wildfire spread, a new kind of AI framework that learns to simulate fires not just as one fixed outcome, but as a range of possible scenarios. By doing so, it accounts for the inherent uncertainty of wildfire dynamics, a feature that traditional models typically fail to represent. Unlike deterministic approaches that generate a single prediction, our model produces ensembles of forecasts that reflect physically meaningful distributions of where fire might go next. This technology could help us develop smarter, faster, and more reliable tools for anticipating wildfire behavior, aiding decision-makers in fire risk assessment and response planning.

[LG-14] Evaluating LLM s and Prompting Strategies for Automated Hardware Diagnosis from Textual User-Reports

链接: https://arxiv.org/abs/2507.00742
作者: Carlos Caminha,Maria de Lourdes M. Silva,Iago C. Chaves,Felipe T. Brito,Victor A. E. Farias,Javam C. Machado
类目: Machine Learning (cs.LG)
*备注: To be published in the Proceedings of the Brazilian Integrated Software and Hardware Seminar 2025 (SEMISH 2025)

点击查看摘要

Abstract:Computer manufacturers offer platforms for users to describe device faults using textual reports such as “My screen is flickering”. Identifying the faulty component from the report is essential for automating tests and improving user experience. However, such reports are often ambiguous and lack detail, making this task challenging. Large Language Models (LLMs) have shown promise in addressing such issues. This study evaluates 27 open-source models (1B-72B parameters) and 2 proprietary LLMs using four prompting strategies: Zero-Shot, Few-Shot, Chain-of-Thought (CoT), and CoT+Few-Shot (CoT+FS). We conducted 98,948 inferences, processing over 51 million input tokens and generating 13 million output tokens. We achieve f1-score up to 0.76. Results show that three models offer the best balance between size and performance: mistral-small-24b-instruct and two smaller models, llama-3.2-1b-instruct and gemma-2-2b-it, that offer competitive performance with lower VRAM usage, enabling efficient inference on end-user devices as modern laptops or smartphones with NPUs.

[LG-15] Ordinality in Discrete-level Question Difficulty Estimation: Introducing Balanced DRPS and OrderedLogitNN

链接: https://arxiv.org/abs/2507.00736
作者: Arthur Thuy,Ekaterina Loginova,Dries F. Benoit
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published in the EvalLAC’25 workshop at AIED 2025

点击查看摘要

Abstract:Recent years have seen growing interest in Question Difficulty Estimation (QDE) using natural language processing techniques. Question difficulty is often represented using discrete levels, framing the task as ordinal regression due to the inherent ordering from easiest to hardest. However, the literature has neglected the ordinal nature of the task, relying on classification or discretized regression models, with specialized ordinal regression methods remaining unexplored. Furthermore, evaluation metrics are tightly coupled to the modeling paradigm, hindering cross-study comparability. While some metrics fail to account for the ordinal structure of difficulty levels, none adequately address class imbalance, resulting in biased performance assessments. This study addresses these limitations by benchmarking three types of model outputs – discretized regression, classification, and ordinal regression – using the balanced Discrete Ranked Probability Score (DRPS), a novel metric that jointly captures ordinality and class imbalance. In addition to using popular ordinal regression methods, we propose OrderedLogitNN, extending the ordered logit model from econometrics to neural networks. We fine-tune BERT on the RACE++ and ARC datasets and find that OrderedLogitNN performs considerably better on complex tasks. The balanced DRPS offers a robust and fair evaluation metric for discrete-level QDE, providing a principled foundation for future research.

[LG-16] Aleatoric and Epistemic Uncertainty Measures for Ordinal Classification through Binary Reduction

链接: https://arxiv.org/abs/2507.00733
作者: Stefan Haas,Eyke Hüllermeier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ordinal classification problems, where labels exhibit a natural order, are prevalent in high-stakes fields such as medicine and finance. Accurate uncertainty quantification, including the decomposition into aleatoric (inherent variability) and epistemic (lack of knowledge) components, is crucial for reliable decision-making. However, existing research has primarily focused on nominal classification and regression. In this paper, we introduce a novel class of measures of aleatoric and epistemic uncertainty in ordinal classification, which is based on a suitable reduction to (entropy- and variance-based) measures for the binary case. These measures effectively capture the trade-off in ordinal classification between exact hit-rate and minimial error distances. We demonstrate the effectiveness of our approach on various tabular ordinal benchmark datasets using ensembles of gradient-boosted trees and multi-layer perceptrons for approximate Bayesian inference. Our method significantly outperforms standard and label-wise entropy and variance-based measures in error detection, as indicated by misclassification rates and mean absolute error. Additionally, the ordinal measures show competitive performance in out-of-distribution (OOD) detection. Our findings highlight the importance of considering the ordinal nature of classification problems when assessing uncertainty.

[LG-17] Large Reasoning Models are not thinking straight: on the unreliability of thinking trajectories

链接: https://arxiv.org/abs/2507.00711
作者: Jhouben Cuesta-Ramirez,Samuel Beaussant,Mehdi Mounsif
类目: Machine Learning (cs.LG)
*备注: Accepted to KONVENS 2025

点击查看摘要

Abstract:Large Language Models (LLMs) trained via Reinforcement Learning (RL) have recently achieved impressive results on reasoning benchmarks. Yet, growing evidence shows that these models often generate longer but ineffective chains of thought (CoTs), calling into question whether benchmark gains reflect real reasoning improvements. We present new evidence of overthinking, where models disregard correct solutions even when explicitly provided, instead continuing to generate unnecessary reasoning steps that often lead to incorrect conclusions. Experiments on three state-of-the-art models using the AIME2024 math benchmark reveal critical limitations in these models ability to integrate corrective information, posing new challenges for achieving robust and interpretable reasoning.

[LG-18] SCAWaveNet: A Spatial-Channel Attention-based Network for Global Significant Wave Height Retrieval

链接: https://arxiv.org/abs/2507.00701
作者: Chong Zhang,Xichao Liu,Yibing Zhan,Dapeng Tao,Jun Ni
类目: Machine Learning (cs.LG)
*备注: 16 pages,6 tables,11 figures

点击查看摘要

Abstract:Recent advancements in spaceborne GNSS missions have produced extensive global datasets, providing a robust basis for deep learning-based significant wave height (SWH) retrieval. While existing deep learning models predominantly utilize CYGNSS data with four-channel information, they often adopt single-channel inputs or simple channel concatenation without leveraging the benefits of cross-channel information interaction during training. To address this limitation, a novel spatial-channel attention-based network, namely SCAWaveNet, is proposed for SWH retrieval. Specifically, features from each channel of the DDMs are modeled as independent attention heads, enabling the fusion of spatial and channel-wise information. For auxiliary parameters, a lightweight attention mechanism is designed to assign weights along the spatial and channel dimensions. The final feature integrates both spatial and channel-level characteristics. Model performance is evaluated using four-channel CYGNSS data. When ERA5 is used as a reference, SCAWaveNet achieves an average RMSE of 0.438 m. When using buoy data from NDBC, the average RMSE reaches 0.432 m. Compared to state-of-the-art models, SCAWaveNet reduces the average RMSE by at least 3.52% on the ERA5 dataset and by 5.47% on the NDBC buoy observations. The code is available at this https URL.

[LG-19] A Test-Function Approach to Incremental Stability

链接: https://arxiv.org/abs/2507.00695
作者: Daniel Pfrommer,Max Simchowitz,Ali Jadbabaie
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages

点击查看摘要

Abstract:This paper presents a novel framework for analyzing Incremental-Input-to-State Stability ( \delta ISS) based on the idea of using rewards as “test functions.” Whereas control theory traditionally deals with Lyapunov functions that satisfy a time-decrease condition, reinforcement learning (RL) value functions are constructed by exponentially decaying a Lipschitz reward function that may be non-smooth and unbounded on both sides. Thus, these RL-style value functions cannot be directly understood as Lyapunov certificates. We develop a new equivalence between a variant of incremental input-to-state stability of a closed-loop system under given a policy, and the regularity of RL-style value functions under adversarial selection of a Hölder-continuous reward function. This result highlights that the regularity of value functions, and their connection to incremental stability, can be understood in a way that is distinct from the traditional Lyapunov-based approach to certifying stability in control theory.

[LG-20] Neural Augmented Kalman Filters for Road Network assisted GNSS positioning ICML2025

链接: https://arxiv.org/abs/2507.00654
作者: Hans van Gorp,Davide Belli,Amir Jalalirad,Bence Major
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: Accepted to ICML 2025 workshop ML4Wireless

点击查看摘要

Abstract:The Global Navigation Satellite System (GNSS) provides critical positioning information globally, but its accuracy in dense urban environments is often compromised by multipath and non-line-of-sight errors. Road network data can be used to reduce the impact of these errors and enhance the accuracy of a positioning system. Previous works employing road network data are either limited to offline applications, or rely on Kalman Filter (KF) heuristics with little flexibility and robustness. We instead propose training a Temporal Graph Neural Network (TGNN) to integrate road network information into a KF. The TGNN is designed to predict the correct road segment and its associated uncertainty to be used in the measurement update step of the KF. We validate our approach with real-world GNSS data and open-source road networks, observing a 29% decrease in positioning error for challenging scenarios compared to a GNSS-only KF. To the best of our knowledge, ours is the first deep learning-based approach jointly employing road network data and GNSS measurements to determine the user position on Earth.

[LG-21] Cooperative Sheaf Neural Networks

链接: https://arxiv.org/abs/2507.00647
作者: André Ribeiro,Ana Luiza Tenório,Juan Belieni,Amauri H. Souza,Diego Mesquita
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sheaf diffusion has recently emerged as a promising design pattern for graph representation learning due to its inherent ability to handle heterophilic data and avoid oversmoothing. Meanwhile, cooperative message passing has also been proposed as a way to enhance the flexibility of information diffusion by allowing nodes to independently choose whether to propagate/gather information from/to neighbors. A natural question ensues: is sheaf diffusion capable of exhibiting this cooperative behavior? Here, we provide a negative answer to this question. In particular, we show that existing sheaf diffusion methods fail to achieve cooperative behavior due to the lack of message directionality. To circumvent this limitation, we introduce the notion of cellular sheaves over directed graphs and characterize their in- and out-degree Laplacians. We leverage our construction to propose Cooperative Sheaf Neural Networks (CSNNs). Theoretically, we characterize the receptive field of CSNN and show it allows nodes to selectively attend (listen) to arbitrarily far nodes while ignoring all others in their path, potentially mitigating oversquashing. Our experiments show that CSNN presents overall better performance compared to prior art on sheaf diffusion as well as cooperative graph neural networks.

[LG-22] A Practical Guide to Interpretable Role-Based Clustering in Multi-Layer Financial Networks

链接: https://arxiv.org/abs/2507.00600
作者: Christian Franssen,Iman van Lelyveld,Bernd Heidergott
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the functional roles of financial institutions within interconnected markets is critical for effective supervision, systemic risk assessment, and resolution planning. We propose an interpretable role-based clustering approach for multi-layer financial networks, designed to identify the functional positions of institutions across different market segments. Our method follows a general clustering framework defined by proximity measures, cluster evaluation criteria, and algorithm selection. We construct explainable node embeddings based on egonet features that capture both direct and indirect trading relationships within and across market layers. Using transaction-level data from the ECB’s Money Market Statistical Reporting (MMSR), we demonstrate how the approach uncovers heterogeneous institutional roles such as market intermediaries, cross-segment connectors, and peripheral lenders or borrowers. The results highlight the flexibility and practical value of role-based clustering in analyzing financial networks and understanding institutional behavior in complex market structures.

[LG-23] Foundation Models for Clinical Records at Health System Scale ICML2025

链接: https://arxiv.org/abs/2507.00574
作者: Haresh Rengaraj Rajamohan,Xiang Gao,Weicheng Zhu,Shih-Lun Huang,Long Chen,Kyunghyun Cho,Cem M. Deniz,Narges Razavian
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2025 Workshop on Foundation Models for Structured Data

点击查看摘要

Abstract:Large-scale pretraining has transformed modeling of language and other data types, but its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present a novel generative pretraining strategy for sequential EHR data using next-visit event prediction. Our model learns to autoregressively generate various tokenized clinical events for the next visit based on patient history and inherently handles the joint prediction of heterogeneous data types. Additionally, we introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Our model is evaluated via zero-shot prediction for forecasting dementia and knee osteoarthritis incidence within 2 and 5 years, and the model performance rivals a fully fine-tuned masked pretrained Transformer baseline, demonstrating that our approach captures complex clinical dependencies without requiring costly task-specific fine-tuning.

[LG-24] Exploring Large Action Sets with Hyperspherical Embeddings using von Mises-Fisher Sampling ICML2025

链接: https://arxiv.org/abs/2507.00518
作者: Walid Bendada,Guillaume Salha-Galvan,Romain Hennequin,Théo Bontempelli,Thomas Bouabça,Tristan Cazenave
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 42nd International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:This paper introduces von Mises-Fisher exploration (vMF-exp), a scalable method for exploring large action sets in reinforcement learning problems where hyperspherical embedding vectors represent these actions. vMF-exp involves initially sampling a state embedding representation using a von Mises-Fisher distribution, then exploring this representation’s nearest neighbors, which scales to virtually unlimited numbers of candidate actions. We show that, under theoretical assumptions, vMF-exp asymptotically maintains the same probability of exploring each action as Boltzmann Exploration (B-exp), a popular alternative that, nonetheless, suffers from scalability issues as it requires computing softmax values for each action. Consequently, vMF-exp serves as a scalable alternative to B-exp for exploring large action sets with hyperspherical embeddings. Experiments on simulated data, real-world public data, and the successful large-scale deployment of vMF-exp on the recommender system of a global music streaming service empirically validate the key properties of the proposed method.

[LG-25] Posterior Inference in Latent Space for Scalable Constrained Black-box Optimization

链接: https://arxiv.org/abs/2507.00480
作者: Kiyoung Om,Kyuil Sim,Taeyoung Yun,Hyeongyu Kang,Jinkyoo Park
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 11 figures, 5 tables. Equal contribution by Kiyoung Om, Kyuil Sim, and Taeyoung Yun

点击查看摘要

Abstract:Optimizing high-dimensional black-box functions under black-box constraints is a pervasive task in a wide range of scientific and engineering problems. These problems are typically harder than unconstrained problems due to hard-to-find feasible regions. While Bayesian optimization (BO) methods have been developed to solve such problems, they often struggle with the curse of dimensionality. Recently, generative model-based approaches have emerged as a promising alternative for constrained optimization. However, they suffer from poor scalability and are vulnerable to mode collapse, particularly when the target distribution is highly multi-modal. In this paper, we propose a new framework to overcome these challenges. Our method iterates through two stages. First, we train flow-based models to capture the data distribution and surrogate models that predict both function values and constraint violations with uncertainty quantification. Second, we cast the candidate selection problem as a posterior inference problem to effectively search for promising candidates that have high objective values while not violating the constraints. During posterior inference, we find that the posterior distribution is highly multi-modal and has a large plateau due to constraints, especially when constraint feedback is given as binary indicators of feasibility. To mitigate this issue, we amortize the sampling from the posterior distribution in the latent space of flow-based models, which is much smoother than that in the data space. We empirically demonstrate that our method achieves superior performance on various synthetic and real-world constrained black-box optimization tasks. Our code is publicly available \hrefthis https URLhere.

[LG-26] Recurrent Memory-Augmented Transformers with Chunked Attention for Long-Context Language Modeling ATC

链接: https://arxiv.org/abs/2507.00453
作者: Ankit Kashyap
类目: Machine Learning (cs.LG)
*备注: 19 pages, 9 figures, 1 table; implemented entirely from scratch in PyTorch

点击查看摘要

Abstract:We present a Transformer architecture for long-context language modeling that combines global attention with two biologically inspired components: chunked local attention and a gated FIFO memory mechanism. This unified attention block allows the model to efficiently handle both short-range and long-range dependencies without increasing attention cost quadratically. The memory module persistently stores past token representations using a gated update mechanism inspired by recurrent networks. Rotary positional encoding is applied per attention head to enable directionally disentangled, scale-invariant positional signals. The architecture is implemented entirely from scratch in PyTorch, with no reliance on high-level libraries, enabling transparent and modular experimentation. Our model offers a lightweight and extensible design for tasks such as dialogue modeling, code completion, and document understanding.

[LG-27] Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning ICCV2025

链接: https://arxiv.org/abs/2507.00423
作者: Wenjin Mo,Zhiyuan Li,Minghong Fang,Mingwei Fang
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To appear in ICCV 2025

点击查看摘要

Abstract:Federated learning (FL) allows multiple clients to collaboratively train a global machine learning model with coordination from a central server, without needing to share their raw data. This approach is particularly appealing in the era of privacy regulations like the GDPR, leading many prominent companies to adopt it. However, FL’s distributed nature makes it susceptible to poisoning attacks, where malicious clients, controlled by an attacker, send harmful data to compromise the model. Most existing poisoning attacks in FL aim to degrade the model’s integrity, such as reducing its accuracy, with limited attention to privacy concerns from these attacks. In this study, we introduce FedPoisonMIA, a novel poisoning membership inference attack targeting FL. FedPoisonMIA involves malicious clients crafting local model updates to infer membership information. Additionally, we propose a robust defense mechanism to mitigate the impact of FedPoisonMIA attacks. Extensive experiments across various datasets demonstrate the attack’s effectiveness, while our defense approach reduces its impact to a degree.

[LG-28] Diffusion Disambiguation Models for Partial Label Learning

链接: https://arxiv.org/abs/2507.00411
作者: Jinfu Fan,Xiaohui Zhong,Kangrui Ren,Jiangnan Li,Linqing Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning from ambiguous labels is a long-standing problem in practical machine learning applications. The purpose of \emphpartial label learning (PLL) is to identify the ground-truth label from a set of candidate labels associated with a given instance. Inspired by the remarkable performance of diffusion models in various generation tasks, this paper explores their potential to denoise ambiguous labels through the reverse denoising process. Therefore, this paper reformulates the label disambiguation problem from the perspective of generative models, where labels are generated by iteratively refining initial random guesses. This perspective enables the diffusion model to learn how label information is generated stochastically. By modeling the generation uncertainty, we can use the maximum likelihood estimate of the label for classification inference. However, such ambiguous labels lead to a mismatch between instance and label, which reduces the quality of generated data. To address this issue, this paper proposes a \emphdiffusion disambiguation model for PLL (DDMP), which first uses the potential complementary information between instances and labels to construct pseudo-clean labels for initial diffusion training. Furthermore, a transition-aware matrix is introduced to estimate the potential ground-truth labels, which are dynamically updated during the diffusion generation. During training, the ground-truth label is progressively refined, improving the classifier. Experiments show the advantage of the DDMP and its suitability for PLL.

[LG-29] HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism

链接: https://arxiv.org/abs/2507.00394
作者: Geng Zhang,Shenggan Cheng,Xuanlei Zhao,Ziming Liu,Yang You
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel pipeline parallelism for long sequence transformer training. First, HelixPipe introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles. Second, it employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with computation. Additionally, HelixPipe utilizes recomputation without attention and chunked MLP to mitigate fragmentation and enable longer sequences. Experiments demonstrate that HelixPipe gains increasing advantages with longer sequence lengths, and outperforms existing methods in throughput and scalability across varying pipeline sizes, model sizes, and cluster configurations. Notably, it achieves a 26% speedup over baseline methods when training a 7B model with 128k sequence length on 64 H20 GPUs. Code is available at this https URL.

[LG-30] MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE

链接: https://arxiv.org/abs/2507.00390
作者: Geng Zhang,Yuxuan Han,Yuxuan Lou,Wangbo Zhao,Yiqi Zhang,Yang You
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token. However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory. While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes. This paper proposes Mixture-of-Novices-and-Experts (MoNE), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression. MoNE evaluates expert redundancy based on two metrics: access frequency and output variance. Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices-unbiased estimations of their original outputs-minimizing performance degradation. Extensive experiments demonstrate that MoNE consistently outperforms baseline methods with minimal accuracy degradation across the three dimensions, confirming its effectiveness and robustness. Notably, it improves the average zero shot accuracy across nine downstream tasks by up to 2.71 under 25% pruning ratio and 3.61 under 50% pruning. The code is available at this https URL.

[LG-31] Augmented Physics-Based Li-ion Battery Model via Adaptive Ensemble Sparse Learning and Conformal Prediction

链接: https://arxiv.org/abs/2507.00353
作者: Samuel Filgueira da Silva,Mehmet Fatih Ozkan,Faissal El Idrissi,Marcello Canova
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate electrochemical models are essential for the safe and efficient operation of lithium-ion batteries in real-world applications such as electrified vehicles and grid storage. Reduced-order models (ROM) offer a balance between fidelity and computational efficiency but often struggle to capture complex and nonlinear behaviors, such as the dynamics in the cell voltage response under high C-rate conditions. To address these limitations, this study proposes an Adaptive Ensemble Sparse Identification (AESI) framework that enhances the accuracy of reduced-order li-ion battery models by compensating for unpredictable dynamics. The approach integrates an Extended Single Particle Model (ESPM) with an evolutionary ensemble sparse learning strategy to construct a robust hybrid model. In addition, the AESI framework incorporates a conformal prediction method to provide theoretically guaranteed uncertainty quantification for voltage error dynamics, thereby improving the reliability of the model’s predictions. Evaluation across diverse operating conditions shows that the hybrid model (ESPM + AESI) improves the voltage prediction accuracy, achieving mean squared error reductions of up to 46% on unseen data. Prediction reliability is further supported by conformal prediction, yielding statistically valid prediction intervals with coverage ratios of 96.85% and 97.41% for the ensemble models based on bagging and stability selection, respectively.

[LG-32] MamNet: A Novel Hybrid Model for Time-Series Forecasting and Frequency Pattern Analysis in Network Traffic

链接: https://arxiv.org/abs/2507.00304
作者: Yujun Zhang,Runlong Li,Xiaoxiang Liang,Xinhao Yang,Tian Su,Bo Liu,Yan Zhou
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 16 pages

点击查看摘要

Abstract:The abnormal fluctuations in network traffic may indicate potential security threats or system failures. Therefore, efficient network traffic prediction and anomaly detection methods are crucial for network security and traffic management. This paper proposes a novel network traffic prediction and anomaly detection model, MamNet, which integrates time-domain modeling and frequency-domain feature extraction. The model first captures the long-term dependencies of network traffic through the Mamba module (time-domain modeling), and then identifies periodic fluctuations in the traffic using Fourier Transform (frequency-domain feature extraction). In the feature fusion layer, multi-scale information is integrated to enhance the model’s ability to detect network traffic anomalies. Experiments conducted on the UNSW-NB15 and CAIDA datasets demonstrate that MamNet outperforms several recent mainstream models in terms of accuracy, recall, and F1-Score. Specifically, it achieves an improvement of approximately 2% to 4% in detection performance for complex traffic patterns and long-term trend detection. The results indicate that MamNet effectively captures anomalies in network traffic across different time scales and is suitable for anomaly detection tasks in network security and traffic management. Future work could further optimize the model structure by incorporating external network event information, thereby improving the model’s adaptability and stability in complex network environments.

[LG-33] Structure-preserving Lift Learn: Scientific machine learning for nonlinear conservative partial differential equations

链接: https://arxiv.org/abs/2507.00301
作者: Harsh Sharma,Juan Diego Draxl Giannoni,Boris Kramer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: arXiv admin note: substantial text overlap with arXiv:2503.02273

点击查看摘要

Abstract:This work presents structure-preserving Lift Learn, a scientific machine learning method that employs lifting variable transformations to learn structure-preserving reduced-order models for nonlinear partial differential equations (PDEs) with conservation laws. We propose a hybrid learning approach based on a recently developed energy-quadratization strategy that uses knowledge of the nonlinearity at the PDE level to derive an equivalent quadratic lifted system with quadratic system energy. The lifted dynamics obtained via energy quadratization are linear in the old variables, making model learning very effective in the lifted setting. Based on the lifted quadratic PDE model form, the proposed method derives quadratic reduced terms analytically and then uses those derived terms to formulate a constrained optimization problem to learn the remaining linear reduced operators in a structure-preserving way. The proposed hybrid learning approach yields computationally efficient quadratic reduced-order models that respect the underlying physics of the high-dimensional problem. We demonstrate the generalizability of quadratic models learned via the proposed structure-preserving Lift Learn method through three numerical examples: the one-dimensional wave equation with exponential nonlinearity, the two-dimensional sine-Gordon equation, and the two-dimensional Klein-Gordon-Zakharov equations. The numerical results show that the proposed learning approach is competitive with the state-of-the-art structure-preserving data-driven model reduction method in terms of both accuracy and computational efficiency.

[LG-34] Examining Reject Relations in Stimulus Equivalence Simulations

链接: https://arxiv.org/abs/2507.00265
作者: Alexis Carrillo,Asieh Abolpour Mofrad,Anis Yazidi,Moises Betancort
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:Simulations offer a valuable tool for exploring stimulus equivalence (SE), yet the potential of reject relations to disrupt the assessment of equivalence class formation is contentious. This study investigates the role of reject relations in the acquisition of stimulus equivalence using computational models. We examined feedforward neural networks (FFNs), bidirectional encoder representations from transformers (BERT), and generative pre-trained transformers (GPT) across 18 conditions in matching-to-sample (MTS) simulations. Conditions varied in training structure (linear series, one-to-many, and many-to-one), relation type (select-only, reject-only, and select-reject), and negative comparison selection (standard and biased). A probabilistic agent served as a benchmark, embodying purely associative learning. The primary goal was to determine whether artificial neural networks could demonstrate equivalence class formation or whether their performance reflected associative learning. Results showed that reject relations influenced agent performance. While some agents achieved high accuracy on equivalence tests, particularly with reject relations and biased negative comparisons, this performance was comparable to the probabilistic agent. These findings suggest that artificial neural networks, including transformer models, may rely on associative strategies rather than SE. This underscores the need for careful consideration of reject relations and more stringent criteria in computational models of equivalence.

[LG-35] Who Should I Listen To? Adaptive Collaboration in Personalized Federated Learning

链接: https://arxiv.org/abs/2507.00259
作者: Amr Abourayya,Jens Kleesiek,Bharat Rao,Michael Kamp
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data heterogeneity is a central challenge in federated learning, and personalized federated learning (PFL) aims to address it by tailoring models to each client’s distribution. Yet many PFL methods fail to outperform local or centralized baselines, suggesting a mismatch between the collaboration they enforce and the structure of the data. We propose an approach based on adaptive collaboration, where clients decide adaptively not only how much to rely on others, but also whom to trust at the level of individual examples. We instantiate this principle in FEDMOSAIC, a federated co-training method in which clients exchange predictions over a shared unlabeled dataset. This enables fine-grained trust decisions that are difficult to achieve with parameter sharing alone. Each client adjusts its loss weighting based on the agreement between private and public data, and contributes to global pseudo-labels in proportion to its estimated per-example confidence. Empirically, FEDMOSAIC improves upon state-of-the-art PFL methods across diverse non-IID settings, and we provide convergence guarantees under standard assumptions. Our results demonstrate the potential of data-aware collaboration for robust and effective personalization.

[LG-36] PPFL-RDSN: Privacy-Preserving Federated Learning-based Residual Dense Spatial Networks for Encrypted Lossy Image Reconstruction

链接: https://arxiv.org/abs/2507.00230
作者: Peilin He,James Joshi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: This paper is under review; do not distribute

点击查看摘要

Abstract:Reconstructing high-quality images from low-resolution inputs using Residual Dense Spatial Networks (RDSNs) is crucial yet challenging, particularly in collaborative scenarios where centralized training poses significant privacy risks, including data leakage and inference attacks, as well as high computational costs. We propose a novel Privacy-Preserving Federated Learning-based RDSN (PPFL-RDSN) framework specifically tailored for lossy image reconstruction. PPFL-RDSN integrates Federated Learning (FL), local differential privacy, and robust model watermarking techniques, ensuring data remains secure on local devices, safeguarding sensitive information, and maintaining model authenticity without revealing underlying data. Empirical evaluations show that PPFL-RDSN achieves comparable performance to the state-of-the-art centralized methods while reducing computational burdens, and effectively mitigates security and privacy vulnerabilities, making it a practical solution for secure and privacy-preserving collaborative computer vision applications.

[LG-37] Graph Neural Networks in Wind Power Forecasting

链接: https://arxiv.org/abs/2507.00105
作者: Javier Castellano,Ignacio Villanueva
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We study the applicability of GNNs to the problem of wind energy forecasting. We find that certain architectures achieve performance comparable to our best CNN-based benchmark. The study is conducted on three wind power facilities using five years of historical data. Numerical Weather Prediction (NWP) variables were used as predictors, and models were evaluated on a 24 to 36 hour ahead test horizon.

[LG-38] DFReg: A Physics-Inspired Framework for Global Weight Distribution Regularization in Neural Networks

链接: https://arxiv.org/abs/2507.00101
作者: Giovanni Ruggieri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce DFReg, a physics-inspired regularization method for deep neural networks that operates on the global distribution of weights. Drawing from Density Functional Theory (DFT), DFReg applies a functional penalty to encourage smooth, diverse, and well-distributed weight configurations. Unlike traditional techniques such as Dropout or L2 decay, DFReg imposes global structural regularity without architectural changes or stochastic perturbations.

[LG-39] A new machine learning framework for occupational accidents forecasting with safety inspections integration

链接: https://arxiv.org/abs/2507.00089
作者: Aho Yapi,Pierre Latouche,Arnaud Guillin,Yan Bailly
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a generic framework for short-term occupational accident forecasting that leverages safety inspections and models accident occurrences as binary time series. The approach generates daily predictions, which are then aggregated into weekly safety assessments to better inform decision making. To ensure the reliability and operational applicability of the forecasts, we apply a sliding-window cross-validation procedure specifically designed for time series data, combined with an evaluation based on aggregated period-level metrics. Several machine learning algorithms, including logistic regression, tree-based models, and neural networks, are trained and systematically compared within this framework. Unlike the other approaches, the long short-term memory (LSTM) network outperforms the other approaches and detects the upcoming high-risk periods with a balanced accuracy of 0.86, confirming the robustness of our methodology and demonstrating that a binary time series model can anticipate these critical periods based on safety inspections. The proposed methodology converts routine safety inspection data into clear weekly risk scores, detecting the periods when accidents are most likely. Decision-makers can integrate these scores into their planning tools to classify inspection priorities, schedule targeted interventions, and funnel resources to the sites or shifts classified as highest risk, stepping in before incidents occur and getting the greatest return on safety investments.

[LG-40] Online Meal Detection Based on CGM Data Dynamics

链接: https://arxiv.org/abs/2507.00080
作者: Ali Tavasoli,Heman Shakeri
类目: Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:We utilize dynamical modes as features derived from Continuous Glucose Monitoring (CGM) data to detect meal events. By leveraging the inherent properties of underlying dynamics, these modes capture key aspects of glucose variability, enabling the identification of patterns and anomalies associated with meal consumption. This approach not only improves the accuracy of meal detection but also enhances the interpretability of the underlying glucose dynamics. By focusing on dynamical features, our method provides a robust framework for feature extraction, facilitating generalization across diverse datasets and ensuring reliable performance in real-world applications. The proposed technique offers significant advantages over traditional approaches, improving detection accuracy,

[LG-41] Fractional Policy Gradients: Reinforcement Learning with Long-Term Memory

链接: https://arxiv.org/abs/2507.00073
作者: Urvi Pawar,Kunal Telangi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted to Journal of Machine Learning Research (JMLR), June 2025. 24 pages, 3 figures. Under review

点击查看摘要

Abstract:We propose Fractional Policy Gradients (FPG), a reinforcement learning framework incorporating fractional calculus for long-term temporal modeling in policy optimization. Standard policy gradient approaches face limitations from Markovian assumptions, exhibiting high variance and inefficient sampling. By reformulating gradients using Caputo fractional derivatives, FPG establishes power-law temporal correlations between state transitions. We develop an efficient recursive computation technique for fractional temporal-difference errors with constant time and memory requirements. Theoretical analysis shows FPG achieves asymptotic variance reduction of order O(t^(-alpha)) versus standard policy gradients while preserving convergence. Empirical validation demonstrates 35-68% sample efficiency gains and 24-52% variance reduction versus state-of-the-art baselines. This framework provides a mathematically grounded approach for leveraging long-range dependencies without computational overhead.

[LG-42] Leverag ing Unlabeled Audio-Visual Data in Speech Emotion Recognition using Knowledge Distillation INTERSPEECH2025

链接: https://arxiv.org/abs/2507.00055
作者: Varsha Pendyala,Pedro Morgado,William Sethares
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: Accepted at INTERSPEECH 2025

点击查看摘要

Abstract:Voice interfaces integral to the human-computer interaction systems can benefit from speech emotion recognition (SER) to customize responses based on user emotions. Since humans convey emotions through multi-modal audio-visual cues, developing SER systems using both the modalities is beneficial. However, collecting a vast amount of labeled data for their development is expensive. This paper proposes a knowledge distillation framework called LightweightSER (LiSER) that leverages unlabeled audio-visual data for SER, using large teacher models built on advanced speech and face representation models. LiSER transfers knowledge regarding speech emotions and facial expressions from the teacher models to lightweight student models. Experiments conducted on two benchmark datasets, RAVDESS and CREMA-D, demonstrate that LiSER can reduce the dependence on extensive labeled datasets for SER tasks.

[LG-43] IDRIFTNET: Physics-Driven Spatiotemporal Deep Learning for Iceberg Drift Forecasting

链接: https://arxiv.org/abs/2507.00036
作者: Rohan Putatunda,Sanjay Purushotham,Ratnaksha Lele,Vandana P. Janeja
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Drifting icebergs in the polar oceans play a key role in the Earth’s climate system, impacting freshwater fluxes into the ocean and regional ecosystems while also posing a challenge to polar navigation. However, accurately forecasting iceberg trajectories remains a formidable challenge, primarily due to the scarcity of spatiotemporal data and the complex, nonlinear nature of iceberg motion, which is also impacted by environmental variables. The iceberg motion is influenced by multiple dynamic environmental factors, creating a highly variable system that makes trajectory identification complex. These limitations hinder the ability of deep learning models to effectively capture the underlying dynamics and provide reliable predictive outcomes. To address these challenges, we propose a hybrid IDRIFTNET model, a physics-driven deep learning model that combines an analytical formulation of iceberg drift physics, with an augmented residual learning model. The model learns the pattern of mismatch between the analytical solution and ground-truth observations, which is combined with a rotate-augmented spectral neural network that captures both global and local patterns from the data to forecast future iceberg drift positions. We compare IDRIFTNET model performance with state-of-the-art models on two Antarctic icebergs: A23A and B22A. Our findings demonstrate that IDRIFTNET outperforms other models by achieving a lower Final Displacement Error (FDE) and Average Displacement Error (ADE) across a variety of time points. These results highlight IDRIFTNET’s effectiveness in capturing the complex, nonlinear drift of icebergs for forecasting iceberg trajectories under limited data and dynamic environmental conditions.

[LG-44] Data Collection with Non-Uniform Axial Power for Phase II of the OECD/NEA AI/ML Critical Heat Flux Benchmark

链接: https://arxiv.org/abs/2507.00034
作者: Reece Bourisaw,Reid McCants,Jean-Marie Le Corre,Anna Iskhakova,Arsen S. Iskhakov
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Critical heat flux (CHF) marks the onset of boiling crisis in light-water reactors, defining safe thermal-hydraulic operating limits. To support Phase II of the OECD/NEA AI/ML CHF benchmark, which introduces spatially varying power profiles, this work compiles and digitizes a broad CHF dataset covering both uniform and non-uniform axial heating conditions. Heating profiles were extracted from technical reports, interpolated onto a consistent axial mesh, validated via energy-balance checks, and encoded in machine-readable formats for benchmark compatibility. Classical CHF correlations exhibit substantial errors under uniform heating and degrade markedly when applied to non-uniform profiles, while modern tabular methods offer improved but still imperfect predictions. A neural network trained solely on uniform data performs well in that regime but fails to generalize to spatially varying scenarios, underscoring the need for models that explicitly incorporate axial power distributions. By providing these curated datasets and baseline modeling results, this study lays the groundwork for advanced transfer-learning strategies, rigorous uncertainty quantification, and design-optimization efforts in the next phase of the CHF benchmark. Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE) Cite as: arXiv:2507.00034 [cs.LG] (or arXiv:2507.00034v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.00034 Focus to learn more arXiv-issued DOI via DataCite

[LG-45] Enhancing Spatio-Temporal Forecasting with Spatial Neighbourhood Fusion:A Case Study on COVID-19 Mobility in Peru

链接: https://arxiv.org/abs/2507.00031
作者: Chuan Li,Jiang You,Hassine Moungla,Vincent Gauthier,Miguel Nunez-del-Prado,Hugo Alatrista-Salas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate modeling of human mobility is critical for understanding epidemic spread and deploying timely interventions. In this work, we leverage a large-scale spatio-temporal dataset collected from Peru’s national Digital Contact Tracing (DCT) application during the COVID-19 pandemic to forecast mobility flows across urban regions. A key challenge lies in the spatial sparsity of hourly mobility counts across hexagonal grid cells, which limits the predictive power of conventional time series models. To address this, we propose a lightweight and model-agnostic Spatial Neighbourhood Fusion (SPN) technique that augments each cell’s features with aggregated signals from its immediate H3 neighbors. We evaluate this strategy on three forecasting backbones: NLinear, PatchTST, and K-U-Net, under various historical input lengths. Experimental results show that SPN consistently improves forecasting performance, achieving up to 9.85 percent reduction in test MSE. Our findings demonstrate that spatial smoothing of sparse mobility signals provides a simple yet effective path toward robust spatio-temporal forecasting during public health crises.

[LG-46] Variational Autoencoder for Generating Broader-Spectrum prior Proposals in Markov chain Monte Carlo Methods

链接: https://arxiv.org/abs/2507.00020
作者: Marcio Borges,Felipe Pereira,Michel Tosin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: The main contribution of this work is to show the advantages of using deep generative models like VAE to provide more flexible and versatile prior distributions

点击查看摘要

Abstract:This study uses a Variational Autoencoder method to enhance the efficiency and applicability of Markov Chain Monte Carlo (McMC) methods by generating broader-spectrum prior proposals. Traditional approaches, such as the Karhunen-Loève Expansion (KLE), require previous knowledge of the covariance function, often unavailable in practical applications. The VAE framework enables a data-driven approach to flexibly capture a broader range of correlation structures in Bayesian inverse problems, particularly subsurface flow modeling. The methodology is tested on a synthetic groundwater flow inversion problem, where pressure data is used to estimate permeability fields. Numerical experiments demonstrate that the VAE-based parameterization achieves comparable accuracy to KLE when the correlation length is known and outperforms KLE when the assumed correlation length deviates from the true value. Moreover, the VAE approach significantly reduces stochastic dimensionality, improving computational efficiency. The results suggest that leveraging deep generative models in McMC methods can lead to more adaptable and efficient Bayesian inference in high-dimensional problems.

[LG-47] MVGBench: Comprehensive Benchmark for Multi-view Generation Models

链接: https://arxiv.org/abs/2507.00006
作者: Xianghui Xie,Chuhang Zou,Meher Gitika Karumuri,Jan Eric Lenssen,Gerard Pons-Moll
类目: Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 17 pages, 11 figures, 9 tables, project page: this https URL

点击查看摘要

Abstract:We propose MVGBench, a comprehensive benchmark for multi-view image generation models (MVGs) that evaluates 3D consistency in geometry and texture, image quality, and semantics (using vision language models). Recently, MVGs have been the main driving force in 3D object creation. However, existing metrics compare generated images against ground truth target views, which is not suitable for generative tasks where multiple solutions exist while differing from ground truth. Furthermore, different MVGs are trained on different view angles, synthetic data and specific lightings – robustness to these factors and generalization to real data are rarely evaluated thoroughly. Without a rigorous evaluation protocol, it is also unclear what design choices contribute to the progress of MVGs. MVGBench evaluates three different aspects: best setup performance, generalization to real data and robustness. Instead of comparing against ground truth, we introduce a novel 3D self-consistency metric which compares 3D reconstructions from disjoint generated multi-views. We systematically compare 12 existing MVGs on 4 different curated real and synthetic datasets. With our analysis, we identify important limitations of existing methods specially in terms of robustness and generalization, and we find the most critical design choices. Using the discovered best practices, we propose ViFiGen, a method that outperforms all evaluated MVGs on 3D consistency. Our code, model, and benchmark suite will be publicly released.

[LG-48] SwarmFusion: Revolutionizing Disaster Response with Swarm Intelligence and Deep Learning

链接: https://arxiv.org/abs/2507.00005
作者: Vasavi Lankipalle
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 6

点击查看摘要

Abstract:Disaster response requires rapid, adaptive decision-making in chaotic environments. SwarmFusion, a novel hybrid framework, integrates particle swarm optimization with convolutional neural networks to optimize real-time resource allocation and path planning. By processing live satellite, drone, and sensor data, SwarmFusion enhances situational awareness and operational efficiency in flood and wildfire scenarios. Simulations using the DisasterSim2025 dataset demonstrate up to 40 percentage faster response times and 90 percentage survivor coverage compared to baseline methods. This scalable, data-driven approach offers a transformative solution for time-critical disaster management, with potential applications across diverse crisis scenarios.

[LG-49] Atmospheric model-trained machine learning selection and classification of ultracool TY dwarfs

链接: https://arxiv.org/abs/2507.00957
作者: Ankit Biswas
类目: olar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 12 pages, 9 figures, to be published in Monthly Notices of the Royal Astronomical Society

点击查看摘要

Abstract:The T and Y spectral classes represent the coolest and lowest-mass population of brown dwarfs, yet their census remains incomplete due to limited statistics. Existing detection frameworks are often constrained to identifying M, L, and early T dwarfs, owing to the sparse observational sample of ultracool dwarfs (UCDs) at later types. This paper presents a novel machine learning framework capable of detecting and classifying late-T and Y dwarfs, trained entirely on synthetic photometry from atmospheric models. Utilizing grids from the ATMO 2020 and Sonora Bobcat models, I produce a training dataset over two orders of magnitude larger than any empirical set of T6 UCDs. Polynomial color relations fitted to the model photometry are used to assign spectral types to these synthetic models, which in turn train an ensemble of classifiers to identify and classify the spectral type of late UCDs. The model is highly performant when validating on both synthetic and empirical datasets, verifying catalogs of known UCDs with object classification metrics 99% and an average spectral type precision within 0.35 +/- 0.37 subtypes. Application of the model to a 1.5 degree region around Pisces and the UKIDSS UDS field results in the discovery of one previously uncatalogued T8.2 candidate, demonstrating the ability of this model-trained approach in discovering faint, late-type UCDs from photometric catalogs.

[LG-50] An in depth look at the Procrustes-Wasserstein distance: properties and barycenters

链接: https://arxiv.org/abs/2507.00894
作者: Davide Adamo,Marco Corneli,Manon Vuillien,Emmanuelle Vila
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Due to its invariance to rigid transformations such as rotations and reflections, Procrustes-Wasserstein (PW) was introduced in the literature as an optimal transport (OT) distance, alternative to Wasserstein and more suited to tasks such as the alignment and comparison of point clouds. Having that application in mind, we carefully build a space of discrete probability measures and show that over that space PW actually is a distance. Algorithms to solve the PW problems already exist, however we extend the PW framework by discussing and testing several initialization strategies. We then introduce the notion of PW barycenter and detail an algorithm to estimate it from the data. The result is a new method to compute representative shapes from a collection of point clouds. We benchmark our method against existing OT approaches, demonstrating superior performance in scenarios requiring precise alignment and shape preservation. We finally show the usefulness of the PW barycenters in an archaeological context. Our results highlight the potential of PW in boosting 2D and 3D point cloud analysis for machine learning and computational geometry applications.

[LG-51] mplate-Fitting Meets Deep Learning: Redshift Estimation Using Physics-Guided Neural Networks

链接: https://arxiv.org/abs/2507.00866
作者: Jonas Chris Ferrao,Dickson Dias,Pranav Naik,Glory D’Cruz,Anish Naik,Siya Khandeparkar,Manisha Gokuldas Fal Dessai
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate photometric redshift estimation is critical for observational cosmology, especially in large-scale surveys where spectroscopic measurements are impractical. Traditional approaches include template fitting and machine learning, each with distinct strengths and limitations. We present a hybrid method that integrates template fitting with deep learning using physics-guided neural networks. By embedding spectral energy distribution templates into the network architecture, our model encodes physical priors into the training process. The system employs a multimodal design, incorporating cross-attention mechanisms to fuse photometric and image data, along with Bayesian layers for uncertainty estimation. We evaluate our model on the publicly available PREML dataset, which includes approximately 400,000 galaxies from the Hyper Suprime-Cam PDR3 release, with 5-band photometry, multi-band imaging, and spectroscopic redshifts. Our approach achieves an RMS error of 0.0507, a 3-sigma catastrophic outlier rate of 0.13%, and a bias of 0.0028. The model satisfies two of the three LSST photometric redshift requirements for redshifts below 3. These results highlight the potential of combining physically motivated templates with data-driven models for robust redshift estimation in upcoming cosmological surveys.

[LG-52] SINDy on slow manifolds

链接: https://arxiv.org/abs/2507.00747
作者: Diemen Delgado-Cano,Erick Kracht,Urban Fasel,Benjamin Herrmann
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 18 pages, 6 figures, to be submitted to Nonlinear Dynamics (Springer)

点击查看摘要

Abstract:The sparse identification of nonlinear dynamics (SINDy) has been established as an effective method to learn interpretable models of dynamical systems from data. However, for high-dimensional slow-fast dynamical systems, the regression problem becomes simultaneously computationally intractable and ill-conditioned. Although, in principle, modeling only the dynamics evolving on the underlying slow manifold addresses both of these challenges, the truncated fast variables have to be compensated by including higher-order nonlinearities as candidate terms for the model, leading to an explosive growth in the size of the SINDy library. In this work, we develop a SINDy variant that is able to robustly and efficiently identify slow-fast dynamics in two steps: (i) identify the slow manifold, that is, an algebraic equation for the fast variables as functions of the slow ones, and (ii) learn a model for the dynamics of the slow variables restricted to the manifold. Critically, the equation learned in (i) is leveraged to build a manifold-informed function library for (ii) that contains only essential higher-order nonlinearites as candidate terms. Rather than containing all monomials of up to a certain degree, the resulting custom library is a sparse subset of the latter that is tailored to the specific problem at hand. The approach is demonstrated on numerical examples of a snap-through buckling beam and the flow over a NACA 0012 airfoil. We find that our method significantly reduces both the condition number and the size of the SINDy library, thus enabling accurate identification of the dynamics on slow manifolds.

[LG-53] Guided Unconditional and Conditional Generative Models for Super-Resolution and Inference of Quasi-Geostrophic Turbulence

链接: https://arxiv.org/abs/2507.00719
作者: Anantha Narayanan Suresh Babu,Akhil Sadam,Pierre F.J. Lermusiaux
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Geophysics (physics.geo-ph)
*备注: 56 pages, 23 figures, 7 tables

点击查看摘要

Abstract:Typically, numerical simulations of the ocean, weather, and climate are coarse, and observations are sparse and gappy. In this work, we apply four generative diffusion modeling approaches to super-resolution and inference of forced two-dimensional quasi-geostrophic turbulence on the beta-plane from coarse, sparse, and gappy observations. Two guided approaches minimally adapt a pre-trained unconditional model: SDEdit modifies the initial condition, and Diffusion Posterior Sampling (DPS) modifies the reverse diffusion process score. The other two conditional approaches, a vanilla variant and classifier-free guidance, require training with paired high-resolution and observation data. We consider eight test cases spanning: two regimes, eddy and anisotropic-jet turbulence; two Reynolds numbers, 10^3 and 10^4; and two observation types, 4x coarse-resolution fields and coarse, sparse and gappy observations. Our comprehensive skill metrics include norms of the reconstructed vorticity fields, turbulence statistical quantities, and quantification of the super-resolved probabilistic ensembles and their errors. We also study the sensitivity to tuning parameters such as guidance strength. Results show that SDEdit generates unphysical fields, while DPS generates reasonable reconstructions at low computational cost but with smoothed fine-scale features. Both conditional approaches require re-training, but they reconstruct missing fine-scale features, are cycle-consistent with observations, and possess the correct statistics such as energy spectra. Further, their mean model errors are highly correlated with and predictable from their ensemble standard deviations. Results highlight the trade-offs between ease of implementation, fidelity (sharpness), and cycle-consistency of the diffusion models, and offer practical guidance for deployment in geophysical inverse problems.

[LG-54] sting the spin-bath view of self-attention: A Hamiltonian analysis of GPT -2 Transformer

链接: https://arxiv.org/abs/2507.00683
作者: Satadeep Bhattacharjee,Seung-Cheol Lee
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recently proposed physics-based framework by Huo and Johnson~\citehuo2024capturing models the attention mechanism of Large Language Models (LLMs) as an interacting two-body spin system, offering a first-principles explanation for phenomena like repetition and bias. Building on this hypothesis, we extract the complete Query-Key weight matrices from a production-grade GPT-2 model and derive the corresponding effective Hamiltonian for every attention head. From these Hamiltonians we obtain analytic \textitphase boundaries logit gap criteria that predict which token should dominate the next-token distribution for a given context. A systematic evaluation on 144 heads across 20 factual-recall prompts reveals a strong negative correlation between the theoretical logit gaps and the model’s empirical token rankings ( r\approx-0.70 , p10^-3 ).Targeted ablations further show that suppressing the heads most aligned with the spin-bath predictions induces the anticipated shifts in output probabilities, confirming a causal link rather than a coincidental association. Taken together, our findings provide the first strong empirical evidence for the spin-bath analogy in a production-grade model. This validation not only furnishes a tractable, physics-inspired lens for interpretability but also provides the groundwork for novel generative models, bridging the gap between theoretical condensed matter physics and AI.

[LG-55] Harnessing the Power of Reinforcement Learning for Adaptive MCMC

链接: https://arxiv.org/abs/2507.00671
作者: Congye Wang,Matthew A. Fisher,Heishiro Kanagawa,Wilson Chen,Chris. J. Oates
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sampling algorithms drive probabilistic machine learning, and recent years have seen an explosion in the diversity of tools for this task. However, the increasing sophistication of sampling algorithms is correlated with an increase in the tuning burden. There is now a greater need than ever to treat the tuning of samplers as a learning task in its own right. In a conceptual breakthrough, Wang et al (2025) formulated Metropolis-Hastings as a Markov decision process, opening up the possibility for adaptive tuning using Reinforcement Learning (RL). Their emphasis was on theoretical foundations; realising the practical benefit of Reinforcement Learning Metropolis-Hastings (RLMH) was left for subsequent work. The purpose of this paper is twofold: First, we observe the surprising result that natural choices of reward, such as the acceptance rate, or the expected squared jump distance, provide insufficient signal for training RLMH. Instead, we propose a novel reward based on the contrastive divergence, whose superior performance in the context of RLMH is demonstrated. Second, we explore the potential of RLMH and present adaptive gradient-based samplers that balance flexibility of the Markov transition kernel with learnability of the associated RL task. A comprehensive simulation study using the posteriordb benchmark supports the practical effectiveness of RLMH.

[LG-56] Hebbian Physics Networks: A Self-Organizing Computational Architecture Based on Local Physical Laws

链接: https://arxiv.org/abs/2507.00641
作者: Gunjan Auti,Hirofumi Daiguji,Gouhei Tanaka
类目: Adaptation and Self-Organizing Systems (nlin.AO); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: 6 pages, 2 figures, 2 supplementary videos

点击查看摘要

Abstract:Traditional machine learning approaches in physics rely on global optimization, limiting interpretability and enforcing physical constraints externally. We introduce the Hebbian Physics Network (HPN), a self-organizing computational framework in which learning emerges from local Hebbian updates driven by violations of conservation laws. Grounded in non-equilibrium thermodynamics and inspired by Prigogine/'s theory of dissipative structures, HPNs eliminate the need for global loss functions by encoding physical laws directly into the system/'s local dynamics. Residuals - quantified imbalances in continuity, momentum, or energy - serve as thermodynamic signals that drive weight adaptation through generalized Hebbian plasticity. We demonstrate this approach on incompressible fluid flow and continuum diffusion, where physically consistent structures emerge from random initial conditions without supervision. HPNs reframe computation as a residual-driven thermodynamic process, offering an interpretable, scalable, and physically grounded alternative for modeling complex dynamical systems.

[LG-57] Forward Reverse Kernel Regression for the Schrödinger bridge problem

链接: https://arxiv.org/abs/2507.00640
作者: Denis Belomestny,John. Schoenmakers
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In this paper, we study the Schrödinger Bridge Problem (SBP), which is central to entropic optimal transport. For general reference processes and begin–endpoint distributions, we propose a forward-reverse iterative Monte Carlo procedure to approximate the Schrödinger potentials in a nonparametric way. In particular, we use kernel based Monte Carlo regression in the context of Picard iteration of a corresponding fixed point problem. By preserving in the iteration positivity and contractivity in a Hilbert metric sense, we develop a provably convergent algorithm. Furthermore, we provide convergence rates for the potential estimates and prove their optimality. Finally, as an application, we propose a non-nested Monte Carlo procedure for the final dimensional distributions of the Schrödinger Bridge process, based on the constructed potentials and the forward-reverse simulation method for conditional diffusions.

[LG-58] Generalization performance of narrow one-hidden layer networks in the teacher-student setting

链接: https://arxiv.org/abs/2507.00629
作者: Jean Barbier,Federica Gerace,Alessandro Ingrosso,Clarissa Lauditi,Enrico M. Malatesta,Gibbs Nwemadji,Rodrigo Pérez Ortiz
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 34 pages, figures

点击查看摘要

Abstract:Understanding the generalization abilities of neural networks for simple input-output distributions is crucial to account for their learning performance on real datasets. The classical teacher-student setting, where a network is trained from data obtained thanks to a label-generating teacher model, serves as a perfect theoretical test bed. In this context, a complete theoretical account of the performance of fully connected one-hidden layer networks in the presence of generic activation functions is lacking. In this work, we develop such a general theory for narrow networks, i.e. networks with a large number of hidden units, yet much smaller than the input dimension. Using methods from statistical physics, we provide closed-form expressions for the typical performance of both finite temperature (Bayesian) and empirical risk minimization estimators, in terms of a small number of weight statistics. In doing so, we highlight the presence of a transition where hidden neurons specialize when the number of samples is sufficiently large and proportional to the number of parameters of the network. Our theory accurately predicts the generalization error of neural networks trained on regression or classification tasks with either noisy full-batch gradient descent (Langevin dynamics) or full-batch gradient descent.

[LG-59] Geometric Gaussian Approximations of Probability Distributions

链接: https://arxiv.org/abs/2507.00616
作者: Nathaël Da Costa,Bálint Mucsányi,Philipp Hennig
类目: Differential Geometry (math.DG); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Approximating complex probability distributions, such as Bayesian posterior distributions, is of central interest in many applications. We study the expressivity of geometric Gaussian approximations. These consist of approximations by Gaussian pushforwards through diffeomorphisms or Riemannian exponential maps. We first review these two different kinds of geometric Gaussian approximations. Then we explore their relationship to one another. We further provide a constructive proof that such geometric Gaussian approximations are universal, in that they can capture any probability distribution. Finally, we discuss whether, given a family of probability distributions, a common diffeomorphism can be found to obtain uniformly high-quality geometric Gaussian approximations for that family.

[LG-60] Simulation-Efficient Cosmological Inference with Multi-Fidelity SBI ICML

链接: https://arxiv.org/abs/2507.00514
作者: Leander Thiele,Adrian E. Bayer,Naoya Takeishi
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures; accepted at ICML-colocated ML4Astro 2025 workshop

点击查看摘要

Abstract:The simulation cost for cosmological simulation-based inference can be decreased by combining simulation sets of varying fidelity. We propose an approach to such multi-fidelity inference based on feature matching and knowledge distillation. Our method results in improved posterior quality, particularly for small simulation budgets and difficult inference problems.

[LG-61] GRAND: Graph Release with Assured Node Differential Privacy

链接: https://arxiv.org/abs/2507.00402
作者: Suqing Liu,Xuan Bi,Tianxi Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Differential privacy is a well-established framework for safeguarding sensitive information in data. While extensively applied across various domains, its application to network data – particularly at the node level – remains underexplored. Existing methods for node-level privacy either focus exclusively on query-based approaches, which restrict output to pre-specified network statistics, or fail to preserve key structural properties of the network. In this work, we propose GRAND (Graph Release with Assured Node Differential privacy), which is, to the best of our knowledge, the first network release mechanism that releases entire networks while ensuring node-level differential privacy and preserving structural properties. Under a broad class of latent space models, we show that the released network asymptotically follows the same distribution as the original network. The effectiveness of the approach is evaluated through extensive experiments on both synthetic and real-world datasets.

[LG-62] Enhancing Interpretability in Generative Modeling: Statistically Disentangled Latent Spaces Guided by Generative Factors in Scientific Datasets

链接: https://arxiv.org/abs/2507.00298
作者: Arkaprabha Ganguli,Nesar Ramachandra,Julie Bessac,Emil Constantinescu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study addresses the challenge of statistically extracting generative factors from complex, high-dimensional datasets in unsupervised or semi-supervised settings. We investigate encoder-decoder-based generative models for nonlinear dimensionality reduction, focusing on disentangling low-dimensional latent variables corresponding to independent physical factors. Introducing Aux-VAE, a novel architecture within the classical Variational Autoencoder framework, we achieve disentanglement with minimal modifications to the standard VAE loss function by leveraging prior statistical knowledge through auxiliary variables. These variables guide the shaping of the latent space by aligning latent factors with learned auxiliary variables. We validate the efficacy of Aux-VAE through comparative assessments on multiple datasets, including astronomical simulations.

[LG-63] Disentangled Feature Importance

链接: https://arxiv.org/abs/2507.00260
作者: Jin-Hong Du,Kathryn Roeder,Larry Wasserman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 26 main and 29 supplementary pages

点击查看摘要

Abstract:Feature importance quantification faces a fundamental challenge: when predictors are correlated, standard methods systematically underestimate their contributions. We prove that major existing approaches target identical population functionals under squared-error loss, revealing why they share this correlation-induced bias. To address this limitation, we introduce \emphDisentangled Feature Importance (DFI), a nonparametric generalization of the classical R^2 decomposition via optimal transport. DFI transforms correlated features into independent latent variables using a transport map, eliminating correlation distortion. Importance is computed in this disentangled space and attributed back through the transport map’s sensitivity. DFI provides a principled decomposition of importance scores that sum to the total predictive variability for latent additive models and to interaction-weighted functional ANOVA variances more generally, under arbitrary feature dependencies. We develop a comprehensive semiparametric theory for DFI. For general transport maps, we establish root- n consistency and asymptotic normality of importance estimators in the latent space, which extends to the original feature space for the Bures-Wasserstein map. Notably, our estimators achieve second-order estimation error, which vanishes if both regression function and transport map estimation errors are o_\mathbbP(n^-1/4) . By design, DFI avoids the computational burden of repeated submodel refitting and the challenges of conditional covariate distribution estimation, thereby achieving computational efficiency. Comments: 26 main and 29 supplementary pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME) Cite as: arXiv:2507.00260 [stat.ML] (or arXiv:2507.00260v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.00260 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

信息检索

[IR-0] Digital Collections Explorer: An Open-Source Multimodal Viewer for Searching Digital Collections

链接: https://arxiv.org/abs/2507.00961
作者: Ying-Hsiang Huang,Benjamin Charles Germain Lee
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 14 pages, 8 figures, 2 tables

点击查看摘要

Abstract:We present Digital Collections Explorer, a web-based, open-source exploratory search platform that leverages CLIP (Contrastive Language-Image Pre-training) for enhanced visual discovery of digital collections. Our Digital Collections Explorer can be installed locally and configured to run on a visual collection of interest on disk in just a few steps. Building upon recent advances in multimodal search techniques, our interface enables natural language queries and reverse image searches over digital collections with visual features. This paper describes the system’s architecture, implementation, and application to various cultural heritage collections, demonstrating its potential for democratizing access to digital archives, especially those with impoverished metadata. We present case studies with maps, photographs, and PDFs extracted from web archives in order to demonstrate the flexibility of the Digital Collections Explorer, as well as its ease of use. We demonstrate that the Digital Collections Explorer scales to hundreds of thousands of images on a MacBook Pro with an M4 chip. Lastly, we host a public demo of Digital Collections Explorer.

[IR-1] EARN: Efficient Inference Acceleration for LLM -based Generative Recommendation by Register Tokens KDD2025

链接: https://arxiv.org/abs/2507.00715
作者: Chaoqun Yang,Xinyu Lin,Wenjie Wang,Yongqi Li,Teng Sun,Xianjing Han,Tat-Seng Chua
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD 2025

点击查看摘要

Abstract:Large Language Model-based generative recommendation (LLMRec) has achieved notable success, but it suffers from high inference latency due to massive computational overhead and memory pressure of KV Cache. Existing KV Cache reduction methods face critical limitations: cache compression offers marginal acceleration given recommendation tasks’ short decoding steps, while prompt compression risks discarding vital interaction history. Through systematic analysis of attention patterns in LLMRec, we uncover two pivotal insights: 1) layer-wise attention sparsity inversion where early layers retain dense informative patterns while later layers exhibit high redundancy, and 2) dual attention sinks phenomenon where attention scores concentrate on both head and tail tokens of input sequences. Motivated by these insights, we propose EARN, an efficient inference framework that leverages the early layers to compress information into register tokens placed at the input sequence boundaries, then focuses solely on these tokens in the subsequent layers. Extensive experiments on three datasets, two LLMRec methods and two LLM architectures demonstrate EARN’s superiority, achieving up to 3.79x speedup and 80.8% KV Cache reduction with better accuracy than the general finetuning approach. Our work bridges the efficiency-effectiveness gap in LLMRec, offering practical deployment advantages for industrial scenarios.

[IR-2] Reliable Annotations with Less Effort: Evaluating LLM -Human Collaboration in Search Clarifications

链接: https://arxiv.org/abs/2507.00543
作者: Leila Tavakoli,Hamed Zamani
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: 9 pages,5 figures

点击查看摘要

Abstract:Despite growing interest in using large language models (LLMs) to automate annotation, their effectiveness in complex, nuanced, and multi-dimensional labelling tasks remains relatively underexplored. This study focuses on annotation for the search clarification task, leveraging a high-quality, multi-dimensional dataset that includes five distinct fine-grained annotation subtasks. Although LLMs have shown impressive capabilities in general settings, our study reveals that even state-of-the-art models struggle to replicate human-level performance in subjective or fine-grained evaluation tasks. Through a systematic assessment, we demonstrate that LLM predictions are often inconsistent, poorly calibrated, and highly sensitive to prompt variations. To address these limitations, we propose a simple yet effective human-in-the-loop (HITL) workflow that uses confidence thresholds and inter-model disagreement to selectively involve human review. Our findings show that this lightweight intervention significantly improves annotation reliability while reducing human effort by up to 45%, offering a relatively scalable and cost-effective yet accurate path forward for deploying LLMs in real-world evaluation settings.

[IR-3] textttWebANNS: Fast and Efficient Approximate Nearest Neighbor Search in Web Browsers SIGIR2025

链接: https://arxiv.org/abs/2507.00521
作者: Mugeng Liu,Siqi Zhong,Qi Yang,Yudong Han,Xuanzhe Liu,Yun Ma
类目: Information Retrieval (cs.IR)
*备注: SIGIR 2025

点击查看摘要

Abstract:Approximate nearest neighbor search (ANNS) has become vital to modern AI infrastructure, particularly in retrieval-augmented generation (RAG) applications. Numerous in-browser ANNS engines have emerged to seamlessly integrate with popular LLM-based web applications, while addressing privacy protection and challenges of heterogeneous device deployments. However, web browsers present unique challenges for ANNS, including computational limitations, external storage access issues, and memory utilization constraints, which state-of-the-art (SOTA) solutions fail to address comprehensively. We propose \textttWebANNS, a novel ANNS engine specifically designed for web browsers. \textttWebANNS leverages WebAssembly to overcome computational bottlenecks, designs a lazy loading strategy to optimize data retrieval from external storage, and applies a heuristic approach to reduce memory usage. Experiments show that \textttWebANNS is fast and memory efficient, achieving up to 743.8\times improvement in 99th percentile query latency over the SOTA engine, while reducing memory usage by up to 39%. Note that \textttWebANNS decreases query time from 10 seconds to the 10-millisecond range in browsers, making in-browser ANNS practical with user-acceptable latency. Comments: SIGIR 2025 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2507.00521 [cs.IR] (or arXiv:2507.00521v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.00521 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3726302.3730115 Focus to learn more DOI(s) linking to related resources

[IR-4] On Mitigating Data Sparsity in Conversational Recommender Systems

链接: https://arxiv.org/abs/2507.00479
作者: Sixiao Zhang,Mingrui Liu,Cheng Long,Wei Yuan,Hongxu Chen,Xiangyu Zhao,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Conversational recommender systems (CRSs) capture user preference through textual information in dialogues. However, they suffer from data sparsity on two fronts: the dialogue space is vast and linguistically diverse, while the item space exhibits long-tail and sparse distributions. Existing methods struggle with (1) generalizing to varied dialogue expressions due to underutilization of rich textual cues, and (2) learning informative item representations under severe sparsity. To address these problems, we propose a CRS model named DACRS. It consists of three modules, namely Dialogue Augmentation, Knowledge-Guided Entity Modeling, and Dialogue-Entity Matching. In the Dialogue Augmentation module, we apply a two-stage augmentation pipeline to augment the dialogue context to enrich the data and improve generalizability. In the Knowledge-Guided Entity Modeling, we propose a knowledge graph (KG) based entity substitution and an entity similarity constraint to enhance the expressiveness of entity embeddings. In the Dialogue-Entity Matching module, we fuse the dialogue embedding with the mentioned entity embeddings through a dialogue-guided attention aggregation to acquire user embeddings that contain both the explicit and implicit user preferences. Extensive experiments on two public datasets demonstrate the state-of-the-art performance of DACRS.

[IR-5] Read the Docs Before Rewriting: Equip Rewriter with Domain Knowledge via Continual Pre-training

链接: https://arxiv.org/abs/2507.00477
作者: Qi Wang,Yixuan Cao,Yifan Liu,Jiangtao Zhao,Ping Luo
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A Retrieval-Augmented Generation (RAG)-based question-answering (QA) system enhances a large language model’s knowledge by retrieving relevant documents based on user queries. Discrepancies between user queries and document phrasings often necessitate query rewriting. However, in specialized domains, the rewriter model may struggle due to limited domain-specific knowledge. To resolve this, we propose the R\R (Read the doc before Rewriting) rewriter, which involves continual pre-training on professional documents, akin to how students prepare for open-book exams by reviewing textbooks. Additionally, it can be combined with supervised fine-tuning for improved results. Experiments on multiple datasets demonstrate that R\R excels in professional QA across multiple domains, effectively bridging the query-document gap, while maintaining good performance in general scenarios, thus advancing the application of RAG-based QA systems in specialized fields.

附件下载

点击下载今日全部论文列表