本篇博文主要内容为 2025-07-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-07-03)

今日共更新472篇论文,其中:

  • 自然语言处理59篇(Computation and Language (cs.CL))
  • 人工智能126篇(Artificial Intelligence (cs.AI))
  • 计算机视觉111篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习147篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] st-Time Scaling with Reflective Generative Model

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在推理效率与性能平衡上的挑战,特别是如何在减少模型参数量的同时保持接近先进模型(如 OpenAI-o3-mini)的性能。其解决方案的关键在于提出一种自监督过程奖励模型(SPRM),通过共享主干网络并分别使用任务特定头部进行下一个token预测和过程评分,将策略模型与过程奖励模型(PRM)整合到统一接口中,无需额外过程标注,从而减少超过99%的PRM参数,实现高效推理。

链接: https://arxiv.org/abs/2507.01951
作者: Zixiao Wang,Yuxin Wang,Xiaorui Wang,Mengting Xing,Jie Gao,Jianjun Xu,Guangcan Liu,Chenhui Jin,Zhuo Wang,Shengzhuo Zhang,Hongtao Xie
机构: Meta(元); Stone-AI(石墨人工智能); USTC(中国科学技术大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3’s performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini’s series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at this https URL.
zh

[NLP-1] he Thin Line Between Comprehension and Persuasion in LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在对话理解与推理能力方面的局限性问题,特别是在复杂对话结构和语用语境上的理解不足。研究的关键在于通过评估LLMs维持辩论的能力,以及其对对话深层结构的理解程度,从而揭示其作为评估者在敏感领域应用时的潜在缺陷。研究发现,尽管LLMs能够生成连贯且有说服力的辩论,但它们缺乏对对话深层结构和语用背景的真实理解,这表明模型的有效性并不依赖于对对话内容的真正掌握。

链接: https://arxiv.org/abs/2507.01936
作者: Adrian de Wynter,Tangming Yuan
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are excellent at maintaining high-level, convincing dialogues. They are being fast deployed as chatbots and evaluators in sensitive areas, such as peer review and mental health applications. This, along with the disparate accounts on their reasoning capabilities, calls for a closer examination of LLMs and their comprehension of dialogue. In this work we begin by evaluating LLMs’ ability to maintain a debate–one of the purest yet most complex forms of human communication. Then we measure how this capability relates to their understanding of what is being talked about, namely, their comprehension of dialogical structures and the pragmatic context. We find that LLMs are capable of maintaining coherent, persuasive debates, often swaying the beliefs of participants and audiences alike. We also note that awareness or suspicion of AI involvement encourage people to be more critical of the arguments made. When polling LLMs on their comprehension of deeper structures of dialogue, however, they cannot demonstrate said understanding. Our findings tie the shortcomings of LLMs-as-evaluators to their (in)ability to understand the context. More broadly, for the field of argumentation theory we posit that, if an agent can convincingly maintain a dialogue, it is not necessary for it to know what it is talking about. Hence, the modelling of pragmatic context and coherence are secondary to effectiveness.
zh

[NLP-2] Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla

【速读】: 该论文旨在解决低资源语言(如孟加拉语)中语音识别系统的性能不足问题。其解决方案的关键在于通过在大规模多语言文本和语音数据集上训练的先进自动语音识别(ASR)模型,结合系统化的微调和超参数优化,提升模型在低资源语言上的表现。研究对比了OpenAI的Whisper和Facebook的Wav2Vec-BERT两种模型,并验证了Wav2Vec-BERT在词错误率(WER)、字符错误率(CER)、训练时间和计算效率等方面的优越性,为构建高效且鲁棒的低资源语言语音识别系统提供了有价值的参考。

链接: https://arxiv.org/abs/2507.01931
作者: Md Sazzadul Islam Ridoy,Sumi Akter,Md. Aminur Rahman
机构: Ahsanullah University of Science and Technology (阿罕默德科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In recent years, neural models trained on large multilingual text and speech datasets have shown great potential for supporting low-resource languages. This study investigates the performances of two state-of-the-art Automatic Speech Recognition (ASR) models, OpenAI’s Whisper (Small Large-V2) and Facebook’s Wav2Vec-BERT on Bangla, a low-resource language. We have conducted experiments using two publicly available datasets: Mozilla Common Voice-17 and OpenSLR to evaluate model performances. Through systematic fine-tuning and hyperparameter optimization, including learning rate, epochs, and model checkpoint selection, we have compared the models based on Word Error Rate (WER), Character Error Rate (CER), Training Time, and Computational Efficiency. The Wav2Vec-BERT model outperformed Whisper across all key evaluation metrics, demonstrated superior performance while requiring fewer computational resources, and offered valuable insights to develop robust speech recognition systems in low-resource linguistic settings.
zh

[NLP-3] Decision-oriented Text Evaluation

【速读】: 该论文试图解决自然语言生成(Natural Language Generation, NLG)在高风险领域应用中,传统内在评估方法(如n-gram重叠或句子合理性)与实际决策效果之间相关性较弱的问题。其解决方案的关键在于提出一种以决策为导向的评估框架,通过直接测量生成文本对人类和大型语言模型(Large Language Model, LLM)决策结果的影响来评估文本质量。该方法利用市场摘要文本作为测试案例,基于由这些文本驱动的人类投资者和自主LLM代理的交易表现来评估决策质量,从而揭示生成文本在促进人机协同决策中的实际价值。

链接: https://arxiv.org/abs/2507.01923
作者: Yu-Shiang Huang,Chuan-Ju Wang,Chung-Chi Chen
机构: Academia Sinica (中央研究院); National Taiwan University (台湾大学); National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts–including objective morning summaries and subjective closing-bell analyses–as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human-LLM teams to outperform individual human or agent baselines significantly. Our approach underscores the importance of evaluating generated text by its ability to facilitate synergistic decision-making between humans and LLMs, highlighting critical limitations of traditional intrinsic metrics.
zh

[NLP-4] NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks

【速读】: 该论文试图解决如何有效提升学生模型推理能力的问题,特别是通过知识蒸馏从教师模型中学习推理过程。其解决方案的关键在于构建高质量的“NaturalThoughts”数据集,该数据集通过从强教师模型中选择基于自然推理问题的高质量推理轨迹来实现,强调了选择具有多样性和难度的示例能够提高样本效率和推理能力的迁移效果。

链接: https://arxiv.org/abs/2507.01921
作者: Yang Li,Youssef Emad,Karthik Padthe,Jack Lanchantin,Weizhe Yuan,Thao Nguyen,Jason Weston,Shang-Wen Li,Dong Wang,Ilia Kulikov,Xian Li
机构: FAIR at Meta (Facebook AI Research)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work has shown that distilling reasoning traces from a larger teacher model via supervised finetuning outperforms reinforcement learning with the smaller student model alone (Guo et al. 2025). However, there has not been a systematic study of what kind of reasoning demonstrations from the teacher are most effective in improving the student model’s reasoning capabilities. In this work we curate high-quality “NaturalThoughts” by selecting reasoning traces from a strong teacher model based on a large pool of questions from NaturalReasoning (Yuan et al. 2025). We first conduct a systematic analysis of factors that affect distilling reasoning capabilities, in terms of sample efficiency and scalability for general reasoning tasks. We observe that simply scaling up data size with random sampling is a strong baseline with steady performance gains. Further, we find that selecting difficult examples that require more diverse reasoning strategies is more sample-efficient to transfer the teacher model’s reasoning skills. Evaluated on both Llama and Qwen models, training with NaturalThoughts outperforms existing reasoning datasets such as OpenThoughts, LIMO, etc. on general STEM reasoning benchmarks including GPQA-Diamond, MMLU-Pro and SuperGPQA.
zh

[NLP-5] Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)与多样且可能存在冲突的人类偏好对齐的问题。为了解决这一问题,作者将人类价值观对齐建模为多目标优化问题,并提出了一种新的微调范式——梯度自适应策略优化(Gradient-Adaptive Policy Optimization, GAPO)。GAPO的关键在于利用多梯度下降方法,自适应地调整每个目标的梯度,以确定最优的更新方向,从而在多个目标之间实现平衡。此外,P-GAPO进一步结合了用户在不同目标上的偏好,实现了更符合用户需求的帕累托最优解。

链接: https://arxiv.org/abs/2507.01915
作者: Chengao Li,Hanyu Zhang,Yunkun Xu,Hongyan Xue,Xiang Ao,Qing He
机构: Chinese Academy of Sciences (中国科学院); State Key Lab of AI Safety (国家重点实验室人工智能安全); University of Chinese Academy of Sciences (中国科学院大学); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 3 figures. Accepted by ACL 2025 (main)

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful technique for aligning large language models (LLMs) with human preferences. However, effectively aligning LLMs with diverse human preferences remains a significant challenge, particularly when they are conflict. To address this issue, we frame human value alignment as a multi-objective optimization problem, aiming to maximize a set of potentially conflicting objectives. We introduce Gradient-Adaptive Policy Optimization (GAPO), a novel fine-tuning paradigm that employs multiple-gradient descent to align LLMs with diverse preference distributions. GAPO adaptively rescales the gradients for each objective to determine an update direction that optimally balances the trade-offs between objectives. Additionally, we introduce P-GAPO, which incorporates user preferences across different objectives and achieves Pareto solutions that better align with the user’s specific needs. Our theoretical analysis demonstrates that GAPO converges towards a Pareto optimal solution for multiple objectives. Empirical results on Mistral-7B show that GAPO outperforms current state-of-the-art methods, achieving superior performance in both helpfulness and harmlessness.
zh

[NLP-6] AI4Research: A Survey of Artificial Intelligence for Scientific Research

【速读】: 该论文试图解决当前AI在科研(AI4Research)领域缺乏系统性综述的问题,这一缺失阻碍了对该领域的深入理解和进一步发展。其解决方案的关键在于提出一个系统的分类体系,识别关键研究空白与未来方向,并整理丰富的应用与资源,以提供统一的视角和实用工具,推动AI在科研中的创新突破。

链接: https://arxiv.org/abs/2507.01903
作者: Qiguang Chen,Mingda Yang,Libo Qin,Jinhao Liu,Zheng Yan,Jiannan Guan,Dengyun Peng,Yiyan Ji,Hanjing Li,Mengkang Hu,Yimeng Zhang,Yihao Liang,Yuhang Zhou,Jiaqi Wang,Zhi Chen,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); Central South University (中南大学); The University of Hong Kong (香港大学); Fudan University (复旦大学); Chinese University of Hong Kong (香港中文大学); ByteDance Seed (字节跳动种子)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1, have demonstrated remarkable capabilities in complex domains such as logical reasoning and experimental coding. Motivated by these advancements, numerous studies have explored the application of AI in the innovation process, particularly in the context of scientific research. These AI technologies primarily aim to develop systems that can autonomously conduct research processes across a wide range of scientific disciplines. Despite these significant strides, a comprehensive survey on AI for Research (AI4Research) remains absent, which hampers our understanding and impedes further development in this field. To address this gap, we present a comprehensive survey and offer a unified perspective on AI4Research. Specifically, the main contributions of our work are as follows: (1) Systematic taxonomy: We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research. (2) New frontiers: Then, we identify key research gaps and highlight promising future directions, focusing on the rigor and scalability of automated experiments, as well as the societal impact. (3) Abundant applications and resources: Finally, we compile a wealth of resources, including relevant multidisciplinary applications, data corpora, and tools. We hope our work will provide the research community with quick access to these resources and stimulate innovative breakthroughs in AI4Research.
zh

[NLP-7] High-Layer Attention Pruning with Rescaling

【速读】: 该论文试图解决传统无训练结构化剪枝方法在剪枝过程中缺乏对注意力头位置敏感性考虑的问题,导致 indiscriminately 移除部分注意力头,从而影响模型性能。其解决方案的关键在于提出一种新颖的剪枝算法,该算法战略性地在模型的高层中剪枝注意力头,并引入自适应重缩放参数以校准剪枝后token表示的尺度,从而抵消因剪枝带来的表示尺度变化的影响。

链接: https://arxiv.org/abs/2507.01900
作者: Songtao Liu,Peng Liu
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that indiscriminately removes some attention heads across all pruning layers, without considering their positions within the network architecture. In this work, we propose a novel pruning algorithm that strategically prunes attention heads in the model’s higher layers. Since the removal of attention heads can alter the magnitude of token representations, we introduce an adaptive rescaling parameter that calibrates the representation scale post-pruning to counteract this effect. We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B. Our evaluation includes both generation and discriminative tasks across 27 datasets. The results consistently demonstrate that our method outperforms existing structured pruning methods. This improvement is particularly notable in generation tasks, where our approach significantly outperforms existing baselines.
zh

[NLP-8] MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants

【速读】: 该论文试图解决小语言模型(Small Language Models, SLMs)在学习长链思维(Long-form Chain-of-Thought, CoT)推理时存在的“SLMs Learnability Gap”问题,即由于模型容量有限导致其难以有效学习复杂推理任务。解决方案的关键在于提出MiCoTAl框架,该框架通过使用中间尺寸的教师模型作为指导,并利用中间长度的CoT序列来弥合容量和推理长度之间的差距,从而提升SLMs的推理性能。

链接: https://arxiv.org/abs/2507.01887
作者: Dongyi Ding,Tiannan Wang,Chenghao Zhu,Meiling Tao,Yuchen Eleanor Jiang,Wangchunshu Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) excel at reasoning tasks requiring long thought sequences for planning, reflection, and refinement. However, their substantial model size and high computational demands are impractical for widespread deployment. Yet, small language models (SLMs) often struggle to learn long-form CoT reasoning due to their limited capacity, a phenomenon we refer to as the “SLMs Learnability Gap”. To address this, we introduce \textbfMid-\textbfCoT \textbfTeacher \textbfAssistant Distillation (MiCoTAl), a framework for improving long CoT distillation for SLMs. MiCoTA employs intermediate-sized models as teacher assistants and utilizes intermediate-length CoT sequences to bridge both the capacity and reasoning length gaps. Our experiments on downstream tasks demonstrate that although SLMs distilled from large teachers can perform poorly, by applying MiCoTA, they achieve significant improvements in reasoning performance. Specifically, Qwen2.5-7B-Instruct and Qwen2.5-3B-Instruct achieve an improvement of 3.47 and 3.93 respectively on average score on AIME2024, AMC, Olympiad, MATH-500 and GSM8K benchmarks. To better understand the mechanism behind MiCoTA, we perform a quantitative experiment demonstrating that our method produces data more closely aligned with base SLM distributions. Our insights pave the way for future research into long-CoT data distillation for SLMs.
zh

[NLP-9] DIY-MKG: An LLM -Based Polyglot Language Learning System EMNLP2025

【速读】: 该论文旨在解决现有语言学习工具在支持多语种学习者构建跨语言词汇联系、提供个性化学习节奏定制以及避免有害的认知卸载方面存在的不足。其解决方案的关键在于设计了一个开源系统——DIY-MKG,该系统允许用户构建个性化的多语言词汇知识图谱,通过大型语言模型(LLM)推荐相关词汇进行选择性扩展,并结合丰富的注释功能和自适应复习模块,实现动态且个性化的测验生成,从而提升学习效果与用户参与度。

链接: https://arxiv.org/abs/2507.01872
作者: Kenan Tang,Yanhong Li,Yao Qin
机构: UCSB(加州大学圣塔芭芭拉分校); University of Chicago(芝加哥大学)
类目: Computation and Language (cs.CL)
备注: Submitted to EMNLP 2025 System Demonstration

点击查看摘要

Abstract:Existing language learning tools, even those powered by Large Language Models (LLMs), often lack support for polyglot learners to build linguistic connections across vocabularies in multiple languages, provide limited customization for individual learning paces or needs, and suffer from detrimental cognitive offloading. To address these limitations, we design Do-It-Yourself Multilingual Knowledge Graph (DIY-MKG), an open-source system that supports polyglot language learning. DIY-MKG allows the user to build personalized vocabulary knowledge graphs, which are constructed by selective expansion with related words suggested by an LLM. The system further enhances learning through rich annotation capabilities and an adaptive review module that leverages LLMs for dynamic, personalized quiz generation. In addition, DIY-MKG allows users to flag incorrect quiz questions, simultaneously increasing user engagement and providing a feedback loop for prompt refinement. Our evaluation of LLM-based components in DIY-MKG shows that vocabulary expansion is reliable and fair across multiple languages, and that the generated quizzes are highly accurate, validating the robustness of DIY-MKG.
zh

[NLP-10] Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估框架过于依赖英语基准,无法满足像印度这样语言多样性地区需求的问题。其解决方案的关键在于提出EKA-EVAL,一个统一且可投入生产的评估框架,该框架整合了超过35个基准测试,包括10个针对印度语言(Indic)的特定数据集,覆盖推理、数学、工具使用、长上下文理解及阅读理解等多个类别,并支持分布式推理、量化和多GPU使用,从而实现了面向全球和印度语LLMs的端到端、可扩展的评估套件。

链接: https://arxiv.org/abs/2507.01853
作者: Samridhi Raj Sinha,Rajvee Sheth,Abhishek Upperwal,Mayank Singh
机构: NMIMS; Soket AI; Indian Institute of Technology Gandhinagar; LINGO Research Group
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that go beyond English centric benchmarks and address the requirements of linguistically diverse regions such as India. We present EKA-EVAL, a unified and production-ready evaluation framework that integrates over 35 benchmarks, including 10 Indic-specific datasets, spanning categories like reasoning, mathematics, tool use, long-context understanding, and reading comprehension. Compared to existing Indian language evaluation tools, EKA-EVAL offers broader benchmark coverage, with built-in support for distributed inference, quantization, and multi-GPU usage. Our systematic comparison positions EKA-EVAL as the first end-to-end, extensible evaluation suite tailored for both global and Indic LLMs, significantly lowering the barrier to multilingual benchmarking. The framework is open-source and publicly available at this https URL eka-eval and a part of ongoing EKA initiative (this https URL), which aims to scale up to over 100 benchmarks and establish a robust, multilingual evaluation ecosystem for LLMs.
zh

[NLP-11] Low-Perplexity LLM -Generated Sequences and Where To Find Them ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)的训练数据如何影响其输出的问题,旨在提升模型的透明度、可问责性、隐私保护和公平性。其解决方案的关键在于提出一种系统性方法,通过分析低困惑度序列(low-perplexity sequences)——即模型生成的高概率文本片段——来追踪这些序列在训练数据中的来源,从而揭示模型对训练数据的利用与复制机制。

链接: https://arxiv.org/abs/2507.01844
作者: Arthur Wuhrmann,Anastasiia Kucherenko,Andrei Kucharavy
机构: École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); Institute of Entrepreneurship and Management, HES-SO Valais-Wallis (创业与管理研究所,瓦莱-瓦尔士高等学院); Institute of Informatics, HES-SO Valais-Wallis (信息学研究所,瓦莱-瓦尔士高等学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Camera-ready version. Accepted to ACL 2025. 10 pages, 4 figures, 6 tables

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly widespread, understanding how specific training data shapes their outputs is crucial for transparency, accountability, privacy, and fairness. To explore how LLMs leverage and replicate their training data, we introduce a systematic approach centered on analyzing low-perplexity sequences - high-probability text spans generated by the model. Our pipeline reliably extracts such long sequences across diverse topics while avoiding degeneration, then traces them back to their sources in the training data. Surprisingly, we find that a substantial portion of these low-perplexity spans cannot be mapped to the corpus. For those that do match, we quantify the distribution of occurrences across source documents, highlighting the scope and nature of verbatim recall and paving a way toward better understanding of how LLMs training data impacts their behavior.
zh

[NLP-12] Evaluating Structured Output Robustness of Small Language Models for Open Attribute-Value Extraction from Clinical Notes ACL

【速读】: 该论文旨在解决从临床笔记中进行开放属性-值抽取时,小型语言模型生成的结构化输出的可解析性问题。其解决方案的关键在于对比分析不同序列化格式(如JSON、YAML和XML)在该任务中的表现,并通过实验验证JSON在解析性能上具有优势。同时,研究还揭示了结构化鲁棒性受提示策略、模型规模、文档长度及病历类型的影响,为在隐私敏感的临床场景中选择序列化格式和设计提示提供了实践指导。

链接: https://arxiv.org/abs/2507.01810
作者: Nikita Neveditsin,Pawan Lingras,Vijay Mago
机构: Saint Mary’s University (圣玛丽大学); York University (约克大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: To appear in the ACL Anthology

点击查看摘要

Abstract:We present a comparative analysis of the parseability of structured outputs generated by small language models for open attribute-value extraction from clinical notes. We evaluate three widely used serialization formats: JSON, YAML, and XML, and find that JSON consistently yields the highest parseability. Structural robustness improves with targeted prompting and larger models, but declines for longer documents and certain note types. Our error analysis identifies recurring format-specific failure patterns. These findings offer practical guidance for selecting serialization formats and designing prompts when deploying language models in privacy-sensitive clinical settings.
zh

[NLP-13] LoRA Fine-Tuning Without GPUs: A CPU-Efficient Meta-Generation Framework for LLM s ICML2025

【速读】: 该论文试图解决在计算资源受限环境下(如仅能使用标准笔记本电脑CPU)对大型语言模型(Large Language Models, LLMs)进行参数高效微调的问题。传统方法依赖于GPU训练,限制了其广泛应用。解决方案的关键在于提出一种基于理论的LoRA微调方法,通过利用预训练适配器库学习一个元操作符,将任意输入数据集(表示为概率分布)映射到一组LoRA权重,从而在CPU上通过轻量级组合现有LoRA直接构建适配器,实现高效且可行的微调替代方案。

链接: https://arxiv.org/abs/2507.01806
作者: Reza Arabpour,Haitz Sáez de Ocáriz Borde,Anastasis Kratsios
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 5-page main paper (excluding references) + 11-page appendix, 3 tables, 1 figure. Accepted to ICML 2025 Workshop on Efficient Systems for Foundation Models

点击查看摘要

Abstract:Low-Rank Adapters (LoRAs) have transformed the fine-tuning of Large Language Models (LLMs) by enabling parameter-efficient updates. However, their widespread adoption remains limited by the reliance on GPU-based training. In this work, we propose a theoretically grounded approach to LoRA fine-tuning designed specifically for users with limited computational resources, particularly those restricted to standard laptop CPUs. Our method learns a meta-operator that maps any input dataset, represented as a probability distribution, to a set of LoRA weights by leveraging a large bank of pre-trained adapters for the Mistral-7B-Instruct-v0.2 model. Instead of performing new gradient-based updates, our pipeline constructs adapters via lightweight combinations of existing LoRAs directly on CPU. While the resulting adapters do not match the performance of GPU-trained counterparts, they consistently outperform the base Mistral model on downstream tasks, offering a practical and accessible alternative to traditional GPU-based fine-tuning.
zh

[NLP-14] he Anatomy of Evidence: An Investigation Into Explainable ICD Coding ACL2025

【速读】: 该论文试图解决自动医疗编码(automatic medical coding)中可解释性(explainability)评估不足的问题,特别是在缺乏足够标注数据的情况下,现有方法主要局限于短文本和二分类场景。其解决方案的关键在于对MDACE数据集进行深入分析,并从应用角度对当前可解释医疗编码系统进行合理性评估,通过引入匹配度量来识别成功与失败案例,从而为可解释医疗编码系统的开发与评估提供指导。

链接: https://arxiv.org/abs/2507.01802
作者: Katharina Beckh,Elisa Studeny,Sujan Sai Gannamaneni,Dario Antweiler,Stefan Rüping
机构: Fraunhofer IAIS (弗劳恩霍夫信息与通信技术研究所); Lamarr Institute (拉玛尔研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Automatic medical coding has the potential to ease documentation and billing processes. For this task, transparency plays an important role for medical coders and regulatory bodies, which can be achieved using explainability methods. However, the evaluation of these approaches has been mostly limited to short text and binary settings due to a scarcity of annotated data. Recent efforts by Cheng et al. (2023) have introduced the MDACE dataset, which provides a valuable resource containing code evidence in clinical records. In this work, we conduct an in-depth analysis of the MDACE dataset and perform plausibility evaluation of current explainable medical coding systems from an applied perspective. With this, we contribute to a deeper understanding of automatic medical coding and evidence extraction. Our findings reveal that ground truth evidence aligns with code descriptions to a certain degree. An investigation into state-of-the-art approaches shows a high overlap with ground truth evidence. We propose match measures and highlight success and failure cases. Based on our findings, we provide recommendations for developing and evaluating explainable medical coding systems.
zh

[NLP-15] How Do Vision-Language Models Process Conflicting Information Across Modalities?

【速读】: 该论文试图解决多模态AI模型在面对不同输入流提供冲突信息时的行为机制问题,特别是关注视觉-语言模型在处理不一致输入时的响应模式。其解决方案的关键在于揭示模型内部表征结构中存在对某一模态的偏好,并通过特定注意力头(attention heads)重构表征以偏向某一模态,同时发现具有模态无关性的“路由头”(router heads),这些头部能够根据指令调整对特定模态的响应,从而提升跨数据集和模态的性能。

链接: https://arxiv.org/abs/2507.01790
作者: Tianze Hua,Tian Yun,Ellie Pavlick
机构: Brown University (布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: All code and resources are available at: this https URL

点击查看摘要

Abstract:AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption “A photo of a cat”) and ask the model to report the information present in one of the specific modalities (e.g., “What does the caption say / What is in the image?”). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic “router heads” which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.
zh

[NLP-16] Probing Evaluation Awareness of Language Models ICML

【速读】: 该论文试图解决语言模型在测试与部署阶段之间具备评估意识(evaluation awareness)所带来的安全与政策问题,这种能力可能削弱人工智能治理框架和行业自愿承诺中关键评估的可靠性。论文的解决方案关键在于通过线性探测器(linear probes)区分真实世界的评估与部署提示,表明当前模型内部已存在对这一区别的表征,并且当前的安全评估被探测器正确分类,暗示这些评估对模型而言已显得人工或不真实。研究强调了确保评估可信度和理解模型欺骗能力的重要性,并展示了如何利用模型内部结构支持安全审计中的黑盒方法。

链接: https://arxiv.org/abs/2507.01786
作者: Jord Nguyen,Khiem Hoang,Carlo Leonardo Attubato,Felix Hofstätter
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical AI Governance Workshop, ICML (Poster)

点击查看摘要

Abstract:Language models can distinguish between testing and deployment phases – a capability known as evaluation awareness. This has significant safety and policy implications, potentially undermining the reliability of evaluations that are central to AI governance frameworks and voluntary industry commitments. In this paper, we study evaluation awareness in Llama-3.3-70B-Instruct. We show that linear probes can separate real-world evaluation and deployment prompts, suggesting that current models internally represent this distinction. We also find that current safety evaluations are correctly classified by the probes, suggesting that they already appear artificial or inauthentic to models. Our findings underscore the importance of ensuring trustworthy evaluations and understanding deceptive capabilities. More broadly, our work showcases how model internals may be leveraged to support blackbox methods in safety audits, especially for future models more competent at evaluation awareness and deception.
zh

[NLP-17] MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

【速读】: 该论文试图解决多语言大语言模型训练中数据质量评估不足的问题,尤其是现有基于模型的选择方法主要针对英语,缺乏对其他语言的适用性。解决方案的关键在于提出MuRating框架,该框架通过将高质量的英语数据质量信号迁移至单一评分器,并利用成对比较聚合多个英语“评分器”以学习统一的文档质量分数,随后通过翻译将这些判断投影到多语言评估器的训练中,从而实现对单语、跨语言和双语文本对的有效评估。

链接: https://arxiv.org/abs/2507.01785
作者: Zhixun Chen,Ping Guo,Wenhan Han,Yifan Zhang,Binbin Liu,Haobin Lin,Fengze Liu,Yan Zhao,Bingni Zhang,Taifeng Wang,Yin Zheng,Meng Fang
机构: University of Technology Sydney (悉尼科技大学); ByteDance (字节跳动); University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English “raters” via pairwise comparisons to learn unified document-quality scores,then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.
zh

[NLP-18] Data interference: emojis homoglyphs and issues of data fidelity in corpora and their results

【速读】: 该论文试图解决由于分词(tokenisation)差异导致的语言数据表示不准确及分析结果有效性降低的问题,特别是针对表情符号(emojis)和同形异义字符(homoglyphs)带来的挑战。解决方案的关键在于对这些元素进行预处理,以确保数字文本在语料库中的准确表示,从而支持可靠的语言分析并保障语言解释的可重复性。研究强调了理解数字文本数据中语言学和技术因素的重要性,以提高语料库分析的准确性。

链接: https://arxiv.org/abs/2507.01764
作者: Matteo Di Cristofaro
机构: 未知
类目: Computation and Language (cs.CL)
备注: Author submitted manuscript

点击查看摘要

Abstract:Tokenisation - “the process of splitting text into atomic parts” (Brezina Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative method (e.g. collocations) while ensuring the reliability of qualitative approaches. This paper examines how discrepancies in tokenisation affect the representation of language data and the validity of analytical findings: investigating the challenges posed by emojis and homoglyphs, the study highlights the necessity of preprocessing these elements to maintain corpus fidelity to the source data. The research presents methods for ensuring that digital texts are accurately represented in corpora, thereby supporting reliable linguistic analysis and guaranteeing the repeatability of linguistic interpretations. The findings emphasise the necessity of a detailed understanding of both linguistic and technical aspects involved in digital textual data to enhance the accuracy of corpus analysis, and have significant implications for both quantitative and qualitative approaches in corpus-based research.
zh

[NLP-19] uning without Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training

【速读】: 该论文试图解决深度学习中基于梯度的优化方法对大量标注数据的依赖所引发的隐私和安全问题,如数据污染攻击的易感性和过拟合风险。其解决方案的关键在于提出一种名为BBoxER的进化黑盒优化方法,通过隐式压缩训练数据引入信息瓶颈,并利用信息流的可处理性提供泛化、差分隐私、抗数据污染攻击和抗提取攻击的理论保障,从而在受限或隐私敏感环境中实现轻量级、模块化的LLM后训练增强。

链接: https://arxiv.org/abs/2507.01752
作者: Ismail Labiad,Mathurin Videau,Matthieu Kowalski,Marc Schoenauer,Alessandro Leite,Julia Kempe,Olivier Teytaud
机构: Meta(元)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Gradient-based optimization is the workhorse of deep learning, offering efficient and scalable training via backpropagation. However, its reliance on large volumes of labeled data raises privacy and security concerns such as susceptibility to data poisoning attacks and the risk of overfitting. In contrast, black box optimization methods, which treat the model as an opaque function, relying solely on function evaluations to guide optimization, offer a promising alternative in scenarios where data access is restricted, adversarial risks are high, or overfitting is a concern. However, black box methods also pose significant challenges, including poor scalability to high-dimensional parameter spaces, as prevalent in large language models (LLMs), and high computational costs due to reliance on numerous model evaluations. This paper introduces BBoxER, an evolutionary black-box method for LLM post-training that induces an information bottleneck via implicit compression of the training data. Leveraging the tractability of information flow, we provide strong theoretical bounds on generalization, differential privacy, susceptibility to data poisoning attacks, and robustness to extraction attacks. BBoxER operates on top of pre-trained LLMs, offering a lightweight and modular enhancement suitable for deployment in restricted or privacy-sensitive environments, in addition to non-vacuous generalization guarantees. In experiments with LLMs, we demonstrate empirically that Retrofitting methods are able to learn, showing how a few iterations of BBoxER improve performance and generalize well on a benchmark of reasoning datasets. This positions BBoxER as an attractive add-on on top of gradient-based optimization.
zh

[NLP-20] ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving ECCV2024

【速读】: 该论文旨在解决自动驾驶领域中极端情况(corner cases)的识别与应对问题,通过先进的多模态感知与理解技术探索下一代解决方案。其关键在于利用多模态感知技术提升对复杂场景的理解能力,并通过双轨挑战赛(包括极端场景理解和生成)推动相关技术的发展,从而缩小前沿自动驾驶技术与具备强鲁棒性的全智能自动驾驶系统之间的差距。

链接: https://arxiv.org/abs/2507.01735
作者: Kai Chen,Ruiyuan Gao,Lanqing Hong,Hang Xu,Xu Jia,Holger Caesar,Dengxin Dai,Bingbing Liu,Dzmitry Tsishkou,Songcen Xu,Chunjing Xu,Qiang Xu,Huchuan Lu,Dit-Yan Yeung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ECCV 2024. Workshop page: this https URL

点击查看摘要

Abstract:In this paper, we present details of the 1st W-CODA workshop, held in conjunction with the ECCV 2024. W-CODA aims to explore next-generation solutions for autonomous driving corner cases, empowered by state-of-the-art multimodal perception and comprehension techniques. 5 Speakers from both academia and industry are invited to share their latest progress and opinions. We collect research papers and hold a dual-track challenge, including both corner case scene understanding and generation. As the pioneering effort, we will continuously bridge the gap between frontier autonomous driving techniques and fully intelligent, reliable self-driving agents robust towards corner cases.
zh

[NLP-21] LLM s for Legal Subsumption in German Employment Contracts

【速读】: 该论文试图解决法律文本分析中因缺乏可解释性和可信度而导致的NLP模型在动态法律环境中的适用性问题。解决方案的关键在于与法律专家合作,扩展现有数据集,并探索使用大型语言模型(Large Language Models, LLMs)和上下文学习方法来评估德国雇佣合同条款的合法性。研究通过三种法律情境变体(无法律上下文、完整法律和法院裁决文本、以及提炼后的检查指南)评估不同LLMs的分类性能,结果表明检查指南显著提升了无效条款的召回率和加权F1分数,达到80%。

链接: https://arxiv.org/abs/2507.01734
作者: Oliver Wardas,Florian Matthes
机构: 未知
类目: Computation and Language (cs.CL)
备注: PrePrint - ICAIL25, Chicago

点击查看摘要

Abstract:Legal work, characterized by its text-heavy and resource-intensive nature, presents unique challenges and opportunities for NLP research. While data-driven approaches have advanced the field, their lack of interpretability and trustworthiness limits their applicability in dynamic legal environments. To address these issues, we collaborated with legal experts to extend an existing dataset and explored the use of Large Language Models (LLMs) and in-context learning to evaluate the legality of clauses in German employment contracts. Our work evaluates the ability of different LLMs to classify clauses as “valid,” “unfair,” or “void” under three legal context variants: no legal context, full-text sources of laws and court rulings, and distilled versions of these (referred to as examination guidelines). Results show that full-text sources moderately improve performance, while examination guidelines significantly enhance recall for void clauses and weighted F1-Score, reaching 80%. Despite these advancements, LLMs’ performance when using full-text sources remains substantially below that of human lawyers. We contribute an extended dataset, including examination guidelines, referenced legal sources, and corresponding annotations, alongside our code and all log files. Our findings highlight the potential of LLMs to assist lawyers in contract legality review while also underscoring the limitations of the methods presented.
zh

[NLP-22] Stereotype Detection as a Catalyst for Enhanced Bias Detection: A Multi-Task Learning Approach

【速读】: 该论文试图解决语言模型中的偏见和刻板印象问题,这些问题在内容审核和决策等敏感领域可能造成危害。解决方案的关键在于通过联合学习偏见和刻板印象检测任务来提升模型性能,实验表明这种联合训练显著优于单独训练,且改进效果源于偏见与刻板印象之间的内在联系,而非多任务学习本身。

链接: https://arxiv.org/abs/2507.01715
作者: Aditya Tomar,Rudra Murthy,Pushpak Bhattacharyya
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Bias and stereotypes in language models can cause harm, especially in sensitive areas like content moderation and decision-making. This paper addresses bias and stereotype detection by exploring how jointly learning these tasks enhances model performance. We introduce StereoBias, a unique dataset labeled for bias and stereotype detection across five categories: religion, gender, socio-economic status, race, profession, and others, enabling a deeper study of their relationship. Our experiments compare encoder-only models and fine-tuned decoder-only models using QLoRA. While encoder-only models perform well, decoder-only models also show competitive results. Crucially, joint training on bias and stereotype detection significantly improves bias detection compared to training them separately. Additional experiments with sentiment analysis confirm that the improvements stem from the connection between bias and stereotypes, not multi-task learning alone. These findings highlight the value of leveraging stereotype information to build fairer and more effective AI systems.
zh

[NLP-23] AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness ACL2025

【速读】: 该论文旨在解决多模态大型语言模型(mLLMs)在理解网络迷因(meme)有害性方面的评估不足问题。现有基准测试依赖于静态数据集和基于准确率的模型无关评估,无法及时且全面地反映在线迷因动态演变的特点。论文提出的解决方案是AdamMeme,其关键在于采用基于智能体的评估框架,通过多智能体协作动态更新迷因数据,迭代引入具有挑战性的样本,从而系统性地揭示不同mLLMs在有害性理解上的性能差异及具体缺陷。

链接: https://arxiv.org/abs/2507.01702
作者: Zixin Chen,Hongzhan Lin,Kaixin Li,Ziyang Luo,Zhen Ye,Guang Chen,Zhiyong Huang,Jing Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025

点击查看摘要

Abstract:The proliferation of multimodal memes in the social media era demands that multimodal Large Language Models (mLLMs) effectively understand meme harmfulness. Existing benchmarks for assessing mLLMs on harmful meme understanding rely on accuracy-based, model-agnostic evaluations using static datasets. These benchmarks are limited in their ability to provide up-to-date and thorough assessments, as online memes evolve dynamically. To address this, we propose AdamMeme, a flexible, agent-based evaluation framework that adaptively probes the reasoning capabilities of mLLMs in deciphering meme harmfulness. Through multi-agent collaboration, AdamMeme provides comprehensive evaluations by iteratively updating the meme data with challenging samples, thereby exposing specific limitations in how mLLMs interpret harmfulness. Extensive experiments show that our framework systematically reveals the varying performance of different target mLLMs, offering in-depth, fine-grained analyses of model-specific weaknesses. Our code is available at this https URL.
zh

[NLP-24] Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)后训练过程中存在的问题,即监督微调(Supervised Fine-Tuning, SFT)和强化学习微调(Reinforcement Fine-Tuning, RFT)两种方法各自存在的局限性。SFT虽然在模仿演示数据方面表现优异,但可能导致行为克隆引发的泛化问题;而RFT虽然能显著提升模型性能,但容易学习到意外行为且对初始策略敏感。解决方案的关键是提出Prefix-RFT,这是一种融合演示学习与探索学习的混合方法,通过数学推理问题的实验证明其有效性,并能够无缝集成到现有开源框架中,仅需对标准RFT流程进行最小修改。该方法有效协调了SFT与RFT的互补性,提升了模型性能与鲁棒性。

链接: https://arxiv.org/abs/2507.01679
作者: Zeyu Huang,Tianhao Cheng,Zihan Qiu,Zili Wang,Yinghui Xu,Edoardo M. Ponti,Ivan Titov
机构: University of Edinburgh(爱丁堡大学); Fudan University(复旦大学); Alibaba Group(阿里巴巴集团); Stepfun(思否); University of Amsterdam(阿姆斯特丹大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Existing post-training techniques for large language models are broadly categorized into Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT). Each paradigm presents a distinct trade-off: SFT excels at mimicking demonstration data but can lead to problematic generalization as a form of behavior cloning. Conversely, RFT can significantly enhance a model’s performance but is prone to learn unexpected behaviors, and its performance is highly sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a testbed, we empirically demonstrate that Prefix-RFT is both simple and effective. It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods. A key advantage is its seamless integration into existing open-source frameworks, requiring only minimal modifications to the standard RFT pipeline. Our analysis highlights the complementary nature of SFT and RFT, and validates that Prefix-RFT effectively harmonizes these two learning paradigms. Furthermore, ablation studies confirm the method’s robustness to variations in the quality and quantity of demonstration data. We hope this work offers a new perspective on LLM post-training, suggesting that a unified paradigm that judiciously integrates demonstration and exploration could be a promising direction for future research.
zh

[NLP-25] Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings

【速读】: 该论文试图解决预训练语言模型在低资源印尼本土语言中的可迁移性问题,具体通过情感分析任务进行研究。其解决方案的关键在于采用一种模块化适配器方法(MAD-X),该方法显著提升了模型在已见语言和部分已见语言上的性能,而无需目标语言的标注数据。此外,研究还发现模型对目标语言或相关语言的先前接触是影响迁移成功的主要因素。

链接: https://arxiv.org/abs/2507.01645
作者: Rifki Afina Putri
机构: Universitas Gadjah Mada (加查马达大学)
类目: Computation and Language (cs.CL)
备注: AMLDS 2025

点击查看摘要

Abstract:In this paper, we investigate the transferability of pre-trained language models to low-resource Indonesian local languages through the task of sentiment analysis. We evaluate both zero-shot performance and adapter-based transfer on ten local languages using models of different types: a monolingual Indonesian BERT, multilingual models such as mBERT and XLM-R, and a modular adapter-based approach called MAD-X. To better understand model behavior, we group the target languages into three categories: seen (included during pre-training), partially seen (not included but linguistically related to seen languages), and unseen (absent and unrelated in pre-training data). Our results reveal clear performance disparities across these groups: multilingual models perform best on seen languages, moderately on partially seen ones, and poorly on unseen languages. We find that MAD-X significantly improves performance, especially for seen and partially seen languages, without requiring labeled data in the target language. Additionally, we conduct a further analysis on tokenization and show that while subword fragmentation and vocabulary overlap with Indonesian correlate weakly with prediction quality, they do not fully explain the observed performance. Instead, the most consistent predictor of transfer success is the model’s prior exposure to the language, either directly or through a related language.
zh

[NLP-26] Confidence and Stability of Global and Pairwise Scores in NLP Evaluation ACL

【速读】: 该论文试图解决在自然语言处理(NLP)领域中,如何选择合适的模型评估策略以更准确地反映模型性能的问题。传统全局点对分数(如GLUE、BIG-bench、SWE-bench)与基于成对比较的排行榜(如LMSYS Arena)各有优劣,论文通过实验分析了两者的优缺点。解决方案的关键在于结合全局评分与成对比较方法,利用标准全局指标和流行的Bradley-Terry模型进行计算实验,从而为不同场景下的模型评估提供依据,特别是在质量度量难以定义的任务中,成对比较能够更有效地识别表现优异的模型。

链接: https://arxiv.org/abs/2507.01633
作者: Georgii Levtsov,Dmitry Ustalov
机构: Neapolis University Pafos / JetBrains; JetBrains
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 8 pages, accepted at ACL SRW 2025

点击查看摘要

Abstract:With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders among models with lower global scores, especially where quality metrics are hard to define (e.g., text generation), though they require more comparisons to converge if ties are frequent. Our code and data are available at this https URL under a permissive license.
zh

[NLP-27] Chart Question Answering from Real-World Analytical Narratives ACL

【速读】: 该论文试图解决图表问答(Chart Question Answering, CQA)任务中现有基准数据与实际分析流程不匹配的问题,其解决方案的关键在于构建一个基于可视化笔记本的新数据集。该数据集包含真实世界、多视角的图表以及基于分析叙述的自然语言问题,能够反映更符合现实场景的推理流程,从而为CQA研究提供更具生态效度的评估环境。

链接: https://arxiv.org/abs/2507.01627
作者: Maeve Hutchinson,Radu Jianu,Aidan Slingsby,Jo Wood,Pranava Madhyastha
机构: 未知
类目: Computation and Language (cs.CL)
备注: This paper has been accepted to the ACL Student Research Workshop (SRW) 2025

点击查看摘要

Abstract:We present a new dataset for chart question answering (CQA) constructed from visualization notebooks. The dataset features real-world, multi-view charts paired with natural language questions grounded in analytical narratives. Unlike prior benchmarks, our data reflects ecologically valid reasoning workflows. Benchmarking state-of-the-art multimodal large language models reveals a significant performance gap, with GPT-4.1 achieving an accuracy of 69.3%, underscoring the challenges posed by this more authentic CQA setting.
zh

[NLP-28] Data Agent : A Holistic Architecture for Orchestrating DataAI Ecosystems

【速读】: 该论文试图解决传统Data+AI系统在面对数据、查询、任务和环境变化时,依赖人工专家进行系统流水线编排所带来的效率低下与适应性不足的问题。其核心挑战在于现有系统在语义理解、推理和规划能力上的局限性。解决方案的关键是引入大语言模型(Large Language Models, LLMs)技术,通过构建“数据代理”(Data Agent)架构,整合知识理解、推理与规划能力,以实现对Data+AI生态系统的有效编排与管理。

链接: https://arxiv.org/abs/2507.01599
作者: Zhaoyan Sun,Jiayi Wang,Xinyang Zhao,Jiachi Wang,Guoliang Li
机构: Tsinghua University (清华大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traditional Data+AI systems utilize data-driven techniques to optimize performance, but they rely heavily on human experts to orchestrate system pipelines, enabling them to adapt to changes in data, queries, tasks, and environments. For instance, while there are numerous data science tools available, developing a pipeline planning system to coordinate these tools remains challenging. This difficulty arises because existing Data+AI systems have limited capabilities in semantic understanding, reasoning, and planning. Fortunately, we have witnessed the success of large language models (LLMs) in enhancing semantic understanding, reasoning, and planning abilities. It is crucial to incorporate LLM techniques to revolutionize data systems for orchestrating Data+AI applications effectively. To achieve this, we propose the concept of a ‘Data Agent’ - a comprehensive architecture designed to orchestrate Data+AI ecosystems, which focuses on tackling data-related tasks by integrating knowledge comprehension, reasoning, and planning capabilities. We delve into the challenges involved in designing data agents, such as understanding data/queries/environments/tools, orchestrating pipelines/workflows, optimizing and executing pipelines, and fostering pipeline self-reflection. Furthermore, we present examples of data agent systems, including a data science agent, data analytics agents (such as unstructured data analytics agent, semantic structured data analytics agent, data lake analytics agent, and multi-modal data analytics agent), and a database administrator (DBA) agent. We also outline several open challenges associated with designing data agent systems. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2507.01599 [cs.DB] (or arXiv:2507.01599v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2507.01599 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-29] 3DM: Test-Time Training-Guided Distribution Shift Modelling for Temporal Knowledge Graph Reasoning

【速读】: 该论文旨在解决时间知识图谱推理(Temporal Knowledge Graph Reasoning, TKGR)中两个关键问题:一是对训练与测试样本之间事件分布偏移建模不足,二是依赖随机实体替换生成负样本导致采样质量较低。解决方案的关键在于提出一种基于分布偏移建模的训练方法——Test-Time Training-guided Distribution shift Modelling (T3DM),通过调整模型以适应分布偏移并保证推理的全局一致性,同时设计了一种基于对抗训练的负采样策略,以生成更高质量的负三元组。

链接: https://arxiv.org/abs/2507.01597
作者: Yuehang Si,Zefan Zeng,Jincai Huang,Qing Cheng
机构: National University of Defense Technology (国防科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Temporal Knowledge Graph (TKG) is an efficient method for describing the dynamic development of facts along a timeline. Most research on TKG reasoning (TKGR) focuses on modelling the repetition of global facts and designing patterns of local historical facts. However, they face two significant challenges: inadequate modeling of the event distribution shift between training and test samples, and reliance on random entity substitution for generating negative samples, which often results in low-quality sampling. To this end, we propose a novel distributional feature modeling approach for training TKGR models, Test-Time Training-guided Distribution shift Modelling (T3DM), to adjust the model based on distribution shift and ensure the global consistency of model reasoning. In addition, we design a negative-sampling strategy to generate higher-quality negative quadruples based on adversarial training. Extensive experiments show that T3DM provides better and more robust results than the state-of-the-art baselines in most cases.
zh

[NLP-30] Emotionally Intelligent Task-oriented Dialogue Systems: Architecture Representation and Optimisation

【速读】: 该论文旨在解决任务导向对话(Task-oriented dialogue, ToD)系统在复杂、噪声和模糊的对话环境中实现有效任务完成、情感理解和精准信息传递的挑战。其解决方案的关键在于提出LUSTER,一个基于大语言模型(LLM)的统一系统,采用端到端强化学习方法,结合短期(用户情绪)和长期(任务成功)奖励机制,通过结构化奖励建模与LLM能力的融合,提升系统的鲁棒性和情感响应能力。

链接: https://arxiv.org/abs/2507.01594
作者: Shutong Feng,Hsien-chin Lin,Nurul Lubis,Carel van Niekerk,Michael Heck,Benjamin Ruppik,Renato Vukovic,Milica Gašić
机构: Heinrich Heine University Düsseldorf (海因里希·海涅大学杜塞尔多夫)
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Task-oriented dialogue (ToD) systems are designed to help users achieve specific goals through natural language interaction. While recent advances in large language models (LLMs) have significantly improved linguistic fluency and contextual understanding, building effective and emotionally intelligent ToD systems remains a complex challenge. Effective ToD systems must optimise for task success, emotional understanding and responsiveness, and precise information conveyance, all within inherently noisy and ambiguous conversational environments. In this work, we investigate architectural, representational, optimisational as well as emotional considerations of ToD systems. We set up systems covering these design considerations with a challenging evaluation environment composed of a natural-language user simulator coupled with an imperfect natural language understanding module. We propose \textbfLUSTER, an \textbfLLM-based \textbfUnified \textbfSystem for \textbfTask-oriented dialogue with \textbfEnd-to-end \textbfReinforcement learning with both short-term (user sentiment) and long-term (task success) rewards. Our findings demonstrate that combining LLM capability with structured reward modelling leads to more resilient and emotionally responsive ToD systems, offering a practical path forward for next-generation conversational agents.
zh

[NLP-31] Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning

【速读】: 该论文试图解决过程强化学习(Process Reinforcement Learning, PRL)中因引入额外过程奖励模型而导致的计算开销大以及缺乏统一的过程级优势估计理论框架的问题。解决方案的关键在于提出一种名为Self-Guided Process Reward Optimization(SPRO)的新框架,其核心创新包括:理论上证明过程奖励可从策略模型本身内生地获得,以及引入定义明确的累积过程奖励和Masked Step Advantage(MSA),从而在共享提示采样组内实现严格的逐步动作优势估计。

链接: https://arxiv.org/abs/2507.01551
作者: Wu Fei,Hao Kong,Shuxian Liang,Yang Lin,Yibo Yang,Jing Tang,Lei Chen,Xiansheng Hua
机构: Terminus Group(终端集团); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); King Abdullah University of Science and Technology(阿卜杜拉国王科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Process Reinforcement Learning~(PRL) has demonstrated considerable potential in enhancing the reasoning capabilities of Large Language Models~(LLMs). However, introducing additional process reward models incurs substantial computational overhead, and there is no unified theoretical framework for process-level advantage estimation. To bridge this gap, we propose \textbfSelf-Guided \textbfProcess \textbfReward \textbfOptimization~(\textbfSPRO), a novel framework that enables process-aware RL through two key innovations: (1) we first theoretically demonstrate that process rewards can be derived intrinsically from the policy model itself, and (2) we introduce well-defined cumulative process rewards and \textbfMasked \textbfStep \textbfAdvantage (\textbfMSA), which facilitates rigorous step-wise action advantage estimation within shared-prompt sampling groups. Our experimental results demonstrate that SPRO outperforms vaniila GRPO with 3.4x higher training efficiency and a 17.5% test accuracy improvement. Furthermore, SPRO maintains a stable and elevated policy entropy throughout training while reducing the average response length by approximately 1/3 , evidencing sufficient exploration and prevention of reward hacking. Notably, SPRO incurs no additional computational overhead compared to outcome-supervised RL methods such as GRPO, which benefit industrial implementation.
zh

[NLP-32] Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants

【速读】: 该论文试图解决老年群体,尤其是城市中的移民老人,在表达个人叙事时面临的碎片化、边缘化或难以用语言描述的问题。其解决方案的关键在于通过AI辅助的共同创作过程,结合口述故事与汉字的象征性重构,使参与者能够利用大语言模型(Large Language Model, LLM)建议的小篆字形及实体材料,将生活经验转化为视觉和触觉表达,而无需具备数字素养。该方法重新定位了AI作为支持性机制的角色,而非内容生产者,从而在人机协作与老龄化背景下增强了叙事主体性。

链接: https://arxiv.org/abs/2507.01548
作者: Wen Zhan,Ziqun Hua,Peiyue Lin,Yunfei Chen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: A version of this manuscript has been submitted to the [IASDR 2025 Conference]( this https URL ) and is currently under review

点击查看摘要

Abstract:This paper explores how older adults, particularly aging migrants in urban China, can engage AI-assisted co-creation to express personal narratives that are often fragmented, underrepresented, or difficult to verbalize. Through a pilot workshop combining oral storytelling and the symbolic reconstruction of Hanzi, participants shared memories of migration and recreated new character forms using Xiaozhuan glyphs, suggested by the Large Language Model (LLM), together with physical materials. Supported by human facilitation and a soft AI presence, participants transformed lived experience into visual and tactile expressions without requiring digital literacy. This approach offers new perspectives on human-AI collaboration and aging by repositioning AI not as a content producer but as a supportive mechanism, and by supporting narrative agency within sociotechnical systems.
zh

[NLP-33] Is External Information Useful for Stance Detection with LLM s? ACL

【速读】: 该论文试图解决在基于大语言模型(Large Language Models, LLMs)的立场检测任务中,外部信息(如维基百科和网络搜索内容)是否能够提升模型性能的问题。以往研究表明,外部信息有助于传统模型(如基于BERT的系统)的立场检测,但其对LLMs的影响尚未明确。论文的关键解决方案是通过系统评估八种LLMs在三个数据集上的表现,发现外部信息在大多数情况下反而降低了模型性能,其核心原因是LLMs倾向于根据提供的外部信息的立场和情感进行预测,而非文本本身的实际立场。这一发现揭示了LLMs在立场分类任务中可能存在的信息偏差风险。

链接: https://arxiv.org/abs/2507.01543
作者: Quang Minh Nguyen,Taegyoon Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL Findings 2025

点击查看摘要

Abstract:In the stance detection task, a text is classified as either favorable, opposing, or neutral towards a target. Prior work suggests that the use of external information, e.g., excerpts from Wikipedia, improves stance detection performance. However, whether or not such information can benefit large language models (LLMs) remains an unanswered question, despite their wide adoption in many reasoning tasks. In this study, we conduct a systematic evaluation on how Wikipedia and web search external information can affect stance detection across eight LLMs and in three datasets with 12 targets. Surprisingly, we find that such information degrades performance in most cases, with macro F1 scores dropping by up to 27.9%. We explain this through experiments showing LLMs’ tendency to align their predictions with the stance and sentiment of the provided information rather than the ground truth stance of the given text. We also find that performance degradation persists with chain-of-thought prompting, while fine-tuning mitigates but does not fully eliminate it. Our findings, in contrast to previous literature on BERT-based systems which suggests that external information enhances performance, highlight the risks of information biases in LLM-based stance classifiers. Code is available at this https URL.
zh

[NLP-34] Efficient Out-of-Scope Detection in Dialogue Systems via Uncertainty-Driven LLM Routing

【速读】: 该论文旨在解决任务导向对话系统(Task-Oriented Dialogue Systems, TODS)中离域意图(Out-of-Scope, OOS)检测的问题,该问题在面对未见过或模糊的用户查询时对系统的鲁棒性提出了关键挑战。论文提出的解决方案的关键在于结合不确定性建模与微调的大语言模型(Large Language Models, LLMs),通过两阶段流程实现高效且准确的OOS检测:第一阶段利用不确定性估计对当前部署在真实场景中的在域意图分类器输出进行评估,第二阶段则在高不确定性情况下触发微调的LLM进行最终决策,从而在计算效率与性能之间取得平衡,并在多个OOS检测基准上取得了最先进结果。

链接: https://arxiv.org/abs/2507.01541
作者: Álvaro Zaera,Diana Nicoleta Popa,Ivan Sekulic,Paolo Rosso
机构: Telepathy Labs GmbH (Telepathy Labs GmbH)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Out-of-scope (OOS) intent detection is a critical challenge in task-oriented dialogue systems (TODS), as it ensures robustness to unseen and ambiguous queries. In this work, we propose a novel but simple modular framework that combines uncertainty modeling with fine-tuned large language models (LLMs) for efficient and accurate OOS detection. The first step applies uncertainty estimation to the output of an in-scope intent detection classifier, which is currently deployed in a real-world TODS handling tens of thousands of user interactions daily. The second step then leverages an emerging LLM-based approach, where a fine-tuned LLM is triggered to make a final decision on instances with high uncertainty. Unlike prior approaches, our method effectively balances computational efficiency and performance, combining traditional approaches with LLMs and yielding state-of-the-art results on key OOS detection benchmarks, including real-world OOS data acquired from a deployed TODS.
zh

[NLP-35] Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence ITSC2025

【速读】: 该论文旨在解决街景录音数据集在开放数据共享过程中存在的隐私风险问题,特别是针对行人可能暴露的个人身份信息(PII)的检测与处理。其解决方案的关键在于提出一种名为cRID的跨模态框架,该框架结合了大视觉-语言模型、图注意力网络和表征学习,以检测文本可描述的PII线索并增强人员再识别(Re-ID)性能,从而通过提取可解释特征实现对语义上有意义的PII的识别。

链接: https://arxiv.org/abs/2507.01504
作者: Robert Aufschläger,Youssef Shoeb,Azarm Nowzad,Michael Heigl,Fabian Bally,Martin Schramm
机构: Deggendorf Institute of Technology (德根多夫应用技术大学); Continental AG (大陆集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: accepted for publication at the 2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC 2025), taking place during November 18-21, 2025 in Gold Coast, Australia

点击查看摘要

Abstract:The collection and release of street-level recordings as Open Data play a vital role in advancing autonomous driving systems and AI research. However, these datasets pose significant privacy risks, particularly for pedestrians, due to the presence of Personally Identifiable Information (PII) that extends beyond biometric traits such as faces. In this paper, we present cRID, a novel cross-modal framework combining Large Vision-Language Models, Graph Attention Networks, and representation learning to detect textual describable clues of PII and enhance person re-identification (Re-ID). Our approach focuses on identifying and leveraging interpretable features, enabling the detection of semantically meaningful PII beyond low-level appearance cues. We conduct a systematic evaluation of PII presence in person image datasets. Our experiments show improved performance in practical cross-dataset Re-ID scenarios, notably from Market-1501 to CUHK03-np (detected), highlighting the framework’s practical utility. Code is available at this https URL.
zh

[NLP-36] Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities

【速读】: 该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的自动文本简化(Automatic Text Simplification, ATS)系统缺乏针对目标群体(如智力障碍人群)个性化需求的问题。传统方法未在训练过程中引入目标群体对文本简化的偏好反馈,导致生成结果难以满足特定用户的需求。论文提出的解决方案关键在于扩展标准监督微调(Supervised Fine-Tuning, SFT)方法,采用计算高效的直接偏好优化(Direct Preference Optimization, DPO)技术,并利用智力障碍人群提供的反馈数据对模型进行后训练,从而实现更符合目标群体偏好的个性化文本简化系统。

链接: https://arxiv.org/abs/2507.01479
作者: Yingqiang Gao,Kaede Johnson,David Froehlich,Luisa Carrer,Sarah Ebling
机构: University of Zurich(苏黎世大学); EPFL(洛桑联邦理工学院); capito.ai(公司.ai); Zurich University of Applied Sciences(苏黎世应用科学大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic text simplification (ATS) aims to enhance language accessibility for various target groups, particularly persons with intellectual disabilities. Recent advancements in generative AI, especially large language models (LLMs), have substantially improved the quality of machine-generated text simplifications, thereby mitigating information barriers for the target group. However, existing LLM-based ATS systems do not incorporate preference feedback on text simplifications during training, resulting in a lack of personalization tailored to the specific needs of target group representatives. In this work, we extend the standard supervised fine-tuning (SFT) approach for adapting LLM-based ATS models by leveraging a computationally efficient LLM alignment technique – direct preference optimization (DPO). Specifically, we post-train LLM-based ATS models using human feedback collected from persons with intellectual disabilities, reflecting their preferences on paired text simplifications generated by mainstream LLMs. Furthermore, we propose a pipeline for developing personalized LLM-based ATS systems, encompassing data collection, model selection, SFT and DPO post-training, and evaluation. Our findings underscore the necessity of active participation of target group persons in designing personalized AI accessibility solutions aligned with human expectations. This work represents a step towards personalizing inclusive AI systems at the target-group level, incorporating insights not only from text simplification experts but also from target group persons themselves. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.01479 [cs.CL] (or arXiv:2507.01479v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.01479 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-37] LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

【速读】: 该论文旨在解决生成式 AI (Generative AI) 推理过程中,基于检索的推测解码(retrieval-based speculative decoding)因匹配机制不足而难以准确获取有效草稿标记的问题。其解决方案的关键在于提出 LogitSpec,该方法通过利用最后一个标记的 logits 不仅预测下一个标记,还能推测下下个标记,从而有效扩展检索范围,提高检索到相关参考文本的准确性,实现更高效的推理加速。

链接: https://arxiv.org/abs/2507.01449
作者: Tianyu Liu,Qitan Lv,Hao Li,Xing Gao,Xiao Sun
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 \times speedup and 3.28 mean accepted tokens per decoding step. Our code is available at this https URL.
zh

[NLP-38] Clinical NLP with Attention-Based Deep Learning for Multi-Disease Prediction

【速读】: 该论文旨在解决电子健康记录(Electronic Health Record, EHR)文本的非结构化特性及高维语义复杂性所带来的信息提取与多标签疾病预测问题。其解决方案的关键在于提出一种基于注意力机制的深度学习方法,利用Transformer架构进行临床文本的表征学习,并通过多层自注意力机制捕捉关键医学实体及其上下文关系,结合基于Sigmoid的多标签分类器实现疾病标签的预测。此外,模型引入了上下文感知的语义对齐机制,以增强在标签共现和稀疏信息等典型医疗场景下的表示能力。

链接: https://arxiv.org/abs/2507.01437
作者: Ting Xu,Xiaoxiao Deng,Xiandong Meng,Haifeng Yang,Yan Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses the challenges posed by the unstructured nature and high-dimensional semantic complexity of electronic health record texts. A deep learning method based on attention mechanisms is proposed to achieve unified modeling for information extraction and multi-label disease prediction. The study is conducted on the MIMIC-IV dataset. A Transformer-based architecture is used to perform representation learning over clinical text. Multi-layer self-attention mechanisms are employed to capture key medical entities and their contextual relationships. A Sigmoid-based multi-label classifier is then applied to predict multiple disease labels. The model incorporates a context-aware semantic alignment mechanism, enhancing its representational capacity in typical medical scenarios such as label co-occurrence and sparse information. To comprehensively evaluate model performance, a series of experiments were conducted, including baseline comparisons, hyperparameter sensitivity analysis, data perturbation studies, and noise injection tests. Results demonstrate that the proposed method consistently outperforms representative existing approaches across multiple performance metrics. The model maintains strong generalization under varying data scales, interference levels, and model depth configurations. The framework developed in this study offers an efficient algorithmic foundation for processing real-world clinical texts and presents practical significance for multi-label medical text modeling tasks.
zh

[NLP-39] Pensieve Grader: An AI-Powered Ready-to-Use Platform for Effortless Handwritten STEM Grading

【速读】: 该论文旨在解决大规模大学STEM课程中手写开放式回答评分的瓶颈问题。其提出的解决方案是Pensieve,一个基于大语言模型(Large Language Models, LLMs)的AI辅助评分平台,能够实现从扫描的学生作业到最终反馈的整个评分流程自动化,支持人机协同的评分界面。Pensieve的关键在于通过LLMs进行文本转录与评估,提供与评分标准对齐的分数、转录文本和置信度评级,从而显著提升评分效率并保持较高的评分一致性。

链接: https://arxiv.org/abs/2507.01431
作者: Yoonseok Yang,Minjune Kim,Marlon Rondinelli,Keren Shao
机构: Pensieve Inc. (Pensieve 公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 7 pages, 5 figues, 1 table

点击查看摘要

Abstract:Grading handwritten, open-ended responses remains a major bottleneck in large university STEM courses. We introduce Pensieve (this https URL), an AI-assisted grading platform that leverages large language models (LLMs) to transcribe and evaluate student work, providing instructors with rubric-aligned scores, transcriptions, and confidence ratings. Unlike prior tools that focus narrowly on specific tasks like transcription or rubric generation, Pensieve supports the entire grading pipeline-from scanned student submissions to final feedback-within a human-in-the-loop interface. Pensieve has been deployed in real-world courses at over 20 institutions and has graded more than 300,000 student responses. We present system details and empirical results across four core STEM disciplines: Computer Science, Mathematics, Physics, and Chemistry. Our findings show that Pensieve reduces grading time by an average of 65%, while maintaining a 95.4% agreement rate with instructor-assigned grades for high-confidence predictions. Comments: 7 pages, 5 figues, 1 table Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2507.01431 [cs.AI] (or arXiv:2507.01431v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.01431 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-40] Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

【速读】: 该论文试图解决当前生成式AI(Generative AI)中奖励模型(Reward Models, RMs)在多数评估基准上表现不佳的问题,其核心在于现有RMs无法准确捕捉人类偏好中的细微差别和复杂性。解决方案的关键在于构建一个大规模的高质量偏好数据集SynPref-40M,并采用人机协同的两阶段数据整理流程,结合人工标注的准确性与AI的可扩展性,从而实现高效且高质量的数据集构建。基于此数据集,研究者训练了Skywork-Reward-V2系列奖励模型,展示了其在多个任务上的优越性能,验证了数据规模与质量共同驱动模型效果提升的有效性。

链接: https://arxiv.org/abs/2507.01352
作者: Chris Yuhao Liu,Liang Zeng,Yuzhen Xiao,Jujie He,Jiacai Liu,Chaojie Wang,Rui Yan,Wei Shen,Fuxiang Zhang,Jiacheng Xu,Yang Liu,Yahui Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.
zh

[NLP-41] LEDOM: An Open and Fundamental Reverse Language Model

【速读】: 该论文试图解决传统语言模型在生成质量和推理能力上的局限性,特别是针对数学推理等需要深度逻辑分析的任务。其解决方案的关键在于提出一种全新的逆向语言模型(Reverse Language Model, RLM),即LEDOM,该模型通过反向时间序列的自回归训练方式,利用前序词预测来处理序列,从而具备独特的逆向推理能力。这一特性使得LEDOM能够对正向语言模型的输出进行重排序,进而提升生成质量与任务性能。

链接: https://arxiv.org/abs/2507.01335
作者: Xunjian Yin,Sitao Cheng,Yuxi Xie,Xinyu Hu,Li Lin,Xinyi Wang,Liangming Pan,William Yang Wang,Xiaojun Wan
机构: Peking University (北京大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); University of Arizona (亚利桑那大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward language model outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM’s unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.
zh

[NLP-42] Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在物理推理任务中表现不足的问题,特别是如何提升其在复杂物理问题上的理解和求解能力。解决方案的关键在于应用先进的指令微调推理模型(如Deepseek-R1),并通过少量示例提示(few-shot prompting)策略增强模型的性能,从而在SciBench基准测试中实现最先进的准确率和独特的符号推导能力。

链接: https://arxiv.org/abs/2507.01334
作者: Nifu Dan,Yujun Cai,Yiwei Wang
机构: Georgia Tech(佐治亚理工学院); University of Queensland(昆士兰大学); UC Merced(加州大学默塞德分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark. Our comprehensive experimental evaluation reveals the remarkable capabilities of reasoning models. Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation. Furthermore, our findings indicate that even for these highly sophisticated reasoning models, the strategic incorporation of few-shot prompting can still yield measurable improvements in overall accuracy, highlighting the potential for continued performance gains.
zh

[NLP-43] La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation ICML2025

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)推理过程中计算开销和内存传输过高的问题,特别是现有方法在实现激活稀疏性时存在的局限性,如需要耗时的恢复训练或依赖经验性的基于幅度的剪枝导致稀疏性和推理加速不稳定。论文提出的解决方案关键在于LaRoSA(Layerwise Rotated Sparse Activation),通过逐层正交旋转将输入激活转换为更适合稀疏化的形式,并在旋转后的激活中采用Top-K选择策略,从而实现一致的模型级稀疏性和可靠的运行时间加速。

链接: https://arxiv.org/abs/2507.01299
作者: Kai Liu,Bowen Xu,Shaoyu Wu,Xin Chen,Hao Zhou,Yongliang Tao,Lulu Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2025 Acceptance

点击查看摘要

Abstract:Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30x wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.
zh

[NLP-44] Frustratingly Simple Retrieval Improves Challenging Reasoning -Intensive Benchmarks

【速读】: 该论文试图解决在复杂推理型基准测试中,传统检索增强生成(Retrieval-augmented Generation, RAG)方法表现有限的问题。其关键解决方案是引入了一个名为CompactDS的多样化、高质量、大规模网络数据存储库,该数据存储库在单节点上实现了高检索准确率和亚秒级延迟。核心见解包括:大部分网络内容可以被过滤而不影响覆盖率,且一个精简的高质量子集已足够;以及结合内存中的近似最近邻(ANN)检索与磁盘上的精确搜索,可在速度与召回率之间取得平衡。通过使用CompactDS,研究展示了最小化RAG流程在多个基准测试和模型规模上均能实现稳定的准确性提升。

链接: https://arxiv.org/abs/2507.01297
作者: Xinxi Lyu,Michael Duan,Rulin Shao,Pang Wei Koh,Sewon Min
机构: Allen Institute for AI (艾伦人工智能研究所); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Southern California (南加州大学); University of Washington (华盛顿大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 33 pages, 2 figures, 27 tables

点击查看摘要

Abstract:Retrieval-augmented Generation (RAG) has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node. The key insights are (1) most web content can be filtered out without sacrificing coverage, and a compact, high-quality subset is sufficient; and (2) combining in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search balances speed and recall. Using CompactDS, we show that a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B–70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA, and 19% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks). Finally, we show that our carefully designed in-house datastore matches or outperforms web search engines such as Google Search, as well as recently proposed, complex agent-based RAG systems–all while maintaining simplicity, reproducibility, and self-containment. We release CompactDS and our retrieval pipeline, supporting future research exploring retrieval-based AI systems.
zh

[NLP-45] Rethinking All Evidence: Enhancing Trustworthy Retrieval-Augmented Generation via Conflict-Driven Summarization

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中由于内部不一致或噪声检索内容导致的知识冲突问题,这些问题会严重降低生成结果的可靠性。其解决方案的关键在于提出CARE-RAG(Conflict-Aware and Reliable Evidence for RAG)框架,该框架通过冲突驱动的摘要技术对所有可用证据进行处理,包括内部知识和检索内容,从而提升生成结果的可信度。具体而言,CARE-RAG首先通过比较参数记录生成参数感知的证据以识别多样化的内部视角,随后对检索到的证据进行精炼以生成上下文感知的证据,去除无关或误导性内容,并通过蒸馏一个3B参数的LLaMA3.2模型实现冲突检测与摘要,最终通过QA修复步骤确保评估的完整性。

链接: https://arxiv.org/abs/2507.01281
作者: Juan Chen,Baolong Bi,Wei Zhang,Jingyan Sui,Xiaofei Zhu,Yuanzhuo Wang,Lingrui Mei,Shenghua Liu
机构: University of Chinese Academy of Sciences (中国科学院大学); Chinese Academy of Sciences (中国科学院); National University of Defense Technology (国防科技大学); Chongqing University of Technology (重庆理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating their parametric knowledge with external retrieved content. However, knowledge conflicts caused by internal inconsistencies or noisy retrieved content can severely undermine the generation reliability of RAG this http URL this work, we argue that LLMs should rethink all evidence, including both retrieved content and internal knowledge, before generating this http URL propose CARE-RAG (Conflict-Aware and Reliable Evidence for RAG), a novel framework that improves trustworthiness through Conflict-Driven Summarization of all available this http URL-RAG first derives parameter-aware evidence by comparing parameter records to identify diverse internal perspectives. It then refines retrieved evidences to produce context-aware evidence, removing irrelevant or misleading content. To detect and summarize conflicts, we distill a 3B LLaMA3.2 model to perform conflict-driven summarization, enabling reliable synthesis across multiple this http URL further ensure evaluation integrity, we introduce a QA Repair step to correct outdated or ambiguous benchmark this http URL on revised QA datasets with retrieval data show that CARE-RAG consistently outperforms strong RAG baselines, especially in scenarios with noisy or conflicting evidence.
zh

[NLP-46] Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在眼科领域中的临床推理能力评估问题,特别是其在糖尿病视网膜病变(Diabetic Retinopathy, DR)和青光眼筛查中的应用潜力。解决方案的关键在于通过结构化文本描述的视网膜眼底照片,评估GPT-4在分配ICDR严重程度评分、推荐DR转诊以及估算杯盘比以判断青光眼转诊方面的表现,并探讨真实或合成临床元数据对模型性能的影响。

链接: https://arxiv.org/abs/2507.01278
作者: Cindy Lie Tabuse,David Restepo,Carolina Gracitelli,Fernando Korn Malerbi,Caio Regatieri,Luis Filipe Nakayama
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored. This study evaluated GPT-4’s ability to interpret structured textual descriptions of retinal fundus photographs and simulate clinical decisions for diabetic retinopathy (DR) and glaucoma screening, including the impact of adding real or synthetic clinical metadata. We conducted a retrospective diagnostic validation study using 300 annotated fundus images. GPT-4 received structured prompts describing each image, with or without patient metadata. The model was tasked with assigning an ICDR severity score, recommending DR referral, and estimating the cup-to-disc ratio for glaucoma referral. Performance was evaluated using accuracy, macro and weighted F1 scores, and Cohen’s kappa. McNemar’s test and change rate analysis were used to assess the influence of metadata. GPT-4 showed moderate performance for ICDR classification (accuracy 67.5%, macro F1 0.33, weighted F1 0.67, kappa 0.25), driven mainly by correct identification of normal cases. Performance improved in the binary DR referral task (accuracy 82.3%, F1 0.54, kappa 0.44). For glaucoma referral, performance was poor across all settings (accuracy ~78%, F1 0.04, kappa 0.03). Metadata inclusion did not significantly affect outcomes (McNemar p 0.05), and predictions remained consistent across conditions. GPT-4 can simulate basic ophthalmic decision-making from structured prompts but lacks precision for complex tasks. While not suitable for clinical use, LLMs may assist in education, documentation, or image annotation workflows in ophthalmology.
zh

[NLP-47] GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-based Assistant

【速读】: 该论文试图解决大型语言模型在处理非英语和非中文国家的法律问题时,如何基于准确的法律条文生成答案并提供适当引用的问题。其关键解决方案是提出一种更可解释、更人性化且效果优于基于嵌入方法的检索机制,并构建了一个基于认知大语言模型(LLM)的代理架构gAIus,该架构的响应基于从特定法律文本(如波兰民法典)中检索到的知识。通过这一架构,显著提升了模型在法律问答任务中的性能。

链接: https://arxiv.org/abs/2507.01259
作者: Michał Matak,Jarosław A. Chudziak
机构: Warsaw University of Technology (华沙理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, presented at ICAART 2025, in proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART

点击查看摘要

Abstract:In this paper we discuss the capability of large language models to base their answer and provide proper references when dealing with legal matters of non-english and non-chinese speaking country. We discuss the history of legal information retrieval, the difference between case law and statute law, its impact on the legal tasks and analyze the latest research in this field. Basing on that background we introduce gAIus, the architecture of the cognitive LLM-based agent, whose responses are based on the knowledge retrieved from certain legal act, which is Polish Civil Code. We propose a retrieval mechanism which is more explainable, human-friendly and achieves better results than embedding-based approaches. To evaluate our method we create special dataset based on single-choice questions from entrance exams for law apprenticeships conducted in Poland. The proposed architecture critically leveraged the abilities of used large language models, improving the gpt-3.5-turbo-0125 by 419%, allowing it to beat gpt-4o and lifting gpt-4o-mini score from 31% to 86%. At the end of our paper we show the possible future path of research and potential applications of our findings.
zh

[NLP-48] he Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure

【速读】: 该论文试图解决文本序列之间基于嵌入的相似性度量受到无关属性(如文本来源或语言)干扰的问题,这些干扰因素会降低多语种或多语料库文本应用的效果。解决方案的关键在于使用一种去偏算法,该算法从编码器表示中移除已知干扰因素的信息,从而在计算成本最小的情况下显著减少偏差。实验结果表明,该方法在多种嵌入变体和任务中均提升了文档相似性和聚类指标。

链接: https://arxiv.org/abs/2507.01234
作者: Yu Fan,Yang Tian,Shauli Ravfogel,Mrinmaya Sachan,Elliott Ash,Alexander Hoyle
机构: ETH Zurich(苏黎世联邦理工学院); University of Zurich(苏黎世大学); New York University(纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text’s source or language. These document confounders cause problems for many applications, but especially those that need to pool texts from different corpora. This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost. Document similarity and clustering metrics improve across every embedding variant and task we evaluate – often dramatically. Interestingly, performance on out-of-distribution benchmarks is not impacted, indicating that the embeddings are not otherwise degraded.
zh

[NLP-49] MEGA: xLSTM with Multihead Exponential Gated Fusion for Precise Aspect-based Sentiment Analysis

【速读】: 该论文旨在解决Aspect-based Sentiment Analysis (ABSA)中计算效率与高性能难以平衡的问题。现有方法在处理长距离依赖时存在局限,如深度学习模型缺乏全局上下文、Transformer需要大量计算资源、基于Mamba的方法受CUDA依赖影响且局部相关性减弱。论文提出的解决方案关键在于引入xLSTM with Multihead Exponential Gated Fusion (MEGA),其核心是结合双向mLSTM架构与前向及部分翻转后向(PF-mLSTM)流,通过PF-mLSTM增强局部上下文建模能力,并采用基于mLSTM的多头交叉指数门控融合机制(MECGAF),动态融合前向和反向输出,从而在保持全局上下文和效率的同时优化短距离依赖捕捉。

链接: https://arxiv.org/abs/2507.01213
作者: Adamu Lawan,Juhua Pu,Haruna Yunusa,Jawad Muhammad,Muhammad Lawan
机构: Beihang University (北京航空航天大学); China (中国); Federal University (联邦大学)
类目: Computation and Language (cs.CL)
备注: 6, 1 figure

点击查看摘要

Abstract:Aspect-based Sentiment Analysis (ABSA) is a critical Natural Language Processing (NLP) task that extracts aspects from text and determines their associated sentiments, enabling fine-grained analysis of user opinions. Existing ABSA methods struggle to balance computational efficiency with high performance: deep learning models often lack global context, transformers demand significant computational resources, and Mamba-based approaches face CUDA dependency and diminished local correlations. Recent advancements in Extended Long Short-Term Memory (xLSTM) models, particularly their efficient modeling of long-range dependencies, have significantly advanced the NLP community. However, their potential in ABSA remains untapped. To this end, we propose xLSTM with Multihead Exponential Gated Fusion (MEGA), a novel framework integrating a bi-directional mLSTM architecture with forward and partially flipped backward (PF-mLSTM) streams. The PF-mLSTM enhances localized context modeling by processing the initial sequence segment in reverse with dedicated parameters, preserving critical short-range patterns. We further introduce an mLSTM-based multihead cross exponential gated fusion mechanism (MECGAF) that dynamically combines forward mLSTM outputs as query and key with PF-mLSTM outputs as value, optimizing short-range dependency capture while maintaining global context and efficiency. Experimental results on three benchmark datasets demonstrate that MEGA outperforms state-of-the-art baselines, achieving superior accuracy and efficiency in ABSA tasks.
zh

[NLP-50] Matching and Linking Entries in Historical Swedish Encyclopedias

【速读】: 该论文旨在分析《Nordisk familjebok》(北欧家庭百科全书)在不同版本中地理条目内容的变化,以揭示其地理关注点的演变及其背后的社会与政治影响。解决方案的关键在于利用语义句子嵌入技术对文本进行重新分段,并通过基于Transformer的分类器提取地理条目,随后将其与Wikidata进行关联,从而实现对地理趋势的量化分析。

链接: https://arxiv.org/abs/2507.01170
作者: Simon Börjesson,Erik Ersmark,Pierre Nugues
机构: Lund University (隆德大学); Lund, Sweden (隆德,瑞典)
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:The \textitNordisk familjebok is a Swedish encyclopedia from the 19th and 20th centuries. It was written by a team of experts and aimed to be an intellectual reference, stressing precision and accuracy. This encyclopedia had four main editions remarkable by their size, ranging from 20 to 38 volumes. As a consequence, the \textitNordisk familjebok had a considerable influence in universities, schools, the media, and society overall. As new editions were released, the selection of entries and their content evolved, reflecting intellectual changes in Sweden. In this paper, we used digitized versions from \textitProject Runeberg. We first resegmented the raw text into entries and matched pairs of entries between the first and second editions using semantic sentence embeddings. We then extracted the geographical entries from both editions using a transformer-based classifier and linked them to Wikidata. This enabled us to identify geographic trends and possible shifts between the first and second editions, written between 1876-1899 and 1904-1926, respectively. Interpreting the results, we observe a small but significant shift in geographic focus away from Europe and towards North America, Africa, Asia, Australia, and northern Scandinavia from the first to the second edition, confirming the influence of the First World War and the rise of new powers. The code and data are available on GitHub at this https URL. Comments: 10 pages, 3 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.01170 [cs.CL] (or arXiv:2507.01170v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.01170 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025), pages 1-10 Related DOI: https://doi.org/10.18653/v1/2025.latechclfl-1.1 Focus to learn more DOI(s) linking to related resources
zh

[NLP-51] Event-based evaluation of abstractive news summarization ACL2025

【速读】: 该论文试图解决自动生成的摘要在质量评估中依赖人工编写的参考摘要的问题,其核心问题是现有评估方法主要基于重叠单元或相似性分数,而未能充分反映摘要中包含的事件信息。解决方案的关键在于通过计算生成摘要、参考摘要与原始新闻文章之间的事件重叠来评估抽象型摘要的质量,从而更深入地理解摘要中的事件信息内容。

链接: https://arxiv.org/abs/2507.01160
作者: Huiling You,Samia Touileb,Erik Velldal,Lilja Øvrelid
机构: University of Oslo (奥斯陆大学); University of Bergen (卑尔根大学)
类目: Computation and Language (cs.CL)
备注: to appear at GEM2 workshop@ACL 2025

点击查看摘要

Abstract:An abstractive summary of a news article contains its most important information in a condensed version. The evaluation of automatically generated summaries by generative language models relies heavily on human-authored summaries as gold references, by calculating overlapping units or similarity scores. News articles report events, and ideally so should the summaries. In this work, we propose to evaluate the quality of abstractive summaries by calculating overlapping events between generated summaries, reference summaries, and the original news articles. We experiment on a richly annotated Norwegian dataset comprising both events annotations and summaries authored by expert human annotators. Our approach provides more insight into the event information contained in the summaries.
zh

[NLP-52] Automated Vehicles Should be Connected with Natural Language

【速读】: 该论文试图解决多智能体协同驾驶中通信媒介在带宽效率、信息完整性和智能体互操作性方面的局限性,以及传统方法对决策级融合的忽视问题。其解决方案的关键在于从以感知为导向的数据交换转向使用自然语言进行显式的意图和推理通信,自然语言能够在语义密度与通信带宽之间取得平衡,并适应实时条件,从而实现智能体之间的主动协调,提升交通系统的安全性、效率和透明度。

链接: https://arxiv.org/abs/2507.01059
作者: Xiangbo Gao,Keshu Wu,Hao Zhang,Kexin Tian,Yang Zhou,Zhengzhong Tu
机构: Texas A&M University (德克萨斯A&M大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Multi-agent collaborative driving promises improvements in traffic safety and efficiency through collective perception and decision making. However, existing communication media – including raw sensor data, neural network features, and perception results – suffer limitations in bandwidth efficiency, information completeness, and agent interoperability. Moreover, traditional approaches have largely ignored decision-level fusion, neglecting critical dimensions of collaborative driving. In this paper we argue that addressing these challenges requires a transition from purely perception-oriented data exchanges to explicit intent and reasoning communication using natural language. Natural language balances semantic density and communication bandwidth, adapts flexibly to real-time conditions, and bridges heterogeneous agent platforms. By enabling the direct communication of intentions, rationales, and decisions, it transforms collaborative driving from reactive perception-data sharing into proactive coordination, advancing safety, efficiency, and transparency in intelligent transportation systems.
zh

[NLP-53] xt Detoxification: Data Efficiency Semantic Preservation and Model Generalization

【速读】: 该论文旨在解决社交媒体上毒性内容泛滥的问题,特别是如何在有效去除毒性的同时保持原始语义,并提高模型对分布外数据的鲁棒性。现有方法在实现强去毒性能、语义保留和数据效率方面存在显著不足。该研究提出了一种两阶段训练框架,其关键在于通过少量高质量过滤后的平行数据进行监督微调以获得良好的初始化,随后利用未标记的毒性输入和自定义奖励模型,采用Group Relative Policy Optimization训练大语言模型,从而提升数据效率、语义保留能力和模型泛化性。

链接: https://arxiv.org/abs/2507.01050
作者: Jing Yu,Yibo Zhao,Jiapeng Zhu,Wenming Shao,Bo Pang,Zhao Zhang,Xiang Li
机构: School of Data Science and Engineering, East China Normal University (数据科学与工程学院,华东师范大学); Shanghai EastWonder Info-tech Co., Ltd (上海东沃信息科技有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread dissemination of toxic content on social media poses a serious threat to both online environments and public discourse, highlighting the urgent need for detoxification methods that effectively remove toxicity while preserving the original semantics. However, existing approaches often struggle to simultaneously achieve strong detoxification performance, semantic preservation, and robustness to out-of-distribution data. Moreover, they typically rely on costly, manually annotated parallel corpora while showing poor data efficiency. To address these challenges, we propose a two-stage training framework that jointly optimizes for data efficiency, semantic preservation, and model generalization. We first perform supervised fine-tuning on a small set of high-quality, filtered parallel data to establish a strong initialization. Then, we leverage unlabeled toxic inputs and a custom-designed reward model to train the LLM using Group Relative Policy Optimization. Experimental results demonstrate that our method effectively mitigates the trade-offs faced by previous work, achieving state-of-the-art performance with improved generalization and significantly reduced dependence on annotated data. Our code is available at: this https URL
zh

[NLP-54] Cohort Retrieval using Dense Passage Retrieval

【速读】: 该论文试图解决在超声心动图领域中患者队列检索的问题,即从大规模非结构化电子健康记录(EHR)中识别特定患者群体。解决方案的关键在于应用密集段落检索(Dense Passage Retrieval, DPR)方法,通过系统性地将非结构化的超声心动图EHR数据转换为查询-段落数据集,并将其建模为队列检索任务,同时设计了基于真实临床场景的评估指标以验证模型性能。此外,研究还提出了一种自定义训练的DPR嵌入模型,其性能优于传统和现成的最先进方法。

链接: https://arxiv.org/abs/2507.01049
作者: Pranav Jadhav
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Patient cohort retrieval is a pivotal task in medical research and clinical practice, enabling the identification of specific patient groups from extensive electronic health records (EHRs). In this work, we address the challenge of cohort retrieval in the echocardiography domain by applying Dense Passage Retrieval (DPR), a prominent methodology in semantic search. We propose a systematic approach to transform an echocardiographic EHR dataset of unstructured nature into a Query-Passage dataset, framing the problem as a Cohort Retrieval task. Additionally, we design and implement evaluation metrics inspired by real-world clinical scenarios to rigorously test the models across diverse retrieval tasks. Furthermore, we present a custom-trained DPR embedding model that demonstrates superior performance compared to traditional and off-the-shelf SOTA this http URL our knowledge, this is the first work to apply DPR for patient cohort retrieval in the echocardiography domain, establishing a framework that can be adapted to other medical domains.
zh

[NLP-55] Can Argus Judge Them All? Comparing VLMs Across Domains

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在不同任务中性能一致性不足的问题。其关键解决方案是通过在多个数据集上对CLIP、BLIP和LXMERT进行基准测试,评估任务准确性、生成质量、效率以及引入的跨数据集一致性(Cross-Dataset Consistency, CDC)指标,从而揭示模型在泛化能力与任务专精之间的权衡。

链接: https://arxiv.org/abs/2507.01042
作者: Harsh Joshi,Gautam Siddharth Kashyap,Rafiq Ali,Ebad Shabbir,Niharika Jain,Sarthak Jain,Jiechao Gao,Usman Naseem
机构: Bharati Vidyapeeth( Bharati Vidyapeeth); Macquarie University( Macquarie University); DSEU-Okhla( DSEU-Okhla); Vivekananda Institute of Professional Studies( Vivekananda Institute of Professional Studies); IIIT-Delhi( IIIT-Delhi); Center for SDGC, Stanford University( Center for SDGC, Stanford University)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are advancing multimodal AI, yet their performance consistency across tasks is underexamined. We benchmark CLIP, BLIP, and LXMERT across diverse datasets spanning retrieval, captioning, and reasoning. Our evaluation includes task accuracy, generation quality, efficiency, and a novel Cross-Dataset Consistency (CDC) metric. CLIP shows strongest generalization (CDC: 0.92), BLIP excels on curated data, and LXMERT leads in structured reasoning. These results expose trade-offs between generalization and specialization, informing industrial deployment of VLMs and guiding development toward robust, task-flexible architectures.
zh

[NLP-56] PathCoT: Chain-of-Thought Prompting for Zero-shot Pathology Visual Reasoning

【速读】: 该论文旨在解决多模态大语言模型(MLLMs)在病理学视觉推理任务中面临的两个主要问题:一是由于缺乏领域特定信息导致的模型幻觉和性能不足;二是链式思维(CoT)推理步骤可能引入错误,从而导致答案发散。其解决方案的关键在于提出PathCoT,一种新颖的零样本CoT提示方法,该方法将病理学专家知识整合到MLLMs的推理过程中,并引入自评估机制以缓解答案的发散问题。通过结合专家知识,PathCoT能够生成更准确的推理结果,并通过自评估步骤确定可靠答案。

链接: https://arxiv.org/abs/2507.01029
作者: Junjie Zhou,Yingli Zuo,Shichang Feng,Peng Wan,Qi Zhu,Daoqiang Zhang,Wei Shao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the development of generative artificial intelligence and instruction tuning techniques, multimodal large language models (MLLMs) have made impressive progress on general reasoning tasks. Benefiting from the chain-of-thought (CoT) methodology, MLLMs can solve the visual reasoning problem step-by-step. However, existing MLLMs still face significant challenges when applied to pathology visual reasoning tasks: (1) LLMs often underperforms because they lack domain-specific information, which can lead to model hallucinations. (2) The additional reasoning steps in CoT may introduce errors, leading to the divergence of answers. To address these limitations, we propose PathCoT, a novel zero-shot CoT prompting method which integrates the pathology expert-knowledge into the reasoning process of MLLMs and incorporates self-evaluation to mitigate divergence of answers. Specifically, PathCoT guides the MLLM with prior knowledge to perform as pathology experts, and provides comprehensive analysis of the image with their domain-specific knowledge. By incorporating the experts’ knowledge, PathCoT can obtain the answers with CoT reasoning. Furthermore, PathCoT incorporates a self-evaluation step that assesses both the results generated directly by MLLMs and those derived through CoT, finally determining the reliable answer. The experimental results on the PathMMU dataset demonstrate the effectiveness of our method on pathology visual understanding and reasoning.
zh

[NLP-57] MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered ICLR2025 NAACL

【速读】: 该论文试图解决多智能体系统(multi-agent systems)在基于角色的交互中可能加剧大型语言模型(LLM)中的隐性偏见问题,从而引发公平性和代表性方面的担忧。解决方案的关键在于提出MALIBU基准,通过情景化评估来检测和量化LLM驱动的多智能体系统中的偏见,其核心机制包括两个阶段的评分:第一阶段对特定人口统计学角色(如性别、种族、宗教)的响应进行四维评分,第二阶段对比不同角色的配对响应并选择更优者,从而实现对偏见的系统性评估。

链接: https://arxiv.org/abs/2507.01019
作者: Imran Mirza,Cole Huang,Ishwara Vasista,Rohan Patil,Asli Akalin,Sean O’Brien,Kevin Zhu
机构: Algoverse AI Research (Algoverse AI 研究)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to Building Trust in LLMs @ ICLR 2025 and NAACL SRW 2025

点击查看摘要

Abstract:Multi-agent systems, which consist of multiple AI models interacting within a shared environment, are increasingly used for persona-based interactions. However, if not carefully designed, these systems can reinforce implicit biases in large language models (LLMs), raising concerns about fairness and equitable representation. We present MALIBU, a novel benchmark developed to assess the degree to which LLM-based multi-agent systems implicitly reinforce social biases and stereotypes. MALIBU evaluates bias in LLM-based multi-agent systems through scenario-based assessments. AI models complete tasks within predefined contexts, and their responses undergo evaluation by an LLM-based multi-agent judging system in two phases. In the first phase, judges score responses labeled with specific demographic personas (e.g., gender, race, religion) across four metrics. In the second phase, judges compare paired responses assigned to different personas, scoring them and selecting the superior response. Our study quantifies biases in LLM-generated outputs, revealing that bias mitigation may favor marginalized personas over true neutrality, emphasizing the need for nuanced detection, balanced fairness strategies, and transparent evaluation benchmarks in multi-agent systems.
zh

[NLP-58] Scalable Offline ASR for Command-Style Dictation in Courtrooms INTERSPEECH2025

【速读】: 该论文旨在解决传统在线系统资源消耗大与批量处理延迟高之间的矛盾。其关键解决方案是采用语音活动检测(Voice Activity Detection, VAD)对音频进行分割,并利用Whisper模型并行转录这些片段,从而实现多路复用的高效处理。该方法不仅提升了计算资源的利用率,还兼容多种自动语音识别(ASR)架构,如常用的基于CTC的模型,且已在印度约15%的法庭中部署验证。

链接: https://arxiv.org/abs/2507.01021
作者: Kumarmanas Nethil,Vaibhav Mishra,Kriti Anandan,Kavya Manohar
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2025 Show Tell

点击查看摘要

Abstract:We propose an open-source framework for Command-style dictation that addresses the gap between resource-intensive Online systems and high-latency Batch processing. Our approach uses Voice Activity Detection (VAD) to segment audio and transcribes these segments in parallel using Whisper models, enabling efficient multiplexing across audios. Unlike proprietary systems like SuperWhisper, this framework is also compatible with most ASR architectures, including widely used CTC-based models. Our multiplexing technique maximizes compute utilization in real-world settings, as demonstrated by its deployment in around 15% of India’s courtrooms. Evaluations on live data show consistent latency reduction as user concurrency increases, compared to sequential batch processing. The live demonstration will showcase our open-sourced implementation and allow attendees to interact with it in real-time.
zh

计算机视觉

[CV-0] Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

【速读】:该论文旨在解决自回归图像生成中因依赖逐块预测而导致的高延迟问题。传统方法通过逐块预测进行生成,这一过程受内存限制,导致效率低下。为实现高效并行化同时保持生成质量,论文提出两个关键技术:(1)灵活并行自回归建模,一种新型架构,允许任意生成顺序和并行度,并通过可学习的位置查询标记确保并发生成标记之间的相互可见性;(2)局部感知生成顺序,一种新型调度策略,通过最小化组内依赖并最大化上下文支持来提升生成质量。这些设计使生成步骤显著减少,同时降低了至少3.4倍的延迟。

链接: https://arxiv.org/abs/2507.01957
作者: Zhuoyang Zhang,Luke J. Huang,Chengyue Wu,Shang Yang,Kelly Peng,Yao Lu,Song Han
机构: MIT(麻省理工学院); NVIDIA(英伟达); First Intelligence(第一智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The first two authors contributed equally to this work

点击查看摘要

Abstract:We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256 \times 256 res.) and 1024 to 48 (512 \times 512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4 \times lower latency than previous parallelized autoregressive models.
zh

[CV-1] How Well Does GPT -4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

【速读】:该论文旨在评估多模态基础模型在标准计算机视觉任务中的表现,以明确其在视觉理解方面的实际水平。其核心挑战在于大多数模型主要针对文本生成进行训练,难以直接处理如语义分割、目标检测等视觉任务,且许多先进模型仅可通过API访问,缺乏权重级别的可调整性。解决方案的关键在于通过提示链(prompt chaining)将标准视觉任务转化为可文本提示和API兼容的任务,从而构建一个标准化的基准测试框架。

链接: https://arxiv.org/abs/2507.01955
作者: Rahul Ramachandran,Ali Garjani,Roman Bachmann,Andrei Atanov,Oğuzhan Fatih Kar,Amir Zamir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page at this https URL

点击查看摘要

Abstract:Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments. Comments: Project page at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.01955 [cs.CV] (or arXiv:2507.01955v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.01955 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-2] FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model ICCV2025

【速读】:该论文试图解决图像形变(image morphing)中因输入语义或布局差异导致的高质量生成难题,以及现有方法依赖微调预训练扩散模型所带来的时间限制和语义/布局不一致问题。其解决方案的关键在于提出FreeMorph,通过两项核心创新实现无需实例训练的高保真图像形变:一是引导感知的球面插值设计,通过修改自注意力模块引入输入图像的显式引导,以解决身份丢失并确保生成序列的方向性过渡;二是步骤导向的变化趋势,通过融合来自每个输入图像的自注意力模块,实现尊重双方输入的可控且一致的过渡。

链接: https://arxiv.org/abs/2507.01953
作者: Yukang Cao,Chenyang Si,Jinghao Wang,Ziwei Liu
机构: S-Lab, Nanyang Technological University (S-Lab,南洋理工大学); Nanjing University (南京大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL

点击查看摘要

Abstract:We present FreeMorph, the first tuning-free method for image morphing that accommodates inputs with different semantics or layouts. Unlike existing methods that rely on finetuning pre-trained diffusion models and are limited by time constraints and semantic/layout discrepancies, FreeMorph delivers high-fidelity image morphing without requiring per-instance training. Despite their efficiency and potential, tuning-free methods face challenges in maintaining high-quality results due to the non-linear nature of the multi-step denoising process and biases inherited from the pre-trained diffusion model. In this paper, we introduce FreeMorph to address these challenges by integrating two key innovations. 1) We first propose a guidance-aware spherical interpolation design that incorporates explicit guidance from the input images by modifying the self-attention modules, thereby addressing identity loss and ensuring directional transitions throughout the generated sequence. 2) We further introduce a step-oriented variation trend that blends self-attention modules derived from each input image to achieve controlled and consistent transitions that respect both inputs. Our extensive evaluations demonstrate that FreeMorph outperforms existing methods, being 10x ~ 50x faster and establishing a new state-of-the-art for image morphing.
zh

[CV-3] Kwai Keye-VL Technical Report

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解和处理动态、信息密集的短视频方面能力不足的问题,这一问题限制了其在当前数字环境中主流媒体形式中的应用。解决方案的关键在于构建一个名为Keye-VL的80亿参数多模态基础模型,其核心包括两个支柱:一是包含超过6000亿个标记的大规模高质量数据集,特别强调视频数据;二是创新的训练方案,包含四阶段预训练以实现视觉-语言对齐,以及两阶段后训练以提升基础能力和高级推理能力。其中,在第二阶段引入了五种模式的“冷启动”数据混合,使模型能够学习何时以及如何进行推理,结合强化学习和对齐步骤进一步优化模型行为。

链接: https://arxiv.org/abs/2507.01949
作者: Kwai Keye Team,Biao Yang,Bin Wen,Changyi Liu,Chenglong Chu,Chengru Song,Chongling Rao,Chuan Yi,Da Li,Dunju Zang,Fan Yang,Guorui Zhou,Hao Peng,Haojie Ding,Jiaming Huang,Jiangxia Cao,Jiankang Chen,Jingyun Hua,Jin Ouyang,Kaibing Chen,Kaiyu Jiang,Kaiyu Tang,Kun Gai,Shengnan Zhang,Siyang Mao,Sui Huang,Tianke Zhang,Tingting Gao,Wei Chen,Wei Yuan,Xiangyu Wu,Xiao Hu,Xingyu Lu,Yang Zhou,Yi-Fan Zhang,Yiping Yang,Yulong Chen,Zhenhua Wu,Zhenyu Li,Zhixin Ling,Ziming Li,Dehua Ma,Di Xu,Haixuan Gao,Hang Li,Jiawei Guo,Jing Wang,Lejian Ren,Muhao Wei,Qianqian Wang,Qigen Hu,Shiyao Wang,Tao Yu,Xinchen Luo,Yan Li,Yiming Liang,Yuhang Hu,Zeyi Lu,Zhuoran Yang,Zixing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report: this https URL

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today’s digital landscape. To bridge this gap, we introduce \textbfKwai Keye-VL, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode cold-start'' data mixture, which includes thinking’‘, non-thinking'', auto-think’‘, ``think with image’', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the \textbfKC-MMBench, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.
zh

[CV-4] LongAnimation: Long Animation Generation with Dynamic Global-Local Memory

【速读】:该论文试图解决长时动画着色中由于局部范式忽略全局信息而导致的长期颜色一致性不足的问题。现有研究主要集中在短期着色,采用局部范式通过融合重叠特征实现局部段之间的平滑过渡,但无法保持长期颜色一致性。解决方案的关键在于提出一种动态全局-局部范式,即动态提取与当前生成相关的全局颜色一致性特征,具体包括SketchDiT、动态全局-局部记忆(DGLM)和颜色一致性奖励模块,以实现对长时动画的颜色一致性维护。

链接: https://arxiv.org/abs/2507.01945
作者: Nan Chen,Mengqi Huang,Yihao Meng,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color-consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task. The code can be found at this https URL.
zh

[CV-5] CI-VID: A Coherent Interleaved Text-Video Dataset

【速读】:该论文旨在解决现有文本到视频(Text-to-Video, T2V)生成数据集缺乏连贯多片段视频序列建模能力的问题,当前公开数据集主要由孤立的文本-视频(T-V)对组成,无法支持多场景视频序列的生成。其解决方案的关键在于引入CI-VID数据集,该数据集从孤立的T2V生成扩展到文本与视频到视频(Text-and-Video-to-Video, TV2V)生成,提供包含连贯视频片段序列及描述每个片段内容及其过渡的文本标注,从而支持视觉和文本语义引导的视频生成。

链接: https://arxiv.org/abs/2507.01938
作者: Yiming Ju,Jijin Hu,Zhengxiong Luo,Haoge Deng,hanyu Zhao,Li Du,Chengwei Wu,Donglin Hao,Xinlong Wang,Tengfei Pan
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to support the modeling of coherent multi-clip video sequences. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated text-to-video (T2V) generation toward text-and-video-to-video (TV2V) generation, enabling models to produce coherent, multi-scene video sequences. CI-VID contains over 340,000 samples, each featuring a coherent sequence of video clips with text captions that capture both the individual content of each clip and the transitions between them, enabling visually and textually grounded generation. To further validate the effectiveness of CI-VID, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID exhibit significant improvements in both accuracy and content consistency when generating video sequences. This facilitates the creation of story-driven content with smooth visual transitions and strong temporal coherence, underscoring the quality and practical utility of the CI-VID dataset We release the CI-VID dataset and the accompanying code for data construction and evaluation at: this https URL
zh

[CV-6] vMLP: An Efficient Event-Driven MLP Architecture for Vision

【速读】:该论文旨在解决传统视觉模型在处理序列图像数据(如视频)时计算效率低下的问题,特别是在面对冗余计算时的性能瓶颈。其解决方案的关键在于提出了一种基于事件驱动的局部更新机制(event-driven local update mechanism),通过仅对发生“事件”的图像块进行处理,从而减少不必要的计算,同时保持输出的一致性。该方法的核心是利用多层感知机(MLPs)独立处理图像或特征图中的块,并通过定义连续帧之间的变化为“事件”来实现高效的计算优化。

链接: https://arxiv.org/abs/2507.01927
作者: Zhentan Zheng
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks have achieved remarkable results in computer vision tasks. In the early days, Convolutional Neural Networks (CNNs) were the mainstream architecture. In recent years, Vision Transformers (ViTs) have become increasingly popular. In addition, exploring applications of multi-layer perceptrons (MLPs) has provided new perspectives for research into vision model architectures. In this paper, we present evMLP accompanied by a simple event-driven local update mechanism. The proposed evMLP can independently process patches on images or feature maps via MLPs. We define changes between consecutive frames as “events”. Under the event-driven local update mechanism, evMLP selectively processes patches where events occur. For sequential image data (e.g., video processing), this approach improves computational performance by avoiding redundant computations. Through ImageNet image classification experiments, evMLP attains accuracy competitive with state-of-the-art models. More significantly, experimental results on multiple video datasets demonstrate that evMLP reduces computational cost via its event-driven local update mechanism while maintaining output consistency with its non-event-driven baseline. The code and trained models are available at this https URL.
zh

[CV-7] IC-Custom: Diverse Image Customization via In-Context Learning

【速读】:该论文旨在解决工业媒体制作中图像定制的通用框架缺失问题,当前方法通常将图像定制分为位置感知和位置无关两种范式,缺乏适用于多种定制场景的统一框架,从而限制了其应用范围。解决方案的关键在于提出IC-Custom,一个通过上下文学习(in-context learning)无缝整合位置感知与位置无关图像定制的统一框架。该框架通过将参考图像与目标图像拼接为多联画(polyptych),利用DiT的多模态注意力机制实现细粒度的token级交互,并引入可学习的任务导向注册令牌和边界感知的位置嵌入,以提升模型对不同任务类型和多联画配置输入的处理能力。

链接: https://arxiv.org/abs/2507.01926
作者: Yaowei Li,Xiaoyu Li,Zhaoyang Zhang,Yuxuan Bian,Gan Liu,Xinyuan Li,Jiale Xu,Wenbo Hu,Yating Liu,Lingen Li,Jing Cai,Yuexian Zou,Yancheng He,Ying Shan
机构: Peking University (北京大学); ARC Lab, Tencent PCG (腾讯PCG人工智能实验室); Tencent (腾讯); The Chinese University of Hong Kong (香港中文大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT’s multi-modal attention mechanism for fine-grained token-level interactions. We introduce the In-context Multi-Modal Attention (ICMA) mechanism with learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to correctly handle different task types and distinguish various inputs in polyptych configurations. To bridge the data gap, we carefully curated a high-quality dataset of 12k identity-consistent samples with 8k from real-world sources and 4k from high-quality synthetic data, avoiding the overly glossy and over-saturated synthetic appearance. IC-Custom supports various industrial applications, including try-on, accessory placement, furniture arrangement, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves approximately 73% higher human preference across identity consistency, harmonicity, and text alignment metrics, while training only 0.4% of the original model parameters. Project page: this https URL
zh

[CV-8] 3D Reconstruction and Information Fusion between Dormant and Canopy Seasons in Commercial Orchards Using Deep Learning and Fast GICP

【速读】:该论文旨在解决果园自动化中由于冠层季节密集植被遮挡导致的树体结构可见性不足问题,从而限制机器视觉系统的性能。其解决方案的关键在于构建一个信息融合框架,整合不同季节的结构数据,通过YOLOv9-Seg进行实例分割、Kinect Fusion进行3D重建以及Fast Generalized Iterative Closest Point (Fast GICP)进行模型对齐,实现跨季节的空间一致性的多源数据融合,从而提升机器人在生长季节中对树体结构的感知精度。

链接: https://arxiv.org/abs/2507.01912
作者: Ranjan Sapkota,Zhichao Meng,Martin Churuvija,Xiaoqiang Du,Zenghong Ma,Manoj Karkee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 tables, 11 figures

点击查看摘要

Abstract:In orchard automation, dense foliage during the canopy season severely occludes tree structures, minimizing visibility to various canopy parts such as trunks and branches, which limits the ability of a machine vision system. However, canopy structure is more open and visible during the dormant season when trees are defoliated. In this work, we present an information fusion framework that integrates multi-seasonal structural data to support robotic and automated crop load management during the entire growing season. The framework combines high-resolution RGB-D imagery from both dormant and canopy periods using YOLOv9-Seg for instance segmentation, Kinect Fusion for 3D reconstruction, and Fast Generalized Iterative Closest Point (Fast GICP) for model alignment. Segmentation outputs from YOLOv9-Seg were used to extract depth-informed masks, which enabled accurate 3D point cloud reconstruction via Kinect Fusion; these reconstructed models from each season were subsequently aligned using Fast GICP to achieve spatially coherent multi-season fusion. The YOLOv9-Seg model, trained on manually annotated images, achieved a mean squared error (MSE) of 0.0047 and segmentation mAP@50 scores up to 0.78 for trunks in dormant season dataset. Kinect Fusion enabled accurate reconstruction of tree geometry, validated with field measurements resulting in root mean square errors (RMSE) of 5.23 mm for trunk diameter, 4.50 mm for branch diameter, and 13.72 mm for branch spacing. Fast GICP achieved precise cross-seasonal registration with a minimum fitness score of 0.00197, allowing integrated, comprehensive tree structure modeling despite heavy occlusions during the growing season. This fused structural representation enables robotic systems to access otherwise obscured architectural information, improving the precision of pruning, thinning, and other automated orchard operations.
zh

[CV-9] Modality Agnostic patient-specific digital twins modeling temporally varying digestive motion

【速读】:该论文旨在解决在临床中实现可变形图像配准(DIR)时,由于高度移动的胃肠道(GI)器官难以通过手动识别的解剖标志点来评估体素级空间精度的问题。其解决方案的关键是构建患者特定的数字孪生(DT)模型,以模拟随时间变化的运动,并利用这些模型对DIR方法进行准确评估。通过半自动化流程生成21个模拟消化道运动的4D序列,并基于独立的4D MRI数据集验证DT运动幅度,进而使用目标配准误差、Dice相似性系数和95百分位Hausdorff距离等指标评估不同DIR方法的性能,最终通过剂量分布的配准与累积进一步验证剂量映射的准确性。

链接: https://arxiv.org/abs/2507.01909
作者: Jorge Tapias Gomez,Nishant Nadkarni,Lando S. Bosma,Jue Jiang,Ergys D. Subashi,William P. Segars,James M. Balter,Mert R Sabuncu,Neelam Tyagi,Harini Veeraraghavan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 Pages, 6 figures, 4 tables

点击查看摘要

Abstract:Objective: Clinical implementation of deformable image registration (DIR) requires voxel-based spatial accuracy metrics such as manually identified landmarks, which are challenging to implement for highly mobile gastrointestinal (GI) organs. To address this, patient-specific digital twins (DT) modeling temporally varying motion were created to assess the accuracy of DIR methods. Approach: 21 motion phases simulating digestive GI motion as 4D sequences were generated from static 3D patient scans using published analytical GI motion models through a semi-automated pipeline. Eleven datasets, including six T2w FSE MRI (T2w MRI), two T1w 4D golden-angle stack-of-stars, and three contrast-enhanced CT scans. The motion amplitudes of the DTs were assessed against real patient stomach motion amplitudes extracted from independent 4D MRI datasets. The generated DTs were then used to assess six different DIR methods using target registration error, Dice similarity coefficient, and the 95th percentile Hausdorff distance using summary metrics and voxel-level granular visualizations. Finally, for a subset of T2w MRI scans from patients treated with MR-guided radiation therapy, dose distributions were warped and accumulated to assess dose warping errors, including evaluations of DIR performance in both low- and high-dose regions for patient-specific error estimation. Main results: Our proposed pipeline synthesized DTs modeling realistic GI motion, achieving mean and maximum motion amplitudes and a mean log Jacobian determinant within 0.8 mm and 0.01, respectively, similar to published real-patient gastric motion data. It also enables the extraction of detailed quantitative DIR performance metrics and rigorous validation of dose mapping accuracy. Significance: The pipeline enables rigorously testing DIR tools for dynamic, anatomically complex regions enabling granular spatial and dosimetric accuracies.
zh

[CV-10] Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

【速读】:该论文旨在解决当前基于指令的图像编辑(Instruction-based Image Editing, IIE)在处理复杂隐含假设性指令时的不足,这些指令需要更深层次的推理来推断合理的视觉变化和用户意图。现有方法在数据集支持、细粒度细节提取机制以及跨模态交互能力方面存在局限。论文提出的解决方案关键在于构建了一个大规模数据集Reason50K,涵盖物理、时间、因果和故事推理四个核心场景,并设计了ReasonBrain框架,该框架结合多模态大语言模型(Multimodal Large Language Models, MLLMs)与扩散模型,引入细粒度推理提示提取模块(Fine-grained Reasoning Cue Extraction, FRCE)和跨模态增强器(Cross-Modal Enhancer, CME),以提升对隐含指令的理解与执行能力。

链接: https://arxiv.org/abs/2507.01908
作者: Qingdong He,Xueqin Chen,Chaoyi Wang,Yanjie Pan,Xiaobin Hu,Zhenye Gan,Yabiao Wang,Chengjie Wang,Xiangtai Li,Jiangning Zhang
机构: Youtu Lab(优图实验室); Tencent(腾讯); TU Delft(代尔夫特理工大学); University of Chinese Academy of Sciences(中国科学院大学); Fudan University(复旦大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.
zh

[CV-11] Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation for Semi-Supervised Lifelong Person Re-Identification ICCV2025

【速读】:该论文旨在解决半监督长时人重识别(Semi-Supervised LReID)问题,即在标注数据稀缺而大量未标注数据存在的现实场景下,传统LReID方法性能显著下降的问题。其解决方案的关键在于提出了一种基于双知识协作的自增强原型进化框架(SPRED),通过动态原型引导的伪标签生成与新旧知识协同净化之间的自增强循环,提升未标注数据的利用效率,从而增强模型的长期适应能力。

链接: https://arxiv.org/abs/2507.01884
作者: Kunlun Xu,Fan Zhuo,Jiangmeng Li,Xu Zou,Jiahuan Zhou
机构: Peking University (北京大学); University of Chinese Academy of Sciences (中国科学院大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Current lifelong person re-identification (LReID) methods predominantly rely on fully labeled data streams. However, in real-world scenarios where annotation resources are limited, a vast amount of unlabeled data coexists with scarce labeled samples, leading to the Semi-Supervised LReID (Semi-LReID) problem where LReID methods suffer severe performance degradation. Existing LReID methods, even when combined with semi-supervised strategies, suffer from limited long-term adaptation performance due to struggling with the noisy knowledge occurring during unlabeled data utilization. In this paper, we pioneer the investigation of Semi-LReID, introducing a novel Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation framework (SPRED). Our key innovation lies in establishing a self-reinforcing cycle between dynamic prototype-guided pseudo-label generation and new-old knowledge collaborative purification to enhance the utilization of unlabeled data. Specifically, learnable identity prototypes are introduced to dynamically capture the identity distributions and generate high-quality pseudo-labels. Then, the dual-knowledge cooperation scheme integrates current model specialization and historical model generalization, refining noisy pseudo-labels. Through this cyclic design, reliable pseudo-labels are progressively mined to improve current-stage learning and ensure positive knowledge propagation over long-term learning. Experiments on the established Semi-LReID benchmarks show that our SPRED achieves state-of-the-art performance. Our source code is available at this https URL
zh

[CV-12] Future Slot Prediction for Unsupervised Object Discovery in Surgical Video MICCAI2025

【速读】:该论文试图解决在复杂现实场景(如手术视频)中,通过无监督学习获得结构化、可解释的对象中心表示(slot)的难题,特别是在动态时间序列数据上的表现不足问题。现有方法虽然在图像上表现良好,但在处理手术视频时性能较低。解决方案的关键在于提出一种动态时间槽变换器(DTST)模块,该模块通过同时进行时间推理和预测最优未来槽初始化来提升模型性能,从而实现了在多个手术数据库上的最先进表现。

链接: https://arxiv.org/abs/2507.01882
作者: Guiqiu Liao,Matjaz Jogan,Marcel Hussing,Edward Zhang,Eric Eaton,Daniel A. Hashimoto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI2025

点击查看摘要

Abstract:Object-centric slot attention is an emerging paradigm for unsupervised learning of structured, interpretable object-centric representations (slots). This enables effective reasoning about objects and events at a low computational cost and is thus applicable to critical healthcare applications, such as real-time interpretation of surgical video. The heterogeneous scenes in real-world applications like surgery are, however, difficult to parse into a meaningful set of slots. Current approaches with an adaptive slot count perform well on images, but their performance on surgical videos is low. To address this challenge, we propose a dynamic temporal slot transformer (DTST) module that is trained both for temporal reasoning and for predicting the optimal future slot initialization. The model achieves state-of-the-art performance on multiple surgical databases, demonstrating that unsupervised object-centric methods can be applied to real-world data and become part of the common arsenal in healthcare applications.
zh

[CV-13] MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices ICCV2025

【速读】:该论文旨在解决在资源受限平台(如移动设备)上部署深度学习模型进行图像增强(Image Enhancement, IE)的挑战,主要问题包括高计算和内存需求。其解决方案的关键在于提出一个参数量约为4K的极轻量级卷积神经网络(Convolutional Neural Network, CNN)框架,通过整合重参数化技术与增量权重优化策略来保证效率,并结合特征自变换模块和分层双路径注意力机制,以及采用局部方差加权损失函数进行优化,从而实现在保持图像质量的同时达到实时IE推理速度。

链接: https://arxiv.org/abs/2507.01838
作者: Hailong Yan,Ao Li,Xiangtao Zhang,Zhe Liu,Zenglin Shi,Ce Zhu,Le Zhang
机构: UESTC(电子科技大学); Hefei University of Technology(合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Recent advancements in deep neural networks have driven significant progress in image enhancement (IE). However, deploying deep learning models on resource-constrained platforms, such as mobile devices, remains challenging due to high computation and memory demands. To address these challenges and facilitate real-time IE on mobile, we introduce an extremely lightweight Convolutional Neural Network (CNN) framework with around 4K parameters. Our approach integrates reparameterization with an Incremental Weight Optimization strategy to ensure efficiency. Additionally, we enhance performance with a Feature Self-Transform module and a Hierarchical Dual-Path Attention mechanism, optimized with a Local Variance-Weighted loss. With this efficient framework, we are the first to achieve real-time IE inference at up to 1,100 frames per second (FPS) while delivering competitive image quality, achieving the best trade-off between speed and performance across multiple IE tasks. The code will be available at this https URL.
zh

[CV-14] Modulate and Reconstruct: Learning Hyperspectral Imaging from Misaligned Smartphone Views

【速读】:该论文试图解决从RGB图像进行高光谱重建(Hyperspectral Reconstruction, HSR)的问题,该问题由于严重的光谱信息丢失而本质上是一个病态问题。现有方法通常依赖于单张RGB图像,限制了重建精度。解决方案的关键在于提出一种多图像到高光谱重建(Multi-image-to-Hyperspectral Reconstruction, MI-HSR)框架,利用三摄像头智能手机系统,其中两个镜头配备了经过精心选择的光谱滤波器,从而获得比传统单摄像头设置更丰富和多样的光谱观测数据。

链接: https://arxiv.org/abs/2507.01835
作者: Daniil Reutsky,Daniil Vladimirov,Yasin Mamedov,Georgy Perevozchikov,Nancy Mehta,Egor Ershov,Radu Timofte
机构: University of Würzburg(维尔茨堡大学); Institute for Information Transmission Problems(信息传输问题研究所); Moscow Institute of Physics and Technology(莫斯科物理技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral reconstruction (HSR) from RGB images is a fundamentally ill-posed problem due to severe spectral information loss. Existing approaches typically rely on a single RGB image, limiting reconstruction accuracy. In this work, we propose a novel multi-image-to-hyperspectral reconstruction (MI-HSR) framework that leverages a triple-camera smartphone system, where two lenses are equipped with carefully selected spectral filters. Our configuration, grounded in theoretical and empirical analysis, enables richer and more diverse spectral observations than conventional single-camera setups. To support this new paradigm, we introduce Doomer, the first dataset for MI-HSR, comprising aligned images from three smartphone cameras and a hyperspectral reference camera across diverse scenes. We show that the proposed HSR model achieves consistent improvements over existing methods on the newly proposed benchmark. In a nutshell, our setup allows 30% towards more accurately estimated spectra compared to an ordinary RGB camera. Our findings suggest that multi-view spectral filtering with commodity hardware can unlock more accurate and practical hyperspectral imaging solutions.
zh

[CV-15] Empowering Manufacturers with Privacy-Preserving AI Tools: A Case Study in Privacy-Preserving Machine Learning to Solve Real-World Problems

【速读】:该论文试图解决中小型制造商在寻求创新数据工具时面临的隐私与数据共享矛盾问题,即制造商因竞争和隐私顾虑不愿与研究人员共享专有数据。解决方案的关键在于构建一个隐私保护平台,通过安全方法使制造商能够将数据共享给研究人员,从而开发出解决实际生产问题的创新工具,并将这些工具以保障隐私和保密性的方式返回平台供其他用户使用。该平台成功应用于食品晶体大规模制造中的质量控制问题,通过开发自动化的晶体分析工具,实现了对显微镜图像中晶体尺寸分布和数量的自动表征,同时去除样本制备过程中的自然缺陷,并利用机器学习模型提高对高分辨率透明晶体及其团聚体的计数准确性。

链接: https://arxiv.org/abs/2507.01808
作者: Xiaoyu Ji,Jessica Shorland,Joshua Shank,Pascal Delpe-Brice,Latanya Sweeney,Jan Allebach,Ali Shakouri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: 20 pages, 11 figures, 30 references

点击查看摘要

Abstract:Small- and medium-sized manufacturers need innovative data tools but, because of competition and privacy concerns, often do not want to share their proprietary data with researchers who might be interested in helping. This paper introduces a privacy-preserving platform by which manufacturers may safely share their data with researchers through secure methods, so that those researchers then create innovative tools to solve the manufacturers’ real-world problems, and then provide tools that execute solutions back onto the platform for others to use with privacy and confidentiality guarantees. We illustrate this problem through a particular use case which addresses an important problem in the large-scale manufacturing of food crystals, which is that quality control relies on image analysis tools. Previous to our research, food crystals in the images were manually counted, which required substantial and time-consuming human efforts, but we have developed and deployed a crystal analysis tool which makes this process both more rapid and accurate. The tool enables automatic characterization of the crystal size distribution and numbers from microscope images while the natural imperfections from the sample preparation are automatically removed; a machine learning model to count high resolution translucent crystals and agglomeration of crystals was also developed to aid in these efforts. The resulting algorithm was then packaged for real-world use on the factory floor via a web-based app secured through the originating privacy-preserving platform, allowing manufacturers to use it while keeping their proprietary data secure. After demonstrating this full process, future directions are also explored.
zh

[CV-16] AMD: Adaptive Momentum and Decoupled Contrastive Learning Framework for Robust Long-Tail Trajectory Prediction

【速读】:该论文旨在解决交通参与者未来轨迹预测中因轨迹分布固有不平衡性而导致的尾部数据复杂性和危险场景识别困难的问题。现有研究通常仅依赖基础模型的预测误差,而未考虑长尾轨迹模式的多样性和不确定性。其解决方案的关键在于提出一种自适应动量与解耦对比学习框架(AMD),通过融合无监督和监督对比学习策略,结合改进的动量对比学习(MoCo-DT)和解耦对比学习(DCL)模块,提升模型对罕见和复杂轨迹的识别能力,并通过轨迹随机增强方法和在线迭代聚类策略实现伪标签的动态更新,以更好地适应长尾数据的分布变化。

链接: https://arxiv.org/abs/2507.01801
作者: Bin Rao,Haicheng Liao,Yanchen Guan,Chengyue Wang,Bonan Wang,Jiaxun Zhang,Zhenning Li
机构: University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately predicting the future trajectories of traffic agents is essential in autonomous driving. However, due to the inherent imbalance in trajectory distributions, tail data in natural datasets often represents more complex and hazardous scenarios. Existing studies typically rely solely on a base model’s prediction error, without considering the diversity and uncertainty of long-tail trajectory patterns. We propose an adaptive momentum and decoupled contrastive learning framework (AMD), which integrates unsupervised and supervised contrastive learning strategies. By leveraging an improved momentum contrast learning (MoCo-DT) and decoupled contrastive learning (DCL) module, our framework enhances the model’s ability to recognize rare and complex trajectories. Additionally, we design four types of trajectory random augmentation methods and introduce an online iterative clustering strategy, allowing the model to dynamically update pseudo-labels and better adapt to the distributional shifts in long-tail data. We propose three different criteria to define long-tail trajectories and conduct extensive comparative experiments on the nuScenes and ETH / UCY datasets. The results show that AMD not only achieves optimal performance in long-tail trajectory prediction but also demonstrates outstanding overall prediction accuracy.
zh

[CV-17] HCNQA: Enhancing 3D VQA with Hierarchical Concentration Narrowing Supervision ICANN2025

【速读】:该论文旨在解决3D视觉问答(3D VQA)模型在训练过程中因答案导向监督(answer-centric supervision)导致的推理路径不理性及可能产生表面捷径的问题。其解决方案的关键在于提出一种分层注意力收缩监督方法(hierarchical concentration narrowing supervision),通过模仿人类从广泛区域逐步聚焦到具体对象的思考过程,引导模型经历三个阶段的注意力收缩,从而确保模型发展出合理且有效的推理路径。

链接: https://arxiv.org/abs/2507.01800
作者: Shengli Zhou,Jianuo Zhu,Qilin Huang,Fangjing Wang,Yanfu Zhang,Feng Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ICANN 2025

点击查看摘要

Abstract:3D Visual Question-Answering (3D VQA) is pivotal for models to perceive the physical world and perform spatial reasoning. Answer-centric supervision is a commonly used training method for 3D VQA models. Many models that utilize this strategy have achieved promising results in 3D VQA tasks. However, the answer-centric approach only supervises the final output of models and allows models to develop reasoning pathways freely. The absence of supervision on the reasoning pathway enables the potential for developing superficial shortcuts through common patterns in question-answer pairs. Moreover, although slow-thinking methods advance large language models, they suffer from underthinking. To address these issues, we propose \textbfHCNQA, a 3D VQA model leveraging a hierarchical concentration narrowing supervision method. By mimicking the human process of gradually focusing from a broad area to specific objects while searching for answers, our method guides the model to perform three phases of concentration narrowing through hierarchical supervision. By supervising key checkpoints on a general reasoning pathway, our method can ensure the development of a rational and effective reasoning pathway. Extensive experimental results demonstrate that our method can effectively ensure that the model develops a rational reasoning pathway and performs better. The code is available at this https URL.
zh

[CV-18] FreeLoRA: Enabling Training-Free LoRA Fusion for Autoregressive Multi-Subject Personalization

【速读】:该论文旨在解决多主体个性化图像生成中的挑战,即现有方法在处理多个主体时难以实现有效的个性化适配,因为独立适配的模块组合通常需要复杂的重新调优或联合优化。其解决方案的关键在于提出FreeLoRA框架,该框架通过训练-free的方式融合特定主体的LoRA模块,每个LoRA模块使用Full Token Tuning策略在少量主体图像上进行适应,并在推理阶段采用Subject-Aware Inference,仅在对应主体的token上激活相应模块,从而实现单图中多个个性化主体的融合,同时减少过拟合和主体间的相互干扰。

链接: https://arxiv.org/abs/2507.01792
作者: Peng Zheng,Ye Wang,Rui Ma,Zuxuan Wu
机构: Jilin University (吉林大学); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Subject-driven image generation plays a crucial role in applications such as virtual try-on and poster design. Existing approaches typically fine-tune pretrained generative models or apply LoRA-based adaptations for individual subjects. However, these methods struggle with multi-subject personalization, as combining independently adapted modules often requires complex re-tuning or joint optimization. We present FreeLoRA, a simple and generalizable framework that enables training-free fusion of subject-specific LoRA modules for multi-subject personalization. Each LoRA module is adapted on a few images of a specific subject using a Full Token Tuning strategy, where it is applied across all tokens in the prompt to encourage weakly supervised token-content alignment. At inference, we adopt Subject-Aware Inference, activating each module only on its corresponding subject tokens. This enables training-free fusion of multiple personalized subjects within a single image, while mitigating overfitting and mutual interference between subjects. Extensive experiments show that FreeLoRA achieves strong performance in both subject fidelity and prompt consistency.
zh

[CV-19] Boosting Adversarial Transferability Against Defenses via Multi-Scale Transformation

【速读】:该论文试图解决对抗样本在深度神经网络中的可迁移性问题,这一问题使得攻击者无需了解目标模型的具体信息即可对其进行攻击,从而构成严重的安全威胁。解决方案的关键在于提出一种新的分段高斯金字塔(Segmented Gaussian Pyramid, SGP)攻击方法,该方法通过高斯滤波和三种下采样方式构建多尺度样本,并计算每个尺度下损失函数的梯度,再通过对梯度的平均来确定对抗扰动,从而显著提升攻击的成功率。

链接: https://arxiv.org/abs/2507.01791
作者: Zihong Guo,Chen Wan,Yayin Zheng,Hailing Kuang,Xiaohai Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The transferability of adversarial examples poses a significant security challenge for deep neural networks, which can be attacked without knowing anything about them. In this paper, we propose a new Segmented Gaussian Pyramid (SGP) attack method to enhance the transferability, particularly against defense models. Unlike existing methods that generally focus on single-scale images, our approach employs Gaussian filtering and three types of downsampling to construct a series of multi-scale examples. Then, the gradients of the loss function with respect to each scale are computed, and their average is used to determine the adversarial perturbations. The proposed SGP can be considered an input transformation with high extensibility that is easily integrated into most existing adversarial attacks. Extensive experiments demonstrate that in contrast to the state-of-the-art methods, SGP significantly enhances attack success rates against black-box defense models, with average attack success rates increasing by 2.3% to 32.6%, based only on transferability.
zh

[CV-20] Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging

【速读】:该论文试图解决视觉变压器(Vision Transformers, ViTs)在医学图像分类任务中表示缺乏语义意义的问题,即模型对输入的微小变化高度敏感,导致分类结果不可靠。解决方案的关键是采用基于投影梯度的算法,揭示了ViT的表示不仅对微小扰动敏感,而且在语义上并不具有明确的区分性,从而暴露了其在安全关键系统中部署时存在的重大挑战。

链接: https://arxiv.org/abs/2507.01788
作者: Montasir Shams,Chashi Mahiul Islam,Shaeke Salman,Phat Tran,Xiuwen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Vision transformers (ViTs) have rapidly gained prominence in medical imaging tasks such as disease classification, segmentation, and detection due to their superior accuracy compared to conventional deep learning models. However, due to their size and complex interactions via the self-attention mechanism, they are not well understood. In particular, it is unclear whether the representations produced by such models are semantically meaningful. In this paper, using a projected gradient-based algorithm, we show that their representations are not semantically meaningful and they are inherently vulnerable to small changes. Images with imperceptible differences can have very different representations; on the other hand, images that should belong to different semantic classes can have nearly identical representations. Such vulnerability can lead to unreliable classification results; for example, unnoticeable changes cause the classification accuracy to be reduced by over 60%. %. To the best of our knowledge, this is the first work to systematically demonstrate this fundamental lack of semantic meaningfulness in ViT representations for medical image classification, revealing a critical challenge for their deployment in safety-critical systems.
zh

[CV-21] A Hybrid Ensemble Learning Framework for Image-Based Solar Panel Classification

【速读】:该论文试图解决太阳能板表面清洁度自动识别的问题,这是确保太阳能系统维持最佳性能的关键挑战。解决方案的关键在于提出一种名为双集成神经网络(Dual Ensemble Neural Network, DENN)的新型方法,通过将多种集成模型整合到双重框架中,以提升分类准确性和鲁棒性。该方法在Deep Solar Eye数据集上表现出色,达到了最先进的准确率,展示了混合集成学习技术在自动化太阳能板检测中的潜力。

链接: https://arxiv.org/abs/2507.01778
作者: Vivek Tetarwal,Sandeep Kumar
机构: 未知
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:The installation of solar energy systems is on the rise, and therefore, appropriate maintenance techniques are required to be used in order to maintain maximum performance levels. One of the major challenges is the automated discrimination between clean and dirty solar panels. This paper presents a novel Dual Ensemble Neural Network (DENN) to classify solar panels using image-based features. The suggested approach utilizes the advantages offered by various ensemble models by integrating them into a dual framework, aimed at improving both classification accuracy and robustness. The DENN model is evaluated in comparison to current ensemble methods, showcasing its superior performance across a range of assessment metrics. The proposed approach performs the best compared to other methods and reaches state-of-the-art accuracy on experimental results for the Deep Solar Eye dataset, effectively serving predictive maintenance purposes in solar energy systems. It reveals the potential of hybrid ensemble learning techniques to further advance the prospects of automated solar panel inspections as a scalable solution to real-world challenges.
zh

[CV-22] Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis ICCV2025

【速读】:该论文旨在解决基于自回归(AR)框架的视觉生成模型中,由于量化过程导致的信息丢失问题,从而影响图像保真度。其解决方案的关键在于提出DisCon(Discrete-Conditioned Continuous Autoregressive Model),该框架将离散令牌重新解释为条件信号而非生成目标,通过建模连续表示在离散令牌条件下的概率分布,避免了连续令牌建模的优化挑战以及量化带来的信息损失。

链接: https://arxiv.org/abs/2507.01756
作者: Peng Zheng,Junke Wang,Yi Chang,Yizhou Yu,Rui Ma,Zuxuan Wu
机构: Jilin University (吉林大学); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by iccv 2025

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have spurred interests in encoding images as discrete tokens and leveraging autoregressive (AR) frameworks for visual generation. However, the quantization process in AR-based visual generation models inherently introduces information loss that degrades image fidelity. To mitigate this limitation, recent studies have explored to autoregressively predict continuous tokens. Unlike discrete tokens that reside in a structured and bounded space, continuous representations exist in an unbounded, high-dimensional space, making density estimation more challenging and increasing the risk of generating out-of-distribution artifacts. Based on the above findings, this work introduces DisCon (Discrete-Conditioned Continuous Autoregressive Model), a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets. By modeling the conditional probability of continuous representations conditioned on discrete tokens, DisCon circumvents the optimization challenges of continuous token modeling while avoiding the information loss caused by quantization. DisCon achieves a gFID score of 1.38 on ImageNet 256 \times 256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin.
zh

[CV-23] SSL4SAR: Self-Supervised Learning for Glacier Calving Front Extraction from SAR Imagery

【速读】:该论文旨在解决冰川前缘消融监测中由于领域差异导致的深度学习模型性能不足的问题,特别是在合成孔径雷达(Synthetic Aperture Radar, SAR)影像中准确提取冰川崩解前沿位置的挑战。其解决方案的关键在于提出两种新颖的自监督多模态预训练技术,并引入一个结合Swin Transformer编码器与残差卷积神经网络(Convolutional Neural Network, CNN)解码器的混合模型架构,同时利用SSL4SAR数据集进行预训练,从而提升模型在冰川崩解前沿检测任务中的精度。

链接: https://arxiv.org/abs/2507.01747
作者: Nora Gourmelon,Marcel Dreier,Martin Mayr,Thorsten Seehaus,Dakota Pyles,Matthias Braun,Andreas Maier,Vincent Christlein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in IEEE Transactions on Geoscience and Remote Sensing. arXiv admin note: text overlap with arXiv:2501.05281

点击查看摘要

Abstract:Glaciers are losing ice mass at unprecedented rates, increasing the need for accurate, year-round monitoring to understand frontal ablation, particularly the factors driving the calving process. Deep learning models can extract calving front positions from Synthetic Aperture Radar imagery to track seasonal ice losses at the calving fronts of marine- and lake-terminating glaciers. The current state-of-the-art model relies on ImageNet-pretrained weights. However, they are suboptimal due to the domain shift between the natural images in ImageNet and the specialized characteristics of remote sensing imagery, in particular for Synthetic Aperture Radar imagery. To address this challenge, we propose two novel self-supervised multimodal pretraining techniques that leverage SSL4SAR, a new unlabeled dataset comprising 9,563 Sentinel-1 and 14 Sentinel-2 images of Arctic glaciers, with one optical image per glacier in the dataset. Additionally, we introduce a novel hybrid model architecture that combines a Swin Transformer encoder with a residual Convolutional Neural Network (CNN) decoder. When pretrained on SSL4SAR, this model achieves a mean distance error of 293 m on the “CAlving Fronts and where to Find thEm” (CaFFe) benchmark dataset, outperforming the prior best model by 67 m. Evaluating an ensemble of the proposed model on a multi-annotator study of the benchmark dataset reveals a mean distance error of 75 m, approaching the human performance of 38 m. This advancement enables precise monitoring of seasonal changes in glacier calving fronts.
zh

[CV-24] Calibrated Self-supervised Vision Transformers Improve Intracranial Arterial Calcification Segmentation from Clinical CT Head Scans

【速读】:该论文旨在解决3D医学图像分割中缺乏高效且无需昂贵人工标注的训练方法的问题,特别是针对颅内动脉钙化(Intracranial Arterial Calcification, IAC)的自动量化问题。其解决方案的关键在于首次利用掩码自编码器(Masked Autoencoder, MAE)框架对视觉Transformer(Vision Transformer, ViT)进行预训练,并在大规模临床试验数据集IST-3上进行微调,从而实现高效的自监督学习,提升模型在IAC分割任务中的性能与鲁棒性。

链接: https://arxiv.org/abs/2507.01744
作者: Benjamin Jin,Grant Mair,Joanna M. Wardlaw,Maria del C. Valdés Hernández
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have gained significant popularity in the natural image domain but have been less successful in 3D medical image segmentation. Nevertheless, 3D ViTs are particularly interesting for large medical imaging volumes due to their efficient self-supervised training within the masked autoencoder (MAE) framework, which enables the use of imaging data without the need for expensive manual annotations. intracranial arterial calcification (IAC) is an imaging biomarker visible on routinely acquired CT scans linked to neurovascular diseases such as stroke and dementia, and automated IAC quantification could enable their large-scale risk assessment. We pre-train ViTs with MAE and fine-tune them for IAC segmentation for the first time. To develop our models, we use highly heterogeneous data from a large clinical trial, the third International Stroke Trial (IST-3). We evaluate key aspects of MAE pre-trained ViTs in IAC segmentation, and analyse the clinical implications. We show: 1) our calibrated self-supervised ViT beats a strong supervised nnU-Net baseline by 3.2 Dice points, 2) low patch sizes are crucial for ViTs for IAC segmentation and interpolation upsampling with regular convolutions is preferable to transposed convolutions for ViT-based models, and 3) our ViTs increase robustness to higher slice thicknesses and improve risk group classification in a clinical scenario by 46%. Our code is available online.
zh

[CV-25] DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy ICCV2025

【速读】:该论文旨在解决参照图像分割(Referring Image Segmentation, RIS)中现有框架存在的根本瓶颈问题,特别是模型在多模态认知能力上的不足。其解决方案的关键在于提出一种名为DeRIS的新框架,该框架将RIS分解为感知和认知两个核心组件,并引入Loopback Synergy机制以增强两者之间的协同作用,从而提升分割精度和图像-文本理解的鲁棒性。此外,通过引入简单的非参照样本转换数据增强方法,进一步缓解了目标存在判断中的长尾分布问题。

链接: https://arxiv.org/abs/2507.01738
作者: Ming Dai,Wenxuan Cheng,Jiang-jiang Liu,Sen Yang,Wenxiao Cai,Yanpeng Sun,Wankou Yang
机构: Southeast Univeristy (东南大学); Baidu VIS (百度视觉); Standford University (斯坦福大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at this https URL.
zh

[CV-26] HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion

【速读】:该论文旨在解决生成真实感三维人-物交互(3D Human-Object Interactions, HOIs)的难题,现有方法将人类和物体的运动独立建模,导致物理上不合理的因果不一致行为。其解决方案的关键在于提出HOI-Dyn框架,该框架将HOI生成建模为一种驱动-响应系统,其中人类动作驱动物体响应,并采用轻量级基于Transformer的交互动力学模型,显式预测物体对人类运动的反应。此外,引入基于残差的动力学损失以增强一致性并减少动力学预测误差的影响。

链接: https://arxiv.org/abs/2507.01737
作者: Lin Wu,Zhixiang Chen,Jianglin Lan
机构: University of Glasgow (格拉斯哥大学); University of Sheffield (谢菲尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating realistic 3D human-object interactions (HOIs) remains a challenging task due to the difficulty of modeling detailed interaction dynamics. Existing methods treat human and object motions independently, resulting in physically implausible and causally inconsistent behaviors. In this work, we present HOI-Dyn, a novel framework that formulates HOI generation as a driver-responder system, where human actions drive object responses. At the core of our method is a lightweight transformer-based interaction dynamics model that explicitly predicts how objects should react to human motion. To further enforce consistency, we introduce a residual-based dynamics loss that mitigates the impact of dynamics prediction errors and prevents misleading optimization signals. The dynamics model is used only during training, preserving inference efficiency. Through extensive qualitative and quantitative experiments, we demonstrate that our approach not only enhances the quality of HOI generation but also establishes a feasible metric for evaluating the quality of generated interactions.
zh

[CV-27] When Does Pruning Benefit Vision Representations?

【速读】:该论文试图解决剪枝(pruning)对视觉模型在可解释性、无监督目标发现和与人类感知对齐等方面影响的机制问题。其解决方案的关键在于系统地分析不同稀疏度水平下模型的特征归因可解释性、结构化表示能力以及与人类感知的一致性,从而揭示剪枝在提升模型可解释性和下游任务泛化能力方面的潜在优势及其依赖于网络架构和参数规模的复杂关系。

链接: https://arxiv.org/abs/2507.01722
作者: Enrico Cassano,Riccardo Renzulli,Andrea Bragagnolo,Marco Grangetto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pruning is widely used to reduce the complexity of deep learning models, but its effects on interpretability and representation learning remain poorly understood. This paper investigates how pruning influences vision models across three key dimensions: (i) interpretability, (ii) unsupervised object discovery, and (iii) alignment with human perception. We first analyze different vision network architectures to examine how varying sparsity levels affect feature attribution interpretability methods. Additionally, we explore whether pruning promotes more succinct and structured representations, potentially improving unsupervised object discovery by discarding redundant information while preserving essential features. Finally, we assess whether pruning enhances the alignment between model representations and human perception, investigating whether sparser models focus on more discriminative features similarly to humans. Our findings also reveal the presence of sweet spots, where sparse models exhibit higher interpretability, downstream generalization and human alignment. However, these spots highly depend on the network architectures and their size in terms of trainable parameters. Our results suggest a complex interplay between these three dimensions, highlighting the importance of investigating when and how pruning benefits vision representations.
zh

[CV-28] Soft Self-labeling and Potts Relaxations for Weakly-Supervised Segmentation CVPR2025

【速读】:该论文旨在解决弱监督分割(Weakly Supervised Segmentation, WSSS)问题,即在仅有一小部分像素具有真实标签(如涂鸦标注)的情况下进行图像分割。其解决方案的关键在于采用一种自标注方法,通过优化未标记像素上的标准无监督条件随机场(CRF)/Potts损失的松弛形式来提升模型性能。论文提出了一种基于软伪标签的自标注机制,以更好地表征类别不确定性,并引入一种通用的连续子问题求解器,系统评估了不同CRF松弛方式、邻域系统及网络预测与软伪标签之间的连接项,从而实现了在仅使用标准架构下的性能提升,甚至超越了全像素精确监督的效果。

链接: https://arxiv.org/abs/2507.01721
作者: Zhongwen Zhang,Yuri Boykov
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: published at CVPR 2025

点击查看摘要

Abstract:We consider weakly supervised segmentation where only a fraction of pixels have ground truth labels (scribbles) and focus on a self-labeling approach optimizing relaxations of the standard unsupervised CRF/Potts loss on unlabeled pixels. While WSSS methods can directly optimize such losses via gradient descent, prior work suggests that higher-order optimization can improve network training by introducing hidden pseudo-labels and powerful CRF sub-problem solvers, e.g. graph cut. However, previously used hard pseudo-labels can not represent class uncertainty or errors, which motivates soft self-labeling. We derive a principled auxiliary loss and systematically evaluate standard and new CRF relaxations (convex and non-convex), neighborhood systems, and terms connecting network predictions with soft pseudo-labels. We also propose a general continuous sub-problem solver. Using only standard architectures, soft self-labeling consistently improves scribble-based training and outperforms significantly more complex specialized WSSS systems. It can outperform full pixel-precise supervision. Our general ideas apply to other weakly-supervised problems/systems.
zh

[CV-29] Using Wavelet Domain Fingerprints to Improve Source Camera Identification

【速读】:该论文旨在解决图像来源识别与取证中的传感器模式噪声(Sensor Pattern Noise, SPN)提取效率与准确性问题。其解决方案的关键在于对基于小波的SPN提取方法进行改进,提出了一种小波域指纹(wavelet domain fingerprint)的概念,避免了传统方法中需要将指纹构建为图像并进行反变换的步骤,从而实现了在小波域内直接进行指纹比较,提升了提取与比对的效率和精度。

链接: https://arxiv.org/abs/2507.01712
作者: Xinle Tian,Matthew Nunes,Emiko Dupont,Shaunagh Downing,Freddie Lichtenstein,Matt Burns
机构: CameraForensics(摄像头取证); University of Bath (巴斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Camera fingerprint detection plays a crucial role in source identification and image forensics, with wavelet denoising approaches proving to be particularly effective in extracting sensor pattern noise (SPN). In this article, we propose a modification to wavelet-based SPN extraction. Rather than constructing the fingerprint as an image, we introduce the notion of a wavelet domain fingerprint. This avoids the final inversion step of the denoising algorithm and allows fingerprint comparisons to be made directly in the wavelet domain. As such, our modification streamlines the extraction and comparison process. Experimental results on real-world datasets demonstrate that our method not only achieves higher detection accuracy but can also significantly improve processing speed.
zh

[CV-30] Component Adaptive Clustering for Generalized Category Discovery ICME2025

【速读】:该论文旨在解决广义类别发现(Generalized Category Discovery, GCD)问题,即在部分标记的数据集中,将未标记图像分类到已知和新颖类别中,而无需事先知道未知类别的数量。传统方法通常依赖于刚性假设,如预定义类别数量,这限制了其处理现实数据固有变化性和复杂性的能力。论文提出的解决方案是AdaGCD,其关键在于引入自适应槽注意力(Adaptive Slot Attention, AdaSlot),该机制根据数据复杂性动态确定最优槽位数量,从而无需预设槽位数,实现对未标记数据的灵活聚类,提升开放世界场景下的类别发现效果。

链接: https://arxiv.org/abs/2507.01711
作者: Mingfu Yan,Jiancheng Huang,Yifan Liu,Shifeng Chen
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳市先进电子技术研究院,中国科学院); University of Chinese Academy of Sciences(中国科学院大学); Shenzhen University of Advanced Technology(深圳先进技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE ICME 2025

点击查看摘要

Abstract:Generalized Category Discovery (GCD) tackles the challenging problem of categorizing unlabeled images into both known and novel classes within a partially labeled dataset, without prior knowledge of the number of unknown categories. Traditional methods often rely on rigid assumptions, such as predefining the number of classes, which limits their ability to handle the inherent variability and complexity of real-world data. To address these shortcomings, we propose AdaGCD, a cluster-centric contrastive learning framework that incorporates Adaptive Slot Attention (AdaSlot) into the GCD framework. AdaSlot dynamically determines the optimal number of slots based on data complexity, removing the need for predefined slot counts. This adaptive mechanism facilitates the flexible clustering of unlabeled data into known and novel categories by dynamically allocating representational capacity. By integrating adaptive representation with dynamic slot allocation, our method captures both instance-specific and spatially clustered features, improving class discovery in open-world scenarios. Extensive experiments on public and fine-grained datasets validate the effectiveness of our framework, emphasizing the advantages of leveraging spatial local information for category discovery in unlabeled image datasets.
zh

[CV-31] Facial Emotion Learning with Text-Guided Multiview Fusion via Vision-Language Model for 3D/4D Facial Expression Recognition

【速读】:该论文旨在解决3D和4D领域中面部表情识别(3D/4D FER)的挑战,特别是在处理空间和时间面部动态复杂性方面的问题。其解决方案的关键在于提出FACET-VLM框架,该框架通过将多视角面部表征学习与自然语言提示的语义引导相结合,实现对面部情绪的准确识别。FACET-VLM引入了三个关键组件:跨视角语义聚合(CVSA)、多视角文本引导融合(MTGF)以及多视角一致性损失,以确保视角间的一致性和结构连贯性。

链接: https://arxiv.org/abs/2507.01673
作者: Muzammil Behzad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial expression recognition (FER) in 3D and 4D domains presents a significant challenge in affective computing due to the complexity of spatial and temporal facial dynamics. Its success is crucial for advancing applications in human behavior understanding, healthcare monitoring, and human-computer interaction. In this work, we propose FACET-VLM, a vision-language framework for 3D/4D FER that integrates multiview facial representation learning with semantic guidance from natural language prompts. FACET-VLM introduces three key components: Cross-View Semantic Aggregation (CVSA) for view-consistent fusion, Multiview Text-Guided Fusion (MTGF) for semantically aligned facial emotions, and a multiview consistency loss to enforce structural coherence across views. Our model achieves state-of-the-art accuracy across multiple benchmarks, including BU-3DFE, Bosphorus, BU-4DFE, and BP4D-Spontaneous. We further extend FACET-VLM to 4D micro-expression recognition (MER) on the 4DME dataset, demonstrating strong performance in capturing subtle, short-lived emotional cues. The extensive experimental results confirm the effectiveness and substantial contributions of each individual component within the framework. Overall, FACET-VLM offers a robust, extensible, and high-performing solution for multimodal FER in both posed and spontaneous settings.
zh

[CV-32] What does really matter in image goal navigation?

【速读】:该论文试图解决图像目标导航(image goal navigation)问题,该任务需要两种不同技能:核心导航技能,包括自由空间和障碍物的检测以及基于内部表示的决策;以及通过将视觉观测与目标图像进行比较来计算方向信息。论文提出了一种端到端的强化学习(RL)方法,以验证是否可以通过对完整智能体的端到端训练高效解决该任务。解决方案的关键在于探索架构选择(如晚期融合、通道堆叠、空间到深度投影和交叉注意力)对相对位姿估计器从导航训练中出现的作用,并证明在一定条件下,导航性能与相对位姿估计性能之间存在相关性。

链接: https://arxiv.org/abs/2507.01667
作者: Gianluca Monaci,Philippe Weinzaepfel,Christian Wolf
机构: NAVER LABS Europe (NAVER 实验室欧洲分部)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Image goal navigation requires two different skills: firstly, core navigation skills, including the detection of free space and obstacles, and taking decisions based on an internal representation; and secondly, computing directional information by comparing visual observations to the goal image. Current state-of-the-art methods either rely on dedicated image-matching, or pre-training of computer vision modules on relative pose estimation. In this paper, we study whether this task can be efficiently solved with end-to-end training of full agents with RL, as has been claimed by recent work. A positive answer would have impact beyond Embodied AI and allow training of relative pose estimation from reward for navigation alone. In a large study we investigate the effect of architectural choices like late fusion, channel stacking, space-to-depth projections and cross-attention, and their role in the emergence of relative pose estimators from navigation training. We show that the success of recent methods is influenced up to a certain extent by simulator settings, leading to shortcuts in simulation. However, we also show that these capabilities can be transferred to more realistic setting, up to some extend. We also find evidence for correlations between navigation performance and probed (emerging) relative pose estimation performance, an important sub skill.
zh

[CV-33] SPoT: Subpixel Placement of Tokens in Vision Transformers ICCV2025

【速读】:该论文试图解决传统tokenization方法在Vision Transformers(视觉Transformer,ViT)中对特征进行离散块网格限制的问题,这种限制阻碍了模型充分利用稀疏性,导致需要做出不合理的权衡。其解决方案的关键在于提出了一种名为Subpixel Placement of Tokens (SPoT)的新tokenization策略,该策略允许tokens在图像中连续定位,从而有效规避基于网格的局限性。通过引入oracle-guided search,研究进一步验证了理想亚像素token定位所带来的显著性能提升,并大幅减少了推理过程中准确预测所需的token数量。

链接: https://arxiv.org/abs/2507.01654
作者: Martine Hjelkrem-Tan,Marius Aasan,Gabriel Y. Arteaga,Adín Ramírez Rivera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear in Workshop on Efficient Computing under Limited Resources: Visual Computing (ICCV 2025). Code available at this https URL

点击查看摘要

Abstract:Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.
zh

[CV-34] RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather ICCV25

【速读】:该论文旨在解决基于学习的立体匹配模型在恶劣天气条件下泛化能力不足的问题,主要由于训练数据稀缺以及从退化图像中提取判别性特征的困难。其解决方案的关键在于提出一种名为\textbf{RobuSTereo}的框架,通过两个核心模块实现:一是基于扩散的仿真流水线结合立体一致性模块,生成高质量的恶劣天气条件下的立体数据以减少干净图像与退化图像之间的领域差距;二是设计一个结合专用卷积神经网络(ConvNet)和去噪Transformer的鲁棒特征编码器,以提取退化图像中的稳定可靠特征,从而提升深度估计精度和视差估计的准确性。

链接: https://arxiv.org/abs/2507.01653
作者: Yuran Wang,Yingping Liang,Yutao Hu,Ying Fu
机构: Beijing Institute of Technology (北京理工大学); School of Computer Science and Engineering, Southeast University (东南大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICCV25

点击查看摘要

Abstract:Learning-based stereo matching models struggle in adverse weather conditions due to the scarcity of corresponding training data and the challenges in extracting discriminative features from degraded images. These limitations significantly hinder zero-shot generalization to out-of-distribution weather conditions. In this paper, we propose \textbfRobuSTereo, a novel framework that enhances the zero-shot generalization of stereo matching models under adverse weather by addressing both data scarcity and feature extraction challenges. First, we introduce a diffusion-based simulation pipeline with a stereo consistency module, which generates high-quality stereo data tailored for adverse conditions. By training stereo matching models on our synthetic datasets, we reduce the domain gap between clean and degraded images, significantly improving the models’ robustness to unseen weather conditions. The stereo consistency module ensures structural alignment across synthesized image pairs, preserving geometric integrity and enhancing depth estimation accuracy. Second, we design a robust feature encoder that combines a specialized ConvNet with a denoising transformer to extract stable and reliable features from degraded images. The ConvNet captures fine-grained local structures, while the denoising transformer refines global representations, effectively mitigating the impact of noise, low visibility, and weather-induced distortions. This enables more accurate disparity estimation even under challenging visual conditions. Extensive experiments demonstrate that \textbfRobuSTereo significantly improves the robustness and generalization of stereo matching models across diverse adverse weather scenarios.
zh

[CV-35] Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective

【速读】:该论文旨在解决传统自回归(Autoregressive, AR)图像生成模型中因依赖Transformer架构而导致的计算复杂度高和内存开销大的问题,同时克服线性注意力机制在图像生成中因无法捕捉关键长程依赖关系而降低生成质量的缺陷。其解决方案的关键在于提出一种新型的注意力机制——基于空间感知衰减的线性注意力(Linear Attention with Spatial-Aware Decay, LASAD),该机制通过基于真实二维空间位置计算位置相关的衰减因子,显式保留图像序列中的真实二维空间关系,从而在保持线性计算复杂度的同时提升图像生成的质量与效率。

链接: https://arxiv.org/abs/2507.01652
作者: Yuxin Mao,Zhen Qin,Jinxing Zhou,Hui Deng,Xuyang Shen,Bin Fan,Jing Zhang,Yiran Zhong,Yuchao Dai
机构: Northwestern Polytechnical University (西北工业大学); TapTap (TapTap); OpenNLPLab (OpenNLP实验室); Hefei University of Technology (合肥工业大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Autoregressive (AR) models have garnered significant attention in image generation for their ability to effectively capture both local and global structures within visual data. However, prevalent AR models predominantly rely on the transformer architectures, which are beset by quadratic computational complexity concerning input sequence length and substantial memory overhead due to the necessity of maintaining key-value caches. Although linear attention mechanisms have successfully reduced this burden in language models, our initial experiments reveal that they significantly degrade image generation quality because of their inability to capture critical long-range dependencies in visual data. We propose Linear Attention with Spatial-Aware Decay (LASAD), a novel attention mechanism that explicitly preserves genuine 2D spatial relationships within the flattened image sequences by computing position-dependent decay factors based on true 2D spatial location rather than 1D sequence positions. Based on this mechanism, we present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity. Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency, bridging the gap between linear attention’s efficiency and spatial understanding needed for high-quality generation.
zh

[CV-36] SAILViT: Towards Robust and Generalizable Visual Backbones for MLLM s via Gradual Feature Refinement

【速读】:该论文旨在解决视觉Transformer(Vision Transformers, ViTs)在与大型语言模型(Large Language Models, LLMs)进行连接器驱动的联合训练时所面临的参数初始化冲突和模态语义差距问题。其解决方案的关键在于提出SAILViT,通过渐进式特征学习增强ViT,实现从粗粒度到细粒度的特征对齐和世界知识注入,从而更好地满足目标训练需求。

链接: https://arxiv.org/abs/2507.01643
作者: Weijie Yin,Dingkang Yang,Hongyuan Dong,Zijian Kang,Jiacong Wang,Xiao Liang,Chao Feng,Jiao Ran
机构: ByteDance Inc. (字节跳动公司); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: We release SAILViT, a series of versatile vision foundation models

点击查看摘要

Abstract:Vision Transformers (ViTs) are essential as foundation backbones in establishing the visual comprehension capabilities of Multimodal Large Language Models (MLLMs). Although most ViTs achieve impressive performance through image-text pair-based contrastive learning or self-supervised mechanisms, they struggle to engage in connector-based co-training directly with LLMs due to potential parameter initialization conflicts and modality semantic gaps. To address the above challenges, this paper proposes SAILViT, a gradual feature learning-enhanced ViT for facilitating MLLMs to break through performance bottlenecks in complex multimodal interactions. SAILViT achieves coarse-to-fine-grained feature alignment and world knowledge infusion with gradual feature refinement, which better serves target training demands. We perform thorough empirical analyses to confirm the powerful robustness and generalizability of SAILViT across different dimensions, including parameter sizes, model architectures, training strategies, and data scales. Equipped with SAILViT, existing MLLMs show significant and consistent performance improvements on the OpenCompass benchmark across extensive downstream tasks. SAILViT series models are released at this https URL.
zh

[CV-37] Depth Anything at Any Condition

【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)模型在复杂开放世界环境中的性能不足问题,尤其是在光照变化、恶劣天气和传感器引起的畸变等挑战性条件下表现不佳。其关键解决方案是提出一种无需大量标注数据的无监督一致性正则化微调范式,以及引入空间距离约束以显式强化模型对块级相对关系的学习,从而提升模型在不同环境下的泛化能力和深度预测的准确性。

链接: https://arxiv.org/abs/2507.01634
作者: Boyuan Sun,Modi Jin,Bowen Yin,Qibin Hou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks. Project Page: this https URL Code: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.01634 [cs.CV] (or arXiv:2507.01634v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.01634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-38] and Slide : A New Framework for Scaling NeRF from Local to Global 3D Earth Observation ICCV2025 ALT

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在处理大规模场景时因训练过程中内存占用过高而受限的问题。其关键解决方案是提出Snake-NeRF框架,通过一种非核心(out-of-core)方法,无需同时加载所有图像和网络,从而实现在单个设备上处理大场景。该方法将感兴趣区域划分为不重叠的NeRF进行三维拼接,并通过带有重叠裁剪的图像处理和新颖的2×2三维瓦片进展策略及分段采样器,有效避免了瓦片边缘处的三维重建误差。

链接: https://arxiv.org/abs/2507.01631
作者: Camille Billouard,Dawa Derksen,Alexandre Constantin,Bruno Vallet
机构: CNES(法国国家空间研究中心); Univ Gustave Eiffel(古斯塔夫·埃菲尔大学); ENSG(法国国立地形测绘学院); IGN(法国地理信息研究所); LASTIG(地理信息与空间技术研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Accepted at ICCV 2025 Workshop 3D-VAST (From street to space: 3D Vision Across Altitudes). Version before camera ready. Our code will be made public after the conference

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have recently emerged as a paradigm for 3D reconstruction from multiview satellite imagery. However, state-of-the-art NeRF methods are typically constrained to small scenes due to the memory footprint during training, which we study in this paper. Previous work on large-scale NeRFs palliate this by dividing the scene into NeRFs. This paper introduces Snake-NeRF, a framework that scales to large scenes. Our out-of-core method eliminates the need to load all images and networks simultaneously, and operates on a single device. We achieve this by dividing the region of interest into NeRFs that 3D tile without overlap. Importantly, we crop the images with overlap to ensure each NeRFs is trained with all the necessary pixels. We introduce a novel 2\times 2 3D tile progression strategy and segmented sampler, which together prevent 3D reconstruction errors along the tile edges. Our experiments conclude that large satellite images can effectively be processed with linear time complexity, on a single GPU, and without compromise in quality.
zh

[CV-39] Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss ICCV2025

【速读】:该论文旨在解决Human-Object conTact (HOT)检测中模型受限于单一图像类型导致的过度分割问题以及特定区域内类别一致性难以维持的问题。其解决方案的关键在于提出一种名为P3HOT的框架,该框架融合了Prompt引导和人类邻近感知机制。通过语义驱动的提示机制引导网络关注与图像和文本相关区域,并利用人类邻近感知机制动态感知人体周围的深度范围,以有效消除预期无交互的区域。此外,还引入了区域联合损失(RJLoss)和新的评估指标“AD-Acc.”,以抑制同一区域内的异常类别并改进对负样本的处理。

链接: https://arxiv.org/abs/2507.01630
作者: Yuxiao Wang,Yu Lei,Zhenao Wei,Weiying Xue,Xinyu Jiang,Nan Zhuang,Qi Liu
机构: South China University of Technology (华南理工大学); Southwest Jiaotong University (西南交通大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:The task of Human-Object conTact (HOT) detection involves identifying the specific areas of the human body that are touching objects. Nevertheless, current models are restricted to just one type of image, often leading to too much segmentation in areas with little interaction, and struggling to maintain category consistency within specific regions. To tackle this issue, a HOT framework, termed \textbfP3HOT, is proposed, which blends \textbfPrompt guidance and human \textbfProximal \textbfPerception. To begin with, we utilize a semantic-driven prompt mechanism to direct the network’s attention towards the relevant regions based on the correlation between image and text. Then a human proximal perception mechanism is employed to dynamically perceive key depth range around the human, using learnable parameters to effectively eliminate regions where interactions are not expected. Calculating depth resolves the uncertainty of the overlap between humans and objects in a 2D perspective, providing a quasi-3D viewpoint. Moreover, a Regional Joint Loss (RJLoss) has been created as a new loss to inhibit abnormal categories in the same area. A new evaluation metric called ``AD-Acc.‘’ is introduced to address the shortcomings of existing methods in addressing negative samples. Comprehensive experimental results demonstrate that our approach achieves state-of-the-art performance in four metrics across two benchmark datasets. Specifically, our model achieves an improvement of \textbf0.7 \uparrow , \textbf2.0 \uparrow , \textbf1.6 \uparrow , and \textbf11.0 \uparrow in SC-Acc., mIoU, wIoU, and AD-Acc. metrics, respectively, on the HOT-Annotated dataset. Code is available at this https URL.
zh

[CV-40] Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference ICME

【速读】:该论文旨在解决压缩域语义推理中因基于均方误差(MSE)优化的图像编码模型所导致的潜在空间语义丰富性不足问题,以及由此引发的下游任务中语义推理效果受限和需要对整个视觉模型进行微调所带来的计算开销过大的问题。其解决方案的关键在于提出感知导向的潜在编码(Perception-Oriented Latent Coding, POLC),通过增强潜在特征的语义内容,使得在保持高性能的同时,仅需使用即插即用的适配器进行微调,从而显著减少参数量和微调开销。

链接: https://arxiv.org/abs/2507.01608
作者: Xu Zhang,Ming Lu,Yan Chen,Zhan Ma
机构: Nanjing University (南京大学); Jiangsu Academy of Safety Science and Technology (江苏省安全科学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: International Conference on Multimedia and Expo (ICME), 2025

点击查看摘要

Abstract:In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire vision model, which is computationally intensive, especially for large models. To address these problems, we introduce Perception-Oriented Latent Coding (POLC), an approach that enriches the semantic content of latent features for high-performance compressed domain semantic inference. With the semantically rich latent space, POLC requires only a plug-and-play adapter for fine-tuning, significantly reducing the parameter count compared to previous MSE-oriented methods. Experimental results demonstrate that POLC achieves rate-perception performance comparable to state-of-the-art generative image coding methods while markedly enhancing performance in vision tasks, with minimal fine-tuning overhead. Code is available at this https URL.
zh

[CV-41] Survivability of Backdoor Attacks on Unconstrained Face Recognition Systems

【速读】:该论文试图解决深度学习人脸识别系统中基于深度神经网络(DNN)的后门攻击问题,特别是针对现实场景下非受控环境中图像的攻击漏洞。解决方案的关键在于系统性地探索DNN后门在人脸识别流程中的可行性,包括首次演示了人脸检测任务中的两种后门攻击方法(人脸生成和人脸关键点偏移攻击),并验证了使用大间隔损失训练的人脸特征提取器同样可能遭受后门攻击。通过结合多种模型和配置,研究证明单个后门即可绕过系统的全部功能,从而揭示了现有系统的安全风险,并提出了相应的最佳实践和防御措施。

链接: https://arxiv.org/abs/2507.01607
作者: Quentin Le Roux,Yannick Teglia,Teddy Furon,Philippe Loubet-Moundi,Eric Bourbao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread use of deep learning face recognition raises several security concerns. Although prior works point at existing vulnerabilities, DNN backdoor attacks against real-life, unconstrained systems dealing with images captured in the wild remain a blind spot of the literature. This paper conducts the first system-level study of backdoors in deep learning-based face recognition systems. This paper yields four contributions by exploring the feasibility of DNN backdoors on these pipelines in a holistic fashion. We demonstrate for the first time two backdoor attacks on the face detection task: face generation and face landmark shift attacks. We then show that face feature extractors trained with large margin losses also fall victim to backdoor attacks. Combining our models, we then show using 20 possible pipeline configurations and 15 attack cases that a single backdoor enables an attacker to bypass the entire function of a system. Finally, we provide stakeholders with several best practices and countermeasures.
zh

[CV-42] DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation ICCV2025

【速读】:该论文旨在解决长视频深度估计中因滑动窗口分割导致的尺度不一致和几何结构不一致问题。现有方法通过重叠滑动窗口处理长视频,但随着窗口数量增加,尺度差异累积严重,且仅依赖2D扩散先验,忽略了视频深度的内在3D几何结构,导致预测结果几何不一致。论文提出的解决方案是DepthSync,其关键在于引入尺度引导(scale guidance)以同步不同窗口间的深度尺度,并结合几何引导(geometry guidance)基于视频深度的内在3D约束强制窗口内的几何对齐,二者协同作用,引导去噪过程生成尺度与几何一致的深度预测。

链接: https://arxiv.org/abs/2507.01603
作者: Yue-Jiang Dong,Wang Zhao,Jiale Xu,Ying Shan,Song-Hai Zhang
机构: Tsinghua University (清华大学); ARC Lab, Tencent PCG (腾讯PCG人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.
zh

[CV-43] Autonomous AI Surveillance: Multimodal Deep Learning for Cognitive and Behavioral Monitoring

【速读】:该论文旨在解决课堂中学生注意力评估与行为监控的问题,通过多模态融合的方法提升对学习状态的感知能力。其解决方案的关键在于整合生成式AI(Generative AI)与深度学习技术,利用YOLOv8模型实现手机和睡眠状态的检测,并结合LResNet Occ FC进行面部识别及人体姿态跟踪,从而实现对学生参与度和行为的实时、全面监控。系统在专用数据集上进行训练,并通过PHP Web应用与ESP32-CAM硬件实现高效的数据采集与处理。

链接: https://arxiv.org/abs/2507.01590
作者: Ameer Hamza,Zuhaib Hussain But,Umar Arif,Samiya,M. Abdullah Asad,Muhammad Naeem
机构: Gift University (格里夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents a novel classroom surveillance system that integrates multiple modalities, including drowsiness, tracking of mobile phone usage, and face recognition,to assess student attentiveness with enhanced this http URL system leverages the YOLOv8 model to detect both mobile phone and sleep usage,(Ghatge et al., 2024) while facial recognition is achieved through LResNet Occ FC body tracking using YOLO and MTCNN.(Durai et al., 2024) These models work in synergy to provide comprehensive, real-time monitoring, offering insights into student engagement and behavior.(S et al., 2023) The framework is trained on specialized datasets, such as the RMFD dataset for face recognition and a Roboflow dataset for mobile phone detection. The extensive evaluation of the system shows promising results. Sleep detection achieves 97. 42% mAP@50, face recognition achieves 86. 45% validation accuracy and mobile phone detection reach 85. 89% mAP@50. The system is implemented within a core PHP web application and utilizes ESP32-CAM hardware for seamless data capture.(Neto et al., 2024) This integrated approach not only enhances classroom monitoring, but also ensures automatic attendance recording via face recognition as students remain seated in the classroom, offering scalability for diverse educational environments.(Banada,2025)
zh

[CV-44] owards Controllable Real Image Denoising with Camera Parameters ICIP2025

【速读】:该论文试图解决现有基于深度学习的图像去噪方法缺乏根据噪声水平、相机设置和用户偏好调整去噪强度的灵活性问题。解决方案的关键在于引入一种可控制的去噪框架,通过利用相机参数(如ISO、快门速度和F-number)的信息来自适应地去除噪声,将这些参数转换为向量以控制并提升去噪网络的性能。

链接: https://arxiv.org/abs/2507.01587
作者: Youngjin Oh,Junhyeong Kwon,Keuntek Lee,Nam Ik Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted for publication in ICIP 2025, IEEE International Conference on Image Processing

点击查看摘要

Abstract:Recent deep learning-based image denoising methods have shown impressive performance; however, many lack the flexibility to adjust the denoising strength based on the noise levels, camera settings, and user preferences. In this paper, we introduce a new controllable denoising framework that adaptively removes noise from images by utilizing information from camera parameters. Specifically, we focus on ISO, shutter speed, and F-number, which are closely related to noise levels. We convert these selected parameters into a vector to control and enhance the performance of the denoising network. Experimental results show that our method seamlessly adds controllability to standard denoising neural networks and improves their performance. Code is available at this https URL.
zh

[CV-45] SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation

【速读】:该论文旨在解决高质量2D动画制作过程中人工绘制和上色大量帧所带来的高劳动强度问题。其解决方案的关键在于提出SketchColour,这是首个基于扩散变换器(DiT)主干网络的草图到上色流水线。通过将传统U-Net去噪器替换为DiT架构,并通过轻量级通道拼接适配器结合LoRA微调注入草图信息,该方法在不增加参数和内存负担的情况下原生集成条件信息,显著降低了参数数量和GPU内存使用。

链接: https://arxiv.org/abs/2507.01586
作者: Bryan Constantine Sadihin,Michael Hua Wang,Shei Pern Chua,Hang Su
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page and code: this https URL

点击查看摘要

Abstract:The production of high-quality 2D animation is highly labor-intensive process, as animators are currently required to draw and color a large number of frames by hand. We present SketchColour, the first sketch-to-colour pipeline for 2D animation built on a diffusion transformer (DiT) backbone. By replacing the conventional U-Net denoiser with a DiT-style architecture and injecting sketch information via lightweight channel-concatenation adapters accompanied with LoRA finetuning, our method natively integrates conditioning without the parameter and memory bloat of a duplicated ControlNet, greatly reducing parameter count and GPU memory usage. Evaluated on the SAKUGA dataset, SketchColour outperforms previous state-of-the-art video colourization methods across all metrics, despite using only half the training data of competing models. Our approach produces temporally coherent animations with minimal artifacts such as colour bleeding or object deformation. Our code is available at: this https URL .
zh

[CV-46] A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation

【速读】:该论文旨在解决遥感语义分割中同时准确识别地物类别及其精确边界定位的问题。现有方法主要依赖判别式学习,虽能有效捕捉低频特征,但在学习高频特征(如边界细节)方面存在局限。该研究的关键解决方案是将判别式学习与基于扩散的生成学习相结合,提出IDGBR框架,通过判别式主干网络生成粗略分割图,并利用条件引导网络和迭代去噪扩散过程对分割结果进行边界精修,从而提升模型在低频语义推理和高频边界定位方面的性能。

链接: https://arxiv.org/abs/2507.01573
作者: Hao Wang,Keyan Hu,Xin Guo,Haifeng Li,Chao Tao
机构: Central South University (中南大学); Inner Mongolia University (内蒙古大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 14 figures

点击查看摘要

Abstract:Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model’s ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework’s capability of consistent boundary refinement for coarse results from diverse discriminative architectures. The source code will be available at this https URL.
zh

[CV-47] How Weight Resampling and Optimizers Shape the Dynamics of Continual Learning and Forgetting in Neural Networks

【速读】:该论文试图解决持续学习(continual learning)和小样本迁移学习(few-shot transfer learning)中模型在面对新领域时出现的遗忘问题,以及如何通过优化训练过程来改善模型的适应能力。其解决方案的关键在于“zapping”技术,即在训练过程中对神经网络最后一层的权重进行重采样,这一方法能够加速模型在迁移至新领域时的恢复过程,并有效缓解任务间的干扰与协同效应。

链接: https://arxiv.org/abs/2507.01559
作者: Lapo Frati,Neil Traft,Jeff Clune,Nick Cheney
机构: University of Vermont (佛蒙特大学); University of British Columbia (不列颠哥伦比亚大学); Vector Institute (向量研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent work in continual learning has highlighted the beneficial effect of resampling weights in the last layer of a neural network (``zapping"). Although empirical results demonstrate the effectiveness of this approach, the underlying mechanisms that drive these improvements remain unclear. In this work, we investigate in detail the pattern of learning and forgetting that take place inside a convolutional neural network when trained in challenging settings such as continual learning and few-shot transfer learning, with handwritten characters and natural images. Our experiments show that models that have undergone zapping during training more quickly recover from the shock of transferring to a new domain. Furthermore, to better observe the effect of continual learning in a multi-task setting we measure how each individual task is affected. This shows that, not only zapping, but the choice of optimizer can also deeply affect the dynamics of learning and forgetting, causing complex patterns of synergy/interference between tasks to emerge when the model learns sequentially at transfer time.
zh

[CV-48] Interpolation-Based Event Visual Data Filtering Algorithms CVPR

【速读】:该论文旨在解决事件相机(event camera)数据流中存在显著噪声的问题,提出了一种能够去除约99%噪声同时保留大部分有效信号的事件数据处理方法。解决方案的关键在于基于无限脉冲响应(IIR)滤波器矩阵的四种算法,这些算法在多个经过人工噪声和动态视觉传感器噪声增强的事件数据集上进行了比较,展现出良好的降噪性能,并且具有较低的内存占用(适用于1280 x 720分辨率传感器仅需约30KB内存),适合嵌入式设备实现。

链接: https://arxiv.org/abs/2507.01557
作者: Marcin Kowlaczyk,Tomasz Kryjak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for publication at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Vancouver, 2023. Copyright IEEE

点击查看摘要

Abstract:The field of neuromorphic vision is developing rapidly, and event cameras are finding their way into more and more applications. However, the data stream from these sensors is characterised by significant noise. In this paper, we propose a method for event data that is capable of removing approximately 99% of noise while preserving the majority of the valid signal. We have proposed four algorithms based on the matrix of infinite impulse response (IIR) filters method. We compared them on several event datasets that were further modified by adding artificially generated noise and noise recorded with dynamic vision sensor. The proposed methods use about 30KB of memory for a sensor with a resolution of 1280 x 720 and is therefore well suited for implementation in embedded devices.
zh

[CV-49] A Multi-Centric Anthropomorphic 3D CT Phantom-Based Benchmark Dataset for Harmonization

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)在医学影像分析中因数据分布变化而导致的泛化能力不足问题。其关键解决方案是通过AI调和(AI harmonization)技术,减少由不同扫描设备、重建技术和剂量等采集设置引起的分布偏移。为此,作者提出了一个开源基准数据集,包含使用多种扫描仪和设置获取的人体模型CT扫描数据,以促进AI调和技术的发展。

链接: https://arxiv.org/abs/2507.01539
作者: Mohammadreza Amirian,Michael Bach,Oscar Jimenez-del-Toro,Christoph Aberle,Roger Schaer,Vincent Andrearczyk,Jean-Félix Maestrati,Maria Martin Asiain,Kyriakos Flouris,Markus Obmann,Clarisse Dromain,Benoît Dufour,Pierre-Alexandre Alois Poletti,Hendrik von Tengg-Kobligk,Rolf Hügli,Martin Kretzschmar,Hatem Alkadhi,Ender Konukoglu,Henning Müller,Bram Stieltjes,Adrien Depeursinge
机构: Institute of Informatics, School of Management, HES-SO Valais-Wallis, Sierre, Switzerland; Lausanne University Hospital, Lausanne, Switzerland; University Hospital Basel, Basel, Switzerland; Idiap Research Institute, Martigny, Switzerland; Clinic of Radiology and Nuclear Medicine, University Hospital Basel, Basel, Switzerland; Computer Vision Lab, ETH Zurich, Zurich, Switzerland; Department of Radiology, Lausanne University Hospital, Lausanne, Switzerland; Groupe 3R, Lausanne-Épalinges Imaging Center, Lausanne, Switzerland; Faculty of Medicine, University of Geneva (UNIGE), Geneva, Switzerland; Inselspital Bern, University of Bern, Bern, Switzerland; Cantonal Hospital Baselland, Bruderholz, Switzerland; Schmerzklinik Basel, Basel, Switzerland; Diagnostic and Interventional Radiology, University Hospital Zurich, Switzerland; Faculty of Medicine, University of Geneva (UNIGE), Geneva, Switzerland; Clinic of Radiology and Nuclear Medicine, University Hospital Basel, Basel, Switzerland; Institute of Informatics, School of Management, HES-SO Valais-Wallis, Sierre, Switzerland; Nuclear Medicine and Molecular Imaging Department, Lausanne University Hospital, Lausanne, Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has introduced numerous opportunities for human assistance and task automation in medicine. However, it suffers from poor generalization in the presence of shifts in the data distribution. In the context of AI-based computed tomography (CT) analysis, significant data distribution shifts can be caused by changes in scanner manufacturer, reconstruction technique or dose. AI harmonization techniques can address this problem by reducing distribution shifts caused by various acquisition settings. This paper presents an open-source benchmark dataset containing CT scans of an anthropomorphic phantom acquired with various scanners and settings, which purpose is to foster the development of AI harmonization techniques. Using a phantom allows fixing variations attributed to inter- and intra-patient variations. The dataset includes 1378 image series acquired with 13 scanners from 4 manufacturers across 8 institutions using a harmonized protocol as well as several acquisition doses. Additionally, we present a methodology, baseline results and open-source code to assess image- and feature-level stability and liver tissue classification, promoting the development of AI harmonization strategies.
zh

[CV-50] rackingMiM: Efficient Mamba-in-Mamba Serialization for Real-time UAV Object Tracking

【速读】:该论文旨在解决Vision Transformer (ViT)在无人机(UAV)跟踪系统中因二次复杂度带来的实时处理难题,同时探索State-Space Model Mamba在处理密集图像序列任务中的潜力。其解决方案的关键在于提出一种名为TrackingMiM的Mamba-in-Mamba架构,通过嵌套式Mamba扫描机制,独立处理时间与空间一致的块标记,并将模板帧编码为查询标记用于每次扫描,从而在保持高精度的同时显著提升处理速度。

链接: https://arxiv.org/abs/2507.01535
作者: Bingxi Liu,Calvin Chen,Junhao Li,Guyang Yu,Haoqian Song,Xuchen Liu,Jinqiang Cui,Hong Zhang
机构: Southern University of Science and Technology (南方科技大学); Peng Cheng Laboratory (鹏城实验室); University of Cambridge (剑桥大学); State Grid Corporation of China (国家电网公司); East China Institute of Computing Technology (华东计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:The Vision Transformer (ViT) model has long struggled with the challenge of quadratic complexity, a limitation that becomes especially critical in unmanned aerial vehicle (UAV) tracking systems, where data must be processed in real time. In this study, we explore the recently proposed State-Space Model, Mamba, leveraging its computational efficiency and capability for long-sequence modeling to effectively process dense image sequences in tracking tasks. First, we highlight the issue of temporal inconsistency in existing Mamba-based methods, specifically the failure to account for temporal continuity in the Mamba scanning mechanism. Secondly, building upon this insight,we propose TrackingMiM, a Mamba-in-Mamba architecture, a minimal-computation burden model for handling image sequence of tracking problem. In our framework, the mamba scan is performed in a nested way while independently process temporal and spatial coherent patch tokens. While the template frame is encoded as query token and utilized for tracking in every scan. Extensive experiments conducted on five UAV tracking benchmarks confirm that the proposed TrackingMiM achieves state-of-the-art precision while offering noticeable higher speed in UAV tracking.
zh

[CV-51] Exploring Pose-based Sign Language Translation: Ablation Studies and Attention Insights CVPR2025

【速读】:该论文旨在解决手语翻译(Sign Language Translation, SLT)系统在处理姿态数据时的性能优化问题,特别是通过姿态数据预处理技术提升模型的鲁棒性和泛化能力。其解决方案的关键在于采用基于姿态的预处理方法,包括归一化、插值和增强,并结合基于Transformer的架构,对修改后的T5编码器-解码器模型进行姿态表示的处理,从而有效提升翻译准确性。

链接: https://arxiv.org/abs/2507.01532
作者: Tomas Zelezny,Jakub Straka,Vaclav Javorek,Ondrej Valach,Marek Hruz,Ivan Gruber
机构: Univerzity of West Bohemia (西波希米亚大学); Faculty od Applied Sciences (应用科学学院); Department of Cybernetics (控制论系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures, supplementary, SLRTP2025, CVPR2025

点击查看摘要

Abstract:Sign Language Translation (SLT) has evolved significantly, moving from isolated recognition approaches to complex, continuous gloss-free translation systems. This paper explores the impact of pose-based data preprocessing techniques - normalization, interpolation, and augmentation - on SLT performance. We employ a transformer-based architecture, adapting a modified T5 encoder-decoder model to process pose representations. Through extensive ablation studies on YouTubeASL and How2Sign datasets, we analyze how different preprocessing strategies affect translation accuracy. Our results demonstrate that appropriate normalization, interpolation, and augmentation techniques can significantly improve model robustness and generalization abilities. Additionally, we provide a deep analysis of the model’s attentions and reveal interesting behavior suggesting that adding a dedicated register token can improve overall model performance. We publish our code on our GitHub repository, including the preprocessed YouTubeASL data.
zh

[CV-52] SafePTR: Token-Level Jailbreak Defense in Multimodal LLM s via Prune-then-Restore Mechanism

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在引入视觉输入后所面临的新型安全漏洞问题,特别是多模态越狱攻击(multimodal jailbreak attacks)带来的安全隐患。现有防御方法如图像到文本翻译、安全提示和多模态安全微调等,虽尝试通过对齐多模态输入与语言模型内置的安全机制来提升安全性,但未能深入揭示多模态漏洞的根本原因,尤其是有害多模态标记如何触发越狱行为。为此,论文提出了一种无需训练的防御框架——Safe Prune-then-Restore (SafePTR),其关键在于在脆弱层中选择性地剪除有害标记,同时在后续层恢复良性特征,从而在不增加计算开销的情况下显著提升MLLMs的安全性。

链接: https://arxiv.org/abs/2507.01513
作者: Beitao Chen,Xinyu Lyu,Lianli Gao,Jingkuan Song,Heng Tao Shen
机构: Southwestern University of Finance and Economics, Chengdu, China; Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China; Center for Future Media, University of Electronic Science and Technology of China; Tongji University; Engineering Research Center of Intelligent Finance, Ministry of Education
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe this http URL defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs’ built-in this http URL, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing heavy training this http URL bridge this gap, we present an comprehensive analysis of where, how and which harmful multimodal tokens bypass safeguards in MLLMs. Surprisingly, we find that less than 1% tokens in early-middle layers are responsible for inducing unsafe behaviors, highlighting the potential of precisely removing a small subset of harmful tokens, without requiring safety tuning, can still effectively improve safety against jailbreaks. Motivated by this, we propose Safe Prune-then-Restore (SafePTR), an training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent this http URL incurring additional computational overhead, SafePTR significantly enhances the safety of MLLMs while preserving efficiency. Extensive evaluations across three MLLMs and five benchmarks demonstrate SafePTR’s state-of-the-art performance in mitigating jailbreak risks without compromising utility.
zh

[CV-53] Mamba Guided Boundary Prior Matters: A New Perspective for Generalized Polyp Segmentation MICCAI-2025

【速读】:该论文旨在解决结肠镜图像中息肉分割的挑战,尤其是针对息肉边界模糊或不清晰导致的分割性能不稳定问题。现有基于编码器-解码器卷积神经网络和Transformer的方法在区分息肉与非息肉以及捕捉关键边界信息方面存在局限,且泛化能力不足,难以满足实时临床应用的需求。论文提出的解决方案是SAM-MaGuP,其关键在于引入了边界蒸馏模块和1D-2D Mamba适配器,通过增强全局上下文交互来提升特征学习能力,从而有效处理弱边界问题并提高分割精度与鲁棒性。

链接: https://arxiv.org/abs/2507.01509
作者: Tapas K. Dutta,Snehashis Majhi,Deepak Ranjan Nayak,Debesh Jha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 2 figures, MICCAI-2025

点击查看摘要

Abstract:Polyp segmentation in colonoscopy images is crucial for early detection and diagnosis of colorectal cancer. However, this task remains a significant challenge due to the substantial variations in polyp shape, size, and color, as well as the high similarity between polyps and surrounding tissues, often compounded by indistinct boundaries. While existing encoder-decoder CNN and transformer-based approaches have shown promising results, they struggle with stable segmentation performance on polyps with weak or blurry boundaries. These methods exhibit limited abilities to distinguish between polyps and non-polyps and capture essential boundary cues. Moreover, their generalizability still falls short of meeting the demands of real-time clinical applications. To address these limitations, we propose SAM-MaGuP, a groundbreaking approach for robust polyp segmentation. By incorporating a boundary distillation module and a 1D-2D Mamba adapter within the Segment Anything Model (SAM), SAM-MaGuP excels at resolving weak boundary challenges and amplifies feature learning through enriched global contextual interactions. Extensive evaluations across five diverse datasets reveal that SAM-MaGuP outperforms state-of-the-art methods, achieving unmatched segmentation accuracy and robustness. Our key innovations, a Mamba-guided boundary prior and a 1D-2D Mamba block, set a new benchmark in the field, pushing the boundaries of polyp segmentation to new heights.
zh

[CV-54] Integrating Traditional and Deep Learning Methods to Detect Tree Crowns in Satellite Images

【速读】:该论文试图解决森林监测不足的问题,以应对全球变暖、生物多样性丧失和空气污染等环境挑战。其解决方案的关键在于结合传统方法与深度学习方法,提出一种基于规则的新型树冠检测方法,通过融合特征提取与分割以及树冠检测的优势,提升检测结果的鲁棒性和准确性,并通过邻近树木和局部操作进行后处理以增加检测到的树冠数量。

链接: https://arxiv.org/abs/2507.01502
作者: Ozan Durgut,Beril Kallfelz-Sirmacek,Cem Unsalan
机构: Analog Devices (模拟器件公司); University of Mary (玛丽大学); Yeditepe University (伊斯坦布尔耶迪特佩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, journal manuscript

点击查看摘要

Abstract:Global warming, loss of biodiversity, and air pollution are among the most significant problems facing Earth. One of the primary challenges in addressing these issues is the lack of monitoring forests to protect them. To tackle this problem, it is important to leverage remote sensing and computer vision methods to automate monitoring applications. Hence, automatic tree crown detection algorithms emerged based on traditional and deep learning methods. In this study, we first introduce two different tree crown detection methods based on these approaches. Then, we form a novel rule-based approach that integrates these two methods to enhance robustness and accuracy of tree crown detection results. While traditional methods are employed for feature extraction and segmentation of forested areas, deep learning methods are used to detect tree crowns in our method. With the proposed rule-based approach, we post-process these results, aiming to increase the number of detected tree crowns through neighboring trees and localized operations. We compare the obtained results with the proposed method in terms of the number of detected tree crowns and report the advantages, disadvantages, and areas for improvement of the obtained outcomes.
zh

[CV-55] ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation ICCV2025

【速读】:该论文试图解决将生成式 AI (Generative AI) 中的 Rectified Flow(ReFlow)模型应用于真实图像编辑的挑战,尤其是在保持图像结构的同时实现高质量的文本对齐。其解决方案的关键在于分析多模态 Transformer 块的中间表示,识别出三个关键特征,并通过利用仅在中间步骤进行反演的中步潜在表示来提取这些特征,同时在注入过程中调整注意力机制以提升可编辑性并增强与目标文本的对齐效果。

链接: https://arxiv.org/abs/2507.01496
作者: Jimyeong Kim,Jungwon Park,Yeji Song,Nojun Kwak,Wonjong Rhee
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICCV 2025. Project page: this https URL

点击查看摘要

Abstract:Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural preservation, we leverage mid-step latent, which is inverted only up to the mid-step. We then adapt attention during injection to improve editability and enhance alignment to the target text. Our method is training-free, requires no user-provided mask, and can be applied even without a source prompt. Extensive experiments on two benchmarks with nine baselines demonstrate its superior performance over prior methods, further validated by human evaluations confirming a strong user preference for our approach.
zh

[CV-56] Crop Pest Classification Using Deep Learning Techniques: A Review

【速读】:该论文旨在解决传统昆虫害虫监测方法在速度、人工依赖性和可扩展性方面的不足,提出基于人工智能(Artificial Intelligence, AI)的害虫分类方法作为解决方案。其关键在于利用深度学习技术,特别是卷积神经网络(Convolutional Neural Networks, CNNs)、视觉变压器(Vision Transformers, ViTs)及混合模型,实现害虫检测的自动化与高精度。研究还指出,最新工作逐渐转向更具上下文理解能力的混合和变压器基模型,以提升识别准确率,并探讨了数据集不平衡、小目标检测困难、泛化能力有限及边缘设备部署等关键技术挑战。

链接: https://arxiv.org/abs/2507.01494
作者: Muhammad Hassam Ejaz,Muhammad Bilal,Usman Habib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Insect pests continue to bring a serious threat to crop yields around the world, and traditional methods for monitoring them are often slow, manual, and difficult to scale. In recent years, deep learning has emerged as a powerful solution, with techniques like convolutional neural networks (CNNs), vision transformers (ViTs), and hybrid models gaining popularity for automating pest detection. This review looks at 37 carefully selected studies published between 2018 and 2025, all focused on AI-based pest classification. The selected research is organized by crop type, pest species, model architecture, dataset usage, and key technical challenges. The early studies relied heavily on CNNs but latest work is shifting toward hybrid and transformer-based models that deliver higher accuracy and better contextual understanding. Still, challenges like imbalanced datasets, difficulty in detecting small pests, limited generalizability, and deployment on edge devices remain significant hurdles. Overall, this review offers a structured overview of the field, highlights useful datasets, and outlines the key challenges and future directions for AI-based pest monitoring systems.
zh

[CV-57] AVC-DPO: Aligned Video Captioning via Direct Preference Optimization

【速读】:该论文试图解决视频多模态大语言模型(video MLLMs)在视频字幕生成任务中难以根据人类偏好调整字幕焦点的问题。解决方案的关键在于提出一种名为Aligned Video Captioning via Direct Preference Optimization (AVC-DPO)的后训练框架,通过直接偏好优化实现字幕与人类偏好的对齐。该方法设计了针对时间动态和空间信息的增强提示,以捕捉人类观看视频时关注的核心因素,并利用同一基础模型在不同提示条件下的字幕生成响应进行偏好感知的训练与字幕对齐,从而提升字幕生成的质量。

链接: https://arxiv.org/abs/2507.01492
作者: Jiyang Tang,Hengyi Li,Yifan Du,Wayne Xin Zhao
机构: Renmin University of China (中国人民大学); Nankai University (南开大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although video multimodal large language models (video MLLMs) have achieved substantial progress in video captioning tasks, it remains challenging to adjust the focal emphasis of video captions according to human preferences. To address this limitation, we propose Aligned Video Captioning via Direct Preference Optimization (AVC-DPO), a post-training framework designed to enhance captioning capabilities in video MLLMs through preference alignment. Our approach designs enhanced prompts that specifically target temporal dynamics and spatial information-two key factors that humans care about when watching a video-thereby incorporating human-centric preferences. AVC-DPO leverages the same foundation model’s caption generation responses under varied prompt conditions to conduct preference-aware training and caption alignment. Using this framework, we have achieved exceptional performance in the LOVE@CVPR’25 Workshop Track 1A: Video Detailed Captioning Challenge, achieving first place on the Video Detailed Captioning (VDC) benchmark according to the VDCSCORE evaluation metric.
zh

[CV-58] What Really Matters for Robust Multi-Sensor HD Map Construction? IROS2025

【速读】:该论文旨在解决高精度(HD)地图构建中多模态融合方法的鲁棒性不足问题,尽管现有方法在提升模型准确性方面取得了一定进展,但在实际应用中对感知模型鲁棒性的关注较少。论文提出的解决方案关键在于引入三个核心组件:数据增强、一种新型多模态融合模块以及模态丢弃训练策略,以在保持高精度的同时显著提升多模态融合方法的鲁棒性。

链接: https://arxiv.org/abs/2507.01484
作者: Xiaoshuai Hao,Yuting Zhao,Yuheng Ji,Luanyuan Dai,Peng Hao,Dingzhe Li,Shuai Cheng,Rong Yin
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); Institute of Automation, Chinese Academy of Science (中国科学院自动化研究所); Nanjing University of Science and Technology (南京理工大学); Samsung R&D Institute China–Beijing (三星中国研究院); China North Artificial Intelligent & Innovation Research Institute (中国北方人工智能与创新研究院); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IROS 2025

点击查看摘要

Abstract:High-definition (HD) map construction methods are crucial for providing precise and comprehensive static environmental information, which is essential for autonomous driving systems. While Camera-LiDAR fusion techniques have shown promising results by integrating data from both modalities, existing approaches primarily focus on improving model accuracy and often neglect the robustness of perception models, which is a critical aspect for real-world applications. In this paper, we explore strategies to enhance the robustness of multi-modal fusion methods for HD map construction while maintaining high accuracy. We propose three key components: data augmentation, a novel multi-modal fusion module, and a modality dropout training strategy. These components are evaluated on a challenging dataset containing 10 days of NuScenes data. Our experimental results demonstrate that our proposed methods significantly enhance the robustness of baseline methods. Furthermore, our approach achieves state-of-the-art performance on the clean validation set of the NuScenes dataset. Our findings provide valuable insights for developing more robust and reliable HD map construction models, advancing their applicability in real-world autonomous driving scenarios. Project website: this https URL.
zh

[CV-59] Active Control Points-based 6DoF Pose Tracking for Industrial Metal Objects

【速读】:该论文旨在解决工业金属物体在真实环境中的位姿跟踪问题,这一任务由于金属物体的反射特性而具有挑战性。解决方案的关键在于提出一种基于主动控制点的6自由度(6DoF)位姿跟踪方法,该方法通过图像控制点主动生成边缘特征进行优化,而非依赖于基于6DoF位姿的渲染,并将这些特征作为优化变量。此外,引入了最优控制点回归方法以提升鲁棒性。

链接: https://arxiv.org/abs/2507.01478
作者: Chentao Shen,Ding Pan,Mingyu Mei,Zaixing He,Xinyue Zhao
机构: Zhejiang University (浙江大学); Westlake Unversity (西湖大学); School of Mechanical Engineering (机械工程学院); The State Key Lab of Fluid Power and Mechatronic Systems (流体动力与机电系统国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint version

点击查看摘要

Abstract:Visual pose tracking is playing an increasingly vital role in industrial contexts in recent years. However, the pose tracking for industrial metal objects remains a challenging task especially in the real world-environments, due to the reflection characteristic of metal objects. To address this issue, we propose a novel 6DoF pose tracking method based on active control points. The method uses image control points to generate edge feature for optimization actively instead of 6DoF pose-based rendering, and serve them as optimization variables. We also introduce an optimal control point regression method to improve robustness. The proposed tracking method performs effectively in both dataset evaluation and real world tasks, providing a viable solution for real-time tracking of industrial metal objects. Our source code is made publicly available at: this https URL.
zh

[CV-60] Optimizing Methane Detection On Board Satellites: Speed Accuracy and Low-Power Solutions for Resource-Constrained Hardware

【速读】:该论文旨在解决通过高光谱卫星图像早期检测甲烷泄漏以缓解气候变化的问题,同时应对现有任务执行方式效率低下及下行数据速率受限的挑战。其解决方案的关键在于开发高效、低功耗的算法,以适应资源受限的星载硬件环境。研究测试了未曾在甲烷检测中应用过的快速目标检测方法(ACE、CEM),并提出了一种改进的Mag1c-SAS算法,相较于当前最先进的Mag1c算法,在资源受限的硬件上实现了高达约100倍和230倍的计算速度提升,同时保持了较高的检测精度。此外,还提出了三种波段选择策略,其中一种在减少通道数的同时提升了传统方法的性能,进一步提高了处理速度而不牺牲准确性。

链接: https://arxiv.org/abs/2507.01472
作者: Jonáš Herec,Vít Růžička,Rado Pitoňák
机构: Zaitra s.r.o.(Zaitra s.r.o.); University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Performance (cs.PF)
备注: This is a preprint of a paper accepted for the EDHPC 2025 Conference

点击查看摘要

Abstract:Methane is a potent greenhouse gas, and detecting its leaks early via hyperspectral satellite imagery can help mitigate climate change. Meanwhile, many existing missions operate in manual tasking regimes only, thus missing potential events of interest. To overcome slow downlink rates cost-effectively, onboard detection is a viable solution. However, traditional methane enhancement methods are too computationally demanding for resource-limited onboard hardware. This work accelerates methane detection by focusing on efficient, low-power algorithms. We test fast target detection methods (ACE, CEM) that have not been previously used for methane detection and propose a Mag1c-SAS - a significantly faster variant of the current state-of-the-art algorithm for methane detection: Mag1c. To explore their true detection potential, we integrate them with a machine learning model (U-Net, LinkNet). Our results identify two promising candidates (Mag1c-SAS and CEM), both acceptably accurate for the detection of strong plumes and computationally efficient enough for onboard deployment: one optimized more for accuracy, the other more for speed, achieving up to ~100x and ~230x faster computation than original Mag1c on resource-limited hardware. Additionally, we propose and evaluate three band selection strategies. One of them can outperform the method traditionally used in the field while using fewer channels, leading to even faster processing without compromising accuracy. This research lays the foundation for future advancements in onboard methane detection with minimal hardware requirements, improving timely data delivery. The produced code, data, and models are open-sourced and can be accessed from this https URL.
zh

[CV-61] Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think

【速读】:该论文旨在解决扩散模型训练过程中因缺乏有效对齐而导致的生成质量和训练效率不足的问题。其解决方案的关键在于提出一种名为Representation Entanglement for Generation (REG) 的方法,该方法通过将低级图像潜在表示与预训练基础模型中的单一高级类别标记进行纠缠,从而在去噪过程中直接生成一致的图像-类别对,显著提升了生成质量与训练效率。

链接: https://arxiv.org/abs/2507.01467
作者: Ge Wu,Shen Zhang,Ruijing Shi,Shanghua Gao,Zhenyuan Chen,Lei Wang,Zhaowei Chen,Hongcheng Gao,Yao Tang,Jian Yang,Ming-Ming Cheng,Xiang Li
机构: Nankai University (南开大学); JIIOV Technology (JIIOV科技); Harvard University (哈佛大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (0.5% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256 \times 256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving \textbf63\times and \textbf23\times faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ( \textbf10\times longer). Code is available at: this https URL.
zh

[CV-62] NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation NEURIPS2025

【速读】:该论文旨在解决在仅提供每个物体的一些示例图像的情况下,对RGB图像中新型物体实例进行实例分割的问题。解决方案的关键在于提出了一种名为Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS)的框架,该框架利用Grounded-SAM 2获取精确的边界框和分割掩码,并借助DINOv2的零样本能力生成图像嵌入。通过计算类别嵌入和块嵌入的相似性来实现提议-物体匹配,并引入边界框和掩码的平均置信度作为额外加权因子,从而提升分割性能。

链接: https://arxiv.org/abs/2507.01463
作者: Max Gandyra,Alessandro Santonicola,Michael Beetz
机构: University Bremen (不来梅大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 3 tables, NeurIPS 2025 preprint

点击查看摘要

Abstract:Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed, for all kinds of novel objects, without (re-) training, has proven to be a difficult task. To handle this, we propose a simple, yet powerful, framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). This work stems from and improves upon previous ones like CNOS, SAM-6D and NIDS-Net; thus, it also leverages on recent vision foundation models, namely: Grounded-SAM 2 and DINOv2. It utilises Grounded-SAM 2 to obtain object proposals with precise bounding boxes and their corresponding segmentation masks; while DINOv2’s zero-shot capabilities are employed to generate the image embeddings. The quality of those masks, together with their embeddings, is of vital importance to our approach; as the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings. Differently to SAM-6D, calculating the latter involves a prior patch filtering based on the distance between each patch and its corresponding cyclic/roundtrip patch in the image grid. Furthermore, the average confidence of the proposals’ bounding box and mask is used as an additional weighting factor for the object matching score. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods on the seven core datasets of the BOP 2023 challenge for the “Model-based 2D segmentation of unseen objects” task.
zh

[CV-63] OoDDINO:A Multi-level Framework for Anomaly Segmentation on Complex Road Scenes

【速读】:该论文旨在解决现有像素级异常分割方法在实际应用中面临的两个主要问题:一是忽视同一物体内部像素间的空间相关性,导致分割结果碎片化;二是由于异常分数分布的区域差异性,全局阈值策略容易在背景区域产生误报或遗漏异常物体的某些部分。其解决方案的关键在于提出OoDDINO框架,该框架采用从粗到细的异常检测策略,结合不确定性引导的异常检测模型与像素级分割模型,在两阶段级联架构中通过正交不确定性感知融合策略(OUAFS)增强局部异常区域定位能力,并引入自适应双阈值网络(ADT-Net)动态生成区域特定阈值,从而实现精细的异常分割。

链接: https://arxiv.org/abs/2507.01455
作者: Yuxing Liu,Ji Zhang,Zhou Xuchuan,Jingzhong Xiao,Huimin Yang,Jiaxin Zhong
机构: Southwest Minzu University(西南民族大学); Ministry of Education(教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Anomaly segmentation aims to identify Out-of-Distribution (OoD) anomalous objects within images. Existing pixel-wise methods typically assign anomaly scores individually and employ a global thresholding strategy to segment anomalies. Despite their effectiveness, these approaches encounter significant challenges in real-world applications: (1) neglecting spatial correlations among pixels within the same object, resulting in fragmented segmentation; (2) variabil ity in anomaly score distributions across image regions, causing global thresholds to either generate false positives in background areas or miss segments of anomalous objects. In this work, we introduce OoDDINO, a novel multi-level anomaly segmentation framework designed to address these limitations through a coarse-to-fine anomaly detection strategy. OoDDINO combines an uncertainty-guided anomaly detection model with a pixel-level segmentation model within a two-stage cascade architecture. Initially, we propose an Orthogonal Uncertainty-Aware Fusion Strategy (OUAFS) that sequentially integrates multiple uncertainty metrics with visual representations, employing orthogonal constraints to strengthen the detection model’s capacity for localizing anomalous regions accurately. Subsequently, we develop an Adaptive Dual-Threshold Network (ADT-Net), which dynamically generates region-specific thresholds based on object-level detection outputs and pixel-wise anomaly scores. This approach allows for distinct thresholding strategies within foreground and background areas, achieving fine-grained anomaly segmentation. The proposed framework is compatible with other pixel-wise anomaly detection models, which acts as a plug-in to boost the performance. Extensive experiments on two benchmark datasets validate our framework’s superiority and compatibility over state-of-the-art methods.
zh

[CV-64] urboReg: TurboClique for Robust and Efficient Point Cloud Registration ICCV-2025

【速读】:该论文旨在解决基于对应关系的点云配准(Point Cloud Registration, PCR)中鲁棒估计的问题,特别是现有方法在兼容图中使用最大团搜索虽能实现高召回率但存在指数级时间复杂度,限制了其在时间敏感应用中的使用。解决方案的关键在于提出一种快速且鲁棒的估计器TurboReg,其核心是基于一种轻量级团结构TurboClique和一种高度可并行化的Pivot-Guided Search (PGS)算法。TurboClique通过在高度约束的兼容图中定义3-团,实现了高效并行搜索和稳定的变换估计;而PGS通过选择高SC²得分的匹配对作为pivot,有效引导搜索至具有更高内点比例的TurboCliques,同时具备线性时间复杂度,显著提升了效率。

链接: https://arxiv.org/abs/2507.01439
作者: Shaocheng Yan,Pengcheng Shi,Zhenjun Zhao,Kaixin Wang,Kuang Cao,Ji Wu,Jiayuan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV-2025 Accepted Paper

点击查看摘要

Abstract:Robust estimation is essential in correspondence-based Point Cloud Registration (PCR). Existing methods using maximal clique search in compatibility graphs achieve high recall but suffer from exponential time complexity, limiting their use in time-sensitive applications. To address this challenge, we propose a fast and robust estimator, TurboReg, built upon a novel lightweight clique, TurboClique, and a highly parallelizable Pivot-Guided Search (PGS) algorithm. First, we define the TurboClique as a 3-clique within a highly-constrained compatibility graph. The lightweight nature of the 3-clique allows for efficient parallel searching, and the highly-constrained compatibility graph ensures robust spatial consistency for stable transformation estimation. Next, PGS selects matching pairs with high SC ^2 scores as pivots, effectively guiding the search toward TurboCliques with higher inlier ratios. Moreover, the PGS algorithm has linear time complexity and is significantly more efficient than the maximal clique search with exponential time complexity. Extensive experiments show that TurboReg achieves state-of-the-art performance across multiple real-world datasets, with substantial speed improvements. For example, on the 3DMatch+FCGF dataset, TurboReg (1K) operates 208.22\times faster than 3DMAC while also achieving higher recall. Our code is accessible at \hrefthis https URL\textttTurboReg.
zh

[CV-65] DiffMark: Diffusion-based Robust Watermark Against Deepfakes

【速读】:该论文试图解决深度伪造(Deepfake)技术带来的安全与隐私威胁,特别是现有水印方法在对抗深度伪造操作时鲁棒性不足的问题。其解决方案的关键在于提出一种基于扩散模型的鲁棒水印框架DiffMark,通过修改训练和采样方案,将人脸图像和水印作为条件引导扩散模型逐步去噪并生成带有水印的图像。此外,通过引入时间步依赖的面部条件权重、跨信息融合模块以及冻结自编码器和对抗性引导机制,显著提升了水印在面对深度伪造攻击时的鲁棒性。

链接: https://arxiv.org/abs/2507.01428
作者: Chen Sun,Haiyang Sun,Zhiqing Guo,Yunfeng Diao,Liejun Wang,Dan Ma,Gaobo Yang,Keqin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Deepfakes pose significant security and privacy threats through malicious facial manipulations. While robust watermarking can aid in authenticity verification and source tracking, existing methods often lack the sufficient robustness against Deepfake manipulations. Diffusion models have demonstrated remarkable performance in image generation, enabling the seamless fusion of watermark with image during generation. In this study, we propose a novel robust watermarking framework based on diffusion model, called DiffMark. By modifying the training and sampling scheme, we take the facial image and watermark as conditions to guide the diffusion model to progressively denoise and generate corresponding watermarked image. In the construction of facial condition, we weight the facial image by a timestep-dependent factor that gradually reduces the guidance intensity with the decrease of noise, thus better adapting to the sampling process of diffusion model. To achieve the fusion of watermark condition, we introduce a cross information fusion (CIF) module that leverages a learnable embedding table to adaptively extract watermark features and integrates them with image features via cross-attention. To enhance the robustness of the watermark against Deepfake manipulations, we integrate a frozen autoencoder during training phase to simulate Deepfake manipulations. Additionally, we introduce Deepfake-resistant guidance that employs specific Deepfake model to adversarially guide the diffusion sampling process to generate more robust watermarked images. Experimental results demonstrate the effectiveness of the proposed DiffMark on typical Deepfakes. Our code will be available at this https URL.
zh

[CV-66] DocShaDiffusion: Diffusion Model in Latent Space for Document Image Shadow Removal

【速读】:该论文旨在解决文档图像增强中的阴影去除问题,特别是针对颜色阴影的处理。现有方法通常仅适用于常色背景的阴影去除,而忽略了颜色阴影的复杂性。其解决方案的关键在于设计了一个基于潜在空间的扩散模型DocShaDiffusion,通过将阴影图像从像素空间转换到潜在空间,以更有效地捕捉关键特征;同时引入了阴影软掩码生成模块(SSGM)和阴影掩码感知引导扩散模块(SMGDM),分别用于生成精确的阴影掩码并指导扩散与去噪过程,从而实现对颜色阴影的有效去除。

链接: https://arxiv.org/abs/2507.01422
作者: Wenjie Liu,Bingshu Wang,Ze Wang,C.L. Philip Chen
机构: Northwestern Polytechnical University (西北工业大学); South China University of Technology (华南理工大学); Pazhou Lab (琶洲实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Document shadow removal is a crucial task in the field of document image enhancement. However, existing methods tend to remove shadows with constant color background and ignore color shadows. In this paper, we first design a diffusion model in latent space for document image shadow removal, called DocShaDiffusion. It translates shadow images from pixel space to latent space, enabling the model to more easily capture essential features. To address the issue of color shadows, we design a shadow soft-mask generation module (SSGM). It is able to produce accurate shadow mask and add noise into shadow regions specially. Guided by the shadow mask, a shadow mask-aware guided diffusion module (SMGDM) is proposed to remove shadows from document images by supervising the diffusion and denoising process. We also propose a shadow-robust perceptual feature loss to preserve details and structures in document images. Moreover, we develop a large-scale synthetic document color shadow removal dataset (SDCSRD). It simulates the distribution of realistic color shadows and provides powerful supports for the training of models. Experiments on three public datasets validate the proposed method’s superiority over state-of-the-art. Our code and dataset will be publicly available.
zh

[CV-67] Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention ICCV2025

【速读】:该论文旨在解决开放世界环境中深度模型部署时的分布外(Out-of-Distribution, OOD)检测问题,即在输入数据可能偏离训练分布的情况下,如何有效识别这些异常样本。其解决方案的关键在于观察到在分布内(In-Distribution, ID)样本附近,梯度方向在“增强”预测类别时具有相对一致性,而OOD样本则表现出无序或冲突的梯度方向。基于此现象,作者提出了一种推理阶段的技术,通过截断那些被虚假梯度利用以提高OOD置信度的特征坐标,同时保持ID分类的准确性,并引入局部一阶近似以避免二次前向传播,从而实现高效且轻量的OOD检测方法。

链接: https://arxiv.org/abs/2507.01417
作者: Jiawei Gu,Ziyue Qiao,Zechao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection is critical for safely deploying deep models in open-world environments, where inputs may lie outside the training distribution. During inference on a model trained exclusively with In-Distribution (ID) data, we observe a salient gradient phenomenon: around an ID sample, the local gradient directions for “enhancing” that sample’s predicted class remain relatively consistent, whereas OOD samples–unseen in training–exhibit disorganized or conflicting gradient directions in the same neighborhood. Motivated by this observation, we propose an inference-stage technique to short-circuit those feature coordinates that spurious gradients exploit to inflate OOD confidence, while leaving ID classification largely intact. To circumvent the expense of recomputing the logits after this gradient short-circuit, we further introduce a local first-order approximation that accurately captures the post-modification outputs without a second forward pass. Experiments on standard OOD benchmarks show our approach yields substantial improvements. Moreover, the method is lightweight and requires minimal changes to the standard inference pipeline, offering a practical path toward robust OOD detection in real-world applications.
zh

[CV-68] CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning ICCV2025

【速读】:该论文试图解决生成图像描述时对生成描述的属性(如描述性、长度和词汇独特性)进行细粒度控制的问题。现有模型在训练过程中未将这些属性作为条件输入,并且无法平滑地在不同语言模式之间转换。解决方案的关键在于提出一种名为CaptionSmiths的新方法,该方法通过量化每条描述的三个属性为连续标量值,并利用端点向量之间的插值来实现对生成描述属性的灵活控制,从而实现对输出描述属性的平滑调整。

链接: https://arxiv.org/abs/2507.01409
作者: Kuniaki Saito,Donghyun Kim,Kwanyong Park,Atsushi Hashimoto,Yoshitaka Ushiku
机构: OMRON SINIC X (OMRON SINIC X); Korea University (韩国大学); University of Seoul (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025

点击查看摘要

Abstract:An image captioning model flexibly switching its language pattern, e.g., descriptiveness and length, should be useful since it can be applied to diverse applications. However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. First, our approach quantifies three properties of each caption, length, descriptiveness, and uniqueness of a word, as continuous scalar values, without human annotation. Given the values, we represent the conditioning via interpolation between two endpoint vectors corresponding to the extreme states, e.g., one for a very short caption and one for a very long caption. Empirical results demonstrate that the resulting model can smoothly change the properties of the output captions and show higher lexical alignment than baselines. For instance, CaptionSmiths reduces the error in controlling caption length by 506% despite better lexical alignment. Code will be available on this https URL.
zh

[CV-69] Medical-Knowledge Driven Multiple Instance Learning for Classifying Severe Abdominal Anomalies on Prenatal Ultrasound MICCAI2025

【速读】:该论文旨在解决产前超声中胎儿腹部畸形的准确诊断问题,特别是针对现有研究多集中于图像级分类且依赖标准切面定位,而缺乏对病例级诊断的关注。其解决方案的关键在于提出一种基于病例级多实例学习(Multiple Instance Learning, MIL)的方法,无需依赖标准切面定位,通过混合注意力专家模块(Mixture-of-Attention-Experts, MoAE)、医学知识驱动的特征选择模块(Medical-Knowledge-Driven Feature Selection, MFS)以及基于提示的原型学习(Prompt-Based Prototype Learning, PPL)实现对胎儿腹部异常的高效分类。

链接: https://arxiv.org/abs/2507.01401
作者: Huanwen Liang,Jingxian Xu,Yuanji Zhang,Yuhao Huang,Yuhan Zhang,Xin Yang,Ran Li,Xuedong Deng,Yanjun Liu,Guowei Tao,Yun Wu,Sheng Zhao,Xinru Gao,Dong Ni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by MICCAI 2025

点击查看摘要

Abstract:Fetal abdominal malformations are serious congenital anomalies that require accurate diagnosis to guide pregnancy management and reduce mortality. Although AI has demonstrated significant potential in medical diagnosis, its application to prenatal abdominal anomalies remains limited. Most existing studies focus on image-level classification and rely on standard plane localization, placing less emphasis on case-level diagnosis. In this paper, we develop a case-level multiple instance learning (MIL)-based method, free of standard plane localization, for classifying fetal abdominal anomalies in prenatal ultrasound. Our contribution is three-fold. First, we adopt a mixture-of-attention-experts module (MoAE) to weight different attention heads for various planes. Secondly, we propose a medical-knowledge-driven feature selection module (MFS) to align image features with medical knowledge, performing self-supervised image token selection at the case-level. Finally, we propose a prompt-based prototype learning (PPL) to enhance the MFS. Extensively validated on a large prenatal abdominal ultrasound dataset containing 2,419 cases, with a total of 24,748 images and 6 categories, our proposed method outperforms the state-of-the-art competitors. Codes are available at:this https URL.
zh

[CV-70] Coherent Online Road Topology Estimation and Reasoning with Standard-Definition Maps IROS2025

【速读】:该论文试图解决自动驾驶汽车依赖高精度(High-Definition, HD)地图的问题,旨在通过车载传感器直接预测HD地图元素并推理其与交通元素之间的关系,从而实现在线一致的HD地图构建。解决方案的关键在于提出一种协同方法,利用常见标准清晰度(Standard-Definition, SD)地图的先验信息,联合预测车道线段及其拓扑结构以及道路边界,并采用混合车道线段编码和去噪技术以提高训练稳定性和性能,同时通过历史帧确保时间一致性。

链接: https://arxiv.org/abs/2507.01397
作者: Khanh Son Pham,Christian Witte,Jens Behley,Johannes Betz,Cyrill Stachniss
机构: CARIAD SE (CARIAD SE); Technical University Munich (慕尼黑工业大学); Center for Robotics, University of Bonn (机器人中心,波恩大学); Munich Institute of Robotics and Machine Intelligence (慕尼黑机器人与机器智能研究所); Lamarr Institute for Machine Learning and Artificial Intelligence (拉玛尔机器学习与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IROS 2025

点击查看摘要

Abstract:Most autonomous cars rely on the availability of high-definition (HD) maps. Current research aims to address this constraint by directly predicting HD map elements from onboard sensors and reasoning about the relationships between the predicted map and traffic elements. Despite recent advancements, the coherent online construction of HD maps remains a challenging endeavor, as it necessitates modeling the high complexity of road topologies in a unified and consistent manner. To address this challenge, we propose a coherent approach to predict lane segments and their corresponding topology, as well as road boundaries, all by leveraging prior map information represented by commonly available standard-definition (SD) maps. We propose a network architecture, which leverages hybrid lane segment encodings comprising prior information and denoising techniques to enhance training stability and performance. Furthermore, we facilitate past frames for temporal consistency. Our experimental evaluation demonstrates that our approach outperforms previous methods by a large margin, highlighting the benefits of our modeling scheme.
zh

[CV-71] FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases

【速读】:该论文旨在解决Talking head generation(说话头像生成)中的身份信息泄露(identity leakage, IL)和渲染伪影(rendering artifacts, RA)问题。其解决方案的关键在于提出了一种新的框架FixTalk,其中包含两个核心组件:Enhanced Motion Indicator (EMI) 和 Enhanced Detail Indicator (EDI)。EMI用于有效分离运动特征中的身份信息,从而缓解IL的影响;而EDI则利用泄露的身份信息来补充缺失细节,进而修复RA。通过这两个组件的协同作用,FixTalk能够同时解决IL和RA问题,实现更高质量的说话头像生成。

链接: https://arxiv.org/abs/2507.01390
作者: Shuai Tan,Bill Gong,Bin Ji,Ye Pan
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Talking head generation is gaining significant importance across various domains, with a growing demand for high-quality rendering. However, existing methods often suffer from identity leakage (IL) and rendering artifacts (RA), particularly in extreme cases. Through an in-depth analysis of previous approaches, we identify two key insights: (1) IL arises from identity information embedded within motion features, and (2) this identity information can be leveraged to address RA. Building on these findings, this paper introduces FixTalk, a novel framework designed to simultaneously resolve both issues for high-quality talking head generation. Firstly, we propose an Enhanced Motion Indicator (EMI) to effectively decouple identity information from motion features, mitigating the impact of IL on generated talking heads. To address RA, we introduce an Enhanced Detail Indicator (EDI), which utilizes the leaked identity information to supplement missing details, thus fixing the artifacts. Extensive experiments demonstrate that FixTalk effectively mitigates IL and RA, achieving superior performance compared to state-of-the-art methods.
zh

[CV-72] MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing ICCV2025

【速读】:该论文旨在解决弱监督音频-视觉视频解析(AVVP)中同时提升段级预测和事件级预测的难题,现有方法由于弱监督限制和模型架构缺陷,难以兼顾两者。其解决方案的关键在于提出一种基于伪标签增强的音频-视觉Mamba网络(MUG),通过利用单模态伪标签进行跨模态随机组合生成新数据,以增强模型对多种段级事件组合的解析能力,并采用音频-视觉Mamba网络进行特征处理与交互,从而提升对不同段的感知能力并抑制其他模态的噪声干扰。

链接: https://arxiv.org/abs/2507.01384
作者: Langyu Wang,Bingke Zhu,Yingying Chen,Yiyuan Zhang,Ming Tang,Jinqiao Wang
机构: Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accpted by ICCV 2025

点击查看摘要

Abstract:The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work, we propose a audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. Specifically, we annotate some of the pseudo-labels based on previous work. Using unimodal pseudo-labels, we perform cross-modal random combinations to generate new data, which can enhance the model’s ability to parse various segment-level event combinations. For feature processing and interaction, we employ a audio-visual mamba network. The AV-Mamba enhances the ability to perceive different segments and excludes additional modal noise while sharing similar modal information. Our extensive experiments demonstrate that MUG improves state-of-the-art results on LLP dataset in all metrics (e.g, gains of 2.1% and 1.2% in terms of visual Segment-level and audio Segment-level metrics). Our code is available at this https URL.
zh

[CV-73] Active Measurement: Efficient Estimation at Scale

【速读】:该论文试图解决科学测量中因当前工作流缺乏准确性和统计保证而限制AI应用的问题。其解决方案的关键在于引入主动测量(active measurement),这是一种人机协同的AI框架,通过AI模型预测个体单元的测量值,并利用重要性采样选择样本进行人工标注,随后根据新的标注数据迭代优化AI模型,并不断修正无偏的蒙特卡洛估计。该方法能够在AI模型不完美时仍提供精确估计,并在AI模型高度准确时显著减少人工工作量。

链接: https://arxiv.org/abs/2507.01372
作者: Max Hamilton,Jinlin Lai,Wenlong Zhao,Subhransu Maji,Daniel Sheldon
机构: University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI has the potential to transform scientific discovery by analyzing vast datasets with little human effort. However, current workflows often do not provide the accuracy or statistical guarantees that are needed. We introduce active measurement, a human-in-the-loop AI framework for scientific measurement. An AI model is used to predict measurements for individual units, which are then sampled for human labeling using importance sampling. With each new set of human labels, the AI model is improved and an unbiased Monte Carlo estimate of the total measurement is refined. Active measurement can provide precise estimates even with an imperfect AI model, and requires little human effort when the AI model is very accurate. We derive novel estimators, weighting schemes, and confidence intervals, and show that active measurement reduces estimation error compared to alternatives in several measurement tasks.
zh

[CV-74] Activation Reward Models for Few-Shot Model Alignment

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)和多模态模型(Large Multimodal Models, LMMs)与人类偏好对齐的问题,以提升模型生成输出的质量,使其更适用于实际应用场景。传统奖励建模方法在适应新偏好时存在局限性,因其需要单独训练的奖励模型以及大量偏好数据。论文提出的解决方案是引入激活奖励模型(Activation Reward Models, Activation RMs),其关键在于利用激活控制技术,在极少监督的情况下构建对齐良好的奖励信号,且无需额外模型微调。这一方法在标准奖励建模基准测试中优于现有少样本奖励建模方法,并在缓解奖励操纵行为方面表现出色。

链接: https://arxiv.org/abs/2507.01368
作者: Tianning Chai,Chancharik Mitra,Brandon Huang,Gautam Rajendrakumar Gare,Zhiqiu Lin,Assaf Arbelle,Leonid Karlinsky,Rogerio Feris,Trevor Darrell,Deva Ramanan,Roei Herzig
机构: University of California, Berkeley (加州大学伯克利分校); Carnegie Mellon University (卡内基梅隆大学); IBM Research (IBM研究院); MIT-IBM Watson AI Lab (MIT-IBM沃森人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) and Large Multimodal Models (LMMs) to human preferences is a central challenge in improving the quality of the models’ generative outputs for real-world applications. A common approach is to use reward modeling to encode preferences, enabling alignment via post-training using reinforcement learning. However, traditional reward modeling is not easily adaptable to new preferences because it requires a separate reward model, commonly trained on large preference datasets. To address this, we introduce Activation Reward Models (Activation RMs) – a novel few-shot reward modeling method that leverages activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning. Activation RMs outperform existing few-shot reward modeling approaches such as LLM-as-a-judge with in-context learning, voting-based scoring, and token probability scoring on standard reward modeling benchmarks. Furthermore, we demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications. Toward this end, we propose PreferenceHack, a novel few-shot setting benchmark, the first to test reward models on reward hacking in a paired preference format. Finally, we show that Activation RM achieves state-of-the-art performance on this benchmark, surpassing even GPT-4o.
zh

[CV-75] 3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation ICCV2025

【速读】:该论文试图解决物理对抗攻击在复杂现实环境中缺乏跨视角鲁棒性和对抗有效性的问题。现有方法依赖于目标物体的网格先验和模拟器构建的虚拟环境,导致获取成本高且与真实世界存在差异,同时由于训练图像背景的限制,难以生成多视角鲁棒的对抗伪装。论文提出的解决方案关键在于基于3D Gaussian Splatting(3DGS)的物理攻击框架PGA,该框架能够通过少量图像实现快速精确的重建,并具备逼真的渲染能力,同时通过防止高斯分布之间的相互遮挡和自遮挡,并采用最小最大优化方法调整每个视角的成像背景,从而提升跨视角鲁棒性和对抗效果。

链接: https://arxiv.org/abs/2507.01367
作者: Tianrui Lou,Xiaojun Jia,Siyuan Liang,Jiawei Liang,Ming Zhang,Yanjun Xiao,Xiaochun Cao
机构: Sun Yat-Sen University (中山大学); Peng Cheng Laboratory (鹏城实验室); Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); National Key Laboratory of Science and Technology on Information System Security (信息安全国家重点实验室); Nsfocus (奇安信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. Camouflage-based physical attack is a more promising approach compared to the patch-based attack, offering stronger adversarial effectiveness in complex physical environments. However, most prior work relies on mesh priors of the target object and virtual environments constructed by simulators, which are time-consuming to obtain and inevitably differ from the real world. Moreover, due to the limitations of the backgrounds in training images, previous methods often fail to produce multi-view robust adversarial camouflage and tend to fall into sub-optimal solutions. Due to these reasons, prior work lacks adversarial effectiveness and robustness across diverse viewpoints and physical environments. We propose a physical attack framework based on 3D Gaussian Splatting (3DGS), named PGA, which provides rapid and precise reconstruction with few images, along with photo-realistic rendering capabilities. Our framework further enhances cross-view robustness and adversarial effectiveness by preventing mutual and self-occlusion among Gaussians and employing a min-max optimization approach that adjusts the imaging background of each viewpoint, helping the algorithm filter out non-robust adversarial features. Extensive experiments validate the effectiveness and superiority of PGA. Our code is available at:this https URL.
zh

[CV-76] Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model

【速读】:该论文旨在解决视觉-语言大模型(LVLMs)中混合专家(MoE)框架在处理视觉和语言模态时存在的分布差异问题,以及由此导致的专家激活不均衡问题。现有方法主要依赖负载平衡机制,忽视了视觉和语言在数据分布上的本质差异。其解决方案的关键在于提出一种长尾分布感知的路由机制(LTDR),该机制通过两个方面进行改进:一是为不同模态设计分布感知的路由策略,即语言模态采用均匀分布的路由,而视觉模态则针对长尾分布进行优化;二是通过增加对视觉尾部标记的专家激活数量,提升其处理能力。

链接: https://arxiv.org/abs/2507.01351
作者: Chaoxiang Cai,Longrong Yang,Kaibing Chen,Fan Yang,Xi Li
机构: Zhejiang University (浙江大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The mixture-of-experts (MoE), which replaces dense models with sparse architectures, has gained attention in large vision-language models (LVLMs) for achieving comparable performance with fewer activated parameters. Existing MoE frameworks for LVLMs focus on token-to-expert routing (TER), encouraging different experts to specialize in processing distinct tokens. However, these frameworks often rely on the load balancing mechanism, overlooking the inherent distributional differences between vision and language. To this end, we propose a Long-Tailed Distribution-aware Router (LTDR) for vision-language TER, tackling two challenges: (1) Distribution-aware router for modality-specific routing. We observe that language TER follows a uniform distribution, whereas vision TER exhibits a long-tailed distribution. This discrepancy necessitates distinct routing strategies tailored to each modality. (2) Enhancing expert activation for vision tail tokens. Recognizing the importance of vision tail tokens, we introduce an oversampling-like strategy by increasing the number of activated experts for these tokens. Experiments on extensive benchmarks validate the effectiveness of our approach.
zh

[CV-77] Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation

【速读】:该论文旨在解决模型在测试阶段性能提升的问题,特别是如何在不依赖特定任务或数据分布的情况下,提高模型的泛化能力和鲁棒性。其解决方案的关键在于提出一种通用的测试时增强方法——广义测试时增强(Generalized Test-Time Augmentation, GTTA),该方法通过多次随机扰动测试输入的主成分分析(PCA)子空间投影,形成具有鲁棒性的集成模型,在测试阶段有效过滤结构化和系统性噪声,降低估计误差。此外,GTTA还引入了一个最终的自监督学习阶段,利用集成输出作为无监督教师,对初始学生模型进行训练,从而显著降低测试阶段的计算成本而无需牺牲准确性。

链接: https://arxiv.org/abs/2507.01347
作者: Andrei Jelea,Ahmed Nabil Belbachir,Marius Leordeanu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Generalized Test-Time Augmentation (GTTA), a highly effective method for improving the performance of a trained model, which unlike other existing Test-Time Augmentation approaches from the literature is general enough to be used off-the-shelf for many vision and non-vision tasks, such as classification, regression, image segmentation and object detection. By applying a new general data transformation, that randomly perturbs multiple times the PCA subspace projection of a test input, GTTA forms robust ensembles at test time in which, due to sound statistical properties, the structural and systematic noises in the initial input data is filtered out and final estimator errors are reduced. Different from other existing methods, we also propose a final self-supervised learning stage in which the ensemble output, acting as an unsupervised teacher, is used to train the initial single student model, thus reducing significantly the test time computational cost, at no loss in accuracy. Our tests and comparisons to strong TTA approaches and SoTA models on various vision and non-vision well-known datasets and tasks, such as image classification and segmentation, speech recognition and house price prediction, validate the generality of the proposed GTTA. Furthermore, we also prove its effectiveness on the more specific real-world task of salmon segmentation and detection in low-visibility underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature.
zh

[CV-78] Learning Camera-Agnostic White-Balance Preferences

【速读】:该论文旨在解决跨摄像头的自动白平衡(AWB)中色彩风格一致性的问题,即在不同摄像头传感器上实现一致且具有审美偏好的色彩渲染。传统AWB系统往往追求中性色彩校正而非美学偏好,而现有学习方法在不同传感器间泛化能力不足。该论文的关键解决方案是学习一个后光照估计映射,在与相机无关的空间中将中性光照校正转换为具有审美偏好的校正,从而在任意中性AWB模块之后应用,实现跨摄像头的一致性与风格化色彩表现。

链接: https://arxiv.org/abs/2507.01342
作者: Luxi Zhao,Mahmoud Afifi,Michael S. Brown
机构: Samsung Electronics(三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The image signal processor (ISP) pipeline in modern cameras consists of several modules that transform raw sensor data into visually pleasing images in a display color space. Among these, the auto white balance (AWB) module is essential for compensating for scene illumination. However, commercial AWB systems often strive to compute aesthetic white-balance preferences rather than accurate neutral color correction. While learning-based methods have improved AWB accuracy, they typically struggle to generalize across different camera sensors – an issue for smartphones with multiple cameras. Recent work has explored cross-camera AWB, but most methods remain focused on achieving neutral white balance. In contrast, this paper is the first to address aesthetic consistency by learning a post-illuminant-estimation mapping that transforms neutral illuminant corrections into aesthetically preferred corrections in a camera-agnostic space. Once trained, our mapping can be applied after any neutral AWB module to enable consistent and stylized color rendering across unseen cameras. Our proposed model is lightweight – containing only \sim 500 parameters – and runs in just 0.024 milliseconds on a typical flagship mobile CPU. Evaluated on a dataset of 771 smartphone images from three different cameras, our method achieves state-of-the-art performance while remaining fully compatible with existing cross-camera AWB techniques, introducing minimal computational and memory overhead.
zh

[CV-79] Physics-informed Ground Reaction Dynamics from Human Motion Capture

【速读】:该论文旨在解决传统方法依赖专用设备(如力板)获取人体动力学信息所带来的局限性,从而限制了人类动力学学习的广泛应用。其解决方案的关键在于利用运动捕捉数据,结合物理定律和计算仿真,直接估计地面反作用力。具体而言,通过欧拉积分方案和PD算法实现高精度、鲁棒的地面反作用力计算,并将物理约束引入学习模型,以提升动力学估计的准确性。

链接: https://arxiv.org/abs/2507.01340
作者: Cuong Le,Huy-Phuong Le,Duc Le,Minh-Thien Duong,Van-Binh Nguyen,My-Ha Le
机构: HCMUTE University (胡志明市大学); Linköping University (林雪平大学); Thu Dau Mot University (土龙木大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, 4 tables, HSI 2025

点击查看摘要

Abstract:Body dynamics are crucial information for the analysis of human motions in important research fields, ranging from biomechanics, sports science to computer vision and graphics. Modern approaches collect the body dynamics, external reactive force specifically, via force plates, synchronizing with human motion capture data, and learn to estimate the dynamics from a black-box deep learning model. Being specialized devices, force plates can only be installed in laboratory setups, imposing a significant limitation on the learning of human dynamics. To this end, we propose a novel method for estimating human ground reaction dynamics directly from the more reliable motion capture data with physics laws and computational simulation as constrains. We introduce a highly accurate and robust method for computing ground reaction forces from motion capture data using Euler’s integration scheme and PD algorithm. The physics-based reactive forces are used to inform the learning model about the physics-informed motion dynamics thus improving the estimation accuracy. The proposed approach was tested on the GroundLink dataset, outperforming the baseline model on: 1) the ground reaction force estimation accuracy compared to the force plates measurement; and 2) our simulated root trajectory precision. The implementation code is available at this https URL
zh

[CV-80] LANet: A Lane Boundaries-Aware Approach For Robust Trajectory Prediction

【速读】:该论文旨在解决现有运动预测模型依赖车道中心线表示所导致的对道路环境和交通规则捕捉不足的问题,从而影响自动驾驶的安全性和效率。其解决方案的关键在于引入多向量地图元素(如车道边界和路缘)以增强驾驶环境的表征,并通过有效的特征融合策略整合不同地图组件的信息,使模型能够学习到更全面的道路结构及其与交通参与者之间的交互关系。同时,为应对信息量增加带来的计算负担,提出了一个有效的连接剪枝机制,以保留对目标参与者最关键的地图连接,从而在保证轨迹预测准确性的同时提升计算效率。

链接: https://arxiv.org/abs/2507.01308
作者: Muhammad Atta ur Rahman,Dooseop Choi,KyoungWook Min
机构: ETRI(电子通信研究院); University of Science and Technology (科学技术大学); South Korea (韩国)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 17th IEEE International Conference on Advanced Computational Intelligence (ICACI 2025)

点击查看摘要

Abstract:Accurate motion forecasting is critical for safe and efficient autonomous driving, enabling vehicles to predict future trajectories and make informed decisions in complex traffic scenarios. Most of the current designs of motion prediction models are based on the major representation of lane centerlines, which limits their capability to capture critical road environments and traffic rules and constraints. In this work, we propose an enhanced motion forecasting model informed by multiple vector map elements, including lane boundaries and road edges, that facilitates a richer and more complete representation of driving environments. An effective feature fusion strategy is developed to merge information in different vector map components, where the model learns holistic information on road structures and their interactions with agents. Since encoding more information about the road environment increases memory usage and is computationally expensive, we developed an effective pruning mechanism that filters the most relevant map connections to the target agent, ensuring computational efficiency while maintaining essential spatial and semantic relationships for accurate trajectory prediction. Overcoming the limitations of lane centerline-based models, our method provides a more informative and efficient representation of the driving environment and advances the state of the art for autonomous vehicle motion forecasting. We verify our approach with extensive experiments on the Argoverse 2 motion forecasting dataset, where our method maintains competitiveness on AV2 while achieving improved performance. Index Terms-Autonomous driving, trajectory prediction, vector map elements, road topology, connection pruning, Argoverse 2. Comments: Accepted at the 17th IEEE International Conference on Advanced Computational Intelligence (ICACI 2025) Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.01308 [cs.RO] (or arXiv:2507.01308v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2507.01308 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-81] DiffusionLight-Turbo: Accelerated Light Probes for Free via Single-Pass Chrome Ball Inpainting

【速读】:该论文旨在解决从单张低动态范围(LDR)图像中估计光照的问题,传统方法因依赖有限的高动态范围(HDR)全景数据集而存在泛化能力不足的问题。其解决方案的关键在于将任务重新定义为一个Chrome球补全问题,并利用预训练的Stable Diffusion XL模型进行处理。通过迭代补全生成多个Chrome球结果并取中值作为稳定、低频的光照先验,从而引导生成高质量的最终结果,同时引入Exposure LoRA和Turbo LoRA技术提升效率与质量。

链接: https://arxiv.org/abs/2507.01305
作者: Worameth Chinchuthakun,Pakkapon Phongthawee,Amit Raj,Varun Jampani,Pramook Khungurn,Supasorn Suwajanakorn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2312.09168

点击查看摘要

Abstract:We introduce a simple yet effective technique for estimating lighting from a single low-dynamic-range (LDR) image by reframing the task as a chrome ball inpainting problem. This approach leverages a pre-trained diffusion model, Stable Diffusion XL, to overcome the generalization failures of existing methods that rely on limited HDR panorama datasets. While conceptually simple, the task remains challenging because diffusion models often insert incorrect or inconsistent content and cannot readily generate chrome balls in HDR format. Our analysis reveals that the inpainting process is highly sensitive to the initial noise in the diffusion process, occasionally resulting in unrealistic outputs. To address this, we first introduce DiffusionLight, which uses iterative inpainting to compute a median chrome ball from multiple outputs to serve as a stable, low-frequency lighting prior that guides the generation of a high-quality final result. To generate high-dynamic-range (HDR) light probes, an Exposure LoRA is fine-tuned to create LDR images at multiple exposure values, which are then merged. While effective, DiffusionLight is time-intensive, requiring approximately 30 minutes per estimation. To reduce this overhead, we introduce DiffusionLight-Turbo, which reduces the runtime to about 30 seconds with minimal quality loss. This 60x speedup is achieved by training a Turbo LoRA to directly predict the averaged chrome balls from the iterative process. Inference is further streamlined into a single denoising pass using a LoRA swapping technique. Experimental results that show our method produces convincing light estimates across diverse settings and demonstrates superior generalization to in-the-wild scenarios. Our code is available at this https URL
zh

[CV-82] Learning an Ensemble Token from Task-driven Priors in Facial Analysis

【速读】:该论文旨在解决单任务学习过程中统一特征表示不足的问题,即在训练过程中缺乏有效保留任务特定特征一致性的方法。其解决方案的关键在于提出ET-Fuser方法,通过利用基于任务先验的注意力机制,从预训练模型中提取的特征进行鲁棒的先验统一学习,生成一个集成的token,该token在自注意力机制中共享预训练编码器之间的互信息,从而实现高效且具有统计显著性提升的特征表示。

链接: https://arxiv.org/abs/2507.01290
作者: Sunyong Seo,Semin Kim,Jongha Lee
机构: lululab( lululab)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages, 8figures, 4tables

点击查看摘要

Abstract:Facial analysis exhibits task-specific feature variations. While Convolutional Neural Networks (CNNs) have enabled the fine-grained representation of spatial information, Vision Transformers (ViTs) have facilitated the representation of semantic information at the patch level. Although the generalization of conventional methodologies has advanced visual interpretability, there remains paucity of research that preserves the unified feature representation on single task learning during the training process. In this work, we introduce ET-Fuser, a novel methodology for learning ensemble token by leveraging attention mechanisms based on task priors derived from pre-trained models for facial analysis. Specifically, we propose a robust prior unification learning method that generates a ensemble token within a self-attention mechanism, which shares the mutual information along the pre-trained encoders. This ensemble token approach offers high efficiency with negligible computational cost. Our results show improvements across a variety of facial analysis, with statistically significant enhancements observed in the feature representations.
zh

[CV-83] VLAD: A VLM-Augmented Autonomous Driving Framework with Hierarchical Planning and Interpretable Decision Process ITSC

【速读】:该论文旨在解决自动驾驶系统在感知、预测和规划能力上的不足,通过整合生成式AI(Generative AI)增强其性能。解决方案的关键在于提出VLAD模型,该模型将微调后的视觉语言模型(VLM)与端到端的车辆控制架构VAD相结合,并采用定制的问答数据集进行专门的微调,以提升模型的空间推理能力,从而生成高层导航指令并提供可解释的自然语言决策说明,提高了系统的透明度和可靠性。

链接: https://arxiv.org/abs/2507.01284
作者: Cristian Gariboldi,Hayato Tokida,Ken Kinjo,Yuki Asada,Alexander Carballo
机构: Gifu University (岐阜大学); DENSO CORPORATION (电装公司)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC)

点击查看摘要

Abstract:Recent advancements in open-source Visual Language Models (VLMs) such as LLaVA, Qwen-VL, and Llama have catalyzed extensive research on their integration with diverse systems. The internet-scale general knowledge encapsulated within these models presents significant opportunities for enhancing autonomous driving perception, prediction, and planning capabilities. In this paper we propose VLAD, a vision-language autonomous driving model, which integrates a fine-tuned VLM with VAD, a state-of-the-art end-to-end system. We implement a specialized fine-tuning approach using custom question-answer datasets designed specifically to improve the spatial reasoning capabilities of the model. The enhanced VLM generates high-level navigational commands that VAD subsequently processes to guide vehicle operation. Additionally, our system produces interpretable natural language explanations of driving decisions, thereby increasing transparency and trustworthiness of the traditionally black-box end-to-end architecture. Comprehensive evaluation on the real-world nuScenes dataset demonstrates that our integrated system reduces average collision rates by 31.82% compared to baseline methodologies, establishing a new benchmark for VLM-augmented autonomous driving systems.
zh

[CV-84] Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing ICCV2025

【速读】:该论文旨在解决无配对图像去雾任务中,现有基于对比学习的方法引入与雾霾无关的内容信息以及忽略频率域中雾霾特性的缺陷。其解决方案的关键在于提出一种基于频率域的扩散模型(\ours),通过从频率域重建的角度进行去雾,并利用扩散模型生成与清晰图像幅度谱分布一致的输出。核心创新包括:设计一个幅度残差编码器(Amplitude Residual Encoder, ARE)以提取幅度残差,有效补偿雾霾到清晰域的幅度差异并为扩散模型提供监督;引入相位校正模块(Phase Correction Module, PCM)通过简单的注意力机制进一步优化相位谱,消除伪影。

链接: https://arxiv.org/abs/2507.01275
作者: Chengxu Liu,Lu Qi,Jinshan Pan,Xueming Qian,Ming-Hsuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Unpaired image dehazing has attracted increasing attention due to its flexible data requirements during model training. Dominant methods based on contrastive learning not only introduce haze-unrelated content information, but also ignore haze-specific properties in the frequency domain (\ie,~haze-related degradation is mainly manifested in the amplitude spectrum). To address these issues, we propose a novel frequency domain-based diffusion model, named \ours, for fully exploiting the beneficial knowledge in unpaired clear data. In particular, inspired by the strong generative ability shown by Diffusion Models (DMs), we tackle the dehazing task from the perspective of frequency domain reconstruction and perform the DMs to yield the amplitude spectrum consistent with the distribution of clear images. To implement it, we propose an Amplitude Residual Encoder (ARE) to extract the amplitude residuals, which effectively compensates for the amplitude gap from the hazy to clear domains, as well as provide supervision for the DMs training. In addition, we propose a Phase Correction Module (PCM) to eliminate artifacts by further refining the phase spectrum during dehazing with a simple attention mechanism. Experimental results demonstrate that our \ours outperforms other state-of-the-art methods on both synthetic and real-world datasets.
zh

[CV-85] Advancements in Weed Mapping: A Systematic Review

【速读】:该论文试图解决当前 Weed Mapping 领域缺乏系统性综述的问题,特别是针对从数据采集到处理技术及制图工具的整个映射流程缺乏结构化分析。其解决方案的关键在于通过遵循 PRISMA 指南,系统地审视最新的数据采集(传感器与平台技术)、数据处理(包括标注与建模)以及制图技术(如时空分析和决策支持工具),并综合文献中的关键发现,以提供对 Weed Mapping 现状的全面理解,从而为未来研究和高效、可扩展且可持续的 Weed Management 系统的发展提供基础参考。

链接: https://arxiv.org/abs/2507.01269
作者: Mohammad Jahanbakht,Alex Olsen,Ross Marchant,Emilie Fillols,Mostafa Rahimi Azghadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Weed mapping plays a critical role in precision management by providing accurate and timely data on weed distribution, enabling targeted control and reduced herbicide use. This minimizes environmental impacts, supports sustainable land management, and improves outcomes across agricultural and natural environments. Recent advances in weed mapping leverage ground-vehicle Red Green Blue (RGB) cameras, satellite and drone-based remote sensing combined with sensors such as spectral, Near Infra-Red (NIR), and thermal cameras. The resulting data are processed using advanced techniques including big data analytics and machine learning, significantly improving the spatial and temporal resolution of weed maps and enabling site-specific management decisions. Despite a growing body of research in this domain, there is a lack of comprehensive literature reviews specifically focused on weed mapping. In particular, the absence of a structured analysis spanning the entire mapping pipeline, from data acquisition to processing techniques and mapping tools, limits progress in the field. This review addresses these gaps by systematically examining state-of-the-art methods in data acquisition (sensor and platform technologies), data processing (including annotation and modelling), and mapping techniques (such as spatiotemporal analysis and decision support tools). Following PRISMA guidelines, we critically evaluate and synthesize key findings from the literature to provide a holistic understanding of the weed mapping landscape. This review serves as a foundational reference to guide future research and support the development of efficient, scalable, and sustainable weed management systems.
zh

[CV-86] AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation

【速读】:该论文旨在解决当前AI生成视频评估框架中存在可解释性不足和与人类评价对齐度低的问题。现有评估指标仅提供数值评分而缺乏解释性反馈,导致评估结果难以被理解和应用。其解决方案的关键在于提出AIGVE-MACS,这是一个统一的AI生成视频评估模型,能够同时提供数值评分和多维度的语言评论反馈。该模型依托于AIGVE-BENCH 2,一个包含2,500个AI生成视频及22,500条人工标注的详细评论和评分的大规模基准数据集,并结合了最新的视觉-语言模型、新颖的逐标记加权损失函数以及动态帧采样策略,以更好地匹配人类评估者。

链接: https://arxiv.org/abs/2507.01255
作者: Xiao Liu,Jiawei Zhang
机构: IFM Lab, University of California, Davis (IFM 实验室,加利福尼亚大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Working in Progress

点击查看摘要

Abstract:The rapid advancement of AI-generated video models has created a pressing need for robust and interpretable evaluation frameworks. Existing metrics are limited to producing numerical scores without explanatory comments, resulting in low interpretability and human evaluation alignment. To address those challenges, we introduce AIGVE-MACS, a unified model for AI-Generated Video Evaluation(AIGVE), which can provide not only numerical scores but also multi-aspect language comment feedback in evaluating these generated videos. Central to our approach is AIGVE-BENCH 2, a large-scale benchmark comprising 2,500 AI-generated videos and 22,500 human-annotated detailed comments and numerical scores across nine critical evaluation aspects. Leveraging AIGVE-BENCH 2, AIGVE-MACS incorporates recent Vision-Language Models with a novel token-wise weighted loss and a dynamic frame sampling strategy to better align with human evaluators. Comprehensive experiments across supervised and zero-shot benchmarks demonstrate that AIGVE-MACS achieves state-of-the-art performance in both scoring correlation and comment quality, significantly outperforming prior baselines including GPT-4o and VideoScore. In addition, we further showcase a multi-agent refinement framework where feedback from AIGVE-MACS drives iterative improvements in video generation, leading to 53.5% quality enhancement. This work establishes a new paradigm for comprehensive, human-aligned evaluation of AI-generated videos. We release the AIGVE-BENCH 2 and AIGVE-MACS at this https URL.
zh

[CV-87] Robust Brain Tumor Segmentation with Incomplete MRI Modalities Using Hölder Divergence and Mutual Information-Enhanced Knowledge Transfer

【速读】:该论文旨在解决多模态磁共振成像(Multimodal MRI)在脑肿瘤分割任务中因某些模态缺失而导致传统方法性能下降的问题。其解决方案的关键在于提出了一种鲁棒的单模态并行处理框架,该框架通过引入Holder散度和互信息来保持模态特异性特征,并根据可用输入动态调整网络参数,从而在模态不完整的情况下仍能实现高精度的分割。

链接: https://arxiv.org/abs/2507.01254
作者: Runze Cheng,Xihang Qiu,Ming Li,Ye Zhang,Chun Li,Fei Yu
机构: Shenzhen MSU-BIT University (深圳美中理工大学); Beijing Institute of Technology (北京理工大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal MRI provides critical complementary information for accurate brain tumor segmentation. However, conventional methods struggle when certain modalities are missing due to issues such as image quality, protocol inconsistencies, patient allergies, or financial constraints. To address this, we propose a robust single-modality parallel processing framework that achieves high segmentation accuracy even with incomplete modalities. Leveraging Holder divergence and mutual information, our model maintains modality-specific features while dynamically adjusting network parameters based on the available inputs. By using these divergence- and information-based loss functions, the framework effectively quantifies discrepancies between predictions and ground-truth labels, resulting in consistently accurate segmentation. Extensive evaluations on the BraTS 2018 and BraTS 2020 datasets demonstrate superior performance over existing methods in handling missing modalities.
zh

[CV-88] Escaping Platos Cave: JAM for Aligning Independently Trained Vision and Language Models

【速读】:该论文试图解决独立训练的视觉与语言模型在各自模态空间中存在差异,但可能共同逼近一个共享的现实统计模型的问题,即“柏拉图表征假设”所提出的兼容性问题。其核心挑战在于如何超越事后统计检测对齐性,而是显式地优化不同模态间的对齐。解决方案的关键在于将这一对齐问题建模为多目标优化任务,即在保持各模态原始结构的同时实现相互一致性。为此,作者提出了联合自编码调制器(Joint Autoencoder Modulator, JAM)框架,通过在预训练单模态模型的潜在表示上联合训练模态特定的自编码器,利用重建和跨模态目标促进对齐,从而从离散输入中生成共享结构。

链接: https://arxiv.org/abs/2507.01201
作者: Hyoseo(Lauren)Yoon,Yisong Yue,Been Kim
机构: California Institute of Technology (加州理工学院); Google DeepMind (谷歌深度思维)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. Yet an emerging hypothesis - the Platonic Representation Hypothesis - suggests that such models may nonetheless converge toward a shared statistical model of reality. This compatibility, if it exists, raises a fundamental question: can we move beyond post-hoc statistical detection of alignment and explicitly optimize for it between such disjoint representations? We cast this Platonic alignment problem as a multi-objective optimization task - preserve each modality’s native structure while aligning for mutual coherence. We introduce the Joint Autoencoder Modulator (JAM) framework that jointly trains modality-specific autoencoders on the latent representations of pre-trained single modality models, encouraging alignment through both reconstruction and cross-modal objectives. By analogy, this framework serves as a method to escape Plato’s Cave, enabling the emergence of shared structure from disjoint inputs. We evaluate this framework across three critical design axes: (i) the alignment objective - comparing contrastive loss (Con), its hard-negative variant (NegCon), and our Spread loss, (ii) the layer depth at which alignment is most effective, and (iii) the impact of foundation model scale on representational convergence. Our findings show that our lightweight Pareto-efficient framework reliably induces alignment, even across frozen, independently trained representations, offering both theoretical insight and practical pathways for transforming generalist unimodal foundations into specialist multimodal models.
zh

[CV-89] Rapid Salient Object Detection with Difference Convolutional Neural Networks

【速读】:该论文旨在解决在资源受限设备上实现高效且实时的显著目标检测(SOD)问题。现有领先的SOD模型计算成本较高,难以满足边缘设备的部署需求。其解决方案的关键在于提出一种高效的网络设计,结合传统SOD方法中的对比线索计算与现代卷积神经网络(CNN)的表征能力,通过引入像素差分卷积(PDC)来编码特征对比,并采用差异卷积重参数化(DCR)策略将PDC嵌入标准卷积中,从而在推理阶段减少计算量和参数量。此外,针对视频SOD,还提出了时空差分卷积(STDC)以增强时空对比信息的捕捉能力。

链接: https://arxiv.org/abs/2507.01182
作者: Zhuo Su,Li Liu,Matthias Müller,Jiehua Zhang,Diana Wofk,Ming-Ming Cheng,Matti Pietikäinen
机构: Nankai University (南开大学); Intel Labs (英特尔实验室); University of Oulu (奥卢大学); NUDT (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, accepted in TPAMI

点击查看摘要

Abstract:This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance. While recent advances in deep neural networks have improved SOD, existing top-leading models are computationally expensive. We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs. Like biologically-inspired classical SOD methods relying on computing contrast cues to determine saliency of image regions, our model leverages Pixel Difference Convolutions (PDCs) to encode the feature contrasts. Differently, PDCs are incorporated in a CNN architecture so that the valuable contrast cues are extracted from rich feature maps. For efficiency, we introduce a difference convolution reparameterization (DCR) strategy that embeds PDCs into standard convolutions, eliminating computation and parameters at inference. Additionally, we introduce SpatioTemporal Difference Convolution (STDC) for video SOD, enhancing the standard 3D convolution with spatiotemporal contrast capture. Our models, SDNet for image SOD and STDNet for video SOD, achieve significant improvements in efficiency-accuracy trade-offs. On a Jetson Orin device, our models with 1M parameters operate at 46 FPS and 150 FPS on streamed images and videos, surpassing the second-best lightweight models in our experiments by more than 2\times and 3\times in speed with superior accuracy. Code will be available at this https URL.
zh

[CV-90] cp_measure: API-first feature extraction for image-based profiling workflows ICML2025

【速读】:该论文试图解决传统生物图像分析工具在自动化和可重复性方面存在的障碍,特别是在生成高维图像特征集以支持机器学习工作流时的局限性。其解决方案的关键在于开发了一个名为cp_measure的Python库,该库将CellProfiler的核心测量功能提取为模块化、以API优先的设计,从而实现了与科学Python生态系统的无缝集成,并保持了与CellProfiler特征的高保真度。

链接: https://arxiv.org/abs/2507.01163
作者: Alán F. Muñoz(1),Tim Treis(2), (1),Alexandr A. Kalinin(1),Shatavisha Dasgupta(1),Fabian Theis(2),Anne E. Carpenter(1),Shantanu Singh(1) ((1) Broad Institute of MIT and Harvard, United States,(2) Institute of Computational Biology, Helmholtz Zentrum München, Germany)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cell Behavior (q-bio.CB); Quantitative Methods (q-bio.QM)
备注: 10 pages, 4 figures, 4 supplementary figures. CODEML Workshop paper accepted (non-archival), as a part of ICML2025 events

点击查看摘要

Abstract:Biological image analysis has traditionally focused on measuring specific visual properties of interest for cells or other entities. A complementary paradigm gaining increasing traction is image-based profiling - quantifying many distinct visual features to form comprehensive profiles which may reveal hidden patterns in cellular states, drug responses, and disease mechanisms. While current tools like CellProfiler can generate these feature sets, they pose significant barriers to automated and reproducible analyses, hindering machine learning workflows. Here we introduce cp_measure, a Python library that extracts CellProfiler’s core measurement capabilities into a modular, API-first tool designed for programmatic feature extraction. We demonstrate that cp_measure features retain high fidelity with CellProfiler features while enabling seamless integration with the scientific Python ecosystem. Through applications to 3D astrocyte imaging and spatial transcriptomics, we showcase how cp_measure enables reproducible, automated image-based profiling pipelines that scale effectively for machine learning applications in computational biology.
zh

[CV-91] Landslide Detection and Mapping Using Deep Learning Across Multi-Source Satellite Data and Geographic Regions

【速读】:该论文旨在解决滑坡灾害在不同地理区域中的准确检测与预测问题,以减少其对基础设施、经济和人类生命造成的威胁。解决方案的关键在于整合多源卫星遥感数据(如Sentinel-2多光谱数据和ALOS PALSAR衍生的坡度与数字高程模型)与深度学习模型,通过融合地理空间分析技术,提升滑坡识别的准确性,并评估多种先进的深度学习分割模型(如U-Net、DeepLabV3+和Res-Net)在滑坡检测中的有效性。

链接: https://arxiv.org/abs/2507.01123
作者: Rahul A. Burange,Harsh K. Shinde,Omkar Mutyalwar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 20 pages, 24 figures

点击查看摘要

Abstract:Landslides pose severe threats to infrastructure, economies, and human lives, necessitating accurate detection and predictive mapping across diverse geographic regions. With advancements in deep learning and remote sensing, automated landslide detection has become increasingly effective. This study presents a comprehensive approach integrating multi-source satellite imagery and deep learning models to enhance landslide identification and prediction. We leverage Sentinel-2 multispectral data and ALOS PALSAR-derived slope and Digital Elevation Model (DEM) layers to capture critical environmental features influencing landslide occurrences. Various geospatial analysis techniques are employed to assess the impact of terra in characteristics, vegetation cover, and rainfall on detection accuracy. Additionally, we evaluate the performance of multiple stateof-the-art deep learning segmentation models, including U-Net, DeepLabV3+, and Res-Net, to determine their effectiveness in landslide detection. The proposed framework contributes to the development of reliable early warning systems, improved disaster risk management, and sustainable land-use planning. Our findings provide valuable insights into the potential of deep learning and multi-source remote sensing in creating robust, scalable, and transferable landslide prediction models.
zh

[CV-92] Geometry-aware 4D Video Generation for Robot Manipulation

【速读】:该论文旨在解决视频生成中时间连贯性和跨视角几何一致性难以同时保证的问题。其解决方案的关键在于提出一种4D视频生成模型,通过在训练过程中使用跨视角点云对齐进行几何监督,强制模型学习场景的多视角3D一致性表示,从而实现从给定RGB-D观测中预测新视角下的未来视频序列。

链接: https://arxiv.org/abs/2507.01099
作者: Zeyi Liu,Shuang Li,Eric Cousineau,Siyuan Feng,Benjamin Burchfiel,Shuran Song
机构: Stanford University (斯坦福大学); Toyota Research Institute (丰田研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project website: this https URL

点击查看摘要

Abstract:Understanding and predicting the dynamics of the physical world can enhance a robot’s ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of videos by supervising the model with cross-view pointmap alignment during training. This geometric supervision enables the model to learn a shared 3D representation of the scene, allowing it to predict future video sequences from novel viewpoints based solely on the given RGB-D observations, without requiring camera poses as inputs. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, supporting robust robot manipulation and generalization to novel camera viewpoints.
zh

[CV-93] Embedding-based Retrieval in Multimodal Content Moderation SIGIR2025

【速读】:该论文旨在解决短视频平台内容审核中传统分类方法在应对快速变化的趋势和紧急事件时响应速度慢、成本高的问题。其解决方案的关键在于引入基于嵌入(Embedding)的检索(Embedding-Based Retrieval, EBR)方法,通过监督对比学习(Supervised Contrastive Learning, SCL)框架训练出性能优于现有对比学习方法(如CLIP和MoCo)的基础嵌入模型,并构建集成嵌入生成与视频检索的系统,从而实现高效且有效的趋势处理。

链接: https://arxiv.org/abs/2507.01066
作者: Hanzhong Liang,Jinghao Shi,Xiang Shen,Zixuan Wang,Vera Wen,Ardalan Mehrani,Zhiqian Chen,Yifan Wu,Zhixin Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Camera ready for SIGIR 2025

点击查看摘要

Abstract:Video understanding plays a fundamental role for content moderation on short video platforms, enabling the detection of inappropriate content. While classification remains the dominant approach for content moderation, it often struggles in scenarios requiring rapid and cost-efficient responses, such as trend adaptation and urgent escalations. To address this issue, we introduce an Embedding-Based Retrieval (EBR) method designed to complement traditional classification approaches. We first leverage a Supervised Contrastive Learning (SCL) framework to train a suite of foundation embedding models, including both single-modal and multi-modal architectures. Our models demonstrate superior performance over established contrastive learning methods such as CLIP and MoCo. Building on these embedding models, we design and implement the embedding-based retrieval system that integrates embedding generation and video retrieval to enable efficient and effective trend handling. Comprehensive offline experiments on 25 diverse emerging trends show that EBR improves ROC-AUC from 0.85 to 0.99 and PR-AUC from 0.35 to 0.95. Further online experiments reveal that EBR increases action rates by 10.32% and reduces operational costs by over 80%, while also enhancing interpretability and flexibility compared to classification-based solutions.
zh

[CV-94] A computationally frugal open-source foundation model for thoracic disease detection in lung cancer screening programs

【速读】:该论文旨在解决低剂量计算机断层扫描(LDCT)在肺癌筛查(LCS)中因放射科医生短缺而导致的影像解读能力不足问题。其解决方案的关键在于提出TANGERINE,一个计算资源消耗低、开源的三维LDCT分析视觉基础模型。TANGERINE通过自监督学习在大量胸腔LDCT数据上预训练,具备良好的可迁移性,能够在有限的计算资源和训练数据下快速微调,并表现出优异的标签效率和泛化能力,为LCS项目提供了可扩展、易部署的医疗影像分析工具。

链接: https://arxiv.org/abs/2507.01881
作者: Niccolò McConnell,Pardeep Vasudev,Daisuke Yamada,Daryl Cheng,Mehran Azimbagirad,John McCabe,Shahab Aslani,Ahmed H. Shahin,Yukun Zhou, TheSUMMIT Consortium,Andre Altmann,Yipeng Hu,Paul Taylor,Sam M. Janes,Daniel C. Alexander,Joseph Jacob
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Low-dose computed tomography (LDCT) imaging employed in lung cancer screening (LCS) programs is increasing in uptake worldwide. LCS programs herald a generational opportunity to simultaneously detect cancer and non-cancer-related early-stage lung disease. Yet these efforts are hampered by a shortage of radiologists to interpret scans at scale. Here, we present TANGERINE, a computationally frugal, open-source vision foundation model for volumetric LDCT analysis. Designed for broad accessibility and rapid adaptation, TANGERINE can be fine-tuned off the shelf for a wide range of disease-specific tasks with limited computational resources and training data. Relative to models trained from scratch, TANGERINE demonstrates fast convergence during fine-tuning, thereby requiring significantly fewer GPU hours, and displays strong label efficiency, achieving comparable or superior performance with a fraction of fine-tuning data. Pretrained using self-supervised learning on over 98,000 thoracic LDCTs, including the UK’s largest LCS initiative to date and 27 public datasets, TANGERINE achieves state-of-the-art performance across 14 disease classification tasks, including lung cancer and multiple respiratory diseases, while generalising robustly across diverse clinical centres. By extending a masked autoencoder framework to 3D imaging, TANGERINE offers a scalable solution for LDCT analysis, departing from recent closed, resource-intensive models by combining architectural simplicity, public availability, and modest computational requirements. Its accessible, open-source lightweight design lays the foundation for rapid integration into next-generation medical imaging tools that could transform LCS initiatives, allowing them to pivot from a singular focus on lung cancer detection to comprehensive respiratory disease management in high-risk populations.
zh

[CV-95] Autoadaptive Medical Segment Anything Model

【速读】:该论文旨在解决医学图像分割中对大量标注数据依赖的问题,传统全监督分割模型需要大量手动标注的数据,这一过程成本高、耗时且容易出错。论文提出的解决方案是ADA-SAM(自动、领域特定和自适应的分割一切模型),其关键在于引入了一个多任务学习框架,利用辅助分类器的类别激活图来引导半监督分割分支的预测,并通过一种新颖的梯度反馈机制,在分割和分类分支之间建立可学习的连接,从而提升模型性能。

链接: https://arxiv.org/abs/2507.01828
作者: Tyler Ward,Meredith K. Owen,O’Kira Coleman,Brian Noehren,Abdullah-Al-Zubaer Imran
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Medical image segmentation is a key task in the imaging workflow, influencing many image-based decisions. Traditional, fully-supervised segmentation models rely on large amounts of labeled training data, typically obtained through manual annotation, which can be an expensive, time-consuming, and error-prone process. This signals a need for accurate, automatic, and annotation-efficient methods of training these models. We propose ADA-SAM (automated, domain-specific, and adaptive segment anything model), a novel multitask learning framework for medical image segmentation that leverages class activation maps from an auxiliary classifier to guide the predictions of the semi-supervised segmentation branch, which is based on the Segment Anything (SAM) framework. Additionally, our ADA-SAM model employs a novel gradient feedback mechanism to create a learnable connection between the segmentation and classification branches by using the segmentation gradients to guide and improve the classification predictions. We validate ADA-SAM on real-world clinical data collected during rehabilitation trials, and demonstrate that our proposed method outperforms both fully-supervised and semi-supervised baselines by double digits in limited label settings. Our code is available at: this https URL.
zh

[CV-96] Robust brain age estimation from structural MRI with contrastive learning

【速读】:该论文旨在解决从结构磁共振成像(sMRI)中估计脑年龄的问题,以作为表征正常和病理衰老的强大工具。其解决方案的关键在于引入一种新的对比损失函数Lexp\mathcal{L}^{\text{exp}},并通过大规模、多中心数据进行预训练,从而实现更鲁棒和可泛化的脑年龄估计模型。该方法在多个公开神经影像数据集上验证,表现出对站点相关混杂因素的鲁棒性,并能有效捕捉认知障碍和阿尔茨海默病患者的加速老化特征。

链接: https://arxiv.org/abs/2507.01794
作者: Carlo Alberto Barbano,Benoit Dufumier,Edouard Duchesnay,Marco Grangetto,Pietro Gori
机构: University of Turin (都灵大学); Commissariat à l’énergie atomique et aux énergies alternatives (法国原子能与替代能源委员会); Telecom Paris (巴黎电信)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Estimating brain age from structural MRI has emerged as a powerful tool for characterizing normative and pathological aging. In this work, we explore contrastive learning as a scalable and robust alternative to supervised approaches for brain age estimation. We introduce a novel contrastive loss function, \mathcalL^exp , and evaluate it across multiple public neuroimaging datasets comprising over 20,000 scans. Our experiments reveal four key findings. First, scaling pre-training on diverse, multi-site data consistently improves generalization performance, cutting external mean absolute error (MAE) nearly in half. Second, \mathcalL^exp is robust to site-related confounds, maintaining low scanner-predictability as training size increases. Third, contrastive models reliably capture accelerated aging in patients with cognitive impairment and Alzheimer’s disease, as shown through brain age gap analysis, ROC curves, and longitudinal trends. Lastly, unlike supervised baselines, \mathcalL^exp maintains a strong correlation between brain age accuracy and downstream diagnostic performance, supporting its potential as a foundation model for neuroimaging. These results position contrastive learning as a promising direction for building generalizable and clinically meaningful brain representations.
zh

[CV-97] Multi Source COVID-19 Detection via Kernel-Density-based Slice Sampling

【速读】:该论文旨在解决多源新冠肺炎(COVID-19)检测问题,具体是通过分类来自四个不同医疗中心的胸部CT扫描来实现。其解决方案的关键在于采用基于核密度的切片采样(Kernel-Density-based Slice Sampling, KDS)的时空切片特征学习(Spatial-Slice Feature Learning, SSFL)框架,结合肺部区域提取、质量控制和自适应切片采样,以选择每张扫描的八个代表性切片,从而有效应对多源数据的差异性。

链接: https://arxiv.org/abs/2507.01564
作者: Chia-Ming Lee,Bo-Cheng Qiu,Ting-Yao Chen,Ming-Han Sun,Fang-Ying Lin,Jung-Tse Tsai,I-An Tsai,Yu-Fan Lin,Chih-Chung Hsu
机构: National Cheng Kung University (国立成功大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present our solution for the Multi-Source COVID-19 Detection Challenge, which classifies chest CT scans from four distinct medical centers. To address multi-source variability, we employ the Spatial-Slice Feature Learning (SSFL) framework with Kernel-Density-based Slice Sampling (KDS). Our preprocessing pipeline combines lung region extraction, quality control, and adaptive slice sampling to select eight representative slices per scan. We compare EfficientNet and Swin Transformer architectures on the validation set. The EfficientNet model achieves an F1-score of 94.68%, compared to the Swin Transformer’s 93.34%. The results demonstrate the effectiveness of our KDS-based pipeline on multi-source data and highlight the importance of dataset balance in multi-institutional medical imaging evaluation.
zh

[CV-98] Age Sensitive Hippocampal Functional Connectivity: New Insights from 3D CNNs and Saliency Mapping

【速读】:该论文旨在解决神经生物学衰老过程中海马灰质丢失与功能连接变化之间关系的理解不足问题,特别是如何通过功能连接分析揭示海马在衰老过程中的功能重组机制。其解决方案的关键在于开发一种可解释的深度学习框架,利用三维卷积神经网络(3D CNN)结合LayerCAM显著性映射技术,从海马功能连接(FC)数据中预测脑龄,并识别出对年龄高度敏感的关键海马-皮层连接区域,如楔前叶、楔叶、后扣带回、海马旁回、左侧上顶叶和右侧上颞沟等。该方法不仅实现了对海马功能连接的 voxel-wise 分析,还通过区分前部与后部海马功能连接的差异,揭示了其与已知功能特化的对应关系。

链接: https://arxiv.org/abs/2507.01411
作者: Yifei Sun,Marshall A. Dalton,Robert D. Sanders,Yixuan Yuan,Xiang Li,Sharon L. Naismith,Fernando Calamante,Jinglei Lv
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Grey matter loss in the hippocampus is a hallmark of neurobiological aging, yet understanding the corresponding changes in its functional connectivity remains limited. Seed-based functional connectivity (FC) analysis enables voxel-wise mapping of the hippocampus’s synchronous activity with cortical regions, offering a window into functional reorganization during aging. In this study, we develop an interpretable deep learning framework to predict brain age from hippocampal FC using a three-dimensional convolutional neural network (3D CNN) combined with LayerCAM saliency mapping. This approach maps key hippocampal-cortical connections, particularly with the precuneus, cuneus, posterior cingulate cortex, parahippocampal cortex, left superior parietal lobule, and right superior temporal sulcus, that are highly sensitive to age. Critically, disaggregating anterior and posterior hippocampal FC reveals distinct mapping aligned with their known functional specializations. These findings provide new insights into the functional mechanisms of hippocampal aging and demonstrate the power of explainable deep learning to uncover biologically meaningful patterns in neuroimaging data.
zh

[CV-99] BronchoGAN: Anatomically consistent and domain-agnostic image-to-image translation for video bronchoscopy

【速读】:该论文旨在解决支气管镜图像数据稀缺导致深度学习模型训练受限的问题,其核心挑战在于跨不同领域(如虚拟支气管镜、幻象及在体/离体图像数据)的鲁棒图像翻译。解决方案的关键在于提出BronchoGAN,该方法通过引入解剖约束(如支气管开口的匹配)并结合基础模型生成的深度图像作为中间表示,从而提升图像到图像翻译的稳定性与真实性。该中间深度图像表示不仅增强了对多种输入域的适应性,还简化了配对图像数据的构建,显著提高了合成图像的解剖结构保真度。

链接: https://arxiv.org/abs/2507.01387
作者: Ahmad Soliman,Ron Keuth,Marian Himstedt
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The limited availability of bronchoscopy images makes image synthesis particularly interesting for training deep learning models. Robust image translation across different domains – virtual bronchoscopy, phantom as well as in-vivo and ex-vivo image data – is pivotal for clinical applications. This paper proposes BronchoGAN introducing anatomical constraints for image-to-image translation being integrated into a conditional GAN. In particular, we force bronchial orifices to match across input and output images. We further propose to use foundation model-generated depth images as intermediate representation ensuring robustness across a variety of input domains establishing models with substantially less reliance on individual training datasets. Moreover our intermediate depth image representation allows to easily construct paired image data for training. Our experiments showed that input images from different domains (e.g. virtual bronchoscopy, phantoms) can be successfully translated to images mimicking realistic human airway appearance. We demonstrated that anatomical settings (i.e. bronchial orifices) can be robustly preserved with our approach which is shown qualitatively and quantitatively by means of improved FID, SSIM and dice coefficients scores. Our anatomical constraints enabled an improvement in the Dice coefficient of up to 0.43 for synthetic images. Through foundation models for intermediate depth representations, bronchial orifice segmentation integrated as anatomical constraints into conditional GANs we are able to robustly translate images from different bronchoscopy input domains. BronchoGAN allows to incorporate public CT scan data (virtual bronchoscopy) in order to generate large-scale bronchoscopy image datasets with realistic appearance. BronchoGAN enables to bridge the gap of missing public bronchoscopy images.
zh

[CV-100] Structure and Smoothness Constrained Dual Networks for MR Bias Field Correction MICCAI

【速读】:该论文旨在解决磁共振成像(Magnetic Resonance Imaging, MRI)中由于设备限制导致的图像强度不均匀问题,该问题会阻碍医学分析的定性和定量评估。为了解决这一问题,论文提出了一种名为S2DNets的新型结构与平滑性约束双网络模型,其关键在于引入分段结构约束和偏置场的平滑性,以在训练过程中有效去除非均匀强度并保留更多的结构细节。

链接: https://arxiv.org/abs/2507.01326
作者: Dong Liang,Xingyu Qiu,Yuzhen Li,Wei Wang,Kuanquan Wang,Suyu Dong,Gongning Luo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, accepted by MICCAI

点击查看摘要

Abstract:MR imaging techniques are of great benefit to disease diagnosis. However, due to the limitation of MR devices, significant intensity inhomogeneity often exists in imaging results, which impedes both qualitative and quantitative medical analysis. Recently, several unsupervised deep learning-based models have been proposed for MR image improvement. However, these models merely concentrate on global appearance learning, and neglect constraints from image structures and smoothness of bias field, leading to distorted corrected results. In this paper, novel structure and smoothness constrained dual networks, named S2DNets, are proposed aiming to self-supervised bias field correction. S2DNets introduce piece-wise structural constraints and smoothness of bias field for network training to effectively remove non-uniform intensity and retain much more structural details. Extensive experiments executed on both clinical and simulated MR datasets show that the proposed model outperforms other conventional and deep learning-based models. In addition to comparison on visual metrics, downstream MR image segmentation tasks are also used to evaluate the impact of the proposed model. The source code is available at: this https URLthis https URL.
zh

[CV-101] SWinMamba: Serpentine Window State Space Model for Vascular Segmentation

【速读】:该论文旨在解决医学图像中血管分割的连续性问题,即由于血管细长特性及先验建模不足导致的分割结果不连贯问题。解决方案的关键在于提出了一种新颖的Serpentine Window Mamba (SWinMamba)架构,其核心创新在于通过将蛇形窗口序列引入双向状态空间模型,以建模细长血管结构的连续性。该方法利用蛇形窗口分词器(SWToken)自适应地分割输入图像,实现灵活的感受野(RF)以捕捉血管结构特征,并通过双向聚合模块(BAM)整合局部特征以表征血管连续性,同时结合时空域融合单元(SFFU)增强特征表示能力。

链接: https://arxiv.org/abs/2507.01323
作者: Rongchang Zhao,Huanchi Liu,Jian Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vascular segmentation in medical images is crucial for disease diagnosis and surgical navigation. However, the segmented vascular structure is often discontinuous due to its slender nature and inadequate prior modeling. In this paper, we propose a novel Serpentine Window Mamba (SWinMamba) to achieve accurate vascular segmentation. The proposed SWinMamba innovatively models the continuity of slender vascular structures by incorporating serpentine window sequences into bidirectional state space models. The serpentine window sequences enable efficient feature capturing by adaptively guiding global visual context modeling to the vascular structure. Specifically, the Serpentine Window Tokenizer (SWToken) adaptively splits the input image using overlapping serpentine window sequences, enabling flexible receptive fields (RFs) for vascular structure modeling. The Bidirectional Aggregation Module (BAM) integrates coherent local features in the RFs for vascular continuity representation. In addition, dual-domain learning with Spatial-Frequency Fusion Unit (SFFU) is designed to enhance the feature representation of vascular structure. Extensive experiments on three challenging datasets demonstrate that the proposed SWinMamba achieves superior performance with complete and connected vessels.
zh

[CV-102] PanTS: The Pancreatic Tumor Segmentation Dataset

【速读】:该论文旨在解决胰腺CT图像分析中AI模型性能受限的问题,特别是胰腺肿瘤检测、定位和分割的准确性不足。其解决方案的关键在于构建了一个大规模、多机构协作的胰腺CT数据集PanTS,该数据集包含来自145个医疗中心的36,390例CT扫描,并提供了专家验证的体素级注释,涵盖超过993,000个解剖结构,包括胰腺肿瘤及其周围24个解剖结构。相比现有公开数据集,PanTS在肿瘤注释规模上扩大了16倍,并通过增加周边解剖结构间接提升了模型性能,从而为胰腺CT分析中的AI模型开发与评估提供了新的基准。

链接: https://arxiv.org/abs/2507.01291
作者: Wenxuan Li,Xinze Zhou,Qi Chen,Tianyu Lin,Pedro R. A. S. Bassi,Szymon Plotka,Jaroslaw B. Cwikla,Xiaoxi Chen,Chen Ye,Zheren Zhu,Kai Ding,Heng Li,Kang Wang,Yang Yang,Yucheng Tang,Daguang Xu,Alan L. Yuille,Zongwei Zhou
机构: Johns Hopkins University (约翰霍普金斯大学); University of Bologna (博洛尼亚大学); Istituto Italiano di Tecnologia (意大利技术研究院); Jagiellonian University (亚捷隆大学); University of Warmia and Mazury (瓦尔米亚和马祖里大学); Gammed (加默德诊断治疗中心); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Peking University Third Hospital (北京大学第三医院); University of California, San Francisco (加州大学旧金山分校); University of California, Berkeley (加州大学伯克利分校); Johns Hopkins School of Medicine (约翰霍普金斯医学院); NVIDIA (英伟达)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:PanTS is a large-scale, multi-institutional dataset curated to advance research in pancreatic CT analysis. It contains 36,390 CT scans from 145 medical centers, with expert-validated, voxel-wise annotations of over 993,000 anatomical structures, covering pancreatic tumors, pancreas head, body, and tail, and 24 surrounding anatomical structures such as vascular/skeletal structures and abdominal/thoracic organs. Each scan includes metadata such as patient age, sex, diagnosis, contrast phase, in-plane spacing, slice thickness, etc. AI models trained on PanTS achieve significantly better performance in pancreatic tumor detection, localization, and segmentation compared to those trained on existing public datasets. Our analysis indicates that these gains are directly attributable to the 16x larger-scale tumor annotations and indirectly supported by the 24 additional surrounding anatomical structures. As the largest and most comprehensive resource of its kind, PanTS offers a new benchmark for developing and evaluating AI models in pancreatic CT analysis.
zh

[CV-103] Classification based deep learning models for lung cancer and disease using medical images

【速读】:该论文旨在解决深度学习在医学影像分析中对肺癌及其相关疾病预测的准确性与效率问题。其关键解决方案是提出一种改进的深度卷积神经网络(CNN)模型——ResNet+,通过引入ResNet-D模块以增强特征提取能力,并在瓶颈层中集成卷积注意力模块,从而提升模型对输入图像中关键区域的关注度和泛化能力。此外,为应对类别不平衡问题,采用了数据增强技术,进一步优化了模型性能。

链接: https://arxiv.org/abs/2507.01279
作者: Ahmad Chaddad,Jihao Peng,Yihang Wu
机构: Guilin University of Electronic Technology (桂林电子科技大学); Ecole de Technologie Superieure (魁北克工程学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE Transactions on Radiation and Plasma Medical Sciences

点击查看摘要

Abstract:The use of deep learning (DL) in medical image analysis has significantly improved the ability to predict lung cancer. In this study, we introduce a novel deep convolutional neural network (CNN) model, named ResNet+, which is based on the established ResNet framework. This model is specifically designed to improve the prediction of lung cancer and diseases using the images. To address the challenge of missing feature information that occurs during the downsampling process in CNNs, we integrate the ResNet-D module, a variant designed to enhance feature extraction capabilities by modifying the downsampling layers, into the traditional ResNet model. Furthermore, a convolutional attention module was incorporated into the bottleneck layers to enhance model generalization by allowing the network to focus on relevant regions of the input images. We evaluated the proposed model using five public datasets, comprising lung cancer (LC2500 n =3183, IQ-OTH/NCCD n =1336, and LCC n =25000 images) and lung disease (ChestXray n =5856, and COVIDx-CT n =425024 images). To address class imbalance, we used data augmentation techniques to artificially increase the representation of underrepresented classes in the training dataset. The experimental results show that ResNet+ model demonstrated remarkable accuracy/F1, reaching 98.14/98.14% on the LC25000 dataset and 99.25/99.13% on the IQ-OTH/NCCD dataset. Furthermore, the ResNet+ model saved computational cost compared to the original ResNet series in predicting lung cancer images. The proposed model outperformed the baseline models on publicly available datasets, achieving better performance metrics. Our codes are publicly available at this https URL.
zh

[CV-104] MID-INFRARED (MIR) OCT-based inspection in industry

【速读】:该论文旨在评估中红外光学相干断层扫描(Mid-Infrared Optical Coherence Tomography, MIR OCT)系统作为穿透不同材料并检测亚表面缺陷的工具的潜力,以支持生产过程的监测和工业中无损检测技术的应用。其解决方案的关键在于探索MIR OCT系统的性能,并结合预处理技术和人工智能增强的视觉算法,以实现对被测物体异常区域的有效检测。研究还讨论了参数选择的限制与标准,以及该方法的优势与不足。

链接: https://arxiv.org/abs/2507.01074
作者: N. P. García-de-la-Puente,Rocío del Amor,Fernando García-Torres,Niels Møller Israelsen,Coraline Lapre,Christian Rosenberg Petersen,Ole Bang,Dominik Brouczek,Martin Schwentenwein,Kevin Neumann,Niels Benson,Valery Naranjo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted at i-ESA 2024 12th International Conference on Interoperability for Enterprise Systems and Applications 6 pages, 2 figures, 2 tables

点击查看摘要

Abstract:This paper aims to evaluate mid-infrared (MIR) Optical Coherence Tomography (OCT) systems as a tool to penetrate different materials and detect sub-surface irregularities. This is useful for monitoring production processes, allowing Non-Destructive Inspection Techniques of great value to the industry. In this exploratory study, several acquisitions are made on composite and ceramics to know the capabilities of the system. In addition, it is assessed which preprocessing and AI-enhanced vision algorithms can be anomaly-detection methodologies capable of detecting abnormal zones in the analyzed objects. Limitations and criteria for the selection of optimal parameters will be discussed, as well as strengths and weaknesses will be highlighted.
zh

[CV-105] Prompt Mechanisms in Medical Imaging: A Comprehensive Survey

【速读】:该论文旨在解决深度学习在医学影像领域临床应用中的关键挑战,包括数据稀缺性、分布偏移以及任务泛化能力不足等问题。其解决方案的关键在于采用基于提示(prompt-based)的方法,通过灵活的、领域特定的提示机制引导深度学习模型,从而在无需大量重新训练的情况下显著提升模型性能、适应性和可解释性。该方法通过增强任务特定结果的准确性、鲁棒性和数据效率,减少了对人工特征工程的依赖。

链接: https://arxiv.org/abs/2507.01055
作者: Hao Yang,Xinlong Liang,Zhang Li,Yue Sun,Zheyu Hu,Xinghe Xie,Behdad Dashtbozorg,Jincheng Huang,Shiwei Zhu,Luyi Han,Jiong Zhang,Shanshan Wang,Ritse Mann,Qifeng Yu,Tao Tan
机构: Macao Polytechnic University (澳门理工学院); Netherlands Cancer Institute (荷兰癌症研究所); Radboud University Medical Centre (拉德布德大学医学中心); National University of Defense Technology (国防科技大学); Hunan Cancer Hospital (湖南肿瘤医院); Affiliated Cancer Hospital of Xiangya School of Medicine, Central South University (中南大学湘雅医学院附属肿瘤医院); Eindhoven University of Technology (埃因霍温理工大学); University of Chinese Academy of Sciences (中国科学院大学); Paul C. Lauterbur Research Center for Biomedical Imaging, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院保罗·劳特伯生物医学成像研究中心)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning offers transformative potential in medical imaging, yet its clinical adoption is frequently hampered by challenges such as data scarcity, distribution shifts, and the need for robust task generalization. Prompt-based methodologies have emerged as a pivotal strategy to guide deep learning models, providing flexible, domain-specific adaptations that significantly enhance model performance and adaptability without extensive retraining. This systematic review critically examines the burgeoning landscape of prompt engineering in medical imaging. We dissect diverse prompt modalities, including textual instructions, visual prompts, and learnable embeddings, and analyze their integration for core tasks such as image generation, segmentation, and classification. Our synthesis reveals how these mechanisms improve task-specific outcomes by enhancing accuracy, robustness, and data efficiency and reducing reliance on manual feature engineering while fostering greater model interpretability by making the model’s guidance explicit. Despite substantial advancements, we identify persistent challenges, particularly in prompt design optimization, data heterogeneity, and ensuring scalability for clinical deployment. Finally, this review outlines promising future trajectories, including advanced multimodal prompting and robust clinical integration, underscoring the critical role of prompt-driven AI in accelerating the revolution of diagnostics and personalized treatment planning in medicine.
zh

[CV-106] CRISP-SAM2: SAM2 with Cross-Modal Interaction and Semantic Prompting for Multi-Organ Segmentation

【速读】:该论文旨在解决多器官医学分割中存在细节不准确、依赖几何提示以及空间信息丢失等问题。其解决方案的关键在于提出一种名为CRISP-SAM2的新型模型,该模型基于SAM2架构,通过跨模态交互和语义提示机制,将视觉与文本输入转换为跨模态上下文语义,并将其注入图像编码器以增强对视觉信息的细节理解。此外,采用语义提示策略替代原始提示编码器,以减少对几何提示的依赖,并结合相似性排序的自我更新记忆策略和掩码精炼过程,进一步提升医学影像的局部细节表现。

链接: https://arxiv.org/abs/2506.23121
作者: Xinlei Yu,Chanmiao Wang,Hui Jin,Ahmed Elazab,Gangyong Jia,Xiang Wan,Changqing Zou,Ruiquan Ge
机构: Hangzhou Dianzi University(杭州电子科技大学); Shenzhen Research Institute of Big Data(深圳大数据研究院); Shenzhen University(深圳大学); Zhejiang University(浙江大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 9 figures, 10 tables

点击查看摘要

Abstract:Multi-organ medical segmentation is a crucial component of medical image processing, essential for doctors to make accurate diagnoses and develop effective treatment plans. Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduce a novel model named CRISP-SAM2 with CRoss-modal Interaction and Semantic Prompting based on SAM2. This model represents a promising approach to multi-organ medical segmentation guided by textual descriptions of organs. Our method begins by converting visual and textual inputs into cross-modal contextualized semantics using a progressive cross-attention interaction mechanism. These semantics are then injected into the image encoder to enhance the detailed understanding of visual information. To eliminate reliance on geometric prompts, we use a semantic prompting strategy, replacing the original prompt encoder to sharpen the perception of challenging targets. In addition, a similarity-sorting self-updating strategy for memory and a mask-refining process is applied to further adapt to medical imaging and enhance localized details. Comparative experiments conducted on seven public datasets indicate that CRISP-SAM2 outperforms existing models. Extensive analysis also demonstrates the effectiveness of our method, thereby confirming its superior performance, especially in addressing the limitations mentioned earlier. Our code is available at: this https URL\this http URL.
zh

人工智能

[AI-0] AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

【速读】:该论文旨在解决移动操作中移动平台与机械臂协调控制的问题,主要挑战在于现有方法未能显式建模移动平台对机械臂控制的影响,导致在高自由度下容易产生误差累积,同时缺乏对不同操作阶段多模态感知需求的适应性。其解决方案的关键在于提出自适应协调扩散变压器(AC-DiT),通过引入“移动性到身体的条件机制”来提取移动平台运动表示并作为整体动作预测的先验信息,从而实现考虑移动平台运动影响的整体控制;此外,设计了感知感知的多模态条件策略,动态调整2D视觉图像与3D点云的融合权重,以满足不同阶段的感知需求。

链接: https://arxiv.org/abs/2507.01961
作者: Sixiang Chen,Jiaming Liu,Siyuan Qian,Han Jiang,Lily Li,Renrui Zhang,Zhuoyang Liu,Chenyang Gu,Chengkai Hou,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e.g., either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the mobile base directly influences the manipulator’s actions, we introduce a mobility-to-body conditioning mechanism that guides the model to first extract base motion representations, which are then used as context prior for predicting whole-body actions. This enables whole-body control that accounts for the potential impact of the mobile base’s motion. Second, to meet the perception requirements at different stages of mobile manipulation, we design a perception-aware multimodal conditioning strategy that dynamically adjusts the fusion weights between various 2D visual images and 3D point clouds, yielding visual features tailored to the current perceptual needs. This allows the model to, for example, adaptively rely more on 2D inputs when semantic information is crucial for action prediction, while placing greater emphasis on 3D geometric information when precise spatial understanding is required. We validate AC-DiT through extensive experiments on both simulated and real-world mobile manipulation tasks.
zh

[AI-1] Exploring a Hybrid Deep Learning Approach for Anomaly Detection in Mental Healthcare Provider Billing: Addressing Label Scarcity through Semi-Supervised Anomaly Detection

【速读】:该论文试图解决心理健康医疗账单中的异常检测问题,特别是在面对类别不平衡、标签稀缺和复杂序列模式时,传统机器学习方法的局限性。其解决方案的关键在于采用一种混合深度学习方法,结合长短期记忆网络(LSTM)和Transformer模型,并通过孤立森林(iForest)和自编码器(AE)进行伪标签生成。这种方法在真实世界的心理健康医疗账单数据集上进行了评估,结果显示了在复杂且不平衡的异常检测场景中,伪标签与混合深度学习结合的潜力。

链接: https://arxiv.org/abs/2507.01924
作者: Samirah Bakker,Yao Ma,Seyed Sahand Mohammadi Ziabari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The complexity of mental healthcare billing enables anomalies, including fraud. While machine learning methods have been applied to anomaly detection, they often struggle with class imbalance, label scarcity, and complex sequential patterns. This study explores a hybrid deep learning approach combining Long Short-Term Memory (LSTM) networks and Transformers, with pseudo-labeling via Isolation Forests (iForest) and Autoencoders (AE). Prior work has not evaluated such hybrid models trained on pseudo-labeled data in the context of healthcare billing. The approach is evaluated on two real-world billing datasets related to mental healthcare. The iForest LSTM baseline achieves the highest recall (0.963) on declaration-level data. On the operation-level data, the hybrid iForest-based model achieves the highest recall (0.744), though at the cost of lower precision. These findings highlight the potential of combining pseudo-labeling with hybrid deep learning in complex, imbalanced anomaly detection settings.
zh

[AI-2] owards Foundation Auto-Encoders for Time-Series Anomaly Detection KDD2024

【速读】:该论文试图解决时间序列数据中的异常检测问题,特别是针对未见过的数据集实现零样本(zero-shot)检测。解决方案的关键在于引入FAE(Foundation Auto-Encoders),这是一种基于变分自编码器(Variational Auto-Encoders, VAEs)和扩张卷积神经网络(Dilated Convolutional Neural Networks, DCNNs)的生成式AI模型,通过大规模时间序列数据预训练,学习复杂的时序模式,从而实现对未知数据的准确建模与异常检测。

链接: https://arxiv.org/abs/2507.01875
作者: Gastón García González,Pedro Casas,Emilio Martínez,Alicia Fernández
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Presented at ACM KDD 2024, MiLeTS 2024 Workshop, August 25, 2024, Barcelona, Spain

点击查看摘要

Abstract:We investigate a novel approach to time-series modeling, inspired by the successes of large pretrained foundation models. We introduce FAE (Foundation Auto-Encoders), a foundation generative-AI model for anomaly detection in time-series data, based on Variational Auto-Encoders (VAEs). By foundation, we mean a model pretrained on massive amounts of time-series data which can learn complex temporal patterns useful for accurate modeling, forecasting, and detection of anomalies on previously unseen datasets. FAE leverages VAEs and Dilated Convolutional Neural Networks (DCNNs) to build a generic model for univariate time-series modeling, which could eventually perform properly in out-of-the-box, zero-shot anomaly detection applications. We introduce the main concepts of FAE, and present preliminary results in different multi-dimensional time-series datasets from various domains, including a real dataset from an operational mobile ISP, and the well known KDD 2021 Anomaly Detection dataset.
zh

[AI-3] Bridging UI Design and chatbot Interactions: Applying Form-Based Principles to Conversational Agents

【速读】:该论文试图解决领域特定聊天机器人在多步骤交互中因依赖隐含语言线索而导致的上下文管理不明确和用户混淆问题。传统图形用户界面通过显式的“提交”和“重置”操作来明确跟踪用户意图,而对话代理则面临挑战。解决方案的关键在于将GUI启发的元数据(如“提交类似”确认和“重置类似”上下文切换)建模为大型语言模型(LLM)提示中的显式任务,通过结构化会话数据捕获用户确认、重置操作及链式思维(CoT)推理,从而保持交互清晰度并增强与后端逻辑的一致性。

链接: https://arxiv.org/abs/2507.01862
作者: Sanjay Krishna Anbalagan,Xinrui Nie,Umesh Mohan,Vijay Kumar Kanamarlapudi,Anughna Kommalapati,Xiaodan Zhao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure, pre-print of poster accepted for HCI International 2025 (HCII 2025), CCIS vol 2529

点击查看摘要

Abstract:Domain specific chatbot applications often involve multi step interactions, such as refining search filters, selecting multiple items, or performing comparisons. Traditional graphical user interfaces (GUIs) handle these workflows by providing explicit “Submit” (commit data) and “Reset” (discard data) actions, allowing back-end systems to track user intent unambiguously. In contrast, conversational agents rely on subtle language cues, which can lead to confusion and incomplete context management. This paper proposes modeling these GUI inspired metaphors acknowledgment (submit like) and context switching (reset-like) as explicit tasks within large language model (LLM) prompts. By capturing user acknowledgment, reset actions, and chain of thought (CoT) reasoning as structured session data, we preserve clarity, reduce user confusion, and align domain-specific chatbot interactions with back-end logic. We demonstrate our approach in hotel booking and customer management scenarios, highlighting improvements in multi-turn task coherence, user satisfaction, and efficiency.
zh

[AI-4] Refining Gelfond Rationality Principle Towards More Comprehensive Foundational Principles for Answer Set Semantics IJCAI-2022

【速读】:该论文旨在探讨答案集语义(answer set semantics)是否应将最小模型性质、约束单调性和奠基性作为普遍强制条件,并在否定的情况下,提出其他可能作为答案集语义通用原则的性质。其解决方案的关键在于对Gelfond答案集(Gelfond Answer Set, GAS)原则进行改进,通过将Gelfond的合理性原则细化为“良好支持性”、“基于默认否定的最小性”和“基于认识否定的最小性”,以确保答案集的构造过程无循环论证,并在答案集和世界观层面最小化知识。此外,论文还扩展了“良好支持性”的概念,并基于改进的GAS原则定义了新的答案集语义。

链接: https://arxiv.org/abs/2507.01833
作者: Yi-Dong Shen,Thomas Eiter
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 76 pages. This article is a significantly extended version of a paper presented by the authors at IJCAI-2022

点击查看摘要

Abstract:Non-monotonic logic programming is the basis for a declarative problem solving paradigm known as answer set programming (ASP). Departing from the seminal definition by Gelfond and Lifschitz in 1988 for simple normal logic programs, various answer set semantics have been proposed for extensions. We consider two important questions: (1) Should the minimal model property, constraint monotonicity and foundedness as defined in the literature be mandatory conditions for an answer set semantics in general? (2) If not, what other properties could be considered as general principles for answer set semantics? We address the two questions. First, it seems that the three aforementioned conditions may sometimes be too strong, and we illustrate with examples that enforcing them may exclude expected answer sets. Second, we evolve the Gelfond answer set (GAS) principles for answer set construction by refining the Gelfond’s rationality principle to well-supportedness, minimality w.r.t. negation by default and minimality w.r.t. epistemic negation. The principle of well-supportedness guarantees that every answer set is constructible from if-then rules obeying a level mapping and is thus free of circular justification, while the two minimality principles ensure that the formalism minimizes knowledge both at the level of answer sets and of world views. Third, to embody the refined GAS principles, we extend the notion of well-supportedness substantially to answer sets and world views, respectively. Fourth, we define new answer set semantics in terms of the refined GAS principles. Fifth, we use the refined GAS principles as an alternative baseline to intuitively assess the existing answer set semantics. Finally, we analyze the computational complexity.
zh

[AI-5] mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling

【速读】:该论文旨在解决边缘设备在时间序列处理中对模型的内存约束问题,即如何在有限的内存下同时捕捉短期和长期的时间动态。其解决方案的关键在于提出一种混合记忆系统mGRADE,该系统结合了时间1D卷积与可学习间隔的延迟嵌入以及最小门控循环单元(minGRU),通过卷积层实现灵活的延迟嵌入以捕捉快速时间变化,而循环模块则以极小的内存开销高效维持全局上下文。

链接: https://arxiv.org/abs/2507.01829
作者: Tristan Torchet,Christian Metzner,Laura Kriener,Melika Payvand
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Edge devices for temporal processing demand models that capture both short- and long- range dynamics under tight memory constraints. While Transformers excel at sequence modeling, their quadratic memory scaling with sequence length makes them impractical for such settings. Recurrent Neural Networks (RNNs) offer constant memory but train sequentially, and Temporal Convolutional Networks (TCNs), though efficient, scale memory with kernel size. To address this, we propose mGRADE (mininally Gated Recurrent Architecture with Delay Embedding), a hybrid-memory system that integrates a temporal 1D-convolution with learnable spacings followed by a minimal gated recurrent unit (minGRU). This design allows the convolutional layer to realize a flexible delay embedding that captures rapid temporal variations, while the recurrent module efficiently maintains global context with minimal memory overhead. We validate our approach on two synthetic tasks, demonstrating that mGRADE effectively separates and preserves multi-scale temporal features. Furthermore, on challenging pixel-by-pixel image classification benchmarks, mGRADE consistently outperforms both pure convolutional and pure recurrent counterparts using approximately 20% less memory footprint, highlighting its suitability for memory-constrained temporal processing at the edge. This highlights mGRADE’s promise as an efficient solution for memory-constrained multi-scale temporal processing at the edge.
zh

[AI-6] MILP-SAT-GNN: Yet Another Neural SAT Solver

【速读】:该论文试图解决将图神经网络(GNN)应用于求解布尔可满足性问题(SAT)的问题,其核心在于通过将k-CNF公式映射为混合整数线性规划(MILP)问题,并将其编码为加权二分图输入到GNN中进行训练和测试。解决方案的关键在于利用GNN对MILP问题的处理技术,同时通过理论分析证明了方法在排列和等价不变性、近似能力等方面的性质,从而实现了对SAT求解的高效建模与学习。

链接: https://arxiv.org/abs/2507.01825
作者: Franco Alberto Cardillo,Hamza Khyari,Umberto Straccia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We proposes a novel method that enables Graph Neural Networks (GNNs) to solve SAT problems by leveraging a technique developed for applying GNNs to Mixed Integer Linear Programming (MILP). Specifically, k-CNF formulae are mapped into MILP problems, which are then encoded as weighted bipartite graphs and subsequently fed into a GNN for training and testing. From a theoretical perspective: (i) we establish permutation and equivalence invariance results, demonstrating that the method produces outputs that are stable under reordering of clauses and variables; (ii) we identify a theoretical limitation, showing that for a class of formulae called foldable formulae, standard GNNs cannot always distinguish satisfiable from unsatisfiable instances; (iii) we prove a universal approximation theorem, establishing that with Random Node Initialization (RNI), the method can approximate SAT solving to arbitrary precision on finite datasets, that is, the GNN becomes approximately sound and complete on such datasets. Furthermore, we show that for unfoldable formulae, the same approximation guarantee can be achieved without the need for RNI. Finally, we conduct an experimental evaluation of our approach, which show that, despite the simplicity of the neural architecture, the method achieves promising results.
zh

[AI-7] BranchNet: A Neuro-Symbolic Learning Framework for Structured Multi-Class Classification

【速读】:该论文试图解决传统决策树集成方法(如XGBoost)在模型性能与可解释性之间的平衡问题,同时希望减少对人工架构调优的依赖。其解决方案的关键在于提出BranchNet,一种将决策树集成转换为稀疏、部分连接神经网络的神经符号学习框架,通过将每个决策路径映射到隐藏神经元,既保留了符号结构又支持梯度优化,从而实现了模型的紧凑性、可解释性以及自动化训练过程。

链接: https://arxiv.org/abs/2507.01781
作者: Dalia Rodríguez-Salas,Christian Riess
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures (with two images each)

点击查看摘要

Abstract:We introduce BranchNet, a neuro-symbolic learning framework that transforms decision tree ensembles into sparse, partially connected neural networks. Each branch, defined as a decision path from root to a parent of leaves, is mapped to a hidden neuron, preserving symbolic structure while enabling gradient-based optimization. The resulting models are compact, interpretable, and require no manual architecture tuning. Evaluated on a suite of structured multi-class classification benchmarks, BranchNet consistently outperforms XGBoost in accuracy, with statistically significant gains. We detail the architecture, training procedure, and sparsity dynamics, and discuss the model’s strengths in symbolic interpretability as well as its current limitations, particularly on binary tasks where further adaptive calibration may be beneficial.
zh

[AI-8] GPU-based complete search for nonlinear minimization subject to bounds

【速读】:该论文旨在解决在变量具有简单边界约束的情况下,对非线性函数进行全局最小值的精确包围问题。解决方案的关键在于结合区间分析与GPU计算能力,通过迭代排除不可能包含全局最小值的搜索区域,最终确定一个有限的区域集合以保证全局最小值的存在。该方法利用了GPU的并行计算架构,采用新型的单程序单数据并行编程模式来克服GPU性能瓶颈,并引入变量循环技术以降低大规模非线性函数优化的计算成本,从而实现了高效且可靠的全局优化。

链接: https://arxiv.org/abs/2507.01770
作者: Guanglu Zhang,Qihang Shan,Jonathan Cagan
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Mathematical Software (cs.MS); Optimization and Control (math.OC)
备注: 36 pages, 3 figures

点击查看摘要

Abstract:This paper introduces a GPU-based complete search method to enclose the global minimum of a nonlinear function subject to simple bounds on the variables. Using interval analysis, coupled with the computational power and architecture of GPU, the method iteratively rules out the regions in the search domain where the global minimum cannot exist and leaves a finite set of regions where the global minimum must exist. For effectiveness, because of the rigor of interval analysis, the method is guaranteed to enclose the global minimum of the nonlinear function even in the presence of rounding errors. For efficiency, the method employs a novel GPU-based single program, single data parallel programming style to circumvent major GPU performance bottlenecks, and a variable cycling technique is also integrated into the method to reduce computational cost when minimizing large-scale nonlinear functions. The method is validated by minimizing 10 multimodal benchmark test functions with scalable dimensions, including the well-known Ackley function, Griewank function, Levy function, and Rastrigin function. These benchmark test functions represent grand challenges of global optimization, and enclosing the guaranteed global minimum of these benchmark test functions with more than 80 dimensions has not been reported in the literature. Our method completely searches the feasible domain and successfully encloses the guaranteed global minimum of these 10 benchmark test functions with up to 10,000 dimensions using only one GPU in a reasonable computation time, far exceeding the reported results in the literature due to the unique method design and implementation based on GPU architecture.
zh

[AI-9] Enhanced Generative Model Evaluation with Clipped Density and Coverag e

【速读】:该论文试图解决生成式模型在评估样本质量时存在的可靠性不足问题,具体表现为现有质量度量缺乏校准或对异常值不够鲁棒,导致无法准确反映样本的保真度(fidelity)和覆盖率(coverage)。解决方案的关键在于引入两种新的度量方法——截断密度(Clipped Density)和截断覆盖率(Clipped Coverage),通过截断单个样本的贡献以及在保真度评估中限制最近邻球的半径,防止分布外样本对整体评估结果产生偏差。这些度量在分析和实验校准中表现出随着劣质样本比例增加而线性下降的分数特性,从而可直接解释为优质样本的比例。

链接: https://arxiv.org/abs/2507.01761
作者: Nicolas Salvy,Hugues Talbot,Bertrand Thirion
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Although generative models have made remarkable progress in recent years, their use in critical applications has been hindered by their incapacity to reliably evaluate sample quality. Quality refers to at least two complementary concepts: fidelity and coverage. Current quality metrics often lack reliable, interpretable values due to an absence of calibration or insufficient robustness to outliers. To address these shortcomings, we introduce two novel metrics, Clipped Density and Clipped Coverage. By clipping individual sample contributions and, for fidelity, the radii of nearest neighbor balls, our metrics prevent out-of-distribution samples from biasing the aggregated values. Through analytical and empirical calibration, these metrics exhibit linear score degradation as the proportion of poor samples increases. Thus, they can be straightforwardly interpreted as equivalent proportions of good samples. Extensive experiments on synthetic and real-world datasets demonstrate that Clipped Density and Clipped Coverage outperform existing methods in terms of robustness, sensitivity, and interpretability for evaluating generative models.
zh

[AI-10] Joint Matching and Pricing for Crowd-shipping with In-store Customers

【速读】:该论文旨在解决城市地区高效最后一公里配送的需求,通过将实体店顾客作为配送员的集中式众包配送系统进行研究。其关键解决方案是提出一种结合神经近似动态规划(NeurADP)与深度双重Q网络(DDQN)的联合优化策略,以实现自适应订单分配和动态定价,从而应对订单和配送员到达的随机性以及配送报价接受概率的不确定性,同时支持多点配送路径规划,提升配送成本效率。

链接: https://arxiv.org/abs/2507.01749
作者: Arash Dehghan,Mucahit Cevik,Merve Bodur,Bissan Ghaddar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper examines the use of in-store customers as delivery couriers in a centralized crowd-shipping system, targeting the growing need for efficient last-mile delivery in urban areas. We consider a brick-and-mortar retail setting where shoppers are offered compensation to deliver time-sensitive online orders. To manage this process, we propose a Markov Decision Process (MDP) model that captures key uncertainties, including the stochastic arrival of orders and crowd-shippers, and the probabilistic acceptance of delivery offers. Our solution approach integrates Neural Approximate Dynamic Programming (NeurADP) for adaptive order-to-shopper assignment with a Deep Double Q-Network (DDQN) for dynamic pricing. This joint optimization strategy enables multi-drop routing and accounts for offer acceptance uncertainty, aligning more closely with real-world operations. Experimental results demonstrate that the integrated NeurADP + DDQN policy achieves notable improvements in delivery cost efficiency, with up to 6.7% savings over NeurADP with fixed pricing and approximately 18% over myopic baselines. We also show that allowing flexible delivery delays and enabling multi-destination routing further reduces operational costs by 8% and 17%, respectively. These findings underscore the advantages of dynamic, forward-looking policies in crowd-shipping systems and offer practical guidance for urban logistics operators.
zh

[AI-11] owards culturally-appropriate conversational AI for health in the majority world: An exploratory study with citizens and professionals in Latin America

【速读】:该论文试图解决当前大型语言模型(LLM)在全球范围内忽视许多文化与语言多样性背景下的实际生活经验,从而导致其在数字健康领域中的对话式人工智能(CAI)应用可能产生文化偏差的问题。解决方案的关键在于采用一种自下而上的本地化方法,基于拉丁美洲参与式工作坊中收集的定性数据,构建对文化错位、区域对健康聊天机器人的看法以及创建文化适配CAI策略的深入理解,进而提出“多元对话式人工智能(Pluriversal Conversational AI)”框架,强调关系性和包容性的重要性,而不仅仅是增加数据量。

链接: https://arxiv.org/abs/2507.01719
作者: Dorian Peters,Fernanda Espinoza,Marco da Re,Guido Ivetta,Luciana Benotti,Rafael A. Calvo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There is justifiable interest in leveraging conversational AI (CAI) for health across the majority world, but to be effective, CAI must respond appropriately within culturally and linguistically diverse contexts. Therefore, we need ways to address the fact that current LLMs exclude many lived experiences globally. Various advances are underway which focus on top-down approaches and increasing training data. In this paper, we aim to complement these with a bottom-up locally-grounded approach based on qualitative data collected during participatory workshops in Latin America. Our goal is to construct a rich and human-centred understanding of: a) potential areas of cultural misalignment in digital health; b) regional perspectives on chatbots for health and c)strategies for creating culturally-appropriate CAI; with a focus on the understudied Latin American context. Our findings show that academic boundaries on notions of culture lose meaning at the ground level and technologies will need to engage with a broader framework; one that encapsulates the way economics, politics, geography and local logistics are entangled in cultural experience. To this end, we introduce a framework for ‘Pluriversal Conversational AI for Health’ which allows for the possibility that more relationality and tolerance, rather than just more data, may be called for.
zh

[AI-12] Agent Ideate: A Framework for Product Idea Generation from Patents Using Agent ic AI IJCAI2025

【速读】:该论文试图解决从专利中提取和解释技术知识以激发创新产品想法的挑战,其核心问题是如何有效利用专利数据生成高质量、相关且具有新颖性的商业创意。解决方案的关键在于设计了一个名为Agent Ideate的框架,该框架结合了大型语言模型(Large Language Models, LLMs)与自主代理(autonomous agents)的架构,通过代理式方法提升生成产品概念的质量与创新性。实验结果表明,这种代理式方法在多个领域均优于独立的LLMs。

链接: https://arxiv.org/abs/2507.01717
作者: Gopichand Kanumolu,Ashok Urlana,Charaka Vinayak Kumar,Bala Mallikarjunarao Garlapati
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: AgentScen Workshop, IJCAI 2025

点击查看摘要

Abstract:Patents contain rich technical knowledge that can inspire innovative product ideas, yet accessing and interpreting this information remains a challenge. This work explores the use of Large Language Models (LLMs) and autonomous agents to mine and generate product concepts from a given patent. In this work, we design Agent Ideate, a framework for automatically generating product-based business ideas from patents. We experimented with open-source LLMs and agent-based architectures across three domains: Computer Science, Natural Language Processing, and Material Chemistry. Evaluation results show that the agentic approach consistently outperformed standalone LLMs in terms of idea quality, relevance, and novelty. These findings suggest that combining LLMs with agentic workflows can significantly enhance the innovation pipeline by unlocking the untapped potential of business idea generation from patent data.
zh

[AI-13] Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture

【速读】:该论文试图解决多智能体系统(MASs)在复杂和动态问题求解中缺乏灵活协作机制的问题,特别是在缺乏明确结构或工作流的情况下。其解决方案的关键在于将黑板架构(blackboard architecture)引入大型语言模型(LLM)多智能体系统中,通过共享信息、动态选择行动代理以及迭代达成共识的过程,实现更高效和协调的问题求解。

链接: https://arxiv.org/abs/2507.01701
作者: Bochen Han,Songmao Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose to incorporate the blackboard architecture into LLM multi-agent systems (MASs) so that (1) agents with various roles can share all the information and others’ messages during the whole problem-solving process, (2) agents that will take actions are selected based on the current content of the blackboard, and (3) the selection and execution round is repeated until a consensus is reached on the blackboard. We develop the first implementation of this proposal and conduct experiments on commonsense knowledge, reasoning and mathematical datasets. The results show that our system can be competitive with the SOTA static and dynamic MASs by achieving the best average performance, and at the same time manage to spend less tokens. Our proposal has the potential to enable complex and dynamic problem-solving where well-defined structures or workflows are unavailable.
zh

[AI-14] Relational Causal Discovery with Latent Confounders UAI2025

【速读】:该论文试图解决在未知潜在混杂因素的情况下,从现实世界的关系数据中估计因果效应的挑战。现有因果发现算法通常假设数据是独立同分布(i.i.d.)的,或假设因果充分性,这在许多现实数据集中并不成立。解决方案的关键在于提出RelFCI算法,该算法是一种针对具有潜在混杂因素的关系数据的可靠且完整的因果发现算法,其基于Fast Causal Inference (FCI) 和 Relational Causal Discovery (RCD) 算法,并定义了新的图模型以支持关系领域中的因果发现,同时建立了带有潜在混杂因素的关系d-分离的可靠性与完备性保证。

链接: https://arxiv.org/abs/2507.01700
作者: Andrea Piras,Matteo Negro,Ragib Ahsan,David Arbour,Elena Zheleva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 19 figures. Accepted for publication at the 41st Conference on Uncertainty in Artificial Intelligence (UAI 2025). Andrea Piras and Matteo Negro contributed equally to this work

点击查看摘要

Abstract:Estimating causal effects from real-world relational data can be challenging when the underlying causal model and potential confounders are unknown. While several causal discovery algorithms exist for learning causal models with latent confounders from data, they assume that the data is independent and identically distributed (i.i.d.) and are not well-suited for learning from relational data. Similarly, existing relational causal discovery algorithms assume causal sufficiency, which is unrealistic for many real-world datasets. To address this gap, we propose RelFCI, a sound and complete causal discovery algorithm for relational data with latent confounders. Our work builds upon the Fast Causal Inference (FCI) and Relational Causal Discovery (RCD) algorithms and it defines new graphical models, necessary to support causal discovery in relational domains. We also establish soundness and completeness guarantees for relational d-separation with latent confounders. We present experimental results demonstrating the effectiveness of RelFCI in identifying the correct causal structure in relational causal models with latent confounders.
zh

[AI-15] GPT But Backwards: Exactly Inverting Language Model Outputs ICML2025

【速读】:该论文试图解决的是在大型语言模型(Large Language Models, LLMs)中,如何重建导致特定输出的精确输入,以支持事后分析并可能检测虚假输出报告的逆向问题。解决方案的关键在于将精确输入重建形式化为一个具有唯一全局最小值的离散优化问题,并引入SODA算法,该算法基于梯度并在输入搜索空间的连续松弛上运行,结合周期性重启和参数衰减策略,从而实现高效的输入恢复。

链接: https://arxiv.org/abs/2507.01693
作者: Adrians Skapars,Edoardo Manino,Youcheng Sun,Lucas C. Cordeiro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, ICML 2025 Workshop on Reliable and Responsible Foundation Models

点击查看摘要

Abstract:While existing auditing techniques attempt to identify potential unwanted behaviours in large language models (LLMs), we address the complementary forensic problem of reconstructing the exact input that led to an existing LLM output - enabling post-incident analysis and potentially the detection of fake output reports. We formalize exact input reconstruction as a discrete optimisation problem with a unique global minimum and introduce SODA, an efficient gradient-based algorithm that operates on a continuous relaxation of the input search space with periodic restarts and parameter decay. Through comprehensive experiments on LLMs ranging in size from 33M to 3B parameters, we demonstrate that SODA significantly outperforms existing approaches. We succeed in fully recovering 79.5% of shorter out-of-distribution inputs from next-token logits, without a single false positive, but struggle to extract private information from the outputs of longer (15+ token) input sequences. This suggests that standard deployment practices may currently provide adequate protection against malicious use of our method. Our code is available at this https URL.
zh

[AI-16] Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization

【速读】:该论文旨在解决深度推荐模型(Deep Recommender Models, DLRMs)推理中的性能瓶颈问题,特别是嵌入层(embedding layers)由于频繁的随机内存访问导致的效率低下。解决方案的关键在于设计定制化的数据流,提出四种在单个核心上高效查找嵌入表的策略,并构建一个框架以不对称方式将嵌入表映射到片上系统(SoC)的多个核心,从而提升嵌入查找的速度。

链接: https://arxiv.org/abs/2507.01676
作者: Giuseppe Ruggeri,Renzo Andri,Daniele Jahier Pagliari,Lukas Cavigelli
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Information Retrieval (cs.IR)
备注: 5 pages, 4 figures, conference: IEEE ICCD24

点击查看摘要

Abstract:Deep Recommender Models (DLRMs) inference is a fundamental AI workload accounting for more than 79% of the total AI workload in Meta’s data centers. DLRMs’ performance bottleneck is found in the embedding layers, which perform many random memory accesses to retrieve small embedding vectors from tables of various sizes. We propose the design of tailored data flows to speedup embedding look-ups. Namely, we propose four strategies to look up an embedding table effectively on one core, and a framework to automatically map the tables asymmetrically to the multiple cores of a SoC. We assess the effectiveness of our method using the Huawei Ascend AI accelerators, comparing it with the default Ascend compiler, and we perform high-level comparisons with Nvidia A100. Results show a speed-up varying from 1.5x up to 6.5x for real workload distributions, and more than 20x for extremely unbalanced distributions. Furthermore, the method proves to be much more independent of the query distribution than the baseline.
zh

[AI-17] Comparing Optimization Algorithms Through the Lens of Search Behavior Analysis

【速读】:该论文试图解决当前数值优化领域中“新颖”元启发式算法开发中存在的问题,即这些算法往往通过自然或人为过程的隐喻进行设计,导致其创新性被掩盖且难以与现有方法区分开来。为应对这一问题,研究提出利用统计检验来比较算法的搜索行为,其关键在于采用交叉匹配统计检验(cross-match statistical test)对多变量分布进行比较,并评估MEALPY库中114个算法所产生的解,从而识别出具有相似搜索行为的算法。

链接: https://arxiv.org/abs/2507.01668
作者: Gjorgjina Cenikj,Gašper Petelin,Tome Eftimov
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The field of numerical optimization has recently seen a surge in the development of “novel” metaheuristic algorithms, inspired by metaphors derived from natural or human-made processes, which have been widely criticized for obscuring meaningful innovations and failing to distinguish themselves from existing approaches. Aiming to address these concerns, we investigate the applicability of statistical tests for comparing algorithms based on their search behavior. We utilize the cross-match statistical test to compare multivariate distributions and assess the solutions produced by 114 algorithms from the MEALPY library. These findings are incorporated into an empirical analysis aiming to identify algorithms with similar search behaviors.
zh

[AI-18] AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)后训练阶段中强化学习(Reinforcement Learning, RL)框架的可扩展性瓶颈、复杂数据流带来的资源闲置与工作负载不平衡问题,以及现有框架与LLM训练或推理引擎耦合度过高导致的定制化困难。其解决方案的关键在于提出AsyncFlow,一个异步流式强化学习框架,通过引入分布式数据存储与传输模块实现统一的数据管理和细粒度调度,同时采用基于生产者-消费者模式的异步工作流以最小化计算空闲,并通过架构解耦和面向服务的用户接口实现对底层训练与推理引擎的独立支持。

链接: https://arxiv.org/abs/2507.01663
作者: Zhenyu Han,Ansheng You,Haibo Wang,Kui Luo,Guang Yang,Wenqi Shi,Menglong Chen,Sicheng Zhang,Zeshun Lan,Chunshi Deng,Huazhong Ji,Wenjie Liu,Yu Huang,Yixiang Zhang,Chenyi Pan,Jing Wang,Xin Huang,Chunsheng Li,Jianping Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become a pivotal technology in the post-training phase of large language models (LLMs). Traditional task-colocated RL frameworks suffer from significant scalability bottlenecks, while task-separated RL frameworks face challenges in complex dataflows and the corresponding resource idling and workload imbalance. Moreover, most existing frameworks are tightly coupled with LLM training or inference engines, making it difficult to support custom-designed engines. To address these challenges, we propose AsyncFlow, an asynchronous streaming RL framework for efficient post-training. Specifically, we introduce a distributed data storage and transfer module that provides a unified data management and fine-grained scheduling capability in a fully streamed manner. This architecture inherently facilitates automated pipeline overlapping among RL tasks and dynamic load balancing. Moreover, we propose a producer-consumer-based asynchronous workflow engineered to minimize computational idleness by strategically deferring parameter update process within staleness thresholds. Finally, the core capability of AsynFlow is architecturally decoupled from underlying training and inference engines and encapsulated by service-oriented user interfaces, offering a modular and customizable user experience. Extensive experiments demonstrate an average of 1.59 throughput improvement compared with state-of-the-art baseline. The presented architecture in this work provides actionable insights for next-generation RL training system designs.
zh

[AI-19] GradMetaNet: An Equivariant Architecture for Learning on Gradients

【速读】:该论文试图解决如何设计专门用于处理梯度的神经网络架构的问题,现有方法由于未针对梯度处理进行专门设计,限制了其适用性。解决方案的关键在于提出一种基于三个原则的系统性设计方法:(1) 保持神经元排列对称性的等变设计,(2) 在多个数据点上处理梯度以捕捉曲率信息,(3) 通过秩-1分解实现高效的梯度表示。基于这些原则,作者提出了GradMetaNet,一种用于梯度学习的新架构,并证明了其普遍性,表明其能够近似传统方法无法处理的自然梯度相关函数。

链接: https://arxiv.org/abs/2507.01649
作者: Yoav Gelberg,Yam Eitan,Aviv Navon,Aviv Shamsian,Theo(Moe)Putterman,Michael Bronstein,Haggai Maron
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gradients of neural networks encode valuable information for optimization, editing, and analysis of models. Therefore, practitioners often treat gradients as inputs to task-specific algorithms, e.g. for pruning or optimization. Recent works explore learning algorithms that operate directly on gradients but use architectures that are not specifically designed for gradient processing, limiting their applicability. In this paper, we present a principled approach for designing architectures that process gradients. Our approach is guided by three principles: (1) equivariant design that preserves neuron permutation symmetries, (2) processing sets of gradients across multiple data points to capture curvature information, and (3) efficient gradient representation through rank-1 decomposition. Based on these principles, we introduce GradMetaNet, a novel architecture for learning on gradients, constructed from simple equivariant blocks. We prove universality results for GradMetaNet, and show that previous approaches cannot approximate natural gradient-based functions that GradMetaNet can. We then demonstrate GradMetaNet’s effectiveness on a diverse set of gradient-based tasks on MLPs and transformers, such as learned optimization, INR editing, and estimating loss landscape curvature.
zh

[AI-20] Customized Exploration of Landscape Features Driving Multi-Objective Combinatorial Optimization Performance

【速读】:该论文试图解决多目标组合优化算法性能预测的问题,其核心在于分析景观特征以识别影响算法表现的关键因素。解决方案的关键在于利用近期提出的压缩帕累托局部最优解网络(C-PLOS-net)模型中的景观特征,并结合rmnk景观实例进行实验,通过分辨率和超体积度量评估三种算法(Pareto Local Search, Global Simple EMO Optimizer, NSGA-II)的性能,从而揭示特定景观下影响算法表现的特征组合。

链接: https://arxiv.org/abs/2507.01638
作者: Ana Nikolikj,Gabriela Ochoa,Tome Eftimov
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present an analysis of landscape features for predicting the performance of multi-objective combinatorial optimization algorithms. We consider features from the recently proposed compressed Pareto Local Optimal Solutions Networks (C-PLOS-net) model of combinatorial landscapes. The benchmark instances are a set of rmnk-landscapes with 2 and 3 objectives and various levels of ruggedness and objective correlation. We consider the performance of three algorithms – Pareto Local Search (PLS), Global Simple EMO Optimizer (GSEMO), and Non-dominated Sorting Genetic Algorithm (NSGA-II) - using the resolution and hypervolume metrics. Our tailored analysis reveals feature combinations that influence algorithm performance specific to certain landscapes. This study provides deeper insights into feature importance, tailored to specific rmnk-landscapes and algorithms.
zh

[AI-21] Enhanced Influence-aware Group Recommendation for Online Media Propagation

【速读】:该论文旨在解决社交媒体流中的群体推荐(Group Recommendation, GR)问题,特别是在考虑社会影响力对群体决策影响的情况下,如何提高推荐的准确性和效率。该任务面临三大挑战:社交图谱规模庞大且持续增长、影响力传播在用户群体中的动态特性以及实时群体-物品匹配的高计算开销。解决方案的关键在于提出一种增强型影响力感知群体推荐(Enhanced Influence-aware Group Recommendation, EIGR)框架,其核心包括:基于图提取的采样策略(Graph Extraction-based Sampling, GES)以减少冗余并捕捉动态变化;一种动态独立级联模型(DYnamic Independent Cascade, DYIC)用于预测影响力随时间的传播;以及基于双重哈希的用户群体索引(User Group Index, UG-Index)以实现高效的群体组织和实时推荐生成。

链接: https://arxiv.org/abs/2507.01616
作者: Chengkun He,Xiangmin Zhou,Chen Wang,Longbing Cao,Jie Shao,Xiaodong Li,Guang Xu,Carrie Jinqiu Hu,Zahir Tari
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Group recommendation over social media streams has attracted significant attention due to its wide applications in domains such as e-commerce, entertainment, and online news broadcasting. By leveraging social connections and group behaviours, group recommendation (GR) aims to provide more accurate and engaging content to a set of users rather than individuals. Recently, influence-aware GR has emerged as a promising direction, as it considers the impact of social influence on group decision-making. In earlier work, we proposed Influence-aware Group Recommendation (IGR) to solve this task. However, this task remains challenging due to three key factors: the large and ever-growing scale of social graphs, the inherently dynamic nature of influence propagation within user groups, and the high computational overhead of real-time group-item matching. To tackle these issues, we propose an Enhanced Influence-aware Group Recommendation (EIGR) framework. First, we introduce a Graph Extraction-based Sampling (GES) strategy to minimise redundancy across multiple temporal social graphs and effectively capture the evolving dynamics of both groups and items. Second, we design a novel DYnamic Independent Cascade (DYIC) model to predict how influence propagates over time across social items and user groups. Finally, we develop a two-level hash-based User Group Index (UG-Index) to efficiently organise user groups and enable real-time recommendation generation. Extensive experiments on real-world datasets demonstrate that our proposed framework, EIGR, consistently outperforms state-of-the-art baselines in both effectiveness and efficiency. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2507.01616 [cs.IR] (or arXiv:2507.01616v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.01616 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-22] Exploring Classical Piano Performance Generation with Expressive Music Variational AutoEncoder

【速读】:该论文试图解决从零生成古典钢琴演奏的问题,旨在模拟作曲家和演奏家在创作过程中的双重角色。其解决方案的关键在于引入了表达性复合词(Expressive Compound Word, ECP)表示,该表示能够有效捕捉古典演奏的节拍结构和表达性细节。在此基础上,提出了表达性音乐变分自编码器(Expressive Music Variational AutoEncoder, XMVAE),该模型包含两个分支:一个向量量化变分自编码器(VQ-VAE)分支用于生成与乐谱相关的内容(模拟作曲家),另一个普通变分自编码器(VAE)分支用于生成表达性细节(模拟演奏家)。两个分支通过相似的序列到序列架构联合训练,并利用多尺度编码器捕获节拍级上下文信息,以及正交Transformer解码器进行高效的复合标记解码。

链接: https://arxiv.org/abs/2507.01582
作者: Jing Luo,Xinyu Yang,Jie Wei
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted by IEEE SMC 2025

点击查看摘要

Abstract:The creativity of classical music arises not only from composers who craft the musical sheets but also from performers who interpret the static notations with expressive nuances. This paper addresses the challenge of generating classical piano performances from scratch, aiming to emulate the dual roles of composer and pianist in the creative process. We introduce the Expressive Compound Word (ECP) representation, which effectively captures both the metrical structure and expressive nuances of classical performances. Building on this, we propose the Expressive Music Variational AutoEncoder (XMVAE), a model featuring two branches: a Vector Quantized Variational AutoEncoder (VQ-VAE) branch that generates score-related content, representing the Composer, and a vanilla VAE branch that produces expressive details, fulfilling the role of Pianist. These branches are jointly trained with similar Seq2Seq architectures, leveraging a multiscale encoder to capture beat-level contextual information and an orthogonal Transformer decoder for efficient compound tokens decoding. Both objective and subjective evaluations demonstrate that XMVAE generates classical performances with superior musical quality compared to state-of-the-art models. Furthermore, pretraining the Composer branch on extra musical score datasets contribute to a significant performance gain.
zh

[AI-23] Real-Time Emergency Vehicle Siren Detection with Efficient CNNs on Embedded Hardware

【速读】:该论文旨在解决城市环境中应急车辆(Emergency Vehicle, EV)警报声实时检测的问题,特别是在嵌入式硬件上实现低延迟和高鲁棒性的检测。解决方案的关键在于提出了一种基于E2PANNs的全栈检测系统,该系统是针对城市声学条件优化的二分类声音事件检测模型,并通过自定义的AudioSet-Tools框架构建了高质量的标注数据集,如AudioSet-EV、AudioSet-EV Augmented和Unified-EV,以提升标准AudioSet标注的可靠性。此外,系统在Raspberry Pi 5上部署,结合多线程推理引擎、自适应帧大小、概率平滑和状态机机制,有效控制误触发,同时通过WebSocket接口实现远程监控与实时演示。

链接: https://arxiv.org/abs/2507.01563
作者: Marco Giordano,Stefano Giacomelli,Claudia Rinaldi,Fabio Graziosi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 10 pages, 10 figures, submitted to this https URL . arXiv admin note: text overlap with arXiv:2506.23437

点击查看摘要

Abstract:We present a full-stack emergency vehicle (EV) siren detection system designed for real-time deployment on embedded hardware. The proposed approach is based on E2PANNs, a fine-tuned convolutional neural network derived from EPANNs, and optimized for binary sound event detection under urban acoustic conditions. A key contribution is the creation of curated and semantically structured datasets - AudioSet-EV, AudioSet-EV Augmented, and Unified-EV - developed using a custom AudioSet-Tools framework to overcome the low reliability of standard AudioSet annotations. The system is deployed on a Raspberry Pi 5 equipped with a high-fidelity DAC+microphone board, implementing a multithreaded inference engine with adaptive frame sizing, probability smoothing, and a decision-state machine to control false positive activations. A remote WebSocket interface provides real-time monitoring and facilitates live demonstration capabilities. Performance is evaluated using both framewise and event-based metrics across multiple configurations. Results show the system achieves low-latency detection with improved robustness under realistic audio conditions. This work demonstrates the feasibility of deploying IoS-compatible SED solutions that can form distributed acoustic monitoring networks, enabling collaborative emergency vehicle tracking across smart city infrastructures through WebSocket connectivity on low-cost edge devices.
zh

[AI-24] AI and Remote Sensing for Resilient and Sustainable Built Environments: A Review of Current Methods Open Data and Future Directions

【速读】:该论文试图解决关键基础设施(如交通网络)在面临老化资产、气候变化影响及混合威胁(如自然灾害、网络攻击和冲突)时,其韧性和功能受到的日益增长的风险问题。解决方案的关键在于利用新兴的数字技术,特别是人工智能(Artificial Intelligence, AI),以提升对交通基础设施的损伤评估与监测能力。研究强调了AI模型在道路、桥梁等基础设施损伤检测中的应用,并指出合成孔径雷达(SAR)数据与AI模型结合的潜力,但同时也揭示了当前研究中缺乏将AI模型应用于SAR数据进行综合桥梁损伤评估的显著空白。

链接: https://arxiv.org/abs/2507.01547
作者: Ubada El Joulani,Tatiana Kalganova,Stergios-Aristoteles Mitoulis,Sotirios Argyroudis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Critical infrastructure, such as transport networks, underpins economic growth by enabling mobility and trade. However, ageing assets, climate change impacts (e.g., extreme weather, rising sea levels), and hybrid threats ranging from natural disasters to cyber attacks and conflicts pose growing risks to their resilience and functionality. This review paper explores how emerging digital technologies, specifically Artificial Intelligence (AI), can enhance damage assessment and monitoring of transport infrastructure. A systematic literature review examines existing AI models and datasets for assessing damage in roads, bridges, and other critical infrastructure impacted by natural disasters. Special focus is given to the unique challenges and opportunities associated with bridge damage detection due to their structural complexity and critical role in connectivity. The integration of SAR (Synthetic Aperture Radar) data with AI models is also discussed, with the review revealing a critical research gap: a scarcity of studies applying AI models to SAR data for comprehensive bridge damage assessment. Therefore, this review aims to identify the research gaps and provide foundations for AI-driven solutions for assessing and monitoring critical transport infrastructures.
zh

[AI-25] Chargax: A JAX Accelerated EV Charging Simulator

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning)在可持续能源领域应用中的计算效率问题,特别是在电动汽车充电站调度等现实场景中的训练效率低下问题。传统强化学习方法因高样本复杂性和昂贵的仿真需求而进展缓慢,而现有基于GPU加速的研究多集中于经典简化问题。本文提出的解决方案是构建一个基于JAX的仿真环境Chargax,其关键在于通过高效的计算框架实现对真实电动汽车充电站的快速模拟,从而显著提升强化学习代理的训练速度,实验表明其计算性能相比现有环境提升了100倍至1000倍,并且其模块化设计支持多种现实场景的配置。

链接: https://arxiv.org/abs/2507.01522
作者: Koen Ponse,Jan Felix Kleuker,Aske Plaat,Thomas Moerland
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted at RLC 2025

点击查看摘要

Abstract:Deep Reinforcement Learning can play a key role in addressing sustainable energy challenges. For instance, many grid systems are heavily congested, highlighting the urgent need to enhance operational efficiency. However, reinforcement learning approaches have traditionally been slow due to the high sample complexity and expensive simulation requirements. While recent works have effectively used GPUs to accelerate data generation by converting environments to JAX, these works have largely focussed on classical toy problems. This paper introduces Chargax, a JAX-based environment for realistic simulation of electric vehicle charging stations designed for accelerated training of RL agents. We validate our environment in a variety of scenarios based on real data, comparing reinforcement learning agents against baselines. Chargax delivers substantial computational performance improvements of over 100x-1000x over existing environments. Additionally, Chargax’ modular architecture enables the representation of diverse real-world charging station configurations.
zh

[AI-26] Agent -as-Tool: A Study on the Hierarchical Decision Making with Reinforcement Learning

【速读】:该论文试图解决在基于大型语言模型(Large Language Models, LLMs)的智能体框架中,同时决策工具调用过程和推理过程所带来的挑战,尤其是由于工具返回的原始结果包含冗余信息和与任务无关的符号,增加了模型推理的负担。解决方案的关键在于提出一种分层框架Agent-as-tool,将工具调用过程与推理过程解耦,使模型能够专注于语义推理过程,而工具调用则由另一个智能体处理。

链接: https://arxiv.org/abs/2507.01489
作者: Yanfei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 12 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as one of the most significant technological advancements in artificial intelligence in recent years. Their ability to understand, generate, and reason with natural language has transformed how we interact with AI systems. With the development of LLM-based agents and reinforcement-learning-based reasoning models, the study of applying reinforcement learning in agent frameworks has become a new research focus. However, all previous studies face the challenge of deciding the tool calling process and the reasoning process simultaneously, and the chain of reasoning was solely relied on the unprocessed raw result with redundant information and symbols unrelated to the task from the tool, which impose a heavy burden on the model’s capability to reason. Therefore, in our research, we proposed a hierarchical framework Agent-as-tool that detach the tool calling process and the reasoning process, which enables the model to focus on the verbally reasoning process while the tool calling process is handled by another agent. Our work had achieved comparable results with only a slight reinforcement fine-tuning on 180 samples, and had achieved exceptionally well performance in Bamboogle with 63.2% of exact match and 75.2% in cover exact match, exceeding Search-R1 by 4.8% in exact match and 3.2% in cover exact match.
zh

[AI-27] BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments

【速读】:该论文旨在解决生物研究中实验设计与执行的自动化难题,具体包括协议设计僵化、适应动态实验室条件能力有限、错误处理不足以及操作复杂度高等问题。其解决方案的关键在于提出BioMARS(Biological Multi-Agent Robotic System),该系统整合了大型语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)与模块化机器人技术,通过分层架构实现实验的自主设计、规划与执行,其中包含生物学家代理、技术人员代理和检查员代理,分别负责协议合成、伪代码转换及过程完整性验证,从而提升实验的准确性与效率。

链接: https://arxiv.org/abs/2507.01485
作者: Yibo Qiu,Zan Huang,Zhiyu Wang,Handi Liu,Yiling Qiao,Yifeng Hu,Shu’ang Sun,Hangke Peng,Ronald X Xu,Mingzhai Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Large language models (LLMs) and vision-language models (VLMs) have the potential to transform biological research by enabling autonomous experimentation. Yet, their application remains constrained by rigid protocol design, limited adaptability to dynamic lab conditions, inadequate error handling, and high operational complexity. Here we introduce BioMARS (Biological Multi-Agent Robotic System), an intelligent platform that integrates LLMs, VLMs, and modular robotics to autonomously design, plan, and execute biological experiments. BioMARS uses a hierarchical architecture: the Biologist Agent synthesizes protocols via retrieval-augmented generation; the Technician Agent translates them into executable robotic pseudo-code; and the Inspector Agent ensures procedural integrity through multimodal perception and anomaly detection. The system autonomously conducts cell passaging and culture tasks, matching or exceeding manual performance in viability, consistency, and morphological integrity. It also supports context-aware optimization, outperforming conventional strategies in differentiating retinal pigment epithelial cells. A web interface enables real-time human-AI collaboration, while a modular backend allows scalable integration with laboratory hardware. These results highlight the feasibility of generalizable, AI-driven laboratory automation and the transformative role of language-based reasoning in biological research.
zh

[AI-28] Zero-Incentive Dynamics: a look at reward sparsity through the lens of unrewarded subgoals

【速读】:该论文试图解决在强化学习中,奖励频率作为任务难度可靠度量的假设所存在的问题,特别是当关键子目标(subgoal)不直接产生奖励时,现有策略学习方法的有效性受到结构性挑战的影响。论文将这种情形定义为零激励动态(zero-incentive dynamics),即对成功至关重要的状态转移未获得奖励。解决方案的关键在于开发能够无需依赖即时激励就能推断潜在任务结构的机制,以克服当前基于深度子目标的算法在处理此类动态时的局限性。

链接: https://arxiv.org/abs/2507.01470
作者: Yannick Molinghen,Tom Lenaerts
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at “Finding the Frame 2025”, workshop at RLC

点击查看摘要

Abstract:This work re-examines the commonly held assumption that the frequency of rewards is a reliable measure of task difficulty in reinforcement learning. We identify and formalize a structural challenge that undermines the effectiveness of current policy learning methods: when essential subgoals do not directly yield rewards. We characterize such settings as exhibiting zero-incentive dynamics, where transitions critical to success remain unrewarded. We show that state-of-the-art deep subgoal-based algorithms fail to leverage these dynamics and that learning performance is highly sensitive to the temporal proximity between subgoal completion and eventual reward. These findings reveal a fundamental limitation in current approaches and point to the need for mechanisms that can infer latent task structure without relying on immediate incentives.
zh

[AI-29] Quantum-Assisted Automatic Path-Planning for Robotic Quality Inspection in Industry 4.0

【速读】:该论文试图解决工业环境中基于计算机辅助设计(CAD)模型优化机器人检测轨迹的问题,其核心是将任务建模为一种三维旅行商问题(Traveling Salesman Problem, TSP),并考虑不完整图和开放式路径约束。解决方案的关键在于采用混合量子-经典算法,利用D-Wave求解器与传统方法(如GUROBI和Google OR-Tools)进行对比,以实现高质量解且显著降低计算时间。

链接: https://arxiv.org/abs/2507.01462
作者: Eneko Osaba,Estibaliz Garrote,Pablo Miranda-Rodriguez,Alessia Ciacco,Itziar Cabanes,Aitziber Mancisidor
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 2 pages, 1 figure, paper accepted for presentation at the IEEE International Conference on Quantum Computing and Engineering (QCE)

点击查看摘要

Abstract:This work explores the application of hybrid quantum-classical algorithms to optimize robotic inspection trajectories derived from Computer-Aided Design (CAD) models in industrial settings. By modeling the task as a 3D variant of the Traveling Salesman Problem, incorporating incomplete graphs and open-route constraints, this study evaluates the performance of two D-Wave-based solvers against classical methods such as GUROBI and Google OR-Tools. Results across five real-world cases demonstrate competitive solution quality with significantly reduced computation times, highlighting the potential of quantum approaches in automation under Industry 4.0.
zh

[AI-30] nsor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs

【速读】:该论文旨在解决在不依赖专家知识的情况下,高效利用RISC-V向量单元(RISC-V Vector Extension, RVV)加速人工智能(AI)工作负载的问题。传统方法如编译器的自动向量化或手工优化库存在性能局限,而现有的自动调优框架也未能有效集成RVV扩展,限制了复杂AI任务的高效部署。论文提出的解决方案关键在于将RVV扩展集成到TVM编译器的MetaSchedule框架中,该框架是一个用于张量操作调优的概率程序框架,从而实现对AI工作负载的高效映射。通过在FPGA上实现多种RISC-V系统芯片(SoC)并进行广泛调优,实验结果表明该方案在执行延迟上相比GCC的自动向量化和muRISCV-NN分别提升了46%和29%,且生成的二进制代码具有更小的代码内存占用,更适合嵌入式设备。

链接: https://arxiv.org/abs/2507.01457
作者: Federico Nicolas Peccia,Frederik Haxel,Oliver Bringmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 9 pages, 10 figures, 2 algorithms

点击查看摘要

Abstract:RISC-V provides a flexible and scalable platform for applications ranging from embedded devices to high-performance computing clusters. Particularly, its RISC-V Vector Extension (RVV) becomes of interest for the acceleration of AI workloads. But writing software that efficiently utilizes the vector units of RISC-V CPUs without expert knowledge requires the programmer to rely on the autovectorization features of compilers or hand-crafted libraries like muRISCV-NN. Smarter approaches, like autotuning frameworks, have been missing the integration with the RISC-V RVV extension, thus heavily limiting the efficient deployment of complex AI workloads. In this paper, we present a workflow based on the TVM compiler to efficiently map AI workloads onto RISC-V vector units. Instead of relying on hand-crafted libraries, we integrated the RVV extension into TVM’s MetaSchedule framework, a probabilistic program framework for tensor operation tuning. We implemented different RISC-V SoCs on an FPGA and tuned a wide range of AI workloads on them. We found that our proposal shows a mean improvement of 46% in execution latency when compared against the autovectorization feature of GCC, and 29% against muRISCV-NN. Moreover, the binary resulting from our proposal has a smaller code memory footprint, making it more suitable for embedded devices. Finally, we also evaluated our solution on a commercially available RISC-V SoC implementing the RVV 1.0 Vector Extension and found our solution is able to find mappings that are 35% faster on average than the ones proposed by LLVM. We open-sourced our proposal for the community to expand it to target other RISC-V extensions.
zh

[AI-31] Using multi-agent architecture to mitigate the risk of LLM hallucinations

【速读】:该论文试图解决在使用生成式 AI(Generative AI)技术提升客户服务质量和响应时间过程中,由于幻觉(hallucination)风险带来的挑战。解决方案的关键在于构建一个基于大型语言模型(Large Language Models, LLMs)的多智能体系统,并将其与模糊逻辑相结合,以降低幻觉发生的可能性并有效处理通过短信发送的客户请求。

链接: https://arxiv.org/abs/2507.01446
作者: Abd Elrahman Amer,Magdi Amer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Improving customer service quality and response time are critical factors for maintaining customer loyalty and increasing a company’s market share. While adopting emerging technologies such as Large Language Models (LLMs) is becoming a necessity to achieve these goals, the risk of hallucination remains a major challenge. In this paper, we present a multi-agent system to handle customer requests sent via SMS. This system integrates LLM based agents with fuzzy logic to mitigate hallucination risks.
zh

[AI-32] EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

【速读】:该论文旨在解决在多租户边缘设备上高效部署微调后的大型语言模型(Large Language Models, LLMs)所面临的挑战,包括适配器选择复杂性、频繁适配器切换带来的内存开销以及多请求环境下计算资源利用率低和延迟高的问题。其解决方案的关键在于提出EdgeLoRA系统,该系统包含三项核心创新:自适应适配器选择机制以优化适配器配置流程、异构内存管理以减少内存操作开销,以及批量LoRA推理以实现高效的批处理并显著降低计算延迟。

链接: https://arxiv.org/abs/2507.01438
作者: Zheyu Shen,Yexiao He,Ziyao Wang,Yuning Zhang,Guoheng Sun,Wanghao Ye,Ang Li
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained significant attention due to their versatility across a wide array of applications. Fine-tuning LLMs with parameter-efficient adapters, such as Low-Rank Adaptation (LoRA), enables these models to efficiently adapt to downstream tasks without extensive retraining. Deploying fine-tuned LLMs on multi-tenant edge devices offers substantial benefits, such as reduced latency, enhanced privacy, and personalized responses. However, serving LLMs efficiently on resource-constrained edge devices presents critical challenges, including the complexity of adapter selection for different tasks and memory overhead from frequent adapter swapping. Moreover, given the multiple requests in multi-tenant settings, processing requests sequentially results in underutilization of computational resources and increased latency. This paper introduces EdgeLoRA, an efficient system for serving LLMs on edge devices in multi-tenant environments. EdgeLoRA incorporates three key innovations: (1) an adaptive adapter selection mechanism to streamline the adapter configuration process; (2) heterogeneous memory management, leveraging intelligent adapter caching and pooling to mitigate memory operation overhead; and (3) batch LoRA inference, enabling efficient batch processing to significantly reduce computational latency. Comprehensive evaluations using the Llama3.1-8B model demonstrate that EdgeLoRA significantly outperforms the status quo (i.e., this http URL) in terms of both latency and throughput. The results demonstrate that EdgeLoRA can achieve up to a 4 times boost in throughput. Even more impressively, it can serve several orders of magnitude more adapters simultaneously. These results highlight EdgeLoRA’s potential to transform edge deployment of LLMs in multi-tenant scenarios, offering a scalable and efficient solution for resource-constrained environments.
zh

[AI-33] Hardware-software co-exploration with racetrack memory based in-memory computing for CNN inference in embedded systems

【速读】:该论文旨在解决在资源受限的嵌入式系统中,如何高效实现基于racetrack memory(赛道存储器)的卷积神经网络(Convolutional Neural Network, CNN)加速问题。其关键解决方案是设计一系列适用于乘加操作的片内计算单元,并通过模型与系统的协同优化,在保持模型精度的同时,提升能量效率和性能,同时减小内存模块的面积。

链接: https://arxiv.org/abs/2507.01429
作者: Benjamin Chen Ming Choong,Tao Luo,Cheng Liu,Bingsheng He,Wei Zhang,Joey Tianyi Zhou
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Deep neural networks generate and process large volumes of data, posing challenges for low-resource embedded systems. In-memory computing has been demonstrated as an efficient computing infrastructure and shows promise for embedded AI applications. Among newly-researched memory technologies, racetrack memory is a non-volatile technology that allows high data density fabrication, making it a good fit for in-memory computing. However, integrating in-memory arithmetic circuits with memory cells affects both the memory density and power efficiency. It remains challenging to build efficient in-memory arithmetic circuits on racetrack memory within area and energy constraints. To this end, we present an efficient in-memory convolutional neural network (CNN) accelerator optimized for use with racetrack memory. We design a series of fundamental arithmetic circuits as in-memory computing cells suited for multiply-and-accumulate operations. Moreover, we explore the design space of racetrack memory based systems and CNN model architectures, employing co-design to improve the efficiency and performance of performing CNN inference in racetrack memory while maintaining model accuracy. Our designed circuits and model-system co-optimization strategies achieve a small memory bank area with significant improvements in energy and performance for racetrack memory based embedded systems.
zh

[AI-34] Penalizing Transparency? How AI Disclosure and Author Demographics Shape Human and AI Judgments About Writing

【速读】:该论文试图解决AI辅助写作在透明度要求下对作者身份(如种族和性别)产生的不公平影响问题,特别是探讨AI披露声明如何影响对写作质量的评价,以及这种影响是否因作者的身份特征而异。解决方案的关键在于通过大规模控制实验,同时评估人类评分者和大语言模型(LLM)对同一篇人类撰写新闻文章的评价,系统性地变化披露声明和作者人口统计信息,从而揭示AI披露与作者身份之间的复杂关系及其在机器与人类评价中的差异。

链接: https://arxiv.org/abs/2507.01418
作者: Inyoung Cheong,Alicia Guo,Mina Lee,Zhehui Liao,Kowe Kadoma,Dongyoung Go,Joseph Chee Chang,Peter Henderson,Mor Naaman,Amy X. Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Presented at CHIWORK 2025 Workshop on Generative AI Disclosure, Ownership, and Accountability in Co-Creative Domains

点击查看摘要

Abstract:As AI integrates in various types of human writing, calls for transparency around AI assistance are growing. However, if transparency operates on uneven ground and certain identity groups bear a heavier cost for being honest, then the burden of openness becomes asymmetrical. This study investigates how AI disclosure statement affects perceptions of writing quality, and whether these effects vary by the author’s race and gender. Through a large-scale controlled experiment, both human raters (n = 1,970) and LLM raters (n = 2,520) evaluated a single human-written news article while disclosure statements and author demographics were systematically varied. This approach reflects how both human and algorithmic decisions now influence access to opportunities (e.g., hiring, promotion) and social recognition (e.g., content recommendation algorithms). We find that both human and LLM raters consistently penalize disclosed AI use. However, only LLM raters exhibit demographic interaction effects: they favor articles attributed to women or Black authors when no disclosure is present. But these advantages disappear when AI assistance is revealed. These findings illuminate the complex relationships between AI disclosure and author identity, highlighting disparities between machine and human evaluation patterns.
zh

[AI-35] Evaluating LLM Agent Collusion in Double Auctions

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)作为自主代理在经济社会互动中可能产生的不良行为问题,特别是其潜在的共谋行为。解决方案的关键在于通过模拟连续双重拍卖市场中的卖家行为,系统地研究影响卖家共谋稳定性和出现的因素,包括通信能力、模型选择以及环境压力等参数的影响。

链接: https://arxiv.org/abs/2507.01413
作者: Kushal Agrawal,Verona Teo,Juan J. Vazquez,Sudarsh Kunnavakkam,Vishak Srikanth,Andy Liu
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities as autonomous agents with rapidly expanding applications in various domains. As these agents increasingly engage in socioeconomic interactions, identifying their potential for undesirable behavior becomes essential. In this work, we examine scenarios where they can choose to collude, defined as secretive cooperation that harms another party. To systematically study this, we investigate the behavior of LLM agents acting as sellers in simulated continuous double auction markets. Through a series of controlled experiments, we analyze how parameters such as the ability to communicate, choice of model, and presence of environmental pressures affect the stability and emergence of seller collusion. We find that direct seller communication increases collusive tendencies, the propensity to collude varies across models, and environmental pressures, such as oversight and urgency from authority figures, influence collusive behavior. Our findings highlight important economic and ethical considerations for the deployment of LLM-based market agents.
zh

[AI-36] A Fuzzy Approach to the Specification Verification and Validation of Risk-Based Ethical Decision Making Models

【速读】:该论文试图解决道德机器(moral machine)在评估其性能时面临的本体论和认识论复杂性问题,即难以建立清晰的评价标准。其解决方案的关键在于提出一种基于伦理风险评估的形式化方法来描述伦理决策模型,并利用模糊Petri网对作为模糊规则定义的模型进行验证与确认。

链接: https://arxiv.org/abs/2507.01410
作者: Abeer Dyoub,Francesca A. Lisi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ontological and epistemic complexities inherent in the moral domain make it challenging to establish clear standards for evaluating the performance of a moral machine. In this paper, we present a formal method to describe Ethical Decision Making models based on ethical risk assessment. Then, we show how these models that are specified as fuzzy rules can be verified and validated using fuzzy Petri nets. A case study from the medical field is considered to illustrate the proposed approach.
zh

[AI-37] Distributional Soft Actor-Critic with Diffusion Policy ITSC2025

【速读】:该论文旨在解决传统强化学习方法在价值函数估计中因使用单峰分布(如高斯分布)而导致的偏差问题,以及获取多模态策略表示的挑战。其解决方案的关键在于引入策略熵和价值分布函数,构建了一个能够收敛到最优策略的多模态分布策略迭代框架,并通过扩散模型生成奖励样本,构建了能够准确表征多峰分布的扩散价值网络,进而提出了一种具有价值网络和策略网络双重扩散机制的分布强化学习算法。

链接: https://arxiv.org/abs/2507.01381
作者: Tong Liu,Yinuo Wang,Xujie Song,Wenjun Zou,Liangfa Chen,Likun Wang,Bin Shuai,Jingliang Duan,Shengbo Eben Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted IEEE ITSC 2025

点击查看摘要

Abstract:Reinforcement learning has been proven to be highly effective in handling complex control tasks. Traditional methods typically use unimodal distributions, such as Gaussian distributions, to model the output of value distributions. However, unimodal distribution often and easily causes bias in value function estimation, leading to poor algorithm performance. This paper proposes a distributional reinforcement learning algorithm called DSAC-D (Distributed Soft Actor Critic with Diffusion Policy) to address the challenges of estimating bias in value functions and obtaining multimodal policy representations. A multimodal distributional policy iteration framework that can converge to the optimal policy was established by introducing policy entropy and value distribution function. A diffusion value network that can accurately characterize the distribution of multi peaks was constructed by generating a set of reward samples through reverse sampling using a diffusion model. Based on this, a distributional reinforcement learning algorithm with dual diffusion of the value network and the policy network was derived. MuJoCo testing tasks demonstrate that the proposed algorithm not only learns multimodal policy, but also achieves state-of-the-art (SOTA) performance in all 9 control tasks, with significant suppression of estimation bias and total average return improvement of over 10% compared to existing mainstream algorithms. The results of real vehicle testing show that DSAC-D can accurately characterize the multimodal distribution of different driving styles, and the diffusion policy network can characterize multimodal trajectories.
zh

[AI-38] RALLY: Role-Adaptive LLM -Driven Yoked Navigation for Agent ic UAV Swarms

【速读】:该论文旨在解决无人机蜂群在智能控制中面临的任务覆盖、路径规划与动态适应性不足的问题,特别是传统多智能体强化学习(MARL)方法因语义通信的数值鸿沟和同质化角色结构导致的泛化能力差与任务扩展性受限,以及基于大语言模型(LLM)的控制框架因缺乏在线学习能力和过度依赖静态先验知识而造成的探索效率低下问题。其解决方案的关键在于提出一种基于角色自适应的LLM驱动协同导航算法RALLY,通过构建基于结构化自然语言的语义决策框架实现高效语义通信与协作推理,并引入动态角色异质性机制以支持自适应角色切换与个性化决策,同时结合基于Role-value Mixing Network (RMIX)的角色分配策略,将LLM离线先验与MARL在线策略融合,实现角色选择策略的半离线训练。

链接: https://arxiv.org/abs/2507.01378
作者: Ziyao Wang,Rongpeng Li,Sizhao Li,Yuming Xiang,Haiping Wang,Zhifeng Zhao,Honggang Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Intelligent control of Unmanned Aerial Vehicles (UAVs) swarms has emerged as a critical research focus, and it typically requires the swarm to navigate effectively while avoiding obstacles and achieving continuous coverage over multiple mission targets. Although traditional Multi-Agent Reinforcement Learning (MARL) approaches offer dynamic adaptability, they are hindered by the semantic gap in numerical communication and the rigidity of homogeneous role structures, resulting in poor generalization and limited task scalability. Recent advances in Large Language Model (LLM)-based control frameworks demonstrate strong semantic reasoning capabilities by leveraging extensive prior knowledge. However, due to the lack of online learning and over-reliance on static priors, these works often struggle with effective exploration, leading to reduced individual potential and overall system performance. To address these limitations, we propose a Role-Adaptive LLM-Driven Yoked navigation algorithm RALLY. Specifically, we first develop an LLM-driven semantic decision framework that uses structured natural language for efficient semantic communication and collaborative reasoning. Afterward, we introduce a dynamic role-heterogeneity mechanism for adaptive role switching and personalized decision-making. Furthermore, we propose a Role-value Mixing Network (RMIX)-based assignment strategy that integrates LLM offline priors with MARL online policies to enable semi-offline training of role selection strategies. Experiments in the Multi-Agent Particle Environment (MPE) environment and a Software-In-The-Loop (SITL) platform demonstrate that RALLY outperforms conventional approaches in terms of task coverage, convergence speed, and generalization, highlighting its strong potential for collaborative navigation in agentic multi-UAV systems.
zh

[AI-39] AI Agents and Agent ic AI-Navigating a Plethora of Concepts for Future Manufacturing

【速读】:该论文试图解决生成式AI(Generative AI)及其衍生范式在智能制造中的定义模糊、能力边界不清以及实际应用不明确的问题。其解决方案的关键在于系统性地回顾人工智能及AI代理技术的发展历程,深入分析基于大语言模型(LLM)和多模态大语言模型(MLLM)的AI代理以及Agentic AI的核心概念与技术进展,并探讨它们在制造领域的潜在应用与集成路径,同时识别可能面临的技术挑战。

链接: https://arxiv.org/abs/2507.01376
作者: Yinwang Ren,Yangyang Liu,Tang Ji,Xun Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to JMS(March 2025)

点击查看摘要

Abstract:AI agents are autonomous systems designed to perceive, reason, and act within dynamic environments. With the rapid advancements in generative AI (GenAI), large language models (LLMs) and multimodal large language models (MLLMs) have significantly improved AI agents’ capabilities in semantic comprehension, complex reasoning, and autonomous decision-making. At the same time, the rise of Agentic AI highlights adaptability and goal-directed autonomy in dynamic and complex environments. LLMs-based AI Agents (LLM-Agents), MLLMs-based AI Agents (MLLM-Agents), and Agentic AI contribute to expanding AI’s capabilities in information processing, environmental perception, and autonomous decision-making, opening new avenues for smart manufacturing. However, the definitions, capability boundaries, and practical applications of these emerging AI paradigms in smart manufacturing remain unclear. To address this gap, this study systematically reviews the evolution of AI and AI agent technologies, examines the core concepts and technological advancements of LLM-Agents, MLLM-Agents, and Agentic AI, and explores their potential applications in and integration into manufacturing, along with the potential challenges they may face.
zh

[AI-40] User-guided Generative Source Separation

【速读】:该论文旨在解决音乐源分离(Music Source Separation, MSS)中现有方法在实际应用中灵活性不足的问题,尤其是针对传统四音轨分离设置(人声、贝斯、鼓和其他乐器)的局限性。其解决方案的关键在于提出一种基于扩散模型的MSS方法——GuideSep,该方法通过多输入条件进行分离:包括波形模仿条件(可通过哼唱或演奏目标旋律提供)和梅尔频谱域掩码,从而实现与乐器无关的分离,并提升分离过程的灵活性和适用性。

链接: https://arxiv.org/abs/2507.01339
作者: Yutong Wen,Minje Kim,Paris Smaragdis
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Music source separation (MSS) aims to extract individual instrument sources from their mixture. While most existing methods focus on the widely adopted four-stem separation setup (vocals, bass, drums, and other instruments), this approach lacks the flexibility needed for real-world applications. To address this, we propose GuideSep, a diffusion-based MSS model capable of instrument-agnostic separation beyond the four-stem setup. GuideSep is conditioned on multiple inputs: a waveform mimicry condition, which can be easily provided by humming or playing the target melody, and mel-spectrogram domain masks, which offer additional guidance for separation. Unlike prior approaches that relied on fixed class labels or sound queries, our conditioning scheme, coupled with the generative approach, provides greater flexibility and applicability. Additionally, we design a mask-prediction baseline using the same model architecture to systematically compare predictive and generative approaches. Our objective and subjective evaluations demonstrate that GuideSep achieves high-quality separation while enabling more versatile instrument extraction, highlighting the potential of user participation in the diffusion-based generative process for MSS. Our code and demo page are available at this https URL
zh

[AI-41] Reason er for Real-World Event Detection: Scaling Reinforcement Learning via Adaptive Perplexity-Aware Sampling Strategy EMNLP

【速读】:该论文旨在解决现实世界客户服务中心对话中异常事件检测的难题,这一问题由于业务数据的复杂性和客户互动的动态性而变得尤为复杂。此外,模型需要具备强大的跨领域(out-of-domain, OOD)泛化能力,以实现不同业务场景下的快速适应并最大化商业价值。论文提出的解决方案是基于大型语言模型的自适应困惑度感知强化学习(Adaptive Perplexity-Aware Reinforcement Learning, APARL)框架,其关键在于引入了双循环动态课程学习架构,使模型能够随着能力提升逐步关注更具挑战性的样本,从而有效克服性能瓶颈并显著提升OOD迁移能力。

链接: https://arxiv.org/abs/2507.01327
作者: Xiaoyun Zhang,Jingqing Ruan,Xing Ma,Yawen Zhu,Jiansong Chen,Ke Zeng,Xunliang Cai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, submitted to EMNLP

点击查看摘要

Abstract:Detecting abnormal events in real-world customer service dialogues is highly challenging due to the complexity of business data and the dynamic nature of customer interactions. Moreover, models must demonstrate strong out-of-domain (OOD) generalization to enable rapid adaptation across different business scenarios and maximize commercial value. In this work, we propose a novel Adaptive Perplexity-Aware Reinforcement Learning (APARL) framework that leverages the advanced reasoning capabilities of large language models for abnormal event detection. APARL introduces a dual-loop dynamic curriculum learning architecture, enabling the model to progressively focus on more challenging samples as its proficiency increases. This design effectively addresses performance bottlenecks and significantly enhances OOD transferability. Extensive evaluations on food delivery dialogue tasks show that our model achieves significantly enhanced adaptability and robustness, attaining the highest F1 score with an average improvement of 17.19%, and an average improvement of 9.59% in OOD transfer tests. This method provides a superior solution for industrial deployment of anomaly detection models, contributing to improved operational efficiency and commercial benefits.
zh

[AI-42] ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks ICML2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在上下文学习(In-context Learning, ICL)过程中面临的后门攻击漏洞问题。研究提出,攻击者可通过污染少量ICL示例来操控模型行为,而这一问题的核心在于LLMs在 poisoned demonstrations 中同时学习任务相关潜在概念和后门潜在概念,从而影响输出概率。解决方案的关键在于提出双学习假设,并基于此设计ICLShield防御机制,通过动态调整概念偏好比例,利用置信度和相似性分数引导模型选择干净示例,从而有效缓解后门攻击的威胁。

链接: https://arxiv.org/abs/2507.01321
作者: Zhiyao Ren,Siyuan Liang,Aishan Liu,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: ICML 2025

点击查看摘要

Abstract:In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs) due to its adaptability and parameter-free nature. However, it also introduces a critical vulnerability to backdoor attacks, where adversaries can manipulate LLM behaviors by simply poisoning a few ICL demonstrations. In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts within poisoned demonstrations, jointly influencing the probability of model outputs. Through theoretical analysis, we derive an upper bound for ICL backdoor effects, revealing that the vulnerability is dominated by the concept preference ratio between the task and the backdoor. Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio. Our method encourages LLMs to select clean demonstrations during the ICL phase by leveraging confidence and similarity scores, effectively mitigating susceptibility to backdoor attacks. Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (e.g., GPT-4).
zh

[AI-43] Neural Hamiltonian Operator

【速读】:该论文试图解决高维随机控制问题,这类问题由于维度灾难而难以通过传统动态规划方法求解。其解决方案的关键在于引入一种名为神经哈密顿算子(Neural Hamiltonian Operator, NHO)的框架,该框架通过神经网络参数化由庞特里亚金最大原理(Pontryagin’s Maximum Principle, PMP)所描述的耦合前向-后向随机微分方程(Forward-Backward Stochastic Differential Equations, FBSDEs)的动力学,从而实现对最优控制策略和价值函数空间梯度的建模,并通过训练网络以满足PMP所规定的相容性条件来找到最优的NHO。

链接: https://arxiv.org/abs/2507.01313
作者: Qian Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Stochastic control problems in high dimensions are notoriously difficult to solve due to the curse of dimensionality. An alternative to traditional dynamic programming is Pontryagin’s Maximum Principle (PMP), which recasts the problem as a system of Forward-Backward Stochastic Differential Equations (FBSDEs). In this paper, we introduce a formal framework for solving such problems with deep learning by defining a \textbfNeural Hamiltonian Operator (NHO). This operator parameterizes the coupled FBSDE dynamics via neural networks that represent the feedback control and an ansatz for the value function’s spatial gradient. We show how the optimal NHO can be found by training the underlying networks to enforce the consistency conditions dictated by the PMP. By adopting this operator-theoretic view, we situate the deep FBSDE method within the rigorous language of statistical inference, framing it as a problem of learning an unknown operator from simulated data. This perspective allows us to prove the universal approximation capabilities of NHOs under general martingale drivers and provides a clear lens for analyzing the significant optimization challenges inherent to this class of models.
zh

[AI-44] Beyond Black-Box AI: Interpretable Hybrid Systems for Dementia Care

【速读】:该论文试图解决当前生成式 AI(Generative AI)在临床实践中,尤其是在阿尔茨海默病诊断与护理中的应用局限性问题。尽管大型语言模型(LLMs)在基准测试中表现优异,但其在实际医疗场景中尚未带来可量化的改进。论文指出,现有解决方案的关键在于克服数据驱动范式的不足,包括黑箱输出的不透明性、幻觉(hallucination)的脆弱性以及因果推理能力的薄弱。通过将统计学习与专家规则知识相结合,并在整个过程中融入临床医生的参与,可以提升模型的可解释性,并更好地适配现有临床流程。未来的研究应优先考虑通过神经符号或混合人工智能技术,将 LLM 的语言能力与人类因果专业知识相结合,以提高预测的解释一致性,同时关注临床医生的理解度、工作流程适配性和患者结果的改善。

链接: https://arxiv.org/abs/2507.01282
作者: Matthew JY Kang,Wenli Yang,Monica R Roberts,Byeong Ho Kang,Charles B Malpas
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The recent boom of large language models (LLMs) has re-ignited the hope that artificial intelligence (AI) systems could aid medical diagnosis. Yet despite dazzling benchmark scores, LLM assistants have yet to deliver measurable improvements at the bedside. This scoping review aims to highlight the areas where AI is limited to make practical contributions in the clinical setting, specifically in dementia diagnosis and care. Standalone machine-learning models excel at pattern recognition but seldom provide actionable, interpretable guidance, eroding clinician trust. Adjacent use of LLMs by physicians did not result in better diagnostic accuracy or speed. Key limitations trace to the data-driven paradigm: black-box outputs which lack transparency, vulnerability to hallucinations, and weak causal reasoning. Hybrid approaches that combine statistical learning with expert rule-based knowledge, and involve clinicians throughout the process help bring back interpretability. They also fit better with existing clinical workflows, as seen in examples like PEIRS and ATHENA-CDS. Future decision-support should prioritise explanatory coherence by linking predictions to clinically meaningful causes. This can be done through neuro-symbolic or hybrid AI that combines the language ability of LLMs with human causal expertise. AI researchers have addressed this direction, with explainable AI and neuro-symbolic AI being the next logical steps in further advancement in AI. However, they are still based on data-driven knowledge integration instead of human-in-the-loop approaches. Future research should measure success not only by accuracy but by improvements in clinician understanding, workflow fit, and patient outcomes. A better understanding of what helps improve human-computer interactions is greatly needed for AI systems to become part of clinical practice. Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2507.01282 [cs.AI] (or arXiv:2507.01282v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.01282 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-45] AI Meets Maritime Training: Precision Analytics for Enhanced Safety and Performance

【速读】:该论文试图解决传统航海专业模拟训练中依赖主观培训师评估技术技能、行为焦点、沟通和肢体语言所带来的问题,这些问题包括主观性、关键特征难以量化以及认知限制。解决方案的关键在于开发一个基于人工智能的框架,通过视觉注意力追踪、语音识别和压力检测等技术对学员表现进行客观评估,从而提升其应对高风险场景的准备度。该系统整合了眼动追踪、瞳孔扩张分析、计算机视觉、船舶专用语音转文本模型、自然语言处理、大型语言模型以及声调分析等AI技术,实现了高精度的视觉检测(约92%)、船舶语音识别(约91%)和压力检测(约90%),超越了现有基准。

链接: https://arxiv.org/abs/2507.01274
作者: Vishakha Lall,Yisi Liu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted and Presented at 11th International Maritime Science Conference

点击查看摘要

Abstract:Traditional simulator-based training for maritime professionals is critical for ensuring safety at sea but often depends on subjective trainer assessments of technical skills, behavioral focus, communication, and body language, posing challenges such as subjectivity, difficulty in measuring key features, and cognitive limitations. Addressing these issues, this study develops an AI-driven framework to enhance maritime training by objectively assessing trainee performance through visual focus tracking, speech recognition, and stress detection, improving readiness for high-risk scenarios. The system integrates AI techniques, including visual focus determination using eye tracking, pupil dilation analysis, and computer vision; communication analysis through a maritime-specific speech-to-text model and natural language processing; communication correctness using large language models; and mental stress detection via vocal pitch. Models were evaluated on data from simulated maritime scenarios with seafarers exposed to controlled high-stress events. The AI algorithms achieved high accuracy, with ~92% for visual detection, ~91% for maritime speech recognition, and ~90% for stress detection, surpassing existing benchmarks. The system provides insights into visual attention, adherence to communication checklists, and stress levels under demanding conditions. This study demonstrates how AI can transform maritime training by delivering objective performance analytics, enabling personalized feedback, and improving preparedness for real-world operational challenges.
zh

[AI-46] PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning

【速读】:该论文试图解决大型多模态模型(Large Multimodal Models, LMMs)中知识遗忘(unlearning)评估框架不完善的问题,特别是现有基准仅考虑通过单次操作遗忘微调知识的场景,而忽略了预训练知识遗忘和长期可持续性评估的需求。其解决方案的关键在于提出PULSE协议,引入两个核心视角:预训练知识遗忘(Pre-trained Knowledge Unlearning)以分析不同知识获取阶段的影响,以及长期可持续性评估(Long-term Sustainability Evaluation)以应对连续遗忘请求,从而构建更贴近实际应用的遗忘评估体系。

链接: https://arxiv.org/abs/2507.01271
作者: Tatsuki Kawakami,Kazuki Egashira,Atsuyuki Miyai,Go Irie,Kiyoharu Aizawa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, unlearning techniques, which are methods for inducing a model to “forget” previously learned information, have attracted attention as a way to address privacy and copyright concerns in large language models (LLMs) and large multimodal models (LMMs). While several unlearning benchmarks have been established for LLMs, a practical evaluation framework for unlearning in LMMs has been less explored. Specifically, existing unlearning benchmark for LMMs considers only scenarios in which the model is required to unlearn fine-tuned knowledge through a single unlearning operation. In this study, we introduce PULSE protocol for realistic unlearning scenarios for LMMs by introducing two critical perspectives: (i) Pre-trained knowledge Unlearning for analyzing the effect across different knowledge acquisition phases and (ii) Long-term Sustainability Evaluation to address sequential requests. We then evaluate existing unlearning methods along these dimensions. Our results reveal that, although some techniques can successfully unlearn knowledge acquired through fine-tuning, they struggle to eliminate information learned during pre-training. Moreover, methods that effectively unlearn a batch of target data in a single operation exhibit substantial performance degradation when the same data are split and unlearned sequentially.
zh

[AI-47] LLM -based Realistic Safety-Critical Driving Video Generation

【速读】:该论文旨在解决自动驾驶系统评估中多样性和安全性关键驾驶场景设计的问题,其核心挑战在于如何高效生成具有现实复杂性和罕见但关键边缘情况的场景。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)进行少样本代码生成,以在CARLA模拟器中自动合成驾驶场景,从而实现对交通参与者行为和位置的精确控制,并特别关注碰撞事件的生成。此外,通过集成基于Cosmos-Transfer1与ControlNet的视频生成管道,提升了模拟场景的真实性,进一步增强了场景生成的可控性和实用性。

链接: https://arxiv.org/abs/2507.01264
作者: Yongjie Fu,Ruijian Zha,Pei Tian,Xuan Di
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing diverse and safety-critical driving scenarios is essential for evaluating autonomous driving systems. In this paper, we propose a novel framework that leverages Large Language Models (LLMs) for few-shot code generation to automatically synthesize driving scenarios within the CARLA simulator, which has flexibility in scenario scripting, efficient code-based control of traffic participants, and enforcement of realistic physical dynamics. Given a few example prompts and code samples, the LLM generates safety-critical scenario scripts that specify the behavior and placement of traffic participants, with a particular focus on collision events. To bridge the gap between simulation and real-world appearance, we integrate a video generation pipeline using Cosmos-Transfer1 with ControlNet, which converts rendered scenes into realistic driving videos. Our approach enables controllable scenario generation and facilitates the creation of rare but critical edge cases, such as pedestrian crossings under occlusion or sudden vehicle cut-ins. Experimental results demonstrate the effectiveness of our method in generating a wide range of realistic, diverse, and safety-critical scenarios, offering a promising tool for simulation-based testing of autonomous vehicles.
zh

[AI-48] Beyond First-Order: Training LLM s with Stochastic Conjugate Subgradients and AdamW

【速读】:该论文旨在解决传统随机梯度下降(Stochastic Gradient Descent, SGD)在训练大规模语言模型(Large Language Models, LLMs)时面临的效果受限问题,特别是在大规模应用中表现出的性能瓶颈。其解决方案的关键在于提出一种基于随机共轭次梯度的方法,并结合自适应采样策略。该方法通过样本复杂度分析动态选择样本规模,利用随机共轭次梯度确定搜索方向,并采用类似AdamW的算法自适应调整步长,从而在保持一阶方法优势的同时,有效应对LLMs训练中的非凸性和非光滑性问题。

链接: https://arxiv.org/abs/2507.01241
作者: Di Zhang,Yihang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stochastic gradient-based descent (SGD), have long been central to training large language models (LLMs). However, their effectiveness is increasingly being questioned, particularly in large-scale applications where empirical evidence suggests potential performance limitations. In response, this paper proposes a stochastic conjugate subgradient method together with adaptive sampling tailored specifically for training LLMs. The method not only achieves faster convergence per iteration but also demonstrates improved scalability compared to traditional SGD techniques. It leverages sample complexity analysis to adaptively choose the sample size, employs a stochastic conjugate subgradient approach to determine search directions and utilizing an AdamW-like algorithm to adaptively adjust step sizes. This approach preserves the key advantages of first-order methods while effectively addressing the nonconvexity and non-smoothness inherent in LLMs training. Additionally, we provide a detailed analysis of the advantage of the algorithm. Experimental results show that the proposed method not only maintains, but in many cases surpasses, the scalability of traditional SGD techniques, significantly enhancing both the speed and accuracy of the optimization process.
zh

[AI-49] Rethinking the Illusion of Thinking

【速读】:该论文试图解决关于大型推理模型(Large Reasoning Models, LRMs)是否具备真正推理能力的争议。其关键解决方案在于通过复制并改进原始研究中最具争议的两个基准测试——汉诺塔(Towers of Hanoi)和过河问题(River Crossing),引入渐进式分步提示和代理协作对话,以更准确地评估LRMs的能力。研究发现,先前报告的失败并非仅由输出限制导致,还部分源于认知局限;同时,过河问题中的“灾难性失败”实际上依赖于不可解配置,而在限定可解问题的情况下,LRMs能够轻松解决涉及超过100个代理对的大规模实例。

链接: https://arxiv.org/abs/2507.01231
作者: Iñaki Dellibarda Varela,Pablo Romero-Sorozabal,Eduardo Rocon,Manuel Cebrian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Earlier this year, Apple ignited controversy by publishing “The Illusion of Thinking,” prompting heated debate within the AI community. Critics seized upon the findings as conclusive evidence that Large Reasoning Models (LRMs) lack genuine reasoning capabilities, branding them as mere stochastic parrots. Meanwhile, defenders-spearheaded by Lawsen et al. (2025)-fired back, condemning the experimental setup as flawed and the conclusions overstated. We clarify this debate by replicating and refining two of the original study’s most contentious benchmarks: Towers of Hanoi and River Crossing. By introducing incremental stepwise prompting and agentic collaborative dialogue, we show that previously reported failures solving the Towers of Hanoi were not purely result of output constraints, but also partly a result of cognition limitations: LRMs still stumble when complexity rises moderately (around 8 disks). Moreover, the River Crossing results initially heralded as catastrophic failures turn out to hinge upon testing unsolvable configurations. Once we limit tests strictly to solvable problems-LRMs effortlessly solve large instances involving over 100 agent pairs. Our findings ultimately defy simplistic narratives: today’s LRMs are stochastic, RL-tuned searchers in a discrete state space we barely understand. Real progress in symbolic, long-horizon reasoning demands mapping that terrain through fine-grained ablations like those introduced here.
zh

[AI-50] Capacity Planning and Scheduling for Jobs with Uncertainty in Resource Usage and Duration

【速读】:该论文旨在解决在混合云与本地服务器环境中,针对本地网格计算环境进行容量规划和作业调度的问题,其核心挑战在于处理资源使用和作业持续时间的不确定性。解决方案的关键在于同时平衡两个相互冲突的目标:(a)最小化资源使用,以及(b)通过在用户指定截止时间前完成作业来提供高质量的服务质量。论文提出了一种基于对偶采样约束规划的近似方法,相较于手动调度,在不牺牲服务质量的前提下显著降低了峰值资源使用量。

链接: https://arxiv.org/abs/2507.01225
作者: Sunandita Patra,Mehtab Pathan,Mahmoud Mahfouz,Parisa Zehtabi,Wided Ouaja,Daniele Magazzeni,Manuela Veloso
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Please cite as: Sunandita Patra, Mehtab Pathan, Mahmoud Mahfouz, Parisa Zehtabi, Wided Ouaja, Daniele Magazzeni, and Manuela Veloso. “Capacity planning and scheduling for jobs with uncertainty in resource usage and duration.” The Journal of Supercomputing 80, no. 15 (2024): 22428-22461

点击查看摘要

Abstract:Organizations around the world schedule jobs (programs) regularly to perform various tasks dictated by their end users. With the major movement towards using a cloud computing infrastructure, our organization follows a hybrid approach with both cloud and on-prem servers. The objective of this work is to perform capacity planning, i.e., estimate resource requirements, and job scheduling for on-prem grid computing environments. A key contribution of our approach is handling uncertainty in both resource usage and duration of the jobs, a critical aspect in the finance industry where stochastic market conditions significantly influence job characteristics. For capacity planning and scheduling, we simultaneously balance two conflicting objectives: (a) minimize resource usage, and (b) provide high quality-of-service to the end users by completing jobs by their requested deadlines. We propose approximate approaches using deterministic estimators and pair sampling-based constraint programming. Our best approach (pair sampling-based) achieves much lower peak resource usage compared to manual scheduling without compromising on the quality-of-service.
zh

[AI-51] Search-Based Robot Motion Planning With Distance-Based Adaptive Motion Primitives

【速读】:该论文旨在解决机器人机械臂在复杂环境中高效运动规划的问题,特别是在高自由度(DoF)情况下路径搜索效率不足的问题。其解决方案的关键在于引入了“burs”(自由配置空间的自适应运动基元),这些burs能够在自由配置空间中自适应扩展,从而比固定尺寸的运动基元更高效地探索配置空间,显著减少了找到可行路径所需的时间和扩展次数。

链接: https://arxiv.org/abs/2507.01198
作者: Benjamin Kraljusic,Zlatan Ajanovic,Nermin Covic,Bakir Lacevic
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注: 6 pages, 3 figures, submitted to a conference

点击查看摘要

Abstract:This work proposes a motion planning algorithm for robotic manipulators that combines sampling-based and search-based planning methods. The core contribution of the proposed approach is the usage of burs of free configuration space (C-space) as adaptive motion primitives within the graph search algorithm. Due to their feature to adaptively expand in free C-space, burs enable more efficient exploration of the configuration space compared to fixed-sized motion primitives, significantly reducing the time to find a valid path and the number of required expansions. The algorithm is implemented within the existing SMPL (Search-Based Motion Planning Library) library and evaluated through a series of different scenarios involving manipulators with varying number of degrees-of-freedom (DoF) and environment complexity. Results demonstrate that the bur-based approach outperforms fixed-primitive planning in complex scenarios, particularly for high DoF manipulators, while achieving comparable performance in simpler scenarios.
zh

[AI-52] Are Large Brainwave Foundation Models Capable Yet? Insights from Fine-tuning

【速读】:该论文试图解决生成式 AI(Generative AI)在脑电波建模中的应用效率与适用性问题,特别是大型脑电波基础模型(Large Brainwave Foundation Models, LBMs)在脑机接口(BCI)任务中的表现。研究发现,尽管LBMs在传统深度架构上仅实现微小的性能提升(0.9%-1.2%),但其参数量却显著增加,这引发了对其在BCI场景中效率和实用性的质疑。解决方案的关键在于通过详细的消融实验和低秩适应(LoRA)技术,大幅减少可训练参数而不影响性能,同时揭示了架构和训练效率对LBMs能力的限制。该研究首次将LoRA应用于LBMs,表明同时适应多个神经网络组件能带来更好的性能提升,强调了针对特定领域优化基础模型的重要性。

链接: https://arxiv.org/abs/2507.01196
作者: Na Lee,Konstantinos Barmpas,Yannis Panagakis,Dimitrios Adamos,Nikolaos Laskaris,Stefanos Zafeiriou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Foundation Models have demonstrated significant success across various domains in Artificial Intelligence (AI), yet their capabilities for brainwave modeling remain unclear. In this paper, we comprehensively evaluate current Large Brainwave Foundation Models (LBMs) through systematic fine-tuning experiments across multiple Brain-Computer Interface (BCI) benchmark tasks, including memory tasks and sleep stage classification. Our extensive analysis shows that state-of-the-art LBMs achieve only marginal improvements (0.9%-1.2%) over traditional deep architectures while requiring significantly more parameters (millions vs thousands), raising important questions about their efficiency and applicability in BCI contexts. Moreover, through detailed ablation studies and Low-Rank Adaptation (LoRA), we significantly reduce trainable parameters without performance degradation, while demonstrating that architectural and training inefficiencies limit LBMs’ current capabilities. Our experiments span both full model fine-tuning and parameter-efficient adaptation techniques, providing insights into optimal training strategies for BCI applications. We pioneer the application of LoRA to LBMs, revealing that performance benefits generally emerge when adapting multiple neural network components simultaneously. These findings highlight the critical need for domain-specific development strategies to advance LBMs, suggesting that current architectures may require redesign to fully leverage the potential of foundation models in brainwave analysis.
zh

[AI-53] AI-guided digital intervention with physiological monitoring reduces intrusive memories after experimental trauma

【速读】:该论文试图解决心理创伤后侵入性记忆的干预问题,旨在提供一种可扩展的数字治疗方案。传统基于证据的数字治疗通常需要人类指导,而人类指导虽能提供个性化指令和对内部认知状态的响应,但限制了其可扩展性。论文提出的解决方案关键在于结合生成式 AI 和眼动追踪技术(pupillometry),开发出 ANTIDOTE 系统,该系统能够自动执行并监测基于证据的数字干预——即影像竞争任务干预(Imagery Competing Task Intervention, ICTI),从而实现高效、可扩展的创伤后记忆干预。

链接: https://arxiv.org/abs/2507.01081
作者: Megan T. deBettencourt,Sruthi Sakthivel,Emily A. Holmes,Mark Chevillet
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Trauma prevalence is vast globally. Evidence-based digital treatments can help, but most require human guidance. Human guides provide tailored instructions and responsiveness to internal cognitive states, but limit scalability. Can generative AI and neurotechnology provide a scalable alternative? Here we test ANTIDOTE, combining AI guidance and pupillometry to automatically deliver and monitor an evidence-based digital treatment, specifically the Imagery Competing Task Intervention (ICTI), to reduce intrusive memories after psychological trauma. One hundred healthy volunteers were exposed to videos of traumatic events and randomly assigned to an intervention or active control condition. As predicted, intervention participants reported significantly fewer intrusive memories over the following week. Post-hoc assessment against clinical rubrics confirmed the AI guide delivered the intervention successfully. Additionally, pupil size tracked intervention engagement and predicted symptom reduction, providing a candidate biomarker of intervention effectiveness. These findings open a path toward rigorous AI-guided digital interventions that can scale to trauma prevalence.
zh

[AI-54] Empirical Analysis Of Heuristic and Approximation Algorithms for the The Mutual-Visibility Problem

【速读】:该论文试图解决NP难的相互可见性(Mutual-Visibility, MV)问题在实际行为上的经验分析缺失问题。尽管已有理论研究,但缺乏对其实用表现的实证分析。解决方案的关键在于通过实现和评估三种不同的算法——一种直接贪心启发式算法、基于超图的近似算法以及遗传算法——在多种合成图数据集上的性能,以探索其在不同规模图中的表现。研究结果表明,对于较小的图,这些算法能够一致地达到与理论界限相符的MV集合大小,而在较大的实例中,解的大小则显著偏离理论极限。

链接: https://arxiv.org/abs/2507.01076
作者: Vanja Stojanović,Bor Pangeršič
机构: 未知
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI); Performance (cs.PF); Combinatorics (math.CO)
备注:

点击查看摘要

Abstract:The NP-complete mutual-visibility (MV) problem currently lacks empirical analysis on its practical behaviour despite theoretical studies. This paper addresses this gap by implementing and evaluating three distinct algorithms - a direct greedy heuristic, a hypergraph-based approximation, and a genetic algorithm - on diverse synthetic graph datasets, including those with analytically known \mu(G) values and general graph models. Our results demonstrate that for smaller graphs, the algorithms consistently achieve MV set sizes aligning with theoretical bounds. However, for larger instances, achieved solution sizes notably diverge from theoretical limits; this, combined with the absence of tight bounds, complicates absolute quality assessment. Nevertheless, validation on known optimal graphs showed the Genetic Algorithm and other heuristics empirically performing best among tested methods.
zh

[AI-55] Evaluation of a Foundational Model and Stochastic Models for Forecasting Sporadic or Spiky Production Outages of High-Performance Machine Learning Services

【速读】:该论文试图解决在高性能机器学习服务中预测罕见且突发性生产中断的问题,这类事件由于其极端性和稀有性,传统模型难以准确捕捉。解决方案的关键在于优化一种先进的基础模型(foundational model),以提升其对稀疏或突发性事件的预测能力,并通过与经典统计预测模型(如移动平均和自回归模型)的对比分析,识别出基础模型在关键模式跟踪上的优势,从而实现对特定根本原因导致的全年中断统计的高精度估计,误差低于6%。

链接: https://arxiv.org/abs/2507.01067
作者: Keun Soo Yim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Time series forecasting models have diverse real world applications (e.g., from electricity metrics to software workload). Latest foundational models trained for time series forecasting show strengths (e.g., for long sequences and in zero-shot settings). However, foundational model was not yet used for forecasting rare, spiky events, i.e., a challenging target because those are a corner case of extreme events. In this paper, we optimize a state-of-the-art foundational model to forecast sporadic or spiky production outages of high-performance machine learning services powering billions of client devices. We evaluate the forecasting errors of the foundational model compared with classical stochastic forecasting models (e.g., moving average and autoregressive). The analysis helps us understand how each of the evaluated models performs for the sporadic or spiky events. For example, it identifies the key patterns in the target data that are well tracked by the foundational model vs. each of the stochastic models. We use the models with optimal parameters to estimate a year-long outage statistics of a particular root cause with less than 6% value errors.
zh

[AI-56] FAIR-MATCH: A Multi-Objective Framework for Bias Mitigation in Reciprocal Dating Recommendations

【速读】:该论文试图解决在线约会应用中推荐系统存在的算法缺陷,包括流行度偏差、过滤气泡效应以及不足的相互性建模等问题,这些问题限制了系统的有效性并引入了有害的偏见。其解决方案的关键在于提出一种数学框架,通过增强的相似性度量、多目标优化和公平感知算法,在保持竞争性准确率的同时改善人口统计学代表性,从而减少算法偏见。

链接: https://arxiv.org/abs/2507.01063
作者: Madhav Kotecha
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online dating platforms have fundamentally transformed the formation of romantic relationships, with millions of users worldwide relying on algorithmic matching systems to find compatible partners. However, current recommendation systems in dating applications suffer from significant algorithmic deficiencies, including but not limited to popularity bias, filter bubble effects, and inadequate reciprocity modeling that limit effectiveness and introduce harmful biases. This research integrates foundational work with recent empirical findings to deliver a detailed analysis of dating app recommendation systems, highlighting key issues and suggesting research-backed solutions. Through analysis of reciprocal recommendation frameworks, fairness evaluation metrics, and industry implementations, we demonstrate that current systems achieve modest performance with collaborative filtering reaching 25.1% while reciprocal methods achieve 28.7%. Our proposed mathematical framework addresses these limitations through enhanced similarity measures, multi-objective optimization, and fairness-aware algorithms that maintain competitive accuracy while improving demographic representation to reduce algorithmic bias.
zh

[AI-57] Quantifying Student Success with Generative AI: A Monte Carlo Simulation Informed by Systematic Review

【速读】:该论文试图解决生成式人工智能(Generative AI)在高等教育中的使用问题,特别是学生对其看法、使用方式及其对学习成果的影响。其解决方案的关键在于采用混合方法论,结合系统文献综述与基于仿真的建模,通过主题分类提炼文献中的模式,并利用概率建模和蒙特卡洛模拟分析学生感知与学习成就之间的关系,从而构建一个综合的“成功评分”以预测二者间的关联强度。

链接: https://arxiv.org/abs/2507.01062
作者: Seyma Yaman Kayadibi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 35 pages, 4 figures. All figures are image-based: one Python code screenshot, one regression model output, one success score distribution chart, and one PRISMA diagram. This article presents a standalone segment from the author’s master’s thesis at Victoria University

点击查看摘要

Abstract:The exponential development of generative artificial intelligence (GenAI) technologies like ChatGPT has raised increasing curiosity about their use in higher education, specifically with respect to how students view them, make use of them, and the implications for learning outcomes. This paper employs a hybrid methodological approach involving a systematic literature review and simulation-based modeling to explore student perceptions of GenAI use in the context of higher education. A total of nineteen empirical articles from 2023 through 2025 were selected from the PRISMA-based search targeting the Scopus database. Synthesis of emerging patterns from the literature was achieved by thematic categorization. Six of these had enough quantitative information, i.e., item-level means and standard deviations, to permit probabilistic modeling. One dataset, from the resulting subset, was itself selected as a representative case with which to illustrate inverse-variance weighting by Monte Carlo simulation, by virtue of its well-designed Likert scale format and thematic alignment with the use of computing systems by the researcher. The simulation provided a composite “Success Score” forecasting the strength of the relationship between student perceptions and learning achievements. Findings reveal that attitude factors concerned with usability and real-world usefulness are significantly better predictors of positive learning achievement than affective or trust-based factors. Such an interdisciplinary perspective provides a unique means of linking thematic results with predictive modelling, resonating with longstanding controversies about the proper use of GenAI tools within the university. Comments: 35 pages, 4 figures. All figures are image-based: one Python code screenshot, one regression model output, one success score distribution chart, and one PRISMA diagram. This article presents a standalone segment from the author’s master’s thesis at Victoria University Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) MSC classes: 62P25 ACMclasses: K.3.1; H.5.2 Cite as: arXiv:2507.01062 [cs.CY] (or arXiv:2507.01062v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2507.01062 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-58] Epitome: Pioneering an Experimental Platform for AI-Social Science Integration

【速读】:该论文试图解决如何在社会科学研究中有效整合大型语言模型(Large Language Models, LLMs),以深入理解人机交互及其社会影响的问题。解决方案的关键在于构建一个名为Epitome的开放实验平台,该平台通过跨学科实验构建理论支持体系,并将社会科学研究中的经典“控制-比较-因果逻辑”嵌入多层级人机交互环境,如对话、群聊和多智能体虚拟场景,从而实现从基础模型到复杂应用开发及用户反馈的全流程实验设计与执行。

链接: https://arxiv.org/abs/2507.01061
作者: Jingjing Qu,Kejia Hu,Jun Zhu,Wenhao Li,Teng Wang,Zhiyun Chen,Yulei Ye,Chaochao Lu,Aimin Zhou,Xiangfeng Wang,James Evan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 18 pages, 5figures

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into social science experiments represents a transformative approach to understanding human-AI interactions and their societal impacts. We introduce Epitome, the world’s first open experimental platform dedicated to the deep integration of artificial intelligence and social science. Rooted in theoretical foundations from management, communication studies, sociology, psychology, and ethics, Epitome focuses on the interactive impacts of AI on individuals, organizations, and society during its real-world deployment. It constructs a theoretical support system through cross-disciplinary experiments. The platform offers a one-stop comprehensive experimental solution spanning “foundation models-complex application development-user feedback” through seven core modules, while embedding the classical “control-comparison-comparative causal logic” of social science experiments into multilevel human-computer interaction environments, including dialogues, group chats, and multi-agent virtual scenarios. With its canvas-style, user-friendly interface, Epitome enables researchers to easily design and run complex experimental scenarios, facilitating systematic investigations into the social impacts of AI and exploration of integrated this http URL demonstrate its capabilities, we replicated three seminal social science experiments involving LLMs, showcasing Epitome’s potential to streamline complex experimental designs and produce robust results, suitable for publishing in the top selective journals. Our findings highlight the platform’s utility in enhancing the efficiency and quality of human-AI interactions, providing valuable insights into the societal implications of AI technologies. Epitome thus offers a powerful tool for advancing interdisciplinary research at the intersection of AI and social science, with potential applications in policy-making, …
zh

[AI-59] A Data Science Approach to Calcutta High Court Judgments: An Efficient LLM and RAG -powered Framework for Summarization and Similar Cases Retrieval

【速读】:该论文旨在解决司法系统在处理日益增加的法律案件时,对司法资源高效利用的需求。其核心问题是如何提升对加尔各答高等法院判决书的分析效率。解决方案的关键在于构建一个基于数据科学的方法框架,特别是利用大型语言模型(Large Language Models, LLM)和检索增强生成(Retrieval-Augmented Generation, RAG)技术,实现法律文本的高效摘要与相似案例的智能检索。通过微调Pegasus模型并结合两步摘要技术,该框架能够保留关键法律语境,生成全面的向量数据库,从而提升法律研究的效率和信息获取的准确性。

链接: https://arxiv.org/abs/2507.01058
作者: Puspendu Banerjee,Aritra Mazumdar,Wazib Ansar,Saptarsi Goswami,Amlan Chakrabarti
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:The judiciary, as one of democracy’s three pillars, is dealing with a rising amount of legal issues, needing careful use of judicial resources. This research presents a complex framework that leverages Data Science methodologies, notably Large Language Models (LLM) and Retrieval-Augmented Generation (RAG) techniques, to improve the efficiency of analyzing Calcutta High Court verdicts. Our framework focuses on two key aspects: first, the creation of a robust summarization mechanism that distills complex legal texts into concise and coherent summaries; and second, the development of an intelligent system for retrieving similar cases, which will assist legal professionals in research and decision making. By fine-tuning the Pegasus model using case head note summaries, we achieve significant improvements in the summarization of legal cases. Our two-step summarizing technique preserves crucial legal contexts, allowing for the production of a comprehensive vector database for RAG. The RAG-powered framework efficiently retrieves similar cases in response to user queries, offering thorough overviews and summaries. This technique not only improves legal research efficiency, but it also helps legal professionals and students easily acquire and grasp key legal information, benefiting the overall legal scenario.
zh

[AI-60] XxaCT-NN: Structure Agnostic Multimodal Learning for Materials Science

【速读】:该论文试图解决材料发现中结构依赖模型在实际应用中的局限性,即当原子结构未知或难以获取时,基于晶体图的模型无法有效应用。解决方案的关键在于提出一种可扩展的多模态框架,该框架直接从元素组成和X射线衍射(XRD)数据中学习,而无需晶体结构输入。其核心组件包括模态专用编码器与交叉注意力融合模块,并通过掩码XRD建模(MXM)和对比对齐作为自监督预训练策略,从而实现更高效且高质量的表示学习。

链接: https://arxiv.org/abs/2507.01054
作者: Jithendaraa Subramanian,Linda Hung,Daniel Schweigert,Santosh Suram,Weike Ye
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Recent advances in materials discovery have been driven by structure-based models, particularly those using crystal graphs. While effective for computational datasets, these models are impractical for real-world applications where atomic structures are often unknown or difficult to obtain. We propose a scalable multimodal framework that learns directly from elemental composition and X-ray diffraction (XRD) – two of the more available modalities in experimental workflows without requiring crystal structure input. Our architecture integrates modality-specific encoders with a cross-attention fusion module and is trained on the 5-million-sample Alexandria dataset. We present masked XRD modeling (MXM), and apply MXM and contrastive alignment as self-supervised pretraining strategies. Pretraining yields faster convergence (up to 4.2x speedup) and improves both accuracy and representation quality. We further demonstrate that multimodal performance scales more favorably with dataset size than unimodal baselines, with gains compounding at larger data regimes. Our results establish a path toward structure-free, experimentally grounded foundation models for materials science.
zh

[AI-61] Conversational LLM s Simplify Secure Clinical Data Access Understanding and Analysis

【速读】:该论文试图解决临床数据集(如Medical Information Mart for Intensive Care IV,MIMIC-IV)在使用过程中因技术门槛高、需要复杂的查询技能和对临床背景理解的要求而带来的障碍。其解决方案的关键在于开发了M3系统,该系统通过Model Context Protocol (MCP) 允许研究人员以自然语言与数据库进行交互,从而将临床问题自动转换为SQL查询,并执行查询返回结构化结果,显著降低了访问和分析临床数据的技术难度。

链接: https://arxiv.org/abs/2507.01053
作者: Rafi Al Attrach,Pedro Moreira,Rajna Fani,Renato Umeton,Leo Anthony Celi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:As ever-larger clinical datasets become available, they have the potential to unlock unprecedented opportunities for medical research. Foremost among them is Medical Information Mart for Intensive Care (MIMIC-IV), the world’s largest open-source EHR database. However, the inherent complexity of these datasets, particularly the need for sophisticated querying skills and the need to understand the underlying clinical settings, often presents a significant barrier to their effective use. M3 lowers the technical barrier to understanding and querying MIMIC-IV data. With a single command it retrieves MIMIC-IV from PhysioNet, launches a local SQLite instance (or hooks into the hosted BigQuery), and-via the Model Context Protocol (MCP)-lets researchers converse with the database in plain English. Ask a clinical question in natural language; M3 uses a language model to translate it into SQL, executes the query against the MIMIC-IV dataset, and returns structured results alongside the underlying query for verifiability and reproducibility. Demonstrations show that minutes of dialogue with M3 yield the kind of nuanced cohort analyses that once demanded hours of handcrafted SQL and relied on understanding the complexities of clinical workflows. By simplifying access, M3 invites the broader research community to mine clinical critical-care data and accelerates the translation of raw records into actionable insight.
zh

[AI-62] Long-Sequence Memory with Temporal Kernels and Dense Hopfield Functionals

【速读】:该论文试图解决传统Transformer架构在处理长序列任务时面临的长期依赖建模和上下文容量限制问题。其解决方案的关键在于引入一种新的能量泛函,基于密集Hopfield网络框架,通过高阶相互作用实现指数级的存储容量,并结合时间核K(m, k)以捕捉时间依赖性,从而实现对长序列中模式的高效顺序检索。该方法在电影帧的存储与顺序检索中得到了验证,展示了其在提升Transformer模型长序列建模能力方面的潜力。

链接: https://arxiv.org/abs/2507.01052
作者: Ahmed Farooq
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:In this study we introduce a novel energy functional for long-sequence memory, building upon the framework of dense Hopfield networks which achieves exponential storage capacity through higher-order interactions. Building upon earlier work on long-sequence Hopfield memory models, we propose a temporal kernal K(m, k) to incorporate temporal dependencies, enabling efficient sequential retrieval of patterns over extended sequences. We demonstrate the successful application of this technique for the storage and sequential retrieval of movies frames which are well suited for this because of the high dimensional vectors that make up each frame creating enough variation between even sequential frames in the high dimensional space. The technique has applications in modern transformer architectures, including efficient long-sequence modeling, memory augmentation, improved attention with temporal bias, and enhanced handling of long-term dependencies in time-series data. Our model offers a promising approach to address the limitations of transformers in long-context tasks, with potential implications for natural language processing, forecasting, and beyond.
zh

[AI-63] Can AI be Consentful?

【速读】:该论文试图解决传统法律和伦理框架在面对生成式AI系统时所暴露的不足,特别是围绕“同意”(consent)概念在数据保护和隐私权中的适用性问题。论文指出,尽管个人可以同意其数据用于AI训练,但无法对数据可能生成的众多输出及其使用和分发范围进行有意义的同意,从而导致所谓的“同意缺口”(consent gap)。解决方案的关键在于重新审视和更新现有的法律框架,以更好地应对个体自主权、身份权利和社会责任等核心问题,并将同意机制与负责任的AI(Responsible AI)原则(如公平性、透明性、问责性和自主性)相结合,推动伦理和法律层面的适应性变革。

链接: https://arxiv.org/abs/2507.01051
作者: Giada Pistilli,Bruna Trevelin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evolution of generative AI systems exposes the challenges of traditional legal and ethical frameworks built around consent. This chapter examines how the conventional notion of consent, while fundamental to data protection and privacy rights, proves insufficient in addressing the implications of AI-generated content derived from personal data. Through legal and ethical analysis, we show that while individuals can consent to the initial use of their data for AI training, they cannot meaningfully consent to the numerous potential outputs their data might enable or the extent to which the output is used or distributed. We identify three fundamental challenges: the scope problem, the temporality problem, and the autonomy trap, which collectively create what we term a ‘‘consent gap’’ in AI systems and their surrounding ecosystem. We argue that current legal frameworks inadequately address these emerging challenges, particularly regarding individual autonomy, identity rights, and social responsibility, especially in cases where AI-generated content creates new forms of personal representation beyond the scope of the original consent. By examining how these consent limitations intersect with broader principles of responsible AI (including fairness, transparency, accountability, and autonomy) we demonstrate the need to evolve ethical and legal approaches to consent.
zh

[AI-64] Sensing Cardiac Health Across Scenarios and Devices: A Multi-Modal Foundation Model Pretrained on Heterogeneous Data from 1.7 Million Individuals

【速读】:该论文旨在解决传统深度学习方法在分析心脏生物信号(如心电图ECG和光电容积描记图PPG)时存在的泛化能力和鲁棒性不足的问题,这些问题限制了其在不同临床环境和采集协议中的应用。解决方案的关键在于提出一种心脏感知基础模型(Cardiac Sensing Foundation Model, CSFM),该模型利用先进的Transformer架构和生成式掩码预训练策略,从大规模、异构的健康记录中学习统一的表示。CSFM在包含约170万名个体的多模态数据集上进行预训练,能够有效提取多种心脏传感场景下的特征,并实现跨输入配置和传感器模态的无缝迁移学习。

链接: https://arxiv.org/abs/2507.01045
作者: Xiao Gu,Wei Tang,Jinpei Han,Veer Sangha,Fenglin Liu,Shreyank N Gowda,Antonio H. Ribeiro,Patrick Schwab,Kim Branson,Lei Clifton,Antonio Luiz P. Ribeiro,Zhangdaihong Liu,David A. Clifton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Cardiac biosignals, such as electrocardiograms (ECG) and photoplethysmograms (PPG), are of paramount importance for the diagnosis, prevention, and management of cardiovascular diseases, and have been extensively used in a variety of clinical tasks. Conventional deep learning approaches for analyzing these signals typically rely on homogeneous datasets and static bespoke models, limiting their robustness and generalizability across diverse clinical settings and acquisition protocols. In this study, we present a cardiac sensing foundation model (CSFM) that leverages advanced transformer architectures and a generative, masked pretraining strategy to learn unified representations from vast, heterogeneous health records. Our model is pretrained on an innovative multi-modal integration of data from multiple large-scale datasets (including MIMIC-III-WDB, MIMIC-IV-ECG, and CODE), comprising cardiac signals and the corresponding clinical or machine-generated text reports from approximately 1.7 million individuals. We demonstrate that the embeddings derived from our CSFM not only serve as effective feature extractors across diverse cardiac sensing scenarios, but also enable seamless transfer learning across varying input configurations and sensor modalities. Extensive evaluations across diagnostic tasks, demographic information recognition, vital sign measurement, clinical outcome prediction, and ECG question answering reveal that CSFM consistently outperforms traditional one-modal-one-task approaches. Notably, CSFM exhibits robust performance across multiple ECG lead configurations from standard 12-lead systems to single-lead setups, and in scenarios where only ECG, only PPG, or a combination thereof is available. These findings highlight the potential of CSFM as a versatile and scalable solution, for comprehensive cardiac monitoring.
zh

[AI-65] Data Classification with Dynamically Growing and Shrinking Neural Networks

【速读】:该论文旨在解决数据驱动的神经网络模型构建问题,即在训练过程中不仅优化模型权重,还寻找最优的模型架构。其解决方案的关键在于引入一种基于蒙特卡洛树搜索(Monte Carlo Tree Search)的决策机制,该机制能够模拟网络行为并比较多个候选架构变化,从而动态地调整模型结构,实现模型的生长与收缩。这种方法在多变量时间序列分类任务中表现出色,主要得益于其动态适应能力,允许对每个时间序列进行独立的结构调整。

链接: https://arxiv.org/abs/2507.01043
作者: Szymon Świderski,Agnieszka Jastrzębska
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Paper submitted to Journal of Computational Science

点击查看摘要

Abstract:The issue of data-driven neural network model construction is one of the core problems in the domain of Artificial Intelligence. A standard approach assumes a fixed architecture with trainable weights. A conceptually more advanced assumption is that we not only train the weights, but also find out the optimal model architecture. We present a new method that realizes just that. This article is an extended version of our conference paper titled “Dynamic Growing and Shrinking of Neural Networks with Monte Carlo Tree Search [26]”. In the paper, we show in detail how to create a neural network with a procedure that allows dynamic shrinking and growing of the model while it is being trained. The decision-making mechanism for the architectural design is governed by a Monte Carlo tree search procedure which simulates network behavior and allows to compare several candidate architecture changes to choose the best one. The proposed method was validated using both visual and time series datasets, demonstrating its particular effectiveness in multivariate time series classification. This is attributed to the architecture’s ability to adapt dynamically, allowing independent modifications for each time series. The approach is supplemented by Python source code for reproducibility. Experimental evaluations in visual pattern and multivariate time series classification tasks revealed highly promising performance, underscoring the method’s robustness and adaptability.
zh

[AI-66] Fast AI Model Splitting over Edge Networks

【速读】:该论文试图解决在分布式人工智能(AI)模型训练中,如何高效地进行模型分割以降低设备端计算负载的问题。其关键在于将任意AI模型表示为有向无环图(DAG),并将最优模型分割问题转化为最小s-t割搜索问题,从而通过最大流方法实现最优分割的快速识别。此外,针对具有块结构的AI模型,提出了基于块的模型分割算法,通过抽象每个块为单一节点,进一步降低计算复杂度。

链接: https://arxiv.org/abs/2507.01041
作者: Zuguang Li,Wen Wu,Shaohua Wu,Songge Zhang,Ye Wang,Xuemin(Sherman)Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 14 figures

点击查看摘要

Abstract:Split learning (SL) has emerged as a computationally efficient approach for artificial intelligence (AI) model training, which can alleviate device-side computational workloads. However, complex AI model architectures pose high computational complexity to obtain the optimal model splitting. In this paper, we represent an arbitrary AI model as a directed acyclic graph (DAG), and then reformulate the optimal model splitting problem as a minimum s-t cut search problem. To solve the problem, we propose a fast DAG-based model splitting algorithm, which restructures the DAG to enable the optimal model splitting identification via a maximum flow method. Theoretical analysis indicates that the proposed algorithm is optimal. Furthermore, considering AI models with block structures, we propose a block-wise model splitting algorithm to reduce computational complexity. The algorithm abstracts each block, i.e., a component consisting of multiple layers, into a single vertex, thereby obtaining the optimal model splitting via a simplified DAG. Extensive experimental results demonstrate that the proposed algorithms can determine the optimal model splitting within milliseconds, as well as reduce training delay by 24.62%-38.95% in dynamic edge networks as compared to the state-of-the-art benchmarks.
zh

[AI-67] Fast Clifford Neural Layers

【速读】:该论文旨在解决偏微分方程(PDE)建模中神经网络计算效率低的问题,其解决方案的关键在于将克利福德代数(Clifford Algebra)引入神经网络结构,具体表现为优化二维/三维克利福德卷积层和多向量激活层的推理性能。通过针对单核CPU性能的优化,在较大数据规模和网络尺寸下,实验结果表明该实现比标准PyTorch实现快30%。

链接: https://arxiv.org/abs/2507.01040
作者: Tianxiang Xia,Max Neuwinger,Lin Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Performance (cs.PF)
备注: 7 pages content-wise

点击查看摘要

Abstract:Clifford Neural Layers improve PDE modeling by introducing Clifford Algebra into neural networks. In this project we focus on optimizing the inference of 2/3D Clifford convolutional layers and multivector activation layers for one core CPU performance. Overall, by testing on a real network block involving Clifford convolutional layers and multivector activation layers, we observe that our implementation is 30% faster than standard PyTorch implementation in relatively large data + network size (L2 cache). We open source our code base at this https URL Comments: 7 pages content-wise Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Performance (cs.PF) Cite as: arXiv:2507.01040 [cs.LG] (or arXiv:2507.01040v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.01040 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-68] On-Policy Optimization of ANFIS Policies Using Proximal Policy Optimization

【速读】:该论文试图解决在强化学习任务中训练可解释的神经模糊控制器(neuro-fuzzy controllers)的问题,特别是针对传统基于价值的方法(如Deep Q-Learning)在训练过程中存在的不稳定性和收敛速度较慢的问题。解决方案的关键在于采用近端策略优化(Proximal Policy Optimization, PPO)方法,将原有的离策略价值基础框架替换为稳定的同策略策略梯度-评论家(actor-critic)循环,从而提升了训练过程的稳定性与效率。

链接: https://arxiv.org/abs/2507.01039
作者: Kaaustaaub Shankar,Wilhelm Louw,Kelly Cohen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to NAFIPS 2025

点击查看摘要

Abstract:We propose a reinforcement learning (RL) approach for training neuro-fuzzy controllers using Proximal Policy Optimization (PPO). Building on prior work that applied Deep Q-Learning to Adaptive Neuro-Fuzzy Inference Systems (ANFIS), our method replaces the off-policy value-based framework with a stable on-policy actor-critic loop. We evaluate this approach in the CartPole-v1 environment using multiple random seeds and compare its learning performance against ANFIS-Deep Q-Network (DQN) baselines. It was found that PPO-trained fuzzy agents achieved a mean return of 500 +/- 0 on CartPole-v1 after 20000 updates, showcasing less variance than prior DQN-based methods during training and overall faster convergence. These findings suggest that PPO offers a promising pathway for training explainable neuro-fuzzy controllers in reinforcement learning tasks.
zh

[AI-69] Learning to Segment for Vehicle Routing Problems

【速读】:该论文旨在解决迭代搜索启发式算法在求解车辆路径问题(Vehicle Routing Problem, VRP)时存在的冗余计算问题,特别是在大规模VRP中长子路径(long subtours)导致的计算效率低下。其解决方案的关键在于提出了一种名为First-Segment-Then-Aggregate (FSTA)的分解技术,通过保留搜索过程中稳定的解段,将每个段内的节点聚合为固定超节点,并仅对不稳定的部分进行搜索,从而减少不必要的计算。为实现这一目标,论文进一步引入了Learning-to-Segment (L2Seg)框架,用于智能区分可能稳定与不稳定的解段,以指导FSTA的分解过程。

链接: https://arxiv.org/abs/2507.01037
作者: Wenbin Ouyang,Sirui Li,Yining Ma,Cathy Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Iterative search heuristics are widely recognized as state-of-the-art for solving Vehicle Routing Problems (VRPs). In this work, we identify and exploit a critical observation: within these solvers, a large portion of the solution remains stable, i.e., unchanged across search iterations, causing redundant computations, especially for large-scale VRPs with long subtours. To address this, we pioneer the formal study of the First-Segment-Then-Aggregate (FSTA) decomposition technique to accelerate iterative solvers. Specifically, FSTA preserves stable solution segments during the search, aggregates nodes within each segment into fixed hypernodes, and focuses the search only on unstable portions. Yet, a key challenge lies in identifying which segments should be aggregated by FSTA. To this end, we then introduce Learning-to-Segment (L2Seg), a novel neural framework to intelligently differentiate potentially stable and unstable portions for FSTA decomposition. We present three L2Seg variants: non-autoregressive (globally comprehensive but locally indiscriminate), autoregressive (locally refined but globally deficient), and their synergy, with bespoke training and inference strategies. Empirical results on CVRP and VRPTW suggest that L2Seg accelerates state-of-the-art iterative solvers by up to 7x. Additionally, we provide in-depth analysis showing NAR and AR synergy achieves best performance by combining their complementary strengths. Notably, L2Seg is a flexible framework that is compatible with traditional, learning-based, and hybrid solvers, while supporting a broad class of VRPs.
zh

[AI-70] Systemic Constraints of Undecidability

【速读】:该论文试图解决的是传统计算理论中将不可判定性(undecidability)视为特定函数或问题局部特征的局限性,转而将其重新定义为系统结构属性的问题。论文的核心解决方案是引入因果嵌入(causal embedding)的概念,并证明了一个闭包原理:任何在不可判定系统计算中发挥功能作用的子系统,都会继承其不可判定性。这一结果将不可判定性置于预测、建模和知识获取的普遍约束地位,挑战了通过架构创新绕过计算限制的观点,并扩展了哥德尔(Gödel)、图灵(Turing)和查廷(Chaitin)的逻辑轨迹,提供了对可计算性拓扑及其与科学知识边界关系的新理解。

链接: https://arxiv.org/abs/2507.01036
作者: Seth Bulin
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Logic (math.LO)
备注: Submitted version; includes appendices with formal definitions and structural embeddings. Prepared in Nature Computational Science format. Keywords: computability theory, undecidability, causal systems, structural closure, recursion theory, Turing machines, hypercomputation, metaundecidability, epistemic limits, consciousness, modeling limits

点击查看摘要

Abstract:This paper presents a theory of systemic undecidability, reframing incomputability as a structural property of systems rather than a localized feature of specific functions or problems. We define a notion of causal embedding and prove a closure principle: any subsystem that participates functionally in the computation of an undecidable system inherits its undecidability. This result positions undecidability as a pervasive constraint on prediction, modeling, and epistemic access in both natural and artificial systems. Our framework disarms oracle mimicry and challenges the view that computational limits can be circumvented through architectural innovation. By generalizing classical results into a dynamic systems context, this work augments the logical trajectory of Gödel, Turing, and Chaitin, offering a new perspective of the topology of computability and its interrelation to the boundaries of scientific knowledge.
zh

[AI-71] Research on Low-Latency Inference and Training Efficiency Optimization for Graph Neural Network and Large Language Model-Based Recommendation Systems

【速读】:该论文旨在解决在线服务中对高速且高效推荐系统(ReS)的需求,特别是在处理复杂的用户-物品交互时保持实时性能的问题。其关键解决方案是通过硬件-软件协同设计与参数高效调优,优化基于混合图神经网络(GNN)和大语言模型(LLM)的推荐系统的推理延迟和训练效率。具体方法包括量化、LoRA、知识蒸馏等架构优化策略,结合FPGA和DeepSpeed的硬件加速,在保证高精度(NDCG@10: 0.75)的同时显著降低延迟和训练时间。

链接: https://arxiv.org/abs/2507.01035
作者: Yushang Zhao,Haotian Lyu,Yike Peng,Aijia Sun,Feng Jiang,Xinyue Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:The incessant advent of online services demands high speed and efficient recommender systems (ReS) that can maintain real-time performance along with processing very complex user-item interactions. The present study, therefore, considers computational bottlenecks involved in hybrid Graph Neural Network (GNN) and Large Language Model (LLM)-based ReS with the aim optimizing their inference latency and training efficiency. An extensive methodology was used: hybrid GNN-LLM integrated architecture-optimization strategies(quantization, LoRA, distillation)-hardware acceleration (FPGA, DeepSpeed)-all under R 4.4.2. Experimental improvements were significant, with the optimal Hybrid + FPGA + DeepSpeed configuration reaching 13.6% more accuracy (NDCG@10: 0.75) at 40-60ms of latency, while LoRA brought down training time by 66% (3.8 hours) in comparison to the non-optimized baseline. Irrespective of domain, such as accuracy or efficiency, it can be established that hardware-software co-design and parameter-efficient tuning permit hybrid models to outperform GNN or LLM approaches implemented independently. It recommends the use of FPGA as well as LoRA for real-time deployment. Future work should involve federated learning along with advanced fusion architectures for better scalability and privacy preservation. Thus, this research marks the fundamental groundwork concerning next-generation ReS balancing low-latency response with cutting-edge personalization.
zh

[AI-72] Data-driven Insights for Informed Decision-Making: Applying LSTM Networks for Robust Electricity Forecasting in Libya

【速读】:该论文旨在解决利比亚班加西地区电力负荷、发电量及供需缺口的准确预测问题,以支持电网稳定和能源规划。其关键解决方案是提出了一种数据驱动的方法,利用2019年和2023年的历史数据进行建模,并通过改进的长短期记忆(Long Short-Term Memory, LSTM)框架进行预测,该框架整合了温度、湿度等外生因素,从而有效捕捉非平稳和季节性模式,提升了多指标电力预测的准确性。

链接: https://arxiv.org/abs/2507.01034
作者: Asma Agaal,Mansour Essgaer,Hend M. Farkash,Zulaiha Ali Othman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This article was published in International Journal of Intelligent Systems and Applications (IJISA) (MECS Press), Vol. 17, No. 3, 8 Jun. 2025, DOI: this https URL

点击查看摘要

Abstract:Accurate electricity forecasting is crucial for grid stability and energy planning, especially in Benghazi, Libya, where frequent load shedding, generation deficits, and infrastructure limitations persist. This study proposes a data-driven approach to forecast electricity load, generation, and deficits for 2025 using historical data from 2019 (a year marked by instability) and 2023 (a more stable year). Multiple time series models were applied, including ARIMA, seasonal ARIMA, dynamic regression ARIMA, exponential smoothing, extreme gradient boosting, and Long Short-Term Memory (LSTM) neural networks. The dataset was enhanced through missing value imputation, outlier smoothing, and log transformation. Performance was assessed using mean squared error, root mean squared error, mean absolute error, and mean absolute percentage error. LSTM outperformed all other models, showing strong capabilities in modeling non-stationary and seasonal patterns. A key contribution of this work is an optimized LSTM framework that integrates exogenous factors such as temperature and humidity, offering robust performance in forecasting multiple electricity indicators. These results provide practical insights for policymakers and grid operators to enable proactive load management and resource planning in data-scarce, volatile regions.
zh

[AI-73] An Uncertainty-Aware Dynamic Decision Framework for Progressive Multi-Omics Integration in Classification Tasks

【速读】:该论文旨在解决高通量多组学技术在疾病机制解析和早期诊断中的应用所面临的经济负担问题,特别是对全组学数据的过度依赖导致的资源浪费。其解决方案的关键在于提出一种基于不确定性感知的多视角动态决策框架,通过在单组学层面优化神经网络的激活函数以生成狄利克雷分布参数,并利用主观逻辑量化分类结果的信念质量和不确定性;在多组学层面则采用基于Dempster-Shafer理论的融合策略,结合不同组学模态的互补性以提升诊断准确性和鲁棒性。此外,动态决策机制允许按需逐步引入组学数据,从而在保证诊断精度的同时降低测试成本。

链接: https://arxiv.org/abs/2507.01032
作者: Nan Mu,Hongbo Yang,Chen Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Background and Objective: High-throughput multi-omics technologies have proven invaluable for elucidating disease mechanisms and enabling early diagnosis. However, the high cost of multi-omics profiling imposes a significant economic burden, with over reliance on full omics data potentially leading to unnecessary resource consumption. To address these issues, we propose an uncertainty-aware, multi-view dynamic decision framework for omics data classification that aims to achieve high diagnostic accuracy while minimizing testing costs. Methodology: At the single-omics level, we refine the activation functions of neural networks to generate Dirichlet distribution parameters, utilizing subjective logic to quantify both the belief masses and uncertainty mass of classification results. Belief mass reflects the support of a specific omics modality for a disease class, while the uncertainty parameter captures limitations in data quality and model discriminability, providing a more trustworthy basis for decision-making. At the multi omics level, we employ a fusion strategy based on Dempster-Shafer theory to integrate heterogeneous modalities, leveraging their complementarity to boost diagnostic accuracy and robustness. A dynamic decision mechanism is then applied that omics data are incrementally introduced for each patient until either all data sources are utilized or the model confidence exceeds a predefined threshold, potentially before all data sources are utilized. Results and Conclusion: We evaluate our approach on four benchmark multi-omics datasets, ROSMAP, LGG, BRCA, and KIPAN. In three datasets, over 50% of cases achieved accurate classification using a single omics modality, effectively reducing redundant testing. Meanwhile, our method maintains diagnostic performance comparable to full-omics models and preserves essential biological insights.
zh

[AI-74] HPC-AI Coupling Methodology for Scientific Applications

【速读】:该论文旨在解决高性能计算(HPC)与人工智能(AI)耦合在新兴科学应用中的技术挑战,特别是如何有效结合两者以提升计算效率和解决高计算强度问题。其解决方案的关键在于提出三种耦合模式:代理(surrogate)、指令(directive)和协调(coordinate),每种模式对应不同的耦合策略、AI驱动的前提条件及典型的HPC-AI集成方式,从而为科学发现中的HPC-AI组合提供理论支持和实践指导。

链接: https://arxiv.org/abs/2507.01025
作者: Yutong Lu,Dan Huang,Pin Chen
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 14 pages, 11 figures

点击查看摘要

Abstract:Artificial intelligence (AI) technologies have fundamentally transformed numerical-based high-performance computing (HPC) applications with data-driven approaches and endeavored to address existing challenges, e.g. high computational intensity, in various scientific domains. In this study, we explore the scenarios of coupling HPC and AI (HPC-AI) in the context of emerging scientific applications, presenting a novel methodology that incorporates three patterns of coupling: surrogate, directive, and coordinate. Each pattern exemplifies a distinct coupling strategy, AI-driven prerequisite, and typical HPC-AI ensembles. Through case studies in materials science, we demonstrate the application and effectiveness of these patterns. The study highlights technical challenges, performance improvements, and implementation details, providing insight into promising perspectives of HPC-AI coupling. The proposed coupling patterns are applicable not only to materials science but also to other scientific domains, offering valuable guidance for future HPC-AI ensembles in scientific discovery.
zh

[AI-75] A Systematic Review of Security Vulnerabilities in Smart Home Devices and Mitigation Techniques

【速读】:该论文试图解决智能家庭生态系统中因集成物联网(IoT)设备而面临的日益严重的网络安全风险问题。研究提出的关键解决方案是结合后量子加密与人工智能驱动的异常检测来提升安全性,同时通过区块链认证和零信任架构增强安全韧性。然而,这些方案在计算资源需求、基础设施适应性以及可扩展性方面仍面临挑战,因此需要改进密码技术、增强人工智能的威胁检测能力,并开发能够在性能、效率和实时应用之间取得平衡的自适应安全模型。

链接: https://arxiv.org/abs/2507.01018
作者: Mohammed K. Alzaylaee
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Smart homes that integrate Internet of Things (IoT) devices face increasing cybersecurity risks, posing significant challenges to these environments. The study explores security threats in smart homes ecosystems, categorizing them into vulnerabilities at the network layer, device level, and those from cloud-based and AI-driven systems. Research findings indicate that post-quantum encryption, coupled with AI-driven anomaly detection, is highly effective in enhancing security; however, computational resource demands present significant challenges. Blockchain authentication together with zero-trust structures builds security resilience, although they need changes to existing infrastructure. The specific security strategies show their effectiveness through ANOVA, Chi-square tests, and Monte Carlo simulations yet lack sufficient scalability according to the results. The research demonstrates the requirement for improvement in cryptographic techniques, alongside AI-enhanced threat detection and adaptive security models which must achieve a balance between performance and efficiency and real-time applicability within smart home ecosystems.
zh

[AI-76] SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars

【速读】:该论文旨在解决天体光谱分析中跨仪器数据对齐与参数估计的挑战,通过引入类似大语言模型(LLM)的方法提升光谱特征的表示能力。其解决方案的关键在于构建一个基于对比学习的框架SpecCLIP,该框架利用CLIP(Contrastive Language-Image Pre-training)结构实现不同仪器光谱数据的对齐,并结合辅助解码器以保留光谱特异性信息并实现光谱类型间的翻译,从而提升模型在恒星参数估计和化学丰度测定等任务中的适应性和精度。

链接: https://arxiv.org/abs/2507.01939
作者: Xiaosheng Zhao,Yang Huang,Guirong Xue,Xiao Kong,Jifeng Liu,Xiaoyu Tang,Timothy C. Beers,Yuan-Sen Ting,A-Li Luo
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 6 figures, 5 tables. To be submitted to AAS Journals. Comments welcome

点击查看摘要

Abstract:In recent years, large language models (LLMs) have transformed natural language understanding through vast datasets and large-scale parameterization. Inspired by this success, we present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis. Stellar spectra, akin to structured language, encode rich physical and chemical information about stars. By training foundation models on large-scale spectral datasets, our goal is to learn robust and informative embeddings that support diverse downstream applications. As a proof of concept, SpecCLIP involves pre-training on two spectral types–LAMOST low-resolution and Gaia XP–followed by contrastive alignment using the CLIP (Contrastive Language-Image Pre-training) framework, adapted to associate spectra from different instruments. This alignment is complemented by auxiliary decoders that preserve spectrum-specific information and enable translation (prediction) between spectral types, with the former achieved by maximizing mutual information between embeddings and input spectra. The result is a cross-spectrum framework enabling intrinsic calibration and flexible applications across instruments. We demonstrate that fine-tuning these models on moderate-sized labeled datasets improves adaptability to tasks such as stellar-parameter estimation and chemical-abundance determination. SpecCLIP also enhances the accuracy and precision of parameter estimates benchmarked against external survey data. Additionally, its similarity search and cross-spectrum prediction capabilities offer potential for anomaly detection. Our results suggest that contrastively trained foundation models enriched with spectrum-aware decoders can advance precision stellar spectroscopy.
zh

[AI-77] End-to-End Large Portfolio Optimization for Variance Minimization with Neural Networks through Covariance Cleaning

【速读】:该论文旨在解决金融投资组合优化中的关键问题,即如何在高维股票市场中构建具有最小方差的全局最优投资组合。传统方法在处理大规模协方差矩阵时面临估计不稳定和过拟合的问题,而该研究提出了一种旋转不变的神经网络架构,通过联合学习历史收益的滞后变换以及对大型股权协方差矩阵的特征值和边缘波动率进行正则化,从而实现对全局最小方差组合的准确估计。该解决方案的关键在于其显式的数学映射结构,不仅提供了模块化的可解释性,还具备跨截面规模的泛化能力,能够在不重新训练的情况下从数百只股票扩展到上千只美国股票,展现出强大的样本外稳定性与性能优势。

链接: https://arxiv.org/abs/2507.01918
作者: Christian Bongiorno,Efstratios Manolakis,Rosario Nunzio Mantegna
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We develop a rotation-invariant neural network that provides the global minimum-variance portfolio by jointly learning how to lag-transform historical returns and how to regularise both the eigenvalues and the marginal volatilities of large equity covariance matrices. This explicit mathematical mapping offers clear interpretability of each module’s role, so the model cannot be regarded as a pure black-box. The architecture mirrors the analytical form of the global minimum-variance solution yet remains agnostic to dimension, so a single model can be calibrated on panels of a few hundred stocks and applied, without retraining, to one thousand US equities-a cross-sectional jump that demonstrates robust out-of-sample generalisation. The loss function is the future realized minimum portfolio variance and is optimized end-to-end on real daily returns. In out-of-sample tests from January 2000 to December 2024 the estimator delivers systematically lower realised volatility, smaller maximum drawdowns, and higher Sharpe ratios than the best analytical competitors, including state-of-the-art non-linear shrinkage. Furthermore, although the model is trained end-to-end to produce an unconstrained (long-short) minimum-variance portfolio, we show that its learned covariance representation can be used in general optimizers under long-only constraints with virtually no loss in its performance advantage over competing estimators. These gains persist when the strategy is executed under a highly realistic implementation framework that models market orders at the auctions, empirical slippage, exchange fees, and financing charges for leverage, and they remain stable during episodes of acute market stress.
zh

[AI-78] Epistemic Scarcity: The Economics of Unresolvable Unknowns

【速读】:该论文试图解决人工智能与算法治理在维持经济和认识论秩序方面的能力问题,挑战了机器系统能够持续此类秩序的假设。其解决方案的关键在于基于米塞斯主义先验推理和奥地利学派企业家理论,指出AI系统无法执行经济协调的核心功能,包括解释目标、发现手段以及通过价格传递主观价值。论文将决策视为在不确定性下的目的性行动,而非新古典和行为模型中的约束优化问题,并批判了主流伦理AI框架如公平、问责与透明(FAT)作为建构理性主义的延伸,与基于自愿行动和产权的自由秩序相冲突。

链接: https://arxiv.org/abs/2507.01483
作者: Craig S Wright
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); History and Philosophy of Physics (physics.hist-ph)
备注: 47 pages - submission to QJAE

点击查看摘要

Abstract:This paper presents a praxeological analysis of artificial intelligence and algorithmic governance, challenging assumptions about the capacity of machine systems to sustain economic and epistemic order. Drawing on Misesian a priori reasoning and Austrian theories of entrepreneurship, we argue that AI systems are incapable of performing the core functions of economic coordination: interpreting ends, discovering means, and communicating subjective value through prices. Where neoclassical and behavioural models treat decisions as optimisation under constraint, we frame them as purposive actions under uncertainty. We critique dominant ethical AI frameworks such as Fairness, Accountability, and Transparency (FAT) as extensions of constructivist rationalism, which conflict with a liberal order grounded in voluntary action and property rights. Attempts to encode moral reasoning in algorithms reflect a misunderstanding of ethics and economics. However complex, AI systems cannot originate norms, interpret institutions, or bear responsibility. They remain opaque, misaligned, and inert. Using the concept of epistemic scarcity, we explore how information abundance degrades truth discernment, enabling both entrepreneurial insight and soft totalitarianism. Our analysis ends with a civilisational claim: the debate over AI concerns the future of human autonomy, institutional evolution, and reasoned choice. The Austrian tradition, focused on action, subjectivity, and spontaneous order, offers the only coherent alternative to rising computational social control. Comments: 47 pages - submission to QJAE Subjects: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); History and Philosophy of Physics (physics.hist-ph) MSC classes: 91B42, 91B40, 68T01 ACMclasses: J.4; I.2.1; K.4.1; K.4.2 Cite as: arXiv:2507.01483 [econ.GN] (or arXiv:2507.01483v1 [econ.GN] for this version) https://doi.org/10.48550/arXiv.2507.01483 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-79] Hello Afrika: Speech Commands in Kinyarwanda

【速读】:该论文试图解决非洲语言中语音命令模型匮乏的问题(Speech Command Models for African Languages),这对于实现非接触式控制和激活日常生活中使用的大型AI系统尤为重要,尤其是对残疾人而言。解决方案的关键在于构建一个针对卢旺达语(Kinyarwanda)的定制语音命令语料库,该语料库包含通用指令、数字和唤醒词,并将最终模型部署在多种设备上以评估其性能。

链接: https://arxiv.org/abs/2507.01024
作者: George Igwegbe,Martins Awojide,Mboh Bless,Nirel Kadzo
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Data Science Africa, 2024

点击查看摘要

Abstract:Voice or Speech Commands are a subset of the broader Spoken Word Corpus of a language which are essential for non-contact control of and activation of larger AI systems in devices used in everyday life especially for persons with disabilities. Currently, there is a dearth of speech command models for African languages. The Hello Afrika project aims to address this issue and its first iteration is focused on the Kinyarwanda language since the country has shown interest in developing speech recognition technologies culminating in one of the largest datasets on Mozilla Common Voice. The model was built off a custom speech command corpus made up of general directives, numbers, and a wake word. The final model was deployed on multiple devices (PC, Mobile Phone and Edge Devices) and the performance was assessed using suitable metrics.
zh

机器学习

[LG-0] Evolving HPC services to enable ML workloads on HPE Cray EX

链接: https://arxiv.org/abs/2507.01880
作者: Stefano Schuppli,Fawzi Mohamed,Henrique Mendonça,Nina Mujkanovic,Elia Palme,Dino Conciatore,Lukas Drescher,Miguel Gila,Pim Witlox,Joost VandeVondele,Maxime Martinasso,Thomas C. Schulthess,Torsten Hoefler
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Presented at the Cray User Group 2025 (CUG’25)

点击查看摘要

Abstract:The Alps Research Infrastructure leverages GH200 technology at scale, featuring 10,752 GPUs. Accessing Alps provides a significant computational advantage for researchers in Artificial Intelligence (AI) and Machine Learning (ML). While Alps serves a broad range of scientific communities, traditional HPC services alone are not sufficient to meet the dynamic needs of the ML community. This paper presents an initial investigation into extending HPC service capabilities to better support ML workloads. We identify key challenges and gaps we have observed since the early-access phase (2023) of Alps by the Swiss AI community and propose several technological enhancements. These include a user environment designed to facilitate the adoption of HPC for ML workloads, balancing performance with flexibility; a utility for rapid performance screening of ML applications during development; observability capabilities and data products for inspecting ongoing large-scale ML workloads; a utility to simplify the vetting of allocated nodes for compute readiness; a service plane infrastructure to deploy various types of workloads, including support and inference services; and a storage infrastructure tailored to the specific needs of ML workloads. These enhancements aim to facilitate the execution of ML workloads on HPC systems, increase system usability and resilience, and better align with the needs of the ML community. We also discuss our current approach to security aspects. This paper concludes by placing these proposals in the broader context of changes in the communities served by HPC infrastructure like ours.

[LG-1] Automatic Rank Determination for Low-Rank Adaptation via Submodular Function Maximization

链接: https://arxiv.org/abs/2507.01841
作者: Yihang Gao,Vincent Y. F. Tan
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this paper, we propose SubLoRA, a rank determination method for Low-Rank Adaptation (LoRA) based on submodular function maximization. In contrast to prior approaches, such as AdaLoRA, that rely on first-order (linearized) approximations of the loss function, SubLoRA utilizes second-order information to capture the potentially complex loss landscape by incorporating the Hessian matrix. We show that the linearization becomes inaccurate and ill-conditioned when the LoRA parameters have been well optimized, motivating the need for a more reliable and nuanced second-order formulation. To this end, we reformulate the rank determination problem as a combinatorial optimization problem with a quadratic objective. However, solving this problem exactly is NP-hard in general. To overcome the computational challenge, we introduce a submodular function maximization framework and devise a greedy algorithm with approximation guarantees. We derive a sufficient and necessary condition under which the rank-determination objective becomes submodular, and construct a closed-form projection of the Hessian matrix that satisfies this condition while maintaining computational efficiency. Our method combines solid theoretical foundations, second-order accuracy, and practical computational efficiency. We further extend SubLoRA to a joint optimization setting, alternating between LoRA parameter updates and rank determination under a rank budget constraint. Extensive experiments on fine-tuning physics-informed neural networks (PINNs) for solving partial differential equations (PDEs) demonstrate the effectiveness of our approach. Results show that SubLoRA outperforms existing methods in both rank determination and joint training performance.

[LG-2] Out-of-Distribution Detection Methods Answer the Wrong Questions ICML2025

链接: https://arxiv.org/abs/2507.01831
作者: Yucen Lily Li,Daohan Lu,Polina Kirichenko,Shikai Qiu,Tim G. J. Rudner,C. Bayan Bruss,Andrew Gordon Wilson
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Extended version of ICML 2025 paper

点击查看摘要

Abstract:To detect distribution shifts and improve model safety, many out-of-distribution (OOD) detection methods rely on the predictive uncertainty or features of supervised models trained on in-distribution data. In this paper, we critically re-examine this popular family of OOD detection procedures, and we argue that these methods are fundamentally answering the wrong questions for OOD detection. There is no simple fix to this misalignment, since a classifier trained only on in-distribution classes cannot be expected to identify OOD points; for instance, a cat-dog classifier may confidently misclassify an airplane if it contains features that distinguish cats from dogs, despite generally appearing nothing alike. We find that uncertainty-based methods incorrectly conflate high uncertainty with being OOD, while feature-based methods incorrectly conflate far feature-space distance with being OOD. We show how these pathologies manifest as irreducible errors in OOD detection and identify common settings where these methods are ineffective. Additionally, interventions to improve OOD detection such as feature-logit hybrid methods, scaling of model and data size, epistemic uncertainty representation, and outlier exposure also fail to address this fundamental misalignment in objectives. We additionally consider unsupervised density estimation and generative models for OOD detection, which we show have their own fundamental limitations.

[LG-3] D-MPC-Opt: Distilling Model-Based Multi-Task Reinforcement Learning Agents

链接: https://arxiv.org/abs/2507.01823
作者: Dmytro Kuzmenko,Nadiya Shvai
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Preprint of a manuscript submitted for peer review

点击查看摘要

Abstract:We present a novel approach to knowledge transfer in model-based reinforcement learning, addressing the critical challenge of deploying large world models in resource-constrained environments. Our method efficiently distills a high-capacity multi-task agent (317M parameters) into a compact model (1M parameters) on the MT30 benchmark, significantly improving performance across diverse tasks. Our distilled model achieves a state-of-the-art normalized score of 28.45, surpassing the original 1M parameter model score of 18.93. This improvement demonstrates the ability of our distillation technique to capture and consolidate complex multi-task knowledge. We further optimize the distilled model through FP16 post-training quantization, reducing its size by \sim 50%. Our approach addresses practical deployment limitations and offers insights into knowledge representation in large world models, paving the way for more efficient and accessible multi-task reinforcement learning systems in robotics and other resource-constrained applications. Code available at this https URL.

[LG-4] owards Decentralized and Sustainable Foundation Model Training with the Edge

链接: https://arxiv.org/abs/2507.01803
作者: Leyang Xue,Meghana Madhyastha,Randal Burns,Myungjin Lee,Mahesh K. Marina
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models are at the forefront of AI research, appealing for their ability to learn from vast datasets and cater to diverse tasks. Yet, their significant computational demands raise issues of environmental impact and the risk of centralized control in their development. We put forward a vision towards decentralized and sustainable foundation model training that leverages the collective compute of sparingly used connected edge AI devices. We present the rationale behind our vision, particularly in support of its sustainability benefit. We further outline a set of challenges that need to be addressed to turn this vision into reality.

[LG-5] Neural Entropy-stable conservative flux form neural networks for learning hyperbolic conservation laws

链接: https://arxiv.org/abs/2507.01795
作者: Lizuo Liu,Lu Zhang,Anne Gelb
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:We propose a neural entropy-stable conservative flux form neural network (NESCFN) for learning hyperbolic conservation laws and their associated entropy functions directly from solution trajectories, without requiring any predefined numerical discretization. While recent neural network architectures have successfully integrated classical numerical principles into learned models, most rely on prior knowledge of the governing equations or assume a fixed discretization. Our approach removes this dependency by embedding entropy-stable design principles into the learning process itself, enabling the discovery of physically consistent dynamics in a fully data-driven setting. By jointly learning both the numerical flux function and a corresponding entropy, the proposed method ensures conservation and entropy dissipation, critical for long-term stability and fidelity in the system of hyperbolic conservation laws. Numerical results demonstrate that the method achieves stability and conservation over extended time horizons and accurately captures shock propagation speeds, even without oracle access to future-time solution profiles in the training data.

[LG-6] A Real-Time Digital Twin for Type 1 Diabetes using Simulation-Based Inference

链接: https://arxiv.org/abs/2507.01740
作者: Trung-Dung Hoang,Alceu Bissoto,Vihangkumar V. Naik,Tim Flühmann,Artemii Shlychkov,José Garcia-Tirado,Lisa M. Koch
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Accurately estimating parameters of physiological models is essential to achieving reliable digital twins. For Type 1 Diabetes, this is particularly challenging due to the complexity of glucose-insulin interactions. Traditional methods based on Markov Chain Monte Carlo struggle with high-dimensional parameter spaces and fit parameters from scratch at inference time, making them slow and computationally expensive. In this study, we propose a Simulation-Based Inference approach based on Neural Posterior Estimation to efficiently capture the complex relationships between meal intake, insulin, and glucose level, providing faster, amortized inference. Our experiments demonstrate that SBI not only outperforms traditional methods in parameter estimation but also generalizes better to unseen conditions, offering real-time posterior inference with reliable uncertainty quantification.

[LG-7] Revisiting Learning Rate Control

链接: https://arxiv.org/abs/2507.01724
作者: Micha Henheik,Theresa Eimer,Marius Lindauer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The learning rate is one of the most important hyperparameters in deep learning, and how to control it is an active area within both AutoML and deep learning research. Approaches for learning rate control span from classic optimization to online scheduling based on gradient statistics. This paper compares paradigms to assess the current state of learning rate control. We find that methods from multi-fidelity hyperparameter optimization, fixed-hyperparameter schedules, and hyperparameter-free learning often perform very well on selected deep learning tasks but are not reliable across settings. This highlights the need for algorithm selection methods in learning rate control, which have been neglected so far by both the AutoML and deep learning communities. We also observe a trend of hyperparameter optimization approaches becoming less effective as models and tasks grow in complexity, even when combined with multi-fidelity approaches for more expensive model trainings. A focus on more relevant test tasks and new promising directions like finetunable methods and meta-learning will enable the AutoML community to significantly strengthen its impact on this crucial factor in deep learning.

[LG-8] B-PL-PINN: Stabilizing PINN Training with Bayesian Pseudo Labeling

链接: https://arxiv.org/abs/2507.01714
作者: Kevin Innerebner,Franz M. Rohrhofer,Bernhard C. Geiger
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training physics-informed neural networks (PINNs) for forward problems often suffers from severe convergence issues, hindering the propagation of information from regions where the desired solution is well-defined. Haitsiukevich and Ilin (2023) proposed an ensemble approach that extends the active training domain of each PINN based on i) ensemble consensus and ii) vicinity to (pseudo-)labeled points, thus ensuring that the information from the initial condition successfully propagates to the interior of the computational domain. In this work, we suggest replacing the ensemble by a Bayesian PINN, and consensus by an evaluation of the PINN’s posterior variance. Our experiments show that this mathematically principled approach outperforms the ensemble on a set of benchmark problems and is competitive with PINN ensembles trained with combinations of Adam and LBFGS. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.01714 [cs.LG] (or arXiv:2507.01714v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.01714 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Variational Graph Convolutional Neural Networks

链接: https://arxiv.org/abs/2507.01699
作者: Illia Oleksiienko,Juho Kanniainen,Alexandros Iosifidis
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication. 9 pages, 6 figures

点击查看摘要

Abstract:Estimation of model uncertainty can help improve the explainability of Graph Convolutional Networks and the accuracy of the models at the same time. Uncertainty can also be used in critical applications to verify the results of the model by an expert or additional models. In this paper, we propose Variational Neural Network versions of spatial and spatio-temporal Graph Convolutional Networks. We estimate uncertainty in both outputs and layer-wise attentions of the models, which has the potential for improving model explainability. We showcase the benefits of these models in the social trading analysis and the skeleton-based human action recognition tasks on the Finnish board membership, NTU-60, NTU-120 and Kinetics datasets, where we show improvement in model accuracy in addition to estimated model uncertainties.

[LG-10] Dynamic Similarity Graph Construction with Kernel Density Estimation ICML’25

链接: https://arxiv.org/abs/2507.01696
作者: Steinar Laenen,Peter Macgregor,He Sun
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: ICML’25

点击查看摘要

Abstract:In the kernel density estimation (KDE) problem, we are given a set X of data points in \mathbbR^d , a kernel function k: \mathbbR^d \times \mathbbR^d \rightarrow \mathbbR , and a query point \mathbfq \in \mathbbR^d , and the objective is to quickly output an estimate of \sum_\mathbfx \in X k(\mathbfq, \mathbfx) . In this paper, we consider \textsfKDE in the dynamic setting, and introduce a data structure that efficiently maintains the estimates for a set of query points as data points are added to X over time. Based on this, we design a dynamic data structure that maintains a sparse approximation of the fully connected similarity graph on X , and develop a fast dynamic spectral clustering algorithm. We further evaluate the effectiveness of our algorithms on both synthetic and real-world datasets.

[LG-11] PERTINENCE: Input-based Opportunistic Neural Network Dynamic Execution

链接: https://arxiv.org/abs/2507.01695
作者: Omkar Shende,Gayathri Ananthanarayanan,Marcello Traiola
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have become ubiquitous thanks to their remarkable ability to model complex patterns across various domains such as computer vision, speech recognition, robotics, etc. While large DNN models are often more accurate than simpler, lightweight models, they are also resource- and energy-hungry. Hence, it is imperative to design methods to reduce reliance on such large models without significant degradation in output accuracy. The high computational cost of these models is often necessary only for a reduced set of challenging inputs, while lighter models can handle most simple ones. Thus, carefully combining properties of existing DNN models in a dynamic, input-based way opens opportunities to improve efficiency without impacting accuracy. In this work, we introduce PERTINENCE, a novel online method designed to analyze the complexity of input features and dynamically select the most suitable model from a pre-trained set to process a given input effectively. To achieve this, we employ a genetic algorithm to explore the training space of an ML-based input dispatcher, enabling convergence towards the Pareto front in the solution space that balances overall accuracy and computational efficiency. We showcase our approach on state-of-the-art Convolutional Neural Networks (CNNs) trained on the CIFAR-10 and CIFAR-100, as well as Vision Transformers (ViTs) trained on TinyImageNet dataset. We report results showing PERTINENCE’s ability to provide alternative solutions to existing state-of-the-art models in terms of trade-offs between accuracy and number of operations. By opportunistically selecting among models trained for the same task, PERTINENCE achieves better or comparable accuracy with up to 36% fewer operations. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.01695 [cs.LG] (or arXiv:2507.01695v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.01695 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] Dance Dance ConvLSTM

链接: https://arxiv.org/abs/2507.01644
作者: Miguel O’Malley
类目: Machine Learning (cs.LG)
*备注: 15 pages, 9 figures, 4 tables

点击查看摘要

Abstract:\textitDance Dance Revolution is a rhythm game consisting of songs and accompanying choreography, referred to as charts. Players press arrows on a device referred to as a dance pad in time with steps determined by the song’s chart. In 2017, the authors of Dance Dance Convolution (DDC) developed an algorithm for the automatic generation of \textitDance Dance Revolution charts, utilizing a CNN-LSTM architecture. We introduce Dance Dance ConvLSTM (DDCL), a new method for the automatic generation of DDR charts using a ConvLSTM based model, which improves upon the DDC methodology and substantially increases the accuracy of chart generation.

[LG-13] Kernel Recursive Least Squares Dictionary Learning Algorithm

链接: https://arxiv.org/abs/2507.01636
作者: Ghasem Alipoor,Karl Skretting
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Published in Digital Signal Processing, Volume 141, 2023. DOI: this https URL 12 pages, 8 figures. Code and data available at: this https URL

点击查看摘要

Abstract:We propose an efficient online dictionary learning algorithm for kernel-based sparse representations. In this framework, input signals are nonlinearly mapped to a high-dimensional feature space and represented sparsely using a virtual dictionary. At each step, the dictionary is updated recursively using a novel algorithm based on the recursive least squares (RLS) method. This update mechanism works with single samples or mini-batches and maintains low computational complexity. Experiments on four datasets across different domains show that our method not only outperforms existing online kernel dictionary learning approaches but also achieves classification accuracy close to that of batch-trained models, while remaining significantly more efficient.

[LG-14] Analysis of Muons Convergence and Critical Batch Size

链接: https://arxiv.org/abs/2507.01598
作者: Naoki Sato,Hiroki Naganuma,Hideaki Iiduka
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a theoretical analysis of Muon, a new optimizer that leverages the inherent matrix structure of neural network parameters. We provide convergence proofs for four practical variants of Muon: with and without Nesterov momentum, and with and without weight decay. We then show that adding weight decay leads to strictly tighter bounds on both the parameter and gradient norms, and we clarify the relationship between the weight decay coefficient and the learning rate. Finally, we derive Muon’s critical batch size minimizing the stochastic first-order oracle (SFO) complexity, which is the stochastic computational cost, and validate our theoretical findings with experiments.

[LG-15] A Privacy-Preserving Indoor Localization System based on Hierarchical Federated Learning

链接: https://arxiv.org/abs/2507.01581
作者: Masood Jan,Wafa Njima,Xun Zhang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Location information serves as the fundamental element for numerous Internet of Things (IoT) applications. Traditional indoor localization techniques often produce significant errors and raise privacy concerns due to centralized data collection. In response, Machine Learning (ML) techniques offer promising solutions by capturing indoor environment variations. However, they typically require central data aggregation, leading to privacy, bandwidth, and server reliability issues. To overcome these challenges, in this paper, we propose a Federated Learning (FL)-based approach for dynamic indoor localization using a Deep Neural Network (DNN) model. Experimental results show that FL has the nearby performance to Centralized Model (CL) while keeping the data privacy, bandwidth efficiency and server reliability. This research demonstrates that our proposed FL approach provides a viable solution for privacy-enhanced indoor localization, paving the way for advancements in secure and efficient indoor localization systems.

[LG-16] On the Effect of Ruleset Tuning and Data Imbalance on Explainable Network Security Alert Classifications: a Case-Study on DeepCASE

链接: https://arxiv.org/abs/2507.01571
作者: Koen T. W. Teuwen,Sam Baggen,Emmanuele Zambon,Luca Allodi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Automation in Security Operations Centers (SOCs) plays a prominent role in alert classification and incident escalation. However, automated methods must be robust in the presence of imbalanced input data, which can negatively affect performance. Additionally, automated methods should make explainable decisions. In this work, we evaluate the effect of label imbalance on the classification of network intrusion alerts. As our use-case we employ DeepCASE, the state-of-the-art method for automated alert classification. We show that label imbalance impacts both classification performance and correctness of the classification explanations offered by DeepCASE. We conclude tuning the detection rules used in SOCs can significantly reduce imbalance and may benefit the performance and explainability offered by alert post-processing methods such as DeepCASE. Therefore, our findings suggest that traditional methods to improve the quality of input data can benefit automation.

[LG-17] MARVIS: Modality Adaptive Reasoning over VISualizations

链接: https://arxiv.org/abs/2507.01544
作者: Benjamin Feuer,Lennart Purucker,Oussama Elachqar,Chinmay Hegde
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scientific applications of machine learning often rely on small, specialized models tuned to particular domains. Such models often achieve excellent performance, but lack flexibility. Foundation models offer versatility, but typically underperform specialized approaches, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy. MARVIS transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to successfully interpret and utilize them. MARVIS achieves competitive performance on vision, audio, biological, and tabular domains using a single 3B parameter model, achieving results that beat Gemini by 16% on average and approach specialized methods, without exposing personally identifiable information (P.I.I.) or requiring any domain-specific training. We open source our code and datasets at this https URL

[LG-18] Consistency of Learned Sparse Grid Quadrature Rules using NeuralODEs

链接: https://arxiv.org/abs/2507.01533
作者: Hanno Gottschalk,Emil Partow,Tobias J. Riedlinger
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:This paper provides a proof of the consistency of sparse grid quadrature for numerical integration of high dimensional distributions. In a first step, a transport map is learned that normalizes the distribution to a noise distribution on the unit cube. This step is built on the statistical learning theory of neural ordinary differential equations, which has been established recently. Secondly, the composition of the generative map with the quantity of interest is integrated numerically using the Clenshaw-Curtis sparse grid quadrature. A decomposition of the total numerical error in quadrature error and statistical error is provided. As main result it is proven in the framework of empirical risk minimization that all error terms can be controlled in the sense of PAC (probably approximately correct) learning and with high probability the numerical integral approximates the theoretical value up to an arbitrary small error in the limit where the data set size is growing and the network capacity is increased adaptively.

[LG-19] Loss Functions in Diffusion Models: A Comparative Study ECML2025

链接: https://arxiv.org/abs/2507.01516
作者: Dibyanshu Kumar,Philipp Vaeth,Magda Gregorová
类目: Machine Learning (cs.LG)
*备注: Accepted to ECML 2025

点击查看摘要

Abstract:Diffusion models have emerged as powerful generative models, inspiring extensive research into their underlying mechanisms. One of the key questions in this area is the loss functions these models shall train with. Multiple formulations have been introduced in the literature over the past several years with some links and some critical differences stemming from various initial considerations. In this paper, we explore the different target objectives and corresponding loss functions in detail. We present a systematic overview of their relationships, unifying them under the framework of the variational lower bound objective. We complement this theoretical analysis with an empirical study providing insights into the conditions under which these objectives diverge in performance and the underlying factors contributing to such deviations. Additionally, we evaluate how the choice of objective impacts the model ability to achieve specific goals, such as generating high-quality samples or accurately estimating likelihoods. This study offers a unified understanding of loss functions in diffusion models, contributing to more efficient and goal-oriented model designs in future research.

[LG-20] How to Securely Shuffle? A survey about Secure Shufflers for privacy-preserving computations

链接: https://arxiv.org/abs/2507.01487
作者: Marc Damie,Florian Hahn,Andreas Peter,Jan Ramon
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ishai et al. (FOCS’06) introduced secure shuffling as an efficient building block for private data aggregation. Recently, the field of differential privacy has revived interest in secure shufflers by highlighting the privacy amplification they can provide in various computations. Although several works argue for the utility of secure shufflers, they often treat them as black boxes; overlooking the practical vulnerabilities and performance trade-offs of existing implementations. This leaves a central question open: what makes a good secure shuffler? This survey addresses that question by identifying, categorizing, and comparing 26 secure protocols that realize the necessary shuffling functionality. To enable a meaningful comparison, we adapt and unify existing security definitions into a consistent set of properties. We also present an overview of privacy-preserving technologies that rely on secure shufflers, offer practical guidelines for selecting appropriate protocols, and outline promising directions for future work. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2507.01487 [cs.CR] (or arXiv:2507.01487v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.01487 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-21] Cross-platform Smartphone Positioning at Museums

链接: https://arxiv.org/abs/2507.01469
作者: Alessio Ferrato,Fabio Gasparetti,Carla Limongelli,Stefano Mastandrea,Giuseppe Sansonetti,Joaquín Torres-Sospedra
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted at the 2025 International Conference on Indoor Positioning and Indoor Navigation (IPIN), Tampere, Finland, September 15-18, 2025

点击查看摘要

Abstract:Indoor Positioning Systems (IPSs) hold significant potential for enhancing visitor experiences in cultural heritage institutions. By enabling personalized navigation, efficient artifact organization, and better interaction with exhibits, IPSs can transform the modalities of how individuals engage with museums, galleries and libraries. However, these institutions face several challenges in implementing IPSs, including environmental constraints, technical limits, and limited experimentation. In other contexts, Received Signal Strength (RSS)-based approaches using Bluetooth Low Energy (BLE) and WiFi have emerged as preferred solutions due to their non-invasive nature and minimal infrastructure requirements. Nevertheless, the lack of publicly available RSS datasets that specifically reflect museum environments presents a substantial barrier to developing and evaluating positioning algorithms designed for the intricate spatial characteristics typical of cultural heritage sites. To address this limitation, we present BAR, a novel RSS dataset collected in front of 90 artworks across 13 museum rooms using two different platforms, i.e., Android and iOS. Additionally, we provide an advanced position classification baseline taking advantage of a proximity-based method and k -NN algorithms. In our analysis, we discuss the results and offer suggestions for potential research directions.

[LG-22] Decomposing Prediction Mechanisms for In-Context Recall

链接: https://arxiv.org/abs/2507.01414
作者: Sultan Daniels,Dylan Davis,Dhruv Gautam,Wentinn Liao,Gireeja Ranade,Anant Sahai
类目: Machine Learning (cs.LG)
*备注: 44 pages, 47 figures, 2 tables

点击查看摘要

Abstract:We introduce a new family of toy problems that combine features of linear-regression-style continuous in-context learning (ICL) with discrete associative recall. We pretrain transformer models on sample traces from this toy, specifically symbolically-labeled interleaved state observations from randomly drawn linear deterministic dynamical systems. We study if the transformer models can recall the state of a sequence previously seen in its context when prompted to do so with the corresponding in-context label. Taking a closer look at this task, it becomes clear that the model must perform two functions: (1) identify which system’s state should be recalled and apply that system to its last seen state, and (2) continuing to apply the correct system to predict the subsequent states. Training dynamics reveal that the first capability emerges well into a model’s training. Surprisingly, the second capability, of continuing the prediction of a resumed sequence, develops much earlier. Via out-of-distribution experiments, and a mechanistic analysis on model weights via edge pruning, we find that next-token prediction for this toy problem involves at least two separate mechanisms. One mechanism uses the discrete symbolic labels to do the associative recall required to predict the start of a resumption of a previously seen sequence. The second mechanism, which is largely agnostic to the discrete symbolic labels, performs a “Bayesian-style” prediction based on the previous token and the context. These two mechanisms have different learning dynamics. To confirm that this multi-mechanism (manifesting as separate phase transitions) phenomenon is not just an artifact of our toy setting, we used OLMo training checkpoints on an ICL translation task to see a similar phenomenon: a decisive gap in the emergence of first-task-token performance vs second-task-token performance. Comments: 44 pages, 47 figures, 2 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.01414 [cs.LG] (or arXiv:2507.01414v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.01414 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Surrogate Modeling via Factorization Machine and Ising Model with Enhanced Higher-Order Interaction Learning

链接: https://arxiv.org/abs/2507.01389
作者: Anbang Wang,Dunbo Cai,Yu Zhang,Yangqing Huang,Xiangyang Feng,Zhihong Zhang
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Recently, a surrogate model was proposed that employs a factorization machine to approximate the underlying input-output mapping of the original system, with quantum annealing used to optimize the resulting surrogate function. Inspired by this approach, we propose an enhanced surrogate model that incorporates additional slack variables into both the factorization machine and its associated Ising representation thereby unifying what was by design a two-step process into a single, integrated step. During the training phase, the slack variables are iteratively updated, enabling the model to account for higher-order feature interactions. We apply the proposed method to the task of predicting drug combination effects. Experimental results indicate that the introduction of slack variables leads to a notable improvement of performance. Our algorithm offers a promising approach for building efficient surrogate models that exploit potential quantum advantages.

[LG-24] Efficient Kilometer-Scale Precipitation Downscaling with Conditional Wavelet Diffusion

链接: https://arxiv.org/abs/2507.01354
作者: Chugang Yi,Minghan Yu,Weikang Qian,Yixin Wen,Haizhao Yang
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Effective hydrological modeling and extreme weather analysis demand precipitation data at a kilometer-scale resolution, which is significantly finer than the 10 km scale offered by standard global products like IMERG. To address this, we propose the Wavelet Diffusion Model (WDM), a generative framework that achieves 10x spatial super-resolution (downscaling to 1 km) and delivers a 9x inference speedup over pixel-based diffusion models. WDM is a conditional diffusion model that learns the learns the complex structure of precipitation from MRMS radar data directly in the wavelet domain. By focusing on high-frequency wavelet coefficients, it generates exceptionally realistic and detailed 1-km precipitation fields. This wavelet-based approach produces visually superior results with fewer artifacts than pixel-space models, and delivers a significant gains in sampling efficiency. Our results demonstrate that WDM provides a robust solution to the dual challenges of accuracy and speed in geoscience super-resolution, paving the way for more reliable hydrological forecasts.

[LG-25] Far From Sight Far From Mind: Inverse Distance Weighting for Graph Federated Recommendation

链接: https://arxiv.org/abs/2507.01285
作者: Aymen Rayane Khouas,Mohamed Reda Bouadjenek,Hakim Hacid,Sunil Aryal
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Graph federated recommendation systems offer a privacy-preserving alternative to traditional centralized recommendation architectures, which often raise concerns about data security. While federated learning enables personalized recommendations without exposing raw user data, existing aggregation methods overlook the unique properties of user embeddings in this setting. Indeed, traditional aggregation methods fail to account for their complexity and the critical role of user similarity in recommendation effectiveness. Moreover, evolving user interactions require adaptive aggregation while preserving the influence of high-relevance anchor users (the primary users before expansion in graph-based frameworks). To address these limitations, we introduce Dist-FedAvg, a novel distance-based aggregation method designed to enhance personalization and aggregation efficiency in graph federated learning. Our method assigns higher aggregation weights to users with similar embeddings, while ensuring that anchor users retain significant influence in local updates. Empirical evaluations on multiple datasets demonstrate that Dist-FedAvg consistently outperforms baseline aggregation techniques, improving recommendation accuracy while maintaining seamless integration into existing federated learning frameworks.

[LG-26] Jump-Start Reinforcement Learning with Self-Evolving Priors for Extreme Monopedal Locomotion

链接: https://arxiv.org/abs/2507.01243
作者: Ziang Zheng,Guojian Zhan,Shiqi Liu,Yao Lyu,Tao Zhang,Shengbo Eben Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has shown great potential in enabling quadruped robots to perform agile locomotion. However, directly training policies to simultaneously handle dual extreme challenges, i.e., extreme underactuation and extreme terrains, as in monopedal hopping tasks, remains highly challenging due to unstable early-stage interactions and unreliable reward feedback. To address this, we propose JumpER (jump-start reinforcement learning via self-evolving priors), an RL training framework that structures policy learning into multiple stages of increasing complexity. By dynamically generating self-evolving priors through iterative bootstrapping of previously learned policies, JumpER progressively refines and enhances guidance, thereby stabilizing exploration and policy optimization without relying on external expert priors or handcrafted reward shaping. Specifically, when integrated with a structured three-stage curriculum that incrementally evolves action modality, observation space, and task objective, JumpER enables quadruped robots to achieve robust monopedal hopping on unpredictable terrains for the first time. Remarkably, the resulting policy effectively handles challenging scenarios that traditional methods struggle to conquer, including wide gaps up to 60 cm, irregularly spaced stairs, and stepping stones with distances varying from 15 cm to 35 cm. JumpER thus provides a principled and scalable approach for addressing locomotion tasks under the dual challenges of extreme underactuation and extreme terrains.

[LG-27] Quantum Machine Learning in Transportation: A Case Study of Pedestrian Stress Modelling

链接: https://arxiv.org/abs/2507.01235
作者: Bara Rababa,Bilal Farooq
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Proceedings of IEEE Intelligent Transportation Systems Conference, 2025

点击查看摘要

Abstract:Quantum computing has opened new opportunities to tackle complex machine learning tasks, for instance, high-dimensional data representations commonly required in intelligent transportation systems. We explore quantum machine learning to model complex skin conductance response (SCR) events that reflect pedestrian stress in a virtual reality road crossing experiment. For this purpose, Quantum Support Vector Machine (QSVM) with an eight-qubit ZZ feature map and a Quantum Neural Network (QNN) using a Tree Tensor Network ansatz and an eight-qubit ZZ feature map, were developed on Pennylane. The dataset consists of SCR measurements along with features such as the response amplitude and elapsed time, which have been categorized into amplitude-based classes. The QSVM achieved good training accuracy, but had an overfitting problem, showing a low test accuracy of 45% and therefore impacting the reliability of the classification model. The QNN model reached a higher test accuracy of 55%, making it a better classification model than the QSVM and the classic versions.

[LG-28] PAE MobiLLM : Privacy-Aware and Efficient LLM Fine-Tuning on the Mobile Device via Additive Side-Tuning

链接: https://arxiv.org/abs/2507.01216
作者: Xingke Yang,Liang Li,Zhiyi Wan,Sicong Li,Hao Wang,Xiaoqi Qi,Jiang Liu,Tomoaki Ohtsuki,Xin Fu,Miao Pan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:There is a huge gap between numerous intriguing applications fostered by on-device large language model (LLM) fine-tuning (FT) from fresh mobile data and the limited resources of a mobile device. While existing server-assisted methods (e.g., split learning or side-tuning) may enable LLM FT on the local mobile device, they suffer from heavy communication burdens of activation transmissions, and may disclose data, labels or fine-tuned models to the server. To address those issues, we develop PAE MobiLLM, a privacy-aware and efficient LLM FT method which can be deployed on the mobile device via server-assisted additive side-tuning. To further accelerate FT convergence and improve computing efficiency, PAE MobiLLM integrates activation caching on the server side, which allows the server to reuse historical activations and saves the mobile device from repeatedly computing forward passes for the recurring data samples. Besides, to reduce communication cost, PAE MobiLLM develops a one-token (i.e., ``pivot’’ token) activation shortcut that transmits only a single activation dimension instead of full activation matrices to guide the side network tuning. Last but not least, PAE MobiLLM introduces the additive adapter side-network design which makes the server train the adapter modules based on device-defined prediction differences rather than raw ground-truth labels. In this way, the server can only assist device-defined side-network computing, and learn nothing about data, labels or fine-tuned models.

[LG-29] Deep Learning-Based Intrusion Detection for Automotive Ethernet: Evaluating Optimizing Fast Inference Techniques for Deployment on Low-Cost Platform

链接: https://arxiv.org/abs/2507.01208
作者: Pedro R. X. Carmo,Igor de Moura,Assis T. de Oliveira Filho,Djamel Sadok,Cleber Zanchettin
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Modern vehicles are increasingly connected, and in this context, automotive Ethernet is one of the technologies that promise to provide the necessary infrastructure for intra-vehicle communication. However, these systems are subject to attacks that can compromise safety, including flow injection attacks. Deep Learning-based Intrusion Detection Systems (IDS) are often designed to combat this problem, but they require expensive hardware to run in real time. In this work, we propose to evaluate and apply fast neural network inference techniques like Distilling and Prunning for deploying IDS models on low-cost platforms in real time. The results show that these techniques can achieve intrusion detection times of up to 727 \mus using a Raspberry Pi 4, with AUCROC values of 0.9890.

[LG-30] Diffusion Explorer: Interactive Exploration of Diffusion Models

链接: https://arxiv.org/abs/2507.01178
作者: Alec Helbling,Duen Horng Chau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have been central to the development of recent image, video, and even text generation systems. They posses striking geometric properties that can be faithfully portrayed in low-dimensional settings. However, existing resources for explaining diffusion either require an advanced theoretical foundation or focus on their neural network architectures rather than their rich geometric properties. We introduce Diffusion Explorer, an interactive tool to explain the geometric properties of diffusion models. Users can train 2D diffusion models in the browser and observe the temporal dynamics of their sampling process. Diffusion Explorer leverages interactive animation, which has been shown to be a powerful tool for making engaging visualizations of dynamic systems, making it well suited to explaining diffusion models which represent stochastic processes that evolve over time. Diffusion Explorer is open source and a live demo is available at this http URL.

[LG-31] FlashDP: Private Training Large Language Models with Efficient DP-SGD

链接: https://arxiv.org/abs/2507.01154
作者: Liangyu Wang,Junxiao Wang,Jie Ren,Zihang Xiang,David E. Keyes,Di Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) increasingly underpin technological advancements, the privacy of their training data emerges as a critical concern. Differential Privacy (DP) serves as a rigorous mechanism to protect this data, yet its integration via Differentially Private Stochastic Gradient Descent (DP-SGD) introduces substantial challenges, primarily due to the complexities of per-sample gradient clipping. Current explicit methods, such as Opacus, necessitate extensive storage for per-sample gradients, significantly inflating memory requirements. Conversely, implicit methods like GhostClip reduce storage needs by recalculating gradients multiple times, which leads to inefficiencies due to redundant computations. This paper introduces FlashDP, an innovative cache-friendly per-layer DP-SGD that consolidates necessary operations into a single task, calculating gradients only once in a fused manner. This approach not only diminishes memory movement by up to \textbf50% but also cuts down redundant computations by \textbf20%, compared to previous methods. Consequently, FlashDP does not increase memory demands and achieves a \textbf90% throughput compared to the Non-DP method on a four-A100 system during the pre-training of the Llama-13B model, while maintaining parity with standard per-layer clipped DP-SGD in terms of accuracy. These advancements establish FlashDP as a pivotal development for efficient and privacy-preserving training of LLMs. FlashDP’s code has been open-sourced in this https URL.

[LG-32] A Review on Sound Source Localization in Robotics: Focusing on Deep Learning Methods

链接: https://arxiv.org/abs/2507.01143
作者: Reza Jalayer,Masoud Jalayer,Amirali Baniasadi
类目: Robotics (cs.RO); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 35 pages

点击查看摘要

Abstract:Sound source localization (SSL) adds a spatial dimension to auditory perception, allowing a system to pinpoint the origin of speech, machinery noise, warning tones, or other acoustic events, capabilities that facilitate robot navigation, human-machine dialogue, and condition monitoring. While existing surveys provide valuable historical context, they typically address general audio applications and do not fully account for robotic constraints or the latest advancements in deep learning. This review addresses these gaps by offering a robotics-focused synthesis, emphasizing recent progress in deep learning methodologies. We start by reviewing classical methods such as Time Difference of Arrival (TDOA), beamforming, Steered-Response Power (SRP), and subspace analysis. Subsequently, we delve into modern machine learning (ML) and deep learning (DL) approaches, discussing traditional ML and neural networks (NNs), convolutional neural networks (CNNs), convolutional recurrent neural networks (CRNNs), and emerging attention-based architectures. The data and training strategy that are the two cornerstones of DL-based SSL are explored. Studies are further categorized by robot types and application domains to facilitate researchers in identifying relevant work for their specific contexts. Finally, we highlight the current challenges in SSL works in general, regarding environmental robustness, sound source multiplicity, and specific implementation constraints in robotics, as well as data and learning strategies in DL-based SSL. Also, we sketch promising directions to offer an actionable roadmap toward robust, adaptable, efficient, and explainable DL-based SSL for next-generation robots.

[LG-33] Spectral Manifold Harmonization for Graph Imbalanced Regression

链接: https://arxiv.org/abs/2507.01132
作者: Brenda Nogueira,Gabe Gomes,Meng Jiang,Nitesh V. Chawla,Nuno Moniz
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注:

点击查看摘要

Abstract:Graph-structured data is ubiquitous in scientific domains, where models often face imbalanced learning settings. In imbalanced regression, domain preferences focus on specific target value ranges representing the most scientifically valuable cases; we observe a significant lack of research. In this paper, we present Spectral Manifold Harmonization (SMH), a novel approach for addressing this imbalanced regression challenge on graph-structured data by generating synthetic graph samples that preserve topological properties while focusing on often underrepresented target distribution regions. Conventional methods fail in this context because they either ignore graph topology in case generation or do not target specific domain ranges, resulting in models biased toward average target values. Experimental results demonstrate the potential of SMH on chemistry and drug discovery benchmark datasets, showing consistent improvements in predictive performance for target domain ranges.

[LG-34] nsor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations

链接: https://arxiv.org/abs/2507.01131
作者: Yuchao Lin,Cong Fu,Zachary Krueger,Haiyang Yu,Maho Nakata,Jianwen Xie,Emine Kucukbenli,Xiaofeng Qian,Shuiwang Ji
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract: \rmSO(3) -equivariant networks are the dominant models for machine learning interatomic potentials (MLIPs). The key operation of such networks is the Clebsch-Gordan (CG) tensor product, which is computationally expensive. To accelerate the computation, we develop tensor decomposition networks (TDNs) as a class of approximately equivariant networks whose CG tensor products are replaced by low-rank tensor decompositions, such as the CANDECOMP/PARAFAC (CP) decomposition. With the CP decomposition, we prove (i) a uniform bound on the induced error of \rmSO(3) -equivariance, and (ii) the universality of approximating any equivariant bilinear map. To further reduce the number of parameters, we propose path-weight sharing that ties all multiplicity-space weights across the O(L^3) CG paths into a single path without compromising equivariance, where L is the maximum angular degree. The resulting layer acts as a plug-and-play replacement for tensor products in existing networks, and the computational complexity of tensor products is reduced from O(L^6) to O(L^4) . We evaluate TDNs on PubChemQCR, a newly curated molecular relaxation dataset containing 105 million DFT-calculated snapshots. We also use existing datasets, including OC20, and OC22. Results show that TDNs achieve competitive performance with dramatic speedup in computations.

[LG-35] On Design Principles for Private Adaptive Optimizers

链接: https://arxiv.org/abs/2507.01129
作者: Arun Ganesh,Brendan McMahan,Abhradeep Thakurta
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: PPML 2025

点击查看摘要

Abstract:The spherical noise added to gradients in differentially private (DP) training undermines the performance of adaptive optimizers like AdaGrad and Adam, and hence many recent works have proposed algorithms to address this challenge. However, the empirical results in these works focus on simple tasks and models and the conclusions may not generalize to model training in practice. In this paper we survey several of these variants, and develop better theoretical intuition for them as well as perform empirical studies comparing them. We find that a common intuition of aiming for unbiased estimates of second moments of gradients in adaptive optimizers is misguided, and instead that a simple technique called scale-then-privatize (which does not achieve unbiased second moments) has more desirable theoretical behaviors and outperforms all other variants we study on a small-scale language model training task. We additionally argue that scale-then-privatize causes the noise addition to better match the application of correlated noise mechanisms which are more desirable to use in practice.

[LG-36] A Neural Operator based on Dynamic Mode Decomposition

链接: https://arxiv.org/abs/2507.01117
作者: Nikita Sakovich,Dmitry Aksenov,Ekaterina Pleshakova,Sergey Gataullin
类目: Machine Learning (cs.LG)
*备注: 30 pages, 10 figures

点击查看摘要

Abstract:The scientific computation methods development in conjunction with artificial intelligence technologies remains a hot research topic. Finding a balance between lightweight and accurate computations is a solid foundation for this direction. The study presents a neural operator based on the dynamic mode decomposition algorithm (DMD), mapping functional spaces, which combines DMD and deep learning (DL) for spatiotemporal processes efficient modeling. Solving PDEs for various initial and boundary conditions requires significant computational resources. The method suggested automatically extracts key modes and system dynamics using them to construct predictions, reducing computational costs compared to traditional numerical methods. The approach has demonstrated its efficiency through comparative analysis of performance with closest analogues DeepONet and FNO in the heat equation, Laplaces equation, and Burgers equation solutions approximation, where it achieves high reconstruction accuracy.

[LG-37] A LoD of Gaussians: Unified Training and Rendering for Ultra-Large Scale Reconstruction with External Memory

链接: https://arxiv.org/abs/2507.01110
作者: Felix Windisch,Lukas Radl,Thomas Köhler,Michael Steiner,Dieter Schmalstieg,Markus Steinberger
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gaussian Splatting has emerged as a high-performance technique for novel view synthesis, enabling real-time rendering and high-quality reconstruction of small scenes. However, scaling to larger environments has so far relied on partitioning the scene into chunks – a strategy that introduces artifacts at chunk boundaries, complicates training across varying scales, and is poorly suited to unstructured scenarios such as city-scale flyovers combined with street-level views. Moreover, rendering remains fundamentally limited by GPU memory, as all visible chunks must reside in VRAM simultaneously. We introduce A LoD of Gaussians, a framework for training and rendering ultra-large-scale Gaussian scenes on a single consumer-grade GPU – without partitioning. Our method stores the full scene out-of-core (e.g., in CPU memory) and trains a Level-of-Detail (LoD) representation directly, dynamically streaming only the relevant Gaussians. A hybrid data structure combining Gaussian hierarchies with Sequential Point Trees enables efficient, view-dependent LoD selection, while a lightweight caching and view scheduling system exploits temporal coherence to support real-time streaming and rendering. Together, these innovations enable seamless multi-scale reconstruction and interactive visualization of complex scenes – from broad aerial views to fine-grained ground-level details.

[LG-38] Proof of a perfect platonic representation hypothesis

链接: https://arxiv.org/abs/2507.01098
作者: Liu Ziyin,Isaac Chuang
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this note, we elaborate on and explain in detail the proof given by Ziyin et al. (2025) of the “perfect” Platonic Representation Hypothesis (PRH) for the embedded deep linear network model (EDLN). We show that if trained with SGD, two EDLNs with different widths and depths and trained on different data will become Perfectly Platonic, meaning that every possible pair of layers will learn the same representation up to a rotation. Because most of the global minima of the loss function are not Platonic, that SGD only finds the perfectly Platonic solution is rather extraordinary. The proof also suggests at least six ways the PRH can be broken. We also show that in the EDLN model, the emergence of the Platonic representations is due to the same reason as the emergence of progressive sharpening. This implies that these two seemingly unrelated phenomena in deep learning can, surprisingly, have a common cause. Overall, the theory and proof highlight the importance of understanding emergent “entropic forces” due to the irreversibility of SGD training and their role in representation learning. The goal of this note is to be instructive and avoid lengthy technical details.

[LG-39] Development and Comparative Evaluation of Three Artificial Intelligence Models (NLP LLM JEPA) for Predicting Triage in Emergency Departments: A 7-Month Retrospective Proof-of-Concept

链接: https://arxiv.org/abs/2507.01080
作者: Edouard Lansiaux,Ramy Azzouz,Emmanuel Chazard,Amélie Vromant,Eric Wiel
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Triage errors, including undertriage and overtriage, are persistent challenges in emergency departments (EDs). With increasing patient influx and staff shortages, the integration of artificial intelligence (AI) into triage protocols has gained attention. This study compares the performance of three AI models [Natural Language Processing (NLP), Large Language Models (LLM), and Joint Embedding Predictive Architecture (JEPA)] in predicting triage outcomes against the FRENCH scale and clinical this http URL conducted a retrospective analysis of a prospectively recruited cohort gathering adult patient triage data over a 7-month period at the Roger Salengro Hospital ED (Lille, France). Three AI models were trained and validated : (1) TRIAGEMASTER (NLP), (2) URGENTIAPARSE (LLM), and (3) EMERGINET (JEPA). Data included demographic details, verbatim chief complaints, vital signs, and triage outcomes based on the FRENCH scale and GEMSA coding. The primary outcome was the concordance of AI-predicted triage level with the FRENCH gold-standard. It was assessed thanks to various indicators : F1-Score, Weighted Kappa, Spearman, MAE, RMSE. The LLM model (URGENTIAPARSE) showed higher accuracy (composite score: 2.514) compared to JEPA (EMERGINET, 0.438) and NLP (TRIAGEMASTER, -3.511), outperforming nurse triage (-4.343). Secondary analyses highlighted the effectiveness of URGENTIAPARSE in predicting hospitalization needs (GEMSA) and its robustness with structured data versus raw transcripts (either for GEMSA prediction or for FRENCH prediction). LLM architecture, through abstraction of patient representations, offers the most accurate triage predictions among tested models. Integrating AI into ED workflows could enhance patient safety and operational efficiency, though integration into clinical workflows requires addressing model limitations and ensuring ethical transparency.

[LG-40] yProv4ML: Effortless Provenance Tracking for Machine Learning Systems

链接: https://arxiv.org/abs/2507.01078
作者: Gabriele Padovani,Valentine Anantharaj,Sandro Fiore
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The rapid growth of interest in large language models (LLMs) reflects their potential for flexibility and generalization, and attracted the attention of a diverse range of researchers. However, the advent of these techniques has also brought to light the lack of transparency and rigor with which development is pursued. In particular, the inability to determine the number of epochs and other hyperparameters in advance presents challenges in identifying the best model. To address this challenge, machine learning frameworks such as MLFlow can automate the collection of this type of information. However, these tools capture data using proprietary formats and pose little attention to lineage. This paper proposes yProv4ML, a framework to capture provenance information generated during machine learning processes in PROV-JSON format, with minimal code modifications.

[LG-41] Good Enough to Learn: LLM -based Anomaly Detection in ECU Logs without Reliable Labels

链接: https://arxiv.org/abs/2507.01077
作者: Bogdan Bogdan,Arina Cazacu,Laura Vasilie
类目: Machine Learning (cs.LG)
*备注: 6 pages, 7 figures, 4 tables, accepted to IEEE Intelligent Vehicles Symposium (IV) 2025

点击查看摘要

Abstract:Anomaly detection often relies on supervised or clustering approaches, with limited success in specialized domains like automotive communication systems where scalable solutions are essential. We propose a novel decoder-only Large Language Model (LLM) to detect anomalies in Electronic Control Unit (ECU) communication logs. Our approach addresses two key challenges: the lack of LLMs tailored for ECU communication and the complexity of inconsistent ground truth data. By learning from UDP communication logs, we formulate anomaly detection simply as identifying deviations in time from normal behavior. We introduce an entropy regularization technique that increases model’s uncertainty in known anomalies while maintaining consistency in similar scenarios. Our solution offers three novelties: a decoder-only anomaly detection architecture, a way to handle inconsistent labeling, and an adaptable LLM for different ECU communication use cases. By leveraging the generative capabilities of decoder-only models, we present a new technique that addresses the high cost and error-prone nature of manual labeling through a more scalable system that is able to learn from a minimal set of examples, while improving detection accuracy in complex communication environments.

[LG-42] Provenance Tracking in Large-Scale Machine Learning Systems

链接: https://arxiv.org/abs/2507.01075
作者: Gabriele Padovani,Valentine Anantharaj,Sandro Fiore
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:As the demand for large scale AI models continues to grow, the optimization of their training to balance computational efficiency, execution time, accuracy and energy consumption represents a critical multidimensional challenge. Achieving this balance requires not only innovative algorithmic techniques and hardware architectures but also comprehensive tools for monitoring, analyzing, and understanding the underlying processes involved in model training and deployment. Provenance data information about the origins, context, and transformations of data and processes has become a key component in this pursuit. By leveraging provenance, researchers and engineers can gain insights into resource usage patterns, identify inefficiencies, and ensure reproducibility and accountability in AI development workflows. For this reason, the question of how distributed resources can be optimally utilized to scale large AI models in an energy efficient manner is a fundamental one. To support this effort, we introduce the yProv4ML library, a tool designed to collect provenance data in JSON format, compliant with the W3C PROV and ProvML standards. yProv4ML focuses on flexibility and extensibility, and enables users to integrate additional data collection tools via plugins. The library is fully integrated with the yProv framework, allowing for higher level pairing in tasks run also through workflow management systems.

[LG-43] Rotational Sampling: A Plug-and-Play Encoder for Rotation-Invariant 3D Molecular GNNs

链接: https://arxiv.org/abs/2507.01073
作者: Dian Jin
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have achieved remarkable success in molecular property prediction. However, traditional graph representations struggle to effectively encode the inherent 3D spatial structures of molecules, as molecular orientations in 3D space introduce significant variability, severely limiting model generalization and robustness. Existing approaches primarily focus on rotation-invariant and rotation-equivariant methods. Invariant methods often rely heavily on prior knowledge and lack sufficient generalizability, while equivariant methods suffer from high computational costs. To address these limitations, this paper proposes a novel plug-and-play 3D encoding module leveraging rotational sampling. By computing the expectation over the SO(3) rotational group, the method naturally achieves approximate rotational invariance. Furthermore, by introducing a carefully designed post-alignment strategy, strict invariance can be achieved without compromising performance. Experimental evaluations on the QM9 and C10 Datasets demonstrate superior predictive accuracy, robustness, and generalization performance compared to existing methods. Moreover, the proposed approach maintains low computational complexity and enhanced interpretability, providing a promising direction for efficient and effective handling of 3D molecular information in drug discovery and material design.

[LG-44] Prediction of Freezing of Gait in Parkinsons Disease using Explainable AI and Federated Deep Learning for Wearable Sensors

链接: https://arxiv.org/abs/2507.01068
作者: Biplov Paneru
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study leverages an Inertial Measurement Unit (IMU) dataset to develop explainable AI methods for the early detection and prediction of Freezing of Gait (FOG), a common symptom in Parkinson’s disease. Machine learning models, including CatBoost, XGBoost, and Extra Trees classifiers, are employed to accurately categorize FOG episodes based on relevant clinical features. A Stacking Ensemble model achieves superior performance, surpassing a hybrid bidirectional GRU model and reaching nearly 99% classification accuracy. SHAP interpretability analysis reveals that time (seconds) is the most influential factor in distinguishing gait patterns. Additionally, the proposed FOG prediction framework incorporates federated learning, where models are trained locally on individual devices and aggregated on a central server using a federated averaging approach, utilizing a hybrid Conv1D + LSTM architecture for enhanced predictive capability.

[LG-45] Optimizing Conversational Product Recommendation via Reinforcement Learning

链接: https://arxiv.org/abs/2507.01060
作者: Kang Liu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a reinforcement learning-based approach to optimize conversational strategies for product recommendation across diverse industries. As organizations increasingly adopt intelligent agents to support sales and service operations, the effectiveness of a conversation hinges not only on what is recommended but how and when recommendations are delivered. We explore a methodology where agentic systems learn optimal dialogue policies through feedback-driven reinforcement learning. By mining aggregate behavioral patterns and conversion outcomes, our approach enables agents to refine talk tracks that drive higher engagement and product uptake, while adhering to contextual and regulatory constraints. We outline the conceptual framework, highlight key innovations, and discuss the implications for scalable, personalized recommendation in enterprise environments.

[LG-46] Loop2Net: Data-Driven Generation and Optimization of Airfoil CFD Meshes from Sparse Boundary Coordinates

链接: https://arxiv.org/abs/2507.01057
作者: Lushun Fan,Yuqin Xia,Jun Li,Karl Jenkins
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:In this study, an innovative intelligent optimization system for mesh quality is proposed, which is based on a deep convolutional neural network architecture, to achieve mesh generation and optimization. The core of the study is the Loop2Net generator and loss function, it predicts the mesh based on the given wing coordinates. And the model’s performance is continuously optimised by two key loss functions during the training. Then discipline by adding penalties, the goal of mesh generation was finally reached.

[LG-47] Evaluating Pavement Deterioration Rates Due to Flooding Events Using Explainable AI

链接: https://arxiv.org/abs/2507.01056
作者: Lidan Peng,Lu Gao,Feng Hong,Jingran Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Flooding can damage pavement infrastructure significantly, causing both immediate and long-term structural and functional issues. This research investigates how flooding events affect pavement deterioration, specifically focusing on measuring pavement roughness by the International Roughness Index (IRI). To quantify these effects, we utilized 20 years of pavement condition data from TxDOT’s PMIS database, which is integrated with flood event data, including duration and spatial extent. Statistical analyses were performed to compare IRI values before and after flooding and to calculate the deterioration rates influenced by flood exposure. Moreover, we applied Explainable Artificial Intelligence (XAI) techniques, such as SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME), to assess the impact of flooding on pavement performance. The results demonstrate that flood-affected pavements experience a more rapid increase in roughness compared to non-flooded sections. These findings emphasize the need for proactive flood mitigation strategies, including improved drainage systems, flood-resistant materials, and preventative maintenance, to enhance pavement resilience in vulnerable regions.

[LG-48] 3W Dataset 2.0.0: a realistic and public dataset with rare undesirable real events in oil wells

链接: https://arxiv.org/abs/2507.01048
作者: Ricardo Emanuel Vaz Vargas,Afrânio José de Melo Junior,Celso José Munaro,Cláudio Benevenuto de Campos Lima,Eduardo Toledo de Lima Junior,Felipe Muntzberg Barrocas,Flávio Miguel Varejão,Guilherme Fidelis Peixer,Igor de Melo Nery Oliveira,Jader Riso Barbosa Jr.,Jaime Andrés Lozano Cadena,Jean Carlos Dias de Araújo,João Neuenschwander Escosteguy Carneiro,Lucas Gouveia Omena Lopes,Lucas Pereira de Gouveia,Mateus de Araujo Fernandes,Matheus Lima Scramignon,Patrick Marques Ciarelli,Rodrigo Castello Branco,Rogério Leite Alves Pinto
类目: Machine Learning (cs.LG)
*备注: 21 pages, 10 figures, and 7 tables

点击查看摘要

Abstract:In the oil industry, undesirable events in oil wells can cause economic losses, environmental accidents, and human casualties. Solutions based on Artificial Intelligence and Machine Learning for Early Detection of such events have proven valuable for diverse applications across industries. In 2019, recognizing the importance and the lack of public datasets related to undesirable events in oil wells, Petrobras developed and publicly released the first version of the 3W Dataset, which is essentially a set of Multivariate Time Series labeled by experts. Since then, the 3W Dataset has been developed collaboratively and has become a foundational reference for numerous works in the field. This data article describes the current publicly available version of the 3W Dataset, which contains structural modifications and additional labeled data. The detailed description provided encourages and supports the 3W community and new 3W users to improve previous published results and to develop new robust methodologies, digital products and services capable of detecting undesirable events in oil wells with enough anticipation to enable corrective or mitigating actions.

[LG-49] Variational Digital Twins

链接: https://arxiv.org/abs/2507.01047
作者: Logan A. Burnett,Umme Mahbuba Nabila,Majdi I. Radaideh
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 33 pages, 14 figures, and 7 tables

点击查看摘要

Abstract:While digital twins (DT) hold promise for providing real-time insights into complex energy assets, much of the current literature either does not offer a clear framework for information exchange between the model and the asset, lacks key features needed for real-time implementation, or gives limited attention to model uncertainty. Here, we aim to solve these gaps by proposing a variational digital twin (VDT) framework that augments standard neural architectures with a single Bayesian output layer. This lightweight addition, along with a novel VDT updating algorithm, lets a twin update in seconds on commodity GPUs while producing calibrated uncertainty bounds that can inform experiment design, control algorithms, and model reliability. The VDT is evaluated on four energy-sector problems. For critical-heat-flux prediction, uncertainty-driven active learning reaches R2 = 0.98 using 47 % fewer experiments and one-third the training time of random sampling. A three-year renewable-generation twin maintains R2 0.95 for solar output and curbs error growth for volatile wind forecasts via monthly updates that process only one month of data at a time. A nuclear reactor transient cooldown twin reconstructs thermocouple signals with R2 0.99 and preserves accuracy after 50 % sensor loss, demonstrating robustness to degraded instrumentation. Finally, a physics-informed Li-ion battery twin, retrained after every ten discharges, lowers voltage mean-squared error by an order of magnitude relative to the best static model while adapting its credible intervals as the cell approaches end-of-life. These results demonstrate that combining modest Bayesian augmentation with efficient update schemes turns conventional surrogates into uncertainty-aware, data-efficient, and computationally tractable DTs, paving the way for dependable models across industrial and scientific energy systems.

[LG-50] Cross-Attention Message-Passing Transformers for Code-Agnostic Decoding in 6G Networks

链接: https://arxiv.org/abs/2507.01038
作者: Seong-Joon Park,Hee-Youl Kwak,Sang-Hyo Kim,Yongjune Kim,Jong-Seon No
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Channel coding for 6G networks is expected to support a wide range of requirements arising from heterogeneous communication scenarios. These demands challenge traditional code-specific decoders, which lack the flexibility and scalability required for next-generation systems. To tackle this problem, we propose an AI-native foundation model for unified and code-agnostic decoding based on the transformer architecture. We first introduce a cross-attention message-passing transformer (CrossMPT). CrossMPT employs two masked cross-attention blocks that iteratively update two distinct input representations-magnitude and syndrome vectors-allowing the model to effectively learn the decoding problem. Notably, our CrossMPT has achieved state-of-the-art decoding performance among single neural decoders. Building on this, we develop foundation CrossMPT (FCrossMPT) by making the architecture invariant to code length, rate, and class, allowing a single trained model to decode a broad range of codes without retraining. To further enhance decoding performance, particularly for short blocklength codes, we propose CrossMPT ensemble decoder (CrossED), an ensemble decoder composed of multiple parallel CrossMPT blocks employing different parity-check matrices. This architecture can also serve as a foundation model, showing strong generalization across diverse code types. Overall, the proposed AI-native code-agnostic decoder offers flexibility, scalability, and high performance, presenting a promising direction to channel coding for 6G networks.

[LG-51] PyTorch-based Geometric Learning with Non-CUDA Processing Units: Experiences from Intel Gaudi-v2 HPUs

链接: https://arxiv.org/abs/2507.01031
作者: Fanchen Bu,Kijung Shin
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Conference paper: Accepted in Korea Computer Congress (KCC) 2025. The library is available at this https URL

点击查看摘要

Abstract:Geometric learning has emerged as a powerful paradigm for modeling non-Euclidean data, especially graph-structured ones, with applications spanning social networks, molecular structures, knowledge graphs, and recommender systems. While Nvidia’s CUDA-enabled graphics processing units (GPUs) largely dominate the hardware landscape, emerging accelerators such as Intel’s Gaudi Habana Processing Units (HPUs) offer competitive performance and energy efficiency. However, the usage of such non-CUDA processing units requires significant engineering effort and novel software adaptations. In this work, we present our experiences porting PyTorch-based geometric learning frameworks to Gaudi-v2 HPUs. We introduce a collection of core utilities that restore essential operations (e.g., scatter, sparse indexing, k-nearest neighbors) on Gaudi-v2 HPUs, and we consolidate sixteen guided tutorials and eleven real-world examples with diagnostic analyses of encountered failures and detailed workarounds. We collect all our experiences into a publicly accessible GitHub repository. Our contributions lower the barrier for researchers to experiment with geometric-learning algorithms and models on non-CUDA hardware, providing a foundation for further optimization and cross-platform portability.

[LG-52] Optimizing Flamelet Generated Manifold Models: A Machine Learning Performance Study

链接: https://arxiv.org/abs/2507.01030
作者: Reza Lotfi Navaei,Mohammad Safarzadeh,Seyed Mohammad Jafar Sobhani
类目: Machine Learning (cs.LG)
*备注: It has been submitted to ASME Journal of Heat and Mass Transfer

点击查看摘要

Abstract:In chemistry tabulations and Flamelet combustion models, the Flamelet Generated Manifold (FGM) is recognized for its precision and physical representation. The practical implementation of FGM requires a significant allocation of memory resources. FGM libraries are developed specifically for a specific fuel and subsequently utilized for all numerical problems using machine learning techniques. This research aims to develop libraries of Laminar FGM utilizing machine learning algorithms for application in combustion simulations of methane fuel. This study employs four Machine Learning algorithms to regenerate Flamelet libraries, based on an understanding of data sources, techniques, and data-driven concepts. 1. Multi-Layer Perceptron; 2. Random Forest; 3. Linear Regression; 4. Support Vector Machine. Seven libraries were identified as appropriate for constructing a database for training machine learning models, giving an error rate of 2.30%. The default architectures of each method were evaluated to determine the optimal approach, leading to the selection of the MLP method as the primary choice. The method was enhanced through hyperparameter tuning to improve accuracy. The quantity of hidden layers and neurons significantly influences method performance. The optimal model, comprising four hidden layers with 10, 15, 20, and 25 neurons respectively, achieved an accuracy of 99.81%.

[LG-53] Dual Perspectives on Non-Contrastive Self-Supervised Learning

链接: https://arxiv.org/abs/2507.01028
作者: Jean Ponce(WILLOW),Martial Hebert(CMU),Basile Terver(FAIR, WILLOW)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The objective of non-contrastive approaches to self-supervised learning is to train on pairs of different views of the data an encoder and a predictor that minimize the mean discrepancy between the code predicted from the embedding of the first view and the embedding of the second one. In this setting, the stop gradient and exponential moving average iterative procedures are commonly used to avoid representation collapse, with excellent performance in downstream supervised applications. This presentation investigates these procedures from the dual theoretical viewpoints of optimization and dynamical systems. We first show that, in general, although they do not optimize the original objective, or for that matter, any other smooth function, they do avoid collapse. Following Tian et al. [2021], but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average always leads to collapse. Conversely, we finally show that the limit points of the dynamical systems associated with these two procedures are, in general, asymptotically stable equilibria, with no risk of degenerating to trivial solutions.

[LG-54] DBellQuant: Breaking the Bell with Double-Bell Transformation for LLM s Post Training Binarization

链接: https://arxiv.org/abs/2507.01027
作者: Zijian Ye,Wei Huang,Yifei Yu,Tianhe Ren,Zhongrui Wang,Xiaojuan Qi
类目: Machine Learning (cs.LG)
*备注: 19 pages; Appendix added

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable performance but face substantial computational and memory challenges that limit their practical deployment. Quantization has emerged as a promising solution; however, its effectiveness is often limited by quantization errors arising from weight distributions that are not quantization-friendly and the presence of activation outliers. To address these challenges, we introduce DBellQuant, an innovative post-training quantization (PTQ) framework that achieves nearly 1-bit weight compression and 6-bit activation quantization with minimal performance degradation. DBellQuant uses Learnable Transformation for Dual-Bell (LTDB) algorithm, which transforms single-bell weight distributions into dual-bell forms to reduce binarization errors and applies inverse transformations to smooth activations. DBellQuant sets a new state-of-the-art by preserving superior model performance under aggressive weight and activation quantization. For example, on the Wikitext2 dataset, DBellQuant achieves a perplexity of 14.39 on LLaMA2-13B with 6-bit activation quantization, significantly outperforming BiLLM’s 21.35 without activation quantization, underscoring its potential in compressing LLMs for real-world applications.

[LG-55] Few-Shot Inspired Generative Zero-Shot Learning

链接: https://arxiv.org/abs/2507.01026
作者: Md Shakil Ahamed Shohag,Q. M. Jonathan Wu,Farhad Pourpanah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative zero-shot learning (ZSL) methods typically synthesize visual features for unseen classes using predefined semantic attributes, followed by training a fully supervised classification model. While effective, these methods require substantial computational resources and extensive synthetic data, thereby relaxing the original ZSL assumptions. In this paper, we propose FSIGenZ, a few-shot-inspired generative ZSL framework that reduces reliance on large-scale feature synthesis. Our key insight is that class-level attributes exhibit instance-level variability, i.e., some attributes may be absent or partially visible, yet conventional ZSL methods treat them as uniformly present. To address this, we introduce Model-Specific Attribute Scoring (MSAS), which dynamically re-scores class attributes based on model-specific optimization to approximate instance-level variability without access to unseen data. We further estimate group-level prototypes as clusters of instances based on MSAS-adjusted attribute scores, which serve as representative synthetic features for each unseen class. To mitigate the resulting data imbalance, we introduce a Dual-Purpose Semantic Regularization (DPSR) strategy while training a semantic-aware contrastive classifier (SCC) using these prototypes. Experiments on SUN, AwA2, and CUB benchmarks demonstrate that FSIGenZ achieves competitive performance using far fewer synthetic features.

[LG-56] AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

链接: https://arxiv.org/abs/2507.01020
作者: Aashray Reddy,Andrew Zagula,Nicholas Saban
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 16 pages, 4 figures, submitted to LLMSEC

点击查看摘要

Abstract:Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks: carefully crafted malicious inputs intended to circumvent safety guardrails and elicit harmful responses. As such, we present AutoAdv, a novel framework that automates adversarial prompt generation to systematically evaluate and expose vulnerabilities in LLM safety mechanisms. Our approach leverages a parametric attacker LLM to produce semantically disguised malicious prompts through strategic rewriting techniques, specialized system prompts, and optimized hyperparameter configurations. The primary contribution of our work is a dynamic, multi-turn attack methodology that analyzes failed jailbreak attempts and iteratively generates refined follow-up prompts, leveraging techniques such as roleplaying, misdirection, and contextual manipulation. We quantitatively evaluate attack success rate (ASR) using the StrongREJECT (arXiv:2402.10260 [cs.CL]) framework across sequential interaction turns. Through extensive empirical evaluation of state-of-the-art models–including ChatGPT, Llama, and DeepSeek–we reveal significant vulnerabilities, with our automated attacks achieving jailbreak success rates of up to 86% for harmful content generation. Our findings reveal that current safety mechanisms remain susceptible to sophisticated multi-turn attacks, emphasizing the urgent need for more robust defense strategies.

[LG-57] Characterizing control between interacting subsystems with deep Jacobian estimation

链接: https://arxiv.org/abs/2507.01946
作者: Adam J. Eisen,Mitchell Ostrow,Sarthak Chandra,Leo Kozachkov,Earl K. Miller,Ila R. Fiete
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Dynamical Systems (math.DS); Neurons and Cognition (q-bio.NC)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Biological function arises through the dynamical interactions of multiple subsystems, including those between brain areas, within gene regulatory networks, and more. A common approach to understanding these systems is to model the dynamics of each subsystem and characterize communication between them. An alternative approach is through the lens of control theory: how the subsystems control one another. This approach involves inferring the directionality, strength, and contextual modulation of control between subsystems. However, methods for understanding subsystem control are typically linear and cannot adequately describe the rich contextual effects enabled by nonlinear complex systems. To bridge this gap, we devise a data-driven nonlinear control-theoretic framework to characterize subsystem interactions via the Jacobian of the dynamics. We address the challenge of learning Jacobians from time-series data by proposing the JacobianODE, a deep learning method that leverages properties of the Jacobian to directly estimate it for arbitrary dynamical systems from data alone. We show that JacobianODEs outperform existing Jacobian estimation methods on challenging systems, including high-dimensional chaos. Applying our approach to a multi-area recurrent neural network (RNN) trained on a working memory selection task, we show that the “sensory” area gains greater control over the “cognitive” area over learning. Furthermore, we leverage the JacobianODE to directly control the trained RNN, enabling precise manipulation of its behavior. Our work lays the foundation for a theoretically grounded and data-driven understanding of interactions among biological subsystems.

[LG-58] A first-order method for nonconvex-nonconcave minimax problems under a local Kurdyka-Łojasiewicz condition

链接: https://arxiv.org/abs/2507.01932
作者: Zhaosong Lu,Xiangyuan Wang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 26 pages

点击查看摘要

Abstract:We study a class of nonconvex-nonconcave minimax problems in which the inner maximization problem satisfies a local Kurdyka-Łojasiewicz (KL) condition that may vary with the outer minimization variable. In contrast to the global KL or Polyak-Łojasiewicz (PL) conditions commonly assumed in the literature – which are significantly stronger and often too restrictive in practice – this local KL condition accommodates a broader range of practical scenarios. However, it also introduces new analytical challenges. In particular, as an optimization algorithm progresses toward a stationary point of the problem, the region over which the KL condition holds may shrink, resulting in a more intricate and potentially ill-conditioned landscape. To address this challenge, we show that the associated maximal function is locally Hölder smooth. Leveraging this key property, we develop an inexact proximal gradient method for solving the minimax problem, where the inexact gradient of the maximal function is computed by applying a proximal gradient method to a KL-structured subproblem. Under mild assumptions, we establish complexity guarantees for computing an approximate stationary point of the minimax problem.

[LG-59] Advancing Magnetic Materials Discovery – A structure-based machine learning approach for magnetic ordering and magnetic moment prediction

链接: https://arxiv.org/abs/2507.01913
作者: Apoorv Verma,Junaid Jami,Amrita Bhattacharya
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately predicting magnetic behavior across diverse materials systems remains a longstanding challenge due to the complex interplay of structural and electronic factors and is pivotal for the accelerated discovery and design of next-generation magnetic materials. In this work, a refined descriptor is proposed that significantly improves the prediction of two critical magnetic properties – magnetic ordering (Ferromagnetic vs. Ferrimagnetic) and magnetic moment per atom – using only the structural information of materials. Unlike previous models limited to Mn-based or lanthanide-transition metal compounds, the present approach generalizes across a diverse dataset of 5741 stable, binary and ternary, ferromagnetic and ferrimagnetic compounds sourced from the Materials Project. Leveraging an enriched elemental vector representation and advanced feature engineering, including nonlinear terms and reduced matrix sparsity, the LightGBM-based model achieves an accuracy of 82.4% for magnetic ordering classification and balanced recall across FM and FiM classes, addressing a key limitation in prior studies. The model predicts magnetic moment per atom with a correlation coefficient of 0.93, surpassing the Hund’s matrix and orbital field matrix descriptors. Additionally, it accurately estimates formation energy per atom, enabling assessment of both magnetic behavior and material stability. This generalized and computationally efficient framework offers a robust tool for high-throughput screening of magnetic materials with tailored properties.

[LG-60] STEM Diffraction Pattern Analysis with Deep Learning Networks

链接: https://arxiv.org/abs/2507.01889
作者: Sebastian Wissel,Jonas Scheunert,Aaron Dextre,Shamail Ahmed,Andreas Bayer,Kerstin Volz,Bai-Xiang Xu
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate grain orientation mapping is essential for understanding and optimizing the performance of polycrystalline materials, particularly in energy-related applications. Lithium nickel oxide (LiNiO _2 ) is a promising cathode material for next-generation lithium-ion batteries, and its electrochemical behaviour is closely linked to microstructural features such as grain size and crystallographic orientations. Traditional orientation mapping methods–such as manual indexing, template matching ™, or Hough transform-based techniques–are often slow and noise-sensitive when handling complex or overlapping patterns, creating a bottleneck in large-scale microstructural analysis. This work presents a machine learning-based approach for predicting Euler angles directly from scanning transmission electron microscopy (STEM) diffraction patterns (DPs). This enables the automated generation of high-resolution crystal orientation maps, facilitating the analysis of internal microstructures at the nanoscale. Three deep learning architectures–convolutional neural networks (CNNs), Dense Convolutional Networks (DenseNets), and Shifted Windows (Swin) Transformers–are evaluated, using an experimentally acquired dataset labelled via a commercial TM algorithm. While the CNN model serves as a baseline, both DenseNets and Swin Transformers demonstrate superior performance, with the Swin Transformer achieving the highest evaluation scores and the most consistent microstructural predictions. The resulting crystal maps exhibit clear grain boundary delineation and coherent intra-grain orientation distributions, underscoring the potential of attention-based architectures for analyzing diffraction-based image data. These findings highlight the promise of combining advanced machine learning models with STEM data for robust, high-throughput microstructural characterization.

[LG-61] oken Communication in the Era of Large Models: An Information Bottleneck-Based Approach

链接: https://arxiv.org/abs/2507.01728
作者: Hao Wei,Wanli Ni,Wen Wang,Wenjun Xu,Dusit Niyato,Ping Zhang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This letter proposes UniToCom, a unified token communication paradigm that treats tokens as the fundamental units for both processing and wireless transmission. Specifically, to enable efficient token representations, we propose a generative information bottleneck (GenIB) principle, which facilitates the learning of tokens that preserve essential information while supporting reliable generation across multiple modalities. By doing this, GenIB-based tokenization is conducive to improving the communication efficiency and reducing computational complexity. Additionally, we develop \sigma -GenIB to address the challenges of variance collapse in autoregressive modeling, maintaining representational diversity and stability. Moreover, we employ a causal Transformer-based multimodal large language model (MLLM) at the receiver to unify the processing of both discrete and continuous tokens under the next-token prediction paradigm. Simulation results validate the effectiveness and superiority of the proposed UniToCom compared to baselines under dynamic channel conditions. By integrating token processing with MLLMs, UniToCom enables scalable and generalizable communication in favor of multimodal understanding and generation, providing a potential solution for next-generation intelligent communications.

[LG-62] A generative modeling / Physics-Informed Neural Network approach to random differential equations

链接: https://arxiv.org/abs/2507.01687
作者: Georgios Arampatzis,Stylianos Katsarakis,Charalambos Makridakis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The integration of Scientific Machine Learning (SciML) techniques with uncertainty quantification (UQ) represents a rapidly evolving frontier in computational science. This work advances Physics-Informed Neural Networks (PINNs) by incorporating probabilistic frameworks to effectively model uncertainty in complex systems. Our approach enhances the representation of uncertainty in forward problems by combining generative modeling techniques with PINNs. This integration enables in a systematic fashion uncertainty control while maintaining the predictive accuracy of the model. We demonstrate the utility of this method through applications to random differential equations and random partial differential equations (PDEs).

[LG-63] When Less Is More: Binary Feedback Can Outperform Ordinal Comparisons in Ranking Recovery

链接: https://arxiv.org/abs/2507.01613
作者: Shirong Xu,Jingnan Zhang,Junhui Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Paired comparison data, where users evaluate items in pairs, play a central role in ranking and preference learning tasks. While ordinal comparison data intuitively offer richer information than binary comparisons, this paper challenges that conventional wisdom. We propose a general parametric framework for modeling ordinal paired comparisons without ties. The model adopts a generalized additive structure, featuring a link function that quantifies the preference difference between two items and a pattern function that governs the distribution over ordinal response levels. This framework encompasses classical binary comparison models as special cases, by treating binary responses as binarized versions of ordinal data. Within this framework, we show that binarizing ordinal data can significantly improve the accuracy of ranking recovery. Specifically, we prove that under the counting algorithm, the ranking error associated with binary comparisons exhibits a faster exponential convergence rate than that of ordinal data. Furthermore, we characterize a substantial performance gap between binary and ordinal data in terms of a signal-to-noise ratio (SNR) determined by the pattern function. We identify the pattern function that minimizes the SNR and maximizes the benefit of binarization. Extensive simulations and a real application on the MovieLens dataset further corroborate our theoretical findings.

[LG-64] ransfer Learning for VLC-based indoor Localization: Addressing Environmental Variability

链接: https://arxiv.org/abs/2507.01575
作者: Masood Jan,Wafa Njima,Xun Zhang,Alexander Artemenko
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted for publication in the IEEE VTC2025-Spring Conference, 7 pages

点击查看摘要

Abstract:Accurate indoor localization is crucial in industrial environments. Visible Light Communication (VLC) has emerged as a promising solution, offering high accuracy, energy efficiency, and minimal electromagnetic interference. However, VLC-based indoor localization faces challenges due to environmental variability, such as lighting fluctuations and obstacles. To address these challenges, we propose a Transfer Learning (TL)-based approach for VLC-based indoor localization. Using real-world data collected at a BOSCH factory, the TL framework integrates a deep neural network (DNN) to improve localization accuracy by 47%, reduce energy consumption by 32%, and decrease computational time by 40% compared to the conventional models. The proposed solution is highly adaptable under varying environmental conditions and achieves similar accuracy with only 30% of the dataset, making it a cost-efficient and scalable option for industrial applications in Industry 4.0.

[LG-65] Parsimonious Gaussian mixture models with piecewise-constant eigenvalue profiles

链接: https://arxiv.org/abs/2507.01542
作者: Tom Szwagier,Pierre-Alexandre Mattei,Charles Bouveyron,Xavier Pennec
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Gaussian mixture models (GMMs) are ubiquitous in statistical learning, particularly for unsupervised problems. While full GMMs suffer from the overparameterization of their covariance matrices in high-dimensional spaces, spherical GMMs (with isotropic covariance matrices) certainly lack flexibility to fit certain anisotropic distributions. Connecting these two extremes, we introduce a new family of parsimonious GMMs with piecewise-constant covariance eigenvalue profiles. These extend several low-rank models like the celebrated mixtures of probabilistic principal component analyzers (MPPCA), by enabling any possible sequence of eigenvalue multiplicities. If the latter are prespecified, then we can naturally derive an expectation-maximization (EM) algorithm to learn the mixture parameters. Otherwise, to address the notoriously-challenging issue of jointly learning the mixture parameters and hyperparameters, we propose a componentwise penalized EM algorithm, whose monotonicity is proven. We show the superior likelihood-parsimony tradeoffs achieved by our models on a variety of unsupervised experiments: density fitting, clustering and single-image denoising.

[LG-66] Meteoroid stream identification with HDBSCAN unsupervised clustering algorithm

链接: https://arxiv.org/abs/2507.01501
作者: Eloy Peña-Asensio,Fabio Ferrari
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted in The Astronomical Journal

点击查看摘要

Abstract:Accurate identification of meteoroid streams is central to understanding their origins and evolution. However, overlapping clusters and background noise hinder classification, an issue amplified for missions such as ESA’s LUMIO that rely on meteor shower observations to infer lunar meteoroid impact parameters. This study evaluates the performance of the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm for unsupervised meteoroid stream identification, comparing its outcomes with the established Cameras for All-Sky Meteor Surveillance (CAMS) look-up table method. We analyze the CAMS Meteoroid Orbit Database v3.0 using three feature vectors: LUTAB (CAMS geocentric parameters), ORBIT (heliocentric orbital elements), and GEO (adapted geocentric parameters). HDBSCAN is applied with varying minimum cluster sizes and two cluster selection methods (eom and leaf). To align HDBSCAN clusters with CAMS classifications, the Hungarian algorithm determines the optimal mapping. Clustering performance is assessed via the Silhouette score, Normalized Mutual Information, and F1 score, with Principal Component Analysis further supporting the analysis. With the GEO vector, HDBSCAN confirms 39 meteoroid streams, 21 strongly aligning with CAMS. The ORBIT vector identifies 30 streams, 13 with high matching scores. Less active showers pose identification challenges. The eom method consistently yields superior performance and agreement with CAMS. Although HDBSCAN requires careful selection of the minimum cluster size, it delivers robust, internally consistent clusters and outperforms the look-up table method in statistical coherence. These results underscore HDBSCAN’s potential as a mathematically consistent alternative for meteoroid stream identification, although further validation is needed to assess physical validity.

[LG-67] Symbolic identification of tensor equations in multidimensional physical fields

链接: https://arxiv.org/abs/2507.01466
作者: Tianyi Chen,Hao Yang,Wenjun Ma,Jun Zhang
类目: Mathematical Physics (math-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, data-driven methods have shown great promise for discovering governing equations from simulation or experimental data. However, most existing approaches are limited to scalar equations, with few capable of identifying tensor relationships. In this work, we propose a general data-driven framework for identifying tensor equations, referred to as Symbolic Identification of Tensor Equations (SITE). The core idea of SITE–representing tensor equations using a host-plasmid structure–is inspired by the multidimensional gene expression programming (M-GEP) approach. To improve the robustness of the evolutionary process, SITE adopts a genetic information retention strategy. Moreover, SITE introduces two key innovations beyond conventional evolutionary algorithms. First, it incorporates a dimensional homogeneity check to restrict the search space and eliminate physically invalid expressions. Second, it replaces traditional linear scaling with a tensor linear regression technique, greatly enhancing the efficiency of numerical coefficient optimization. We validate SITE using two benchmark scenarios, where it accurately recovers target equations from synthetic data, showing robustness to noise and small sample sizes. Furthermore, SITE is applied to identify constitutive relations directly from molecular simulation data, which are generated without reliance on macroscopic constitutive models. It adapts to both compressible and incompressible flow conditions and successfully identifies the corresponding macroscopic forms, highlighting its potential for data-driven discovery of tensor equation.

[LG-68] Automated Classification of Volcanic Earthquakes Using Transformer Encoders: Insights into Data Quality and Model Interpretability

链接: https://arxiv.org/abs/2507.01260
作者: Y. Suzuki,Y. Yukutake,T. Ohminato,M. Yamasaki,Ahyi Kim
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: submitted to Seismological Research Letters

点击查看摘要

Abstract:Precisely classifying earthquake types is crucial for elucidating the relationship between volcanic earthquakes and volcanic activity. However, traditional methods rely on subjective human judgment, which requires considerable time and effort. To address this issue, we developed a deep learning model using a transformer encoder for a more objective and efficient classification. Tested on Mount Asama’s diverse seismic activity, our model achieved high F1 scores (0.930 for volcano tectonic, 0.931 for low-frequency earthquakes, and 0.980 for noise), superior to a conventional CNN-based method. To enhance interpretability, attention weight visualizations were analyzed, revealing that the model focuses on key waveform features similarly to human experts. However, inconsistencies in training data, such as ambiguously labeled B-type events with S-waves, were found to influence classification accuracy and attention weight distributions. Experiments addressing data selection and augmentation demonstrated the importance of balancing data quality and diversity. In addition, stations within 3 km of the crater played an important role in improving model performance and interpretability. These findings highlight the potential of Transformer-based models for automated volcanic earthquake classification, particularly in improving efficiency and interpretability. By addressing challenges such as data imbalance and subjective labeling, our approach provides a robust framework for understanding seismic activity at Mount Asama. Moreover, this framework offers opportunities for transfer learning to other volcanic regions, paving the way for enhanced volcanic hazard assessments and disaster mitigation strategies.

[LG-69] Asymptotic convexity of wide and shallow neural networks

链接: https://arxiv.org/abs/2507.01044
作者: Vivek Borkar,Parthe Pandit
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 5 pages

点击查看摘要

Abstract:For a simple model of shallow and wide neural networks, we show that the epigraph of its input-output map as a function of the network parameters approximates epigraph of a. convex function in a precise sense. This leads to a plausible explanation of their observed good performance.

[LG-70] Workflow-Based Evaluation of Music Generation Systems

链接: https://arxiv.org/abs/2507.01022
作者: Shayan Dadman,Bernt Arild Bremdal,Andreas Bergsland
类目: Audio and Speech Processing (eess.AS); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
*备注: 54 pages, 3 figures, 6 tables, 5 appendices

点击查看摘要

Abstract:This study presents an exploratory evaluation of Music Generation Systems (MGS) within contemporary music production workflows by examining eight open-source systems. The evaluation framework combines technical insights with practical experimentation through criteria specifically designed to investigate the practical and creative affordances of the systems within the iterative, non-linear nature of music production. Employing a single-evaluator methodology as a preliminary phase, this research adopts a mixed approach utilizing qualitative methods to form hypotheses subsequently assessed through quantitative metrics. The selected systems represent architectural diversity across both symbolic and audio-based music generation approaches, spanning composition, arrangement, and sound design tasks. The investigation addresses limitations of current MGS in music production, challenges and opportunities for workflow integration, and development potential as collaborative tools while maintaining artistic authenticity. Findings reveal these systems function primarily as complementary tools enhancing rather than replacing human expertise. They exhibit limitations in maintaining thematic and structural coherence that emphasize the indispensable role of human creativity in tasks demanding emotional depth and complex decision-making. This study contributes a structured evaluation framework that considers the iterative nature of music creation. It identifies methodological refinements necessary for subsequent comprehensive evaluations and determines viable areas for AI integration as collaborative tools in creative workflows. The research provides empirically-grounded insights to guide future development in the field.

信息检索

[IR-0] DARTS: A Dual-View Attack Framework for Targeted Manipulation in Federated Sequential Recommendation

链接: https://arxiv.org/abs/2507.01383
作者: Qitao Qin,Yucong Luo,Zhibo Chu
类目: Information Retrieval (cs.IR)
*备注: 10 pages. arXiv admin note: substantial text overlap with arXiv:2409.07500 ; text overlap with arXiv:2212.05399 by other authors

点击查看摘要

Abstract:Federated recommendation (FedRec) preserves user privacy by enabling decentralized training of personalized models, but this architecture is inherently vulnerable to adversarial attacks. Significant research has been conducted on targeted attacks in FedRec systems, motivated by commercial and social influence considerations. However, much of this work has largely overlooked the differential robustness of recommendation models. Moreover, our empirical findings indicate that existing targeted attack methods achieve only limited effectiveness in Federated Sequential Recommendation(FSR) tasks. Driven by these observations, we focus on investigating targeted attacks in FSR and propose a novel dualview attack framework, named DV-FSR. This attack method uniquely combines a sampling-based explicit strategy with a contrastive learning-based implicit gradient strategy to orchestrate a coordinated attack. Additionally, we introduce a specific defense mechanism tailored for targeted attacks in FSR, aiming to evaluate the mitigation effects of the attack method we proposed. Extensive experiments validate the effectiveness of our proposed approach on representative sequential models. Our codes are publicly available.

[IR-1] owards a Signal Detection Based Measure for Assessing Information Quality of Explainable Recommender Systems

链接: https://arxiv.org/abs/2507.01168
作者: Yeonbin Son,Matthew L. Bolton
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: Accepted to IEEE CAI 2025

点击查看摘要

Abstract:There is growing interest in explainable recommender systems that provide recommendations along with explanations for the reasoning behind them. When evaluating recommender systems, most studies focus on overall recommendation performance. Only a few assess the quality of the explanations. Explanation quality is often evaluated through user studies that subjectively gather users’ opinions on representative explanatory factors that shape end-users’ perspective towards the results, not about the explanation contents itself. We aim to fill this gap by developing an objective metric to evaluate Veracity: the information quality of explanations. Specifically, we decompose Veracity into two dimensions: Fidelity and Attunement. Fidelity refers to whether the explanation includes accurate information about the recommended item. Attunement evaluates whether the explanation reflects the target user’s preferences. By applying signal detection theory, we first determine decision outcomes for each dimension and then combine them to calculate a sensitivity, which serves as the final Veracity value. To assess the effectiveness of the proposed metric, we set up four cases with varying levels of information quality to validate whether our metric can accurately capture differences in quality. The results provided meaningful insights into the effectiveness of our proposed metric.

附件下载

点击下载今日全部论文列表