本篇博文主要内容为 2025-07-28 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-07-28)
今日共更新429篇论文,其中:
- 自然语言处理共52篇(Computation and Language (cs.CL))
- 人工智能共94篇(Artificial Intelligence (cs.AI))
- 计算机视觉共118篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共103篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
【速读】: 该论文旨在解决当前GUI自动化代理(GUI automation agents)在多平台(Windows、macOS、Linux、iOS、Android和Web)环境下评估标准缺失及效率与质量难以平衡的问题。为应对这一挑战,作者提出了MMBench-GUI,一个分层基准测试框架,涵盖GUI内容理解、元素定位、任务自动化和任务协作四个层次,系统性地评估代理的核心能力。其关键解决方案在于引入效率-质量面积(Efficiency-Quality Area, EQA)指标以量化在线自动化场景中的执行效率,并通过实证发现:精准的视觉定位是任务成功率的关键决定因素,而模块化架构中集成专用定位模块可显著提升性能;同时,可靠GUI自动化依赖于强大的任务规划能力、跨平台泛化能力以及长上下文记忆、宽动作空间和长期推理等机制;此外,论文指出任务效率是一个被严重忽视的维度,现有模型普遍存在冗余步骤问题,因此强调精确定位、有效规划与早期停止策略的融合对于实现真正高效且可扩展的GUI自动化至关重要。
链接: https://arxiv.org/abs/2507.19478
作者: Xuehui Wang,Zhenyu Wu,JingJing Xie,Zichen Ding,Bowen Yang,Zehao Li,Zhaoyang Liu,Qingyun Li,Xuan Dong,Zhe Chen,Weiyun Wang,Xiangyu Zhao,Jixuan Chen,Haodong Duan,Tianbao Xie,Chenyu Yang,Shiqian Su,Yue Yu,Yuan Huang,Yiqian Liu,Xiao Zhang,Yanting Zhang,Xiangyu Yue,Weijie Su,Xizhou Zhu,Wei Shen,Jifeng Dai,Wenhai Wang
机构: Shanghai AI Laboratory; Shanghai Jiao Tong University; Xiamen University; University of Science and Technology of China; The Hong Kong University of Science and Technology; Harbin Institute of Technology; Tsinghua University; Nanjing University; Fudan University; University of Hong Kong; Donghua University; The Chinese University of Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: in progress
Abstract:We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at this https URL.
zh
[NLP-1] Advancing Event Forecasting through Massive Training of Large Language Models : Challenges Solutions and Broader Impacts
【速读】: 该论文试图解决如何通过大规模训练实现具备超级预测者(superforecaster)水平的事件预测大语言模型(LLM)的问题。其核心挑战在于现有方法在训练过程中面临噪声稀疏性(noisiness-sparsity)、知识截止(knowledge cut-off)以及奖励结构简单(simple reward structure)等关键障碍。解决方案的关键在于:一是提出使用假设事件贝叶斯网络(hypothetical event Bayesian networks)来缓解噪声与稀疏问题,二是利用低回忆率和反事实事件增强训练数据多样性,三是引入辅助奖励信号以优化模型学习过程;同时,在数据层面主张激进地整合市场数据、公开数据集和爬取数据,以支持大规模训练与评估。这些技术路径共同推动AI向更广泛的社会领域提供预测智能(predictive intelligence)。
链接: https://arxiv.org/abs/2507.19477
作者: Sang-Woo Lee,Sohee Yang,Donghyun Kwak,Noah Y. Siegel
机构: Google Deepmind(谷歌深度思维); NAVER Cloud(NAVER云); Independent(独立)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Many recent papers have studied the development of superforecaster-level event forecasting LLMs. While methodological problems with early studies cast doubt on the use of LLMs for event forecasting, recent studies with improved evaluation methods have shown that state-of-the-art LLMs are gradually reaching superforecaster-level performance, and reinforcement learning has also been reported to improve future forecasting. Additionally, the unprecedented success of recent reasoning models and Deep Research-style models suggests that technology capable of greatly improving forecasting performance has been developed. Therefore, based on these positive recent trends, we argue that the time is ripe for research on large-scale training of superforecaster-level event forecasting LLMs. We discuss two key research directions: training methods and data acquisition. For training, we first introduce three difficulties of LLM-based event forecasting training: noisiness-sparsity, knowledge cut-off, and simple reward structure problems. Then, we present related ideas to mitigate these problems: hypothetical event Bayesian networks, utilizing poorly-recalled and counterfactual events, and auxiliary reward signals. For data, we propose aggressive use of market, public, and crawling datasets to enable large-scale training and evaluation. Finally, we explain how these technical advances could enable AI to provide predictive intelligence to society in broader areas. This position paper presents promising specific paths and considerations for getting closer to superforecaster-level AI technology, aiming to call for researchers’ interest in these directions.
zh
[NLP-2] Conversations Gone Awry But Then? Evaluating Conversational Forecasting Models
【速读】: 该论文旨在解决自动化对话系统在人类交互中缺乏前瞻性判断能力的问题,具体聚焦于“对话偏离预测”(Conversations Gone Awry, CGA)任务,即预测正在进行的对话是否会走向失控。其解决方案的关键在于构建首个统一的评估框架,不仅实现了不同模型架构之间的直接、可靠比较,还引入了一种新指标以量化模型随对话进展动态修正预测的能力,从而更全面地衡量模型的预测性能与适应性。
链接: https://arxiv.org/abs/2507.19470
作者: Son Quoc Tran,Tushaar Gangavarapu,Nicholas Chernogor,Jonathan P. Chang,Cristian Danescu-Niculescu-Mizil
机构: Cornell University (康奈尔大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Harvey Mudd College (哈维穆德学院)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Code and data available as part of ConvoKit: this https URL
Abstract:We often rely on our intuition to anticipate the direction of a conversation. Endowing automated systems with similar foresight can enable them to assist human-human interactions. Recent work on developing models with this predictive capacity has focused on the Conversations Gone Awry (CGA) task: forecasting whether an ongoing conversation will derail. In this work, we revisit this task and introduce the first uniform evaluation framework, creating a benchmark that enables direct and reliable comparisons between different architectures. This allows us to present an up-to-date overview of the current progress in CGA models, in light of recent advancements in language modeling. Our framework also introduces a novel metric that captures a model’s ability to revise its forecast as the conversation progresses.
zh
[NLP-3] GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在通过强化学习(Reinforcement Learning, RL)方法(如Group Relative Policy Optimization, GRPO)进行下游任务适配时,因依赖大量rollouts而导致的样本效率低下的问题。现有方法通常需要数千次rollouts才能有效学习新任务,而该研究提出了一种基于自然语言反思的提示优化器GEPA(Genetic-Pareto),其核心创新在于将自然语言作为可解释的学习媒介,使模型能够从少量rollouts中提取高阶规则:GEPA通过采样系统级轨迹(如推理链、工具调用及输出),以自然语言形式诊断问题、生成并测试提示更新,并利用帕累托前沿(Pareto frontier)整合互补经验。这种设计显著提升了学习效率,在四个任务中相较GRPO平均提升10%、最高达20%,同时减少最多35倍的rollouts需求;此外,GEPA还优于当前领先的提示优化器MIPROv2超过10%,并在代码优化的推理时搜索策略中展现出潜力。
链接: https://arxiv.org/abs/2507.19457
作者: Lakshya A Agrawal,Shangyin Tan,Dilara Soylu,Noah Ziems,Rishi Khare,Krista Opsahl-Ong,Arnav Singhvi,Herumb Shandilya,Michael J Ryan,Meng Jiang,Christopher Potts,Koushik Sen,Alexandros G. Dimakis,Ion Stoica,Dan Klein,Matei Zaharia,Omar Khattab
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA’s design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization.
zh
[NLP-4] okenSmith: Streamlining Data Editing Search and Inspection for Large-Scale Language Model Training and Interpretability
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)预训练过程中,训练数据与模型行为之间关系难以理解的问题。当前的工作流程存在繁琐、碎片化且对研究者不友好的缺陷,限制了对预训练数据的有效分析和调试。解决方案的关键在于提出 TokenSmith——一个开源库,支持对 Megatron 风格预训练框架(如 GPT-NeoX、Megatron 和 NVIDIA NeMo)所用数据集的交互式编辑、检查与分析,提供搜索、查看、导入、导出、采样及结构化修改等操作,且无需改动训练代码即可实现数据调试与实验,从而简化数据管理并提升可访问性,使生产级数据工具民主化。
链接: https://arxiv.org/abs/2507.19419
作者: Mohammad Aflah Khan,Ameya Godbole,Johnny Tian-Zheng Wei,Ryan Wang,James Flemings,Krishna Gummadi,Willie Neiswanger,Robin Jia
机构: Max Planck Institute for Software Systems (马普软件系统研究所); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding the relationship between training data and model behavior during pretraining is crucial, but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-source library for interactive editing, inspection, and analysis of datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range of operations including searching, viewing, ingesting, exporting, inspecting, and sampling data, all accessible through a simple user interface and a modular backend. It also enables structured editing of pretraining data without requiring changes to training code, simplifying dataset debugging, validation, and experimentation. TokenSmith is designed as a plug and play addition to existing large language model pretraining workflows, thereby democratizing access to production-grade dataset tooling. TokenSmith is hosted on GitHub1, with accompanying documentation and tutorials. A demonstration video is also available on YouTube. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.19419 [cs.CL] (or arXiv:2507.19419v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.19419 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-5] owards Domain Specification of Embedding Models in Medicine
【速读】: 该论文旨在解决当前医疗文本嵌入模型存在的两大关键问题:一是现有模型训练数据范围狭窄且方法陈旧,难以捕捉实际医疗场景中术语与语义的多样性;二是现有评估体系 inadequately generalizes across real-world医疗任务,导致模型性能评价不全面。解决方案的关键在于提出一个基于多源自监督对比学习(self-supervised contrastive learning)大规模微调的医学文本嵌入模型 MEDTE,并配套构建了一个包含51项任务的综合性基准测试套件(benchmark suite),覆盖分类、聚类、成对分类和检索等类型,专为医疗文本特性设计。实验表明,该组合方案不仅建立了更可靠的评估框架,且生成的嵌入在多种任务上均显著优于现有最先进方法。
链接: https://arxiv.org/abs/2507.19407
作者: Mohammad Khodadad,Ali Shiraee,Mahdi Astaraki,Hamidreza Mahyar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Medical text embedding models are foundational to a wide array of healthcare applications, ranging from clinical decision support and biomedical information retrieval to medical question answering, yet they remain hampered by two critical shortcomings. First, most models are trained on a narrow slice of medical and biological data, beside not being up to date in terms of methodology, making them ill suited to capture the diversity of terminology and semantics encountered in practice. Second, existing evaluations are often inadequate: even widely used benchmarks fail to generalize across the full spectrum of real world medical tasks. To address these gaps, we leverage MEDTE, a GTE model extensively fine-tuned on diverse medical corpora through self-supervised contrastive learning across multiple data sources, to deliver robust medical text embeddings. Alongside this model, we propose a comprehensive benchmark suite of 51 tasks spanning classification, clustering, pair classification, and retrieval modeled on the Massive Text Embedding Benchmark (MTEB) but tailored to the nuances of medical text. Our results demonstrate that this combined approach not only establishes a robust evaluation framework but also yields embeddings that consistently outperform state of the art alternatives in different tasks. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.19407 [cs.CL] (or arXiv:2507.19407v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.19407 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-6] Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study
【速读】: 该论文旨在解决荷兰临床自由文本中不良药物事件(Adverse Drug Event, ADE)检测的基准建模问题,其核心挑战在于如何在标注数据有限且ADE类别严重不平衡的场景下,构建高准确率与临床实用性的自然语言处理模型。解决方案的关键在于:首先,采用多种Transformer架构(包括BERTje、RobBERT、NuNER及自研模型)进行命名实体识别(Named Entity Recognition, NER)和关系分类(Relation Classification, RC)任务训练;其次,引入“金标准两步法”与“端到端预测实体法”双评估策略,并结合微平均(micro-averaged)与宏平均(macro-averaged)F1分数以应对样本不均衡问题;最终,通过内部验证与外部文档级验证(基于出院记录)证明所提方法具有良好的泛化能力,其中最优模型(this http URL)在宏观F1上达0.63(金标准)和0.62(预测实体),外部验证召回率达67–74%,表明其具备实际临床部署潜力。
链接: https://arxiv.org/abs/2507.19396
作者: Rachel M. Murphy(1),Nishant Mishra(1),Nicolette F. de Keizer(1),Dave A. Dongelmans(2),Kitty J. Jager(1),Ameen Abu-Hanna(1),Joanna E. Klopotowska(1),Iacer Calixto(1) ((1) Amsterdam UMC location University of Amsterdam, Department of Medical Informatics, Amsterdam, The Netherlands, (2) Amsterdam UMC location University of Amsterdam, Department of Intensive Care Medicine, Amsterdam, the Netherlands)
机构: 未知
类目: Computation and Language (cs.CL)
备注: 30 Pages, 5 Figures (Main Paper), 19 Pages, 2 Figures(Supplements). Rachel M. Murphy and Nishant Mishra are shared first authors. Joanna E. Klopotowska and Iacer Calixto are shared last authors
Abstract:In this study, we set a benchmark for adverse drug event (ADE) detection in Dutch clinical free text documents using several transformer models, clinical scenarios and fit-for-purpose performance measures. We trained a Bidirectional Long Short-Term Memory (Bi-LSTM) model and four transformer-based Dutch and/or multilingual encoder models (BERTje, RobBERT, this http URL, and NuNER) for the tasks of named entity recognition (NER) and relation classification (RC) using 102 richly annotated Dutch ICU clinical progress notes. Anonymized free text clinical progress notes of patients admitted to intensive care unit (ICU) of one academic hospital and discharge letters of patients admitted to Internal Medicine wards of two non-academic hospitals were reused. We evaluated our ADE RC models internally using gold standard (two-step task) and predicted entities (end-to-end task). In addition, all models were externally validated on detecting ADEs at the document level. We report both micro- and macro-averaged F1 scores, given the imbalance of ADEs in the datasets. Although differences for the ADE RC task between the models were small, this http URL was the best performing model with macro-averaged F1 score of 0.63 using gold standard and 0.62 using predicted entities. The this http URL models also performed the best in our external validation and achieved recall of between 0.67 to 0.74 using predicted entities, meaning between 67 to 74% of discharge letters with ADEs were detected. Our benchmark study presents a robust and clinically meaningful approach for evaluating language models for ADE detection in clinical free text documents. Our study highlights the need to use appropriate performance measures fit for the task of ADE detection in clinical free-text documents and envisioned future clinical use.
zh
[NLP-7] Data Augmentation for Spoken Grammatical Error Correction ISCA
【速读】: 该论文旨在解决当前Spoken Grammatical Error Correction (SGEC)领域缺乏高质量标注语音数据的问题。现有研究虽有成熟的书面语法纠错(GEC)基准数据集,但针对口语场景中包含语法错误和不流畅表达的音频-文本配对数据仍严重不足。解决方案的关键在于提出一种全自动方法,用于生成带有语法错误和不流畅表达的音频-文本对,并设计一系列客观评估指标以筛选更适配SGEC任务的数据集。该方法在保持原始语料文本与声学特征不变的前提下,引入新型错误类型,从而扩充并丰富原始语料库,同时确保第二语言(L2)学习者的语言能力评分不受影响。实验基于首个公开的带语法标注语音语料库S/I Corpus进行验证,结果表明该增强语料库可同时提升书面GEC与口语GEC的性能。
链接: https://arxiv.org/abs/2507.19374
作者: Penny Karanasou,Mengjie Qian,Stefano Bannò,Mark J.F. Gales,Kate M. Knill
机构: ALTA Institute, Department of Engineering (ALTA 研究所,工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work has been accepted by ISCA SLaTE 2025
Abstract:While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S\I Corpus, the first publicly available speech dataset with grammar error annotations.
zh
[NLP-8] LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences ACL2025
【速读】: 该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在图像描述生成评估中存在的三大关键问题:缺乏标准化的评价指标、忽视幻觉与社会偏见等风险,以及未考虑用户偏好差异。其解决方案的核心是提出LOTUS——一个用于评估详细图像描述的基准测试平台,该平台不仅系统性地涵盖描述质量(如对齐度和描述详尽性)、风险(如幻觉现象)和社会偏见(如性别偏见),还支持基于用户偏好的定制化评估,从而实现更全面、公平且贴近实际应用需求的模型性能衡量。
链接: https://arxiv.org/abs/2507.19362
作者: Yusuke Hirota,Boyi Li,Ryo Hachiuma,Yueh-Hua Wu,Boris Ivanovic,Yuta Nakashima,Marco Pavone,Yejin Choi,Yu-Chiang Frank Wang,Chao-Han Huck Yang
机构: NVIDIA Research (NVIDIA 研究院); Osaka University (大阪大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted to ACL 2025. Leaderboard: this http URL
Abstract:Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (\eg, hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.
zh
[NLP-9] SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models ACL2025
【速读】: 该论文旨在解决当前语音理解大语言模型(LLM Voice)评估体系中缺乏认知层次结构、难以统一比较不同架构(如级联式ASR+LLM与端到端模型)以及无法有效识别标注错误和幻觉等问题。解决方案的关键在于提出一种基于布卢姆认知分类法(Bloom’s Taxonomy)的新型评估框架——语音智能商(Speech-based Intelligence Quotient, SIQ),从记忆(verbatim准确性,即词错误率WER)、理解(LLM解释相似性)和应用(下游任务问答准确率)三个认知层级系统性地量化模型的语音理解能力,从而实现更全面、可比且具备认知逻辑的评估。
链接: https://arxiv.org/abs/2507.19361
作者: Zhen Wan,Chao-Han Huck Yang,Yahan Yu,Jinchuan Tian,Sheng Li,Ke Hu,Zhehuai Chen,Shinji Watanabe,Fei Cheng,Chenhui Chu,Sadao Kurohashi
机构: Kyoto University (京都大学); NVIDIA (英伟达); Carnegie Mellon University (卡内基梅隆大学); Institute of Science Tokyo (东京科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Our Speech-IQ leaderboard will be hosted at this http URL . ACL 2025 main
Abstract:We introduce Speech-based Intelligence Quotient (SIQ) as a new form of human cognition-inspired evaluation pipeline for voice understanding large language models, LLM Voice, designed to assess their voice understanding ability. Moving beyond popular voice understanding metrics such as word error rate (WER), SIQ examines LLM Voice across three cognitive levels motivated by Bloom’s Taxonomy: (1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e., similarity of LLM’s interpretations); and (3) Application (i.e., QA accuracy for simulating downstream tasks). We demonstrate that SIQ not only quantifies voice understanding abilities but also provides unified comparisons between cascaded methods (e.g., ASR LLM) and end-to-end models, identifies annotation errors in existing benchmarks, and detects hallucinations in LLM Voice. Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks, while exposing overlooked challenges in multi-modal training.
zh
[NLP-10] Enhancing Speech Emotion Recognition Leverag ing Aligning Timestamps of ASR Transcripts and Speaker Diarization
【速读】: 该论文旨在解决自动语音识别(ASR)与说话人辨认(SD)输出之间的时间戳错位问题,该错位会降低多模态情感识别(SER)系统的可靠性,尤其是在对话场景中。解决方案的关键在于构建一个基于预训练ASR和说话人辨认模型的对齐流程,通过系统性地同步时间戳生成准确标注的说话人片段,并在此基础上融合RoBERTa文本嵌入与Wav2Vec音频嵌入,采用交叉注意力机制结合门控机制进行特征融合,从而显著提升SER的准确性。
链接: https://arxiv.org/abs/2507.19356
作者: Hsuan-Yu Wang,Pei-Ying Lee,Berlin Chen
机构: National Taiwan Normal University (国立台湾师范大学)
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 figures, to appear in the Proceedings of the 2025 International Conference on Asian Language Processing (IALP)
Abstract:In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. To address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diarization models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERTa with audio embeddings from Wav2Vec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming baseline methods that lack synchronization. The results highlight the critical importance of temporal alignment, demonstrating its effectiveness in enhancing overall emotion recognition accuracy and providing a foundation for robust multimodal emotion analysis.
zh
[NLP-11] Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Tasks
【速读】: 该论文旨在解决循环型大语言模型(Recurrent LLMs)在长文本任务中性能落后于自注意力机制大语言模型(Self-Attention LLMs)的问题,其核心瓶颈在于前者受限于固定大小的内存容量,难以有效处理长上下文。解决方案的关键在于提出一种受人类阅读策略启发的“平滑读取”(Smooth Reading)方法,该方法采用分块推理(chunk-wise inference)策略,将上下文分段处理并迭代式地进行信息摘要,从而显著降低内存需求,提升与循环结构的兼容性。实验表明,该方法大幅缩小了两类模型在LongBench等长文本任务上的性能差距,使SWA-3B-4k模型从低于Self-Attention LLMs 5.68%提升至高出3.61%,同时保持了循环模型在训练和推理速度上的优势(分别快3倍和2倍)。
链接: https://arxiv.org/abs/2507.19353
作者: Kai Liu,Zhan Su,Peijie Dong,Fengran Mo,Jianfei Gao,ShaoTing Zhang,Kai Chen
机构: Shanghai AI Laboratory(上海人工智能实验室); Tongji University(同济大学); University of Montreal(蒙特利尔大学); HKUST(GZ)(香港科技大学(广州))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, recurrent large language models (Recurrent LLMs) with linear computational complexity have re-emerged as efficient alternatives to self-attention-based LLMs (Self-Attention LLMs), which have quadratic complexity. However, Recurrent LLMs often underperform on long-context tasks due to their limited fixed-size memory. Previous research has primarily focused on enhancing the memory capacity of Recurrent LLMs through architectural innovations, but these approaches have not yet enabled Recurrent LLMs to match the performance of Self-Attention LLMs on long-context tasks. We argue that this limitation arises because processing the entire context at once is not well-suited for Recurrent LLMs. In this paper, we propose Smooth Reading, a chunk-wise inference method inspired by human reading strategies. Smooth Reading processes context in chunks and iteratively summarizes the contextual information, thereby reducing memory demands and making the approach more compatible with Recurrent LLMs. Our experimental results show that this method substantially narrows the performance gap between Recurrent and Self-Attention LLMs on long-context tasks, while preserving the efficiency advantages of Recurrent LLMs. Our Smooth Reading boosts SWA-3B-4k (a Recurrent LLM) from 5.68% lower to 3.61% higher performance than Self-Attention LLMs on LongBench. Besides, our method maintains the high efficiency, training 3x faster and inferring 2x faster at 64k context compared to Self-Attention LLMs. To our knowledge, this is the first work to achieve comparable performance using Recurrent LLMs compared with Self-Attention LLMs on long-context tasks. We hope our method will inspire future research in this area. To facilitate further progress, we will release code and dataset.
zh
[NLP-12] Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对噪声检索段落(noisy retrieved passages)时性能下降的问题,即低质量或无关的检索结果会显著削弱大语言模型(Large Language Models, LLMs)的推理可靠性和准确性。解决方案的关键在于提出了一种名为“段落注入”(Passage Injection)的方法,该方法通过显式地将检索到的段落嵌入LLMs的推理过程中,使模型能够在生成回答前主动识别并抵抗噪声信息,同时有效利用高质量的有用段落。实验表明,该方法在多种基准数据集和噪声场景下均能显著提升RAG系统的鲁棒性与整体性能。
链接: https://arxiv.org/abs/2507.19333
作者: Minghao Tang,Shiyu Ni,Jiafeng Guo,Keping Bi
机构: CAS Key Lab of Network DataScience and Technology, ICT, CAS; State Key Laboratory of AI Safety; University of Chinese Academy of Sciences
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-augmented generation (RAG) has been widely adopted to augment large language models (LLMs) with external knowledge for knowledge-intensive tasks. However, its effectiveness is often undermined by the presence of noisy (i.e., low-quality) retrieved passages. Enhancing LLMs’ robustness to such noise is critical for improving the reliability of RAG systems. Recent advances have equipped LLMs with strong reasoning and self-reflection capabilities, allowing them to identify and correct errors in their reasoning process. Inspired by this ability, we propose Passage Injection-a simple yet effective method that explicitly incorporates retrieved passages into LLMs’ reasoning process, aiming to enhance the model’s ability to recognize and resist noisy passages. We validate Passage Injection under general RAG settings using BM25 as the retriever. Experiments on four reasoning-enhanced LLMs across four factual QA datasets demonstrate that Passage Injection significantly improves overall RAG performance. Further analysis on two noisy retrieval settings-random noise, where the model is provided irrelevant passages, and counterfactual noise, where it is given misleading passages-shows that Passage Injection consistently improves robustness. Controlled experiments confirm that Passage Injection can also effectively leverage helpful passages. These findings suggest that incorporating passages in LLMs’ reasoning process is a promising direction for building more robust RAG systems. The code can be found \hrefherethis https URL.
zh
[NLP-13] AutoPCR: Automated Phenotype Concept Recognition by Prompting
【速读】: 该论文旨在解决生物医学文本挖掘中表型概念识别(Phenotype Concept Recognition, PCR)任务的泛化能力不足问题,现有方法通常依赖特定本体(ontology)的训练,难以适应多样化的文本类型和不断演化的生物医学术语。其解决方案的关键在于提出AutoPCR,一种无需本体特异性训练的基于提示(prompt-based)的PCR方法,通过三个阶段实现:首先结合规则与神经标注策略进行实体抽取,其次利用SapBERT检索候选概念,最后通过大语言模型(LLM)提示完成实体链接。该方法在四个基准数据集上表现出最优且最稳健的性能,验证了其在提及级和文档级评估中的优越性,并通过消融实验和迁移学习进一步证明其对新本体具备良好的归纳能力和通用性。
链接: https://arxiv.org/abs/2507.19315
作者: Yicheng Tao,Yuanhao Huang,Jie Liu
机构: University of Michigan(密歇根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Phenotype concept recognition (CR) is a fundamental task in biomedical text mining, enabling applications such as clinical diagnostics and knowledge graph construction. However, existing methods often require ontology-specific training and struggle to generalize across diverse text types and evolving biomedical terminology. We present AutoPCR, a prompt-based phenotype CR method that does not require ontology-specific training. AutoPCR performs CR in three stages: entity extraction using a hybrid of rule-based and neural tagging strategies, candidate retrieval via SapBERT, and entity linking through prompting a large language model. Experiments on four benchmark datasets show that AutoPCR achieves the best average and most robust performance across both mention-level and document-level evaluations, surpassing prior state-of-the-art methods. Further ablation and transfer studies demonstrate its inductive capability and generalizability to new ontologies.
zh
[NLP-14] Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trumps Presidential Campaigns
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在识别和分类政治话语中细粒度民粹主义(populism)现象方面的局限性问题。现有研究虽表明LLMs在指令遵循任务中表现优异,但其对社会科学研究中复杂且争议性强的概念如民粹主义的理解仍不充分。论文的关键解决方案在于构建专门用于捕捉民粹主义话语的新颖数据集,并系统评估多种预训练语言模型(包括开源与专有模型)在不同提示范式下的性能表现。结果显示,未经微调的最新一代指令微调LLMs在民粹主义识别任务中表现不佳,而经过微调的RoBERTa分类器显著优于所有LLMs;此外,通过将最优模型应用于特朗普竞选演讲及欧洲政治人物演讲的分析,验证了模型在跨语境迁移中的有效性,发现指令微调的LLMs在域外数据上更具鲁棒性。
链接: https://arxiv.org/abs/2507.19303
作者: Ilias Chalkidis,Stephanie Brandl,Paris Aslanidis
机构: University of Copenhagen (哥本哈根大学); National Centre for Social Research (EKKE) (希腊国家社会研究所)
类目: Computation and Language (cs.CL)
备注: Pre-print
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of instruction-following tasks, yet their grasp of nuanced social science concepts remains underexplored. This paper examines whether LLMs can identify and classify fine-grained forms of populism, a complex and contested concept in both academic and media debates. To this end, we curate and release novel datasets specifically designed to capture populist discourse. We evaluate a range of pre-trained (large) language models, both open-weight and proprietary, across multiple prompting paradigms. Our analysis reveals notable variation in performance, highlighting the limitations of LLMs in detecting populist discourse. We find that a fine-tuned RoBERTa classifier vastly outperforms all new-era instruction-tuned LLMs, unless fine-tuned. Additionally, we apply our best-performing model to analyze campaign speeches by Donald Trump, extracting valuable insights into his strategic use of populist rhetoric. Finally, we assess the generalizability of these models by benchmarking them on campaign speeches by European politicians, offering a lens into cross-context transferability in political discourse analysis. In this setting, we find that instruction-tuned LLMs exhibit greater robustness on out-of-domain data.
zh
[NLP-15] A Markov Categorical Framework for Language Modeling
【速读】: 该论文旨在解决自回归语言模型(Auto-regressive Language Models, ARLMs)中一个核心问题:为何简单的负对数似然(Negative Log-Likelihood, NLL)训练目标能够生成具有强大泛化能力的表示。传统理论难以解释NLL优化如何隐式地引导模型学习到数据的内在结构与几何特性。解决方案的关键在于引入马尔可夫范畴(Markov Categories, MCs)作为分析框架,将自回归生成过程建模为Stoch范畴中马尔可夫核(Markov kernel)的复合,并结合统计散度(statistical divergences)刻画信息流动和学习到的几何结构。通过这一框架,作者揭示了三个关键机制:一是量化了如EAGLE等推测解码方法所利用的隐藏状态中的信息冗余;二是证明NLL最小化迫使模型不仅预测下一个词元,还学习数据的条件随机性(由类别熵刻画);三是首次形式化地表明NLL训练本质上是一种隐式的谱对比学习(spectral contrastive learning),即通过预测头的信息几何结构,使表示空间自动对齐于预测相似性算子的特征谱,从而在无需显式对比样本的情况下学习到几何结构化的表示空间。
链接: https://arxiv.org/abs/2507.19247
作者: Yifan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL
Abstract:Auto-regressive language models factorize sequence probabilities and are trained by minimizing the negative log-likelihood (NLL) objective. While empirically powerful, a deep theoretical understanding of why this simple objective yields such versatile representations remains elusive. This work introduces a unifying analytical framework using Markov Categories (MCs) to deconstruct the AR generation process and the NLL objective. We model the single-step generation map as a composition of Markov kernels in the category Stoch. This compositional view, when enriched with statistical divergences, allows us to dissect information flow and learned geometry. Our framework makes three main contributions. First, we provide a formal, information-theoretic rationale for the success of modern speculative decoding methods like EAGLE, quantifying the information surplus in hidden states that these methods exploit. Second, we formalize how NLL minimization forces the model to learn not just the next token, but the data’s intrinsic conditional stochasticity, a process we analyze using categorical entropy. Third, and most centrally, we prove that NLL training acts as an implicit form of spectral contrastive learning. By analyzing the information geometry of the model’s prediction head, we show that NLL implicitly forces the learned representation space to align with the eigenspectrum of a predictive similarity operator, thereby learning a geometrically structured space without explicit contrastive pairs. This compositional and information-geometric perspective reveals the deep structural principles underlying the effectiveness of modern LMs. Project Page: this https URL
zh
[NLP-16] Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation
【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于扩散机制的语言模型(Large Language Diffusion Models, LLDMs)存在的安全漏洞问题,特别是其对传统针对大语言模型(Large Language Models, LLMs)的越狱攻击(jailbreak)表现出的脆弱性不足以及潜在的有害内容生成风险。解决方案的关键在于提出一种名为 PArallel Decoding jailbreak(PAD)的新攻击方法,其核心创新是引入多点注意力攻击(Multi-Point Attention Attack),通过模仿LLMs中肯定响应模式来引导LLDM的并行生成过程,从而高效诱导有害输出。实验表明,PAD在四种LLDM上实现了高达97%的攻击成功率,并揭示了LLDM相较于同规模自回归LLMs在有害生成速度上提升2倍,凸显其安全风险。
链接: https://arxiv.org/abs/2507.19227
作者: Yuanhe Zhang,Fangzhou Xie,Zhenhong Zhou,Zherui Li,Hao Chen,Kun Wang,Yufei Guo
机构: 1. Tsinghua University (清华大学); 2. Baidu (百度); 3. Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Diffusion Models (LLDMs) exhibit comparable performance to LLMs while offering distinct advantages in inference speed and mathematical reasoning this http URL precise and rapid generation capabilities of LLDMs amplify concerns of harmful generations, while existing jailbreak methodologies designed for Large Language Models (LLMs) prove limited effectiveness against LLDMs and fail to expose safety this http URL defense cannot definitively resolve harmful generation concerns, as it remains unclear whether LLDMs possess safety robustness or existing attacks are incompatible with diffusion-based this http URL address this, we first reveal the vulnerability of LLDMs to jailbreak and demonstrate that attack failure in LLDMs stems from fundamental architectural this http URL present a PArallel Decoding jailbreak (PAD) for diffusion-based language models. PAD introduces Multi-Point Attention Attack, which guides parallel generative processes toward harmful outputs that inspired by affirmative response patterns in LLMs. Experimental evaluations across four LLDMs demonstrate that PAD achieves jailbreak attack success rates by 97%, revealing significant safety vulnerabilities. Furthermore, compared to autoregressive LLMs of the same size, LLDMs increase the harmful generation speed by 2x, significantly highlighting risks of uncontrolled this http URL comprehensive analysis, we provide an investigation into LLDM architecture, offering critical insights for the secure deployment of diffusion-based language models.
zh
[NLP-17] How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)评估中存在的高估问题(overestimation),即由于公共基准测试集的数据污染(benchmark contamination)或训练数据分布不均衡,导致LLMs在公开评测中获得虚高的性能表现,从而引发不公平比较并削弱对模型真实能力的准确评估。现有方法如保持测试样本保密、依赖人工评估或频繁更新数据集,难以同时实现可复现性、透明性和高效性。为此,作者提出ArxivRoll动态评估框架,其核心创新在于两个关键组件:一是SCP(Sequencing, Cloze, and Prediction)自动生成机制,用于基于ArXiv最新论文构建私有测试用例;二是Rugged Scores(RS)指标,量化公共基准污染程度与训练偏差比例。通过每六个月利用新文献生成一次全新基准进行一次性评估,ArxivRoll实现了高效、透明且具备可复现性的LLM性能测评体系。
链接: https://arxiv.org/abs/2507.19219
作者: Zi Liang,Liantong Yu,Shiyu Zhang,Qingqing Ye,Haibo Hu
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Source code: this https URL Website: this https URL
Abstract:Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model training, LLMs may achieve unreal evaluation results on public benchmarks, either intentionally or unintentionally, which leads to unfair comparisons among LLMs and undermines their realistic capability assessments. Existing benchmarks attempt to address these issues by keeping test cases permanently secret, mitigating contamination through human evaluation, or repeatedly collecting and constructing new samples. However, these approaches fail to ensure reproducibility, transparency, and high efficiency simultaneously. Moreover, the extent of overestimation in current LLMs remains unquantified. To address these issues, we propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography. ArxivRoll comprises two key components: \emphi) SCP (Sequencing, Cloze, and Prediction), an automated generator for private test cases, and \emphii) Rugged Scores (RS), metrics that measure the proportion of public benchmark contamination and training bias. Leveraging SCP, ArxivRoll constructs a new benchmark every six months using recent articles from ArXiv and employs them for one-time evaluations of LLM performance. Extensive experiments demonstrate the high quality of our benchmark, and we provide a systematic evaluation of current LLMs. The source code is available at this https URL.
zh
[NLP-18] owards Multimodal Social Conversations with Robots: Using Vision-Language Models
【速读】: 该论文旨在解决社交机器人在开放域对话中缺乏多模态交互能力的问题,即未能有效利用视觉、语言等多元信息来增强社会互动的自然性和理解深度。其解决方案的关键在于引入视觉-语言模型(Vision-Language Models, VLMs),通过适配此类模型以处理社交场景中的广泛视觉信息,从而实现更通用且自主的多模态社交交互能力。论文进一步指出,尽管VLMs具备良好的泛化性,仍存在若干技术挑战需解决,并建议优化评估方法以支持系统性能的量化验证。
链接: https://arxiv.org/abs/2507.19196
作者: Ruben Janssens,Tony Belpaeme
机构: IDLab-AIRO; Ghent University–imec (根特大学–imec)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Submitted to the workshop “Human - Foundation Models Interaction: A Focus On Multimodal Information” (FoMo-HRI) at IEEE RO-MAN 2025
Abstract:Large language models have given social robots the ability to autonomously engage in open-domain conversations. However, they are still missing a fundamental social skill: making use of the multiple modalities that carry social interactions. While previous work has focused on task-oriented interactions that require referencing the environment or specific phenomena in social interactions such as dialogue breakdowns, we outline the overall needs of a multimodal system for social conversations with robots. We then argue that vision-language models are able to process this wide range of visual information in a sufficiently general manner for autonomous social robots. We describe how to adapt them to this setting, which technical challenges remain, and briefly discuss evaluation practices.
zh
[NLP-19] Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models ?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中因数据污染(data poisoning)与方言差异(如非洲裔美国英语 AAVE 与标准美式英语 SAE)交互作用而加剧毒性输出的问题,尤其是对AAVE输入的偏见放大现象。其解决方案的关键在于揭示:即使少量中毒数据也能显著提升AAVE输入对应的毒性输出,且模型规模越大,这种放大效应越明显;同时通过GPT-4o作为公平性审计工具识别出针对AAVE的有害刻板印象(如攻击性、犯罪性和智力低下),从而强调必须引入方言敏感的评估机制、针对性去偏干预措施以及更具社会责任感的训练协议来缓解此类系统性偏差。
链接: https://arxiv.org/abs/2507.19195
作者: Chaymaa Abbas,Mariette Awad,Razane Tajeddine
机构: Maroun Semaan Faculty of Engineering and Architecture, American University of Beirut (美国贝鲁特美洲大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Despite the ongoing improvements in the design of large language models (LLMs) to foster inclusion and balanced responses, these systems remain susceptible to encoding and amplifying social biases. This study examines how dialectal variation, specifically African American Vernacular English (AAVE) versus Standard American English (SAE), interacts with data poisoning to influence toxicity in outputs. Using both small- and medium-scale LLaMA models, we show that even minimal exposure to poisoned data significantly increases toxicity for AAVE inputs, while it remains comparatively unaffected for SAE. Larger models exhibit a more significant amplification effect which suggests heightened susceptibility with scale. To further assess these disparities, we employed GPT-4o as a fairness auditor, which identified harmful stereotypical patterns disproportionately tied to AAVE inputs, including portrayals of aggression, criminality, and intellectual inferiority. These findings underscore the compounding impact of data poisoning and dialectal bias and emphasize the need for dialect-aware evaluation, targeted debiasing interventions, and socially responsible training protocols during development.
zh
[NLP-20] An Empirical Investigation of Gender Stereotype Representation in Large Language Models : The Italian Case KDD2025 ECML
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文本时可能加剧性别和职业偏见的问题,尤其关注其如何对中性提示(ungendered prompts)作出响应并产生有偏见的输出。研究通过结构化实验方法,使用意大利语(一种具有显著语法性别差异的语言)下三组具有层级关系的职业组合提示,考察两个主流LLM聊天机器人——OpenAI ChatGPT(gpt-4o-mini)与Google Gemini(gemini-1.5-flash)的输出倾向;关键发现表明,Gemini将100%的“她”代词关联到“助手”而非“经理”,ChatGPT为97%,凸显了模型在非英语语言环境中仍难以保持客观性。解决方案的关键在于识别此类偏见机制,并推动未来研究向扩展多语言或多模型对比、优化提示工程策略及构建更广泛实验基底的方向发展,以制定有效缓解措施,确保人工智能系统促进社会公平而非加剧不平等。
链接: https://arxiv.org/abs/2507.19156
作者: Gioele Giachino,Marco Rondina,Antonio Vetrò,Riccardo Coppola,Juan Carlos De Martin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 16 pages, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2025) - 5th Workshop on Bias and Fairness in AI (BIAS25)
Abstract:The increasing use of Large Language Models (LLMs) in a large variety of domains has sparked worries about how easily they can perpetuate stereotypes and contribute to the generation of biased content. With a focus on gender and professional bias, this work examines in which manner LLMs shape responses to ungendered prompts, contributing to biased outputs. This analysis uses a structured experimental method, giving different prompts involving three different professional job combinations, which are also characterized by a hierarchical relationship. This study uses Italian, a language with extensive grammatical gender differences, to highlight potential limitations in current LLMs’ ability to generate objective text in non-English languages. Two popular LLM-based chatbots are examined, namely OpenAI ChatGPT (gpt-4o-mini) and Google Gemini (gemini-1.5-flash). Through APIs, we collected a range of 3600 responses. The results highlight how content generated by LLMs can perpetuate stereotypes. For example, Gemini associated 100% (ChatGPT 97%) of ‘she’ pronouns to the ‘assistant’ rather than the ‘manager’. The presence of bias in AI-generated text can have significant implications in many fields, such as in the workplaces or in job selections, raising ethical concerns about its use. Understanding these risks is pivotal to developing mitigation strategies and assuring that AI-based systems do not increase social inequalities, but rather contribute to more equitable outcomes. Future research directions include expanding the study to additional chatbots or languages, refining prompt engineering methods or further exploiting a larger experimental base.
zh
[NLP-21] OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?
【速读】: 该论文旨在解决现有基准测试无法充分反映计算机使用代理(computer-using agents)在实际应用场景中所需的能力异质性及其与用户真实需求之间对齐的问题,从而阻碍了针对性能力提升和研究成果向实用部署的可靠转化。其解决方案的关键在于提出OS-MAP基准,该基准将416个现实任务按15个应用程序组织,并从两个维度进行评估:一是基于五级自动化分类的自动化水平,二是基于真实用户需求层级的泛化范围。这种设计构建了一个性能-泛化评估矩阵,能够精细刻画代理所需能力与真实场景之间的匹配度,从而为研究提供结构化、全面的评估框架。
链接: https://arxiv.org/abs/2507.19132
作者: Xuetian Chen,Yinghao Chen,Xinfeng Yuan,Zhuo Peng,Lu Chen,Yuekeng Li,Zhoujia Zhang,Yingqian Huang,Leyan Huang,Jiaqing Liang,Tianbao Xie,Zhiyong Wu,Qiushi Sun,Biqing Qi,Bowen Zhou
机构: Fudan University (复旦大学); Shanghai AI Lab; Tsinghua University (清华大学); The University of Hong Kong (香港大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Work in progress
Abstract:Computer-using agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands-hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-MAP, a benchmark for daily computer-using automation that organizes its 416 realistic tasks across 15 applications along two key dimensions: a five-level taxonomy of automation and a generalization scope derived from a real-world user demand hierarchy. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-MAP evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy. This design captures varying levels of required agent autonomy and generalization, forming a performance-generalization evaluation matrix for structured and comprehensive assessment. Experiments show that even State-of-the-Art agents with VLM backbones struggle with higher-level tasks involving perception, reasoning, and coordination-highlighting the need for a deeper understanding of current strengths and limitations to drive the future progress in computer-using agents research and deployment. All code, environments, baselines, and data are publicly available at this https URL.
zh
[NLP-22] Objectifying the Subjective: Cognitive Biases in Topic Interpretations ACL
【速读】: 该论文旨在解决现有主题质量评估指标(如一致性分数和词插入测试)无法衡量主题在促进语料库探索方面的有效性这一问题。其解决方案的关键在于通过用户研究揭示用户对主题的实际解释机制,发现用户主要依赖可得性启发式(availability heuristic)和代表性启发式(representativeness heuristic)而非概率推理来理解主题;进而提出基于锚定与调整启发式(anchoring-and-adjustment heuristic)的主题解释理论,即用户以显著词为锚点并进行语义调整形成解释。该研究强调,在生态理性视角下,主题解释本质上是用户在不确定性下的判断过程,因此需构建认知偏差意识的用户模型与评价框架。
链接: https://arxiv.org/abs/2507.19117
作者: Swapnil Hingmire,Ze Shi Li,Shiyu(Vivienne)Zeng,Ahmed Musa Awon,Luiz Franciscatto Guerra,Neil Ernst
机构: Mehta Family School of Data Science and Artificial Intelligence, Department of Data Science, Indian Institute of Technology (IIT) Palakkad, Kerala, India; Department of Computer Science, University of Victoria, Victoria, Canada
类目: Computation and Language (cs.CL)
备注: Accepted for publication at the Transactions of ACL (TACL) (pre-MIT Press publication version)
Abstract:Interpretation of topics is crucial for their downstream applications. State-of-the-art evaluation measures of topic quality such as coherence and word intrusion do not measure how much a topic facilitates the exploration of a corpus. To design evaluation measures grounded on a task, and a population of users, we do user studies to understand how users interpret topics. We propose constructs of topic quality and ask users to assess them in the context of a topic and provide rationale behind evaluations. We use reflexive thematic analysis to identify themes of topic interpretations from rationales. Users interpret topics based on availability and representativeness heuristics rather than probability. We propose a theory of topic interpretation based on the anchoring-and-adjustment heuristic: users anchor on salient words and make semantic adjustments to arrive at an interpretation. Topic interpretation can be viewed as making a judgment under uncertainty by an ecologically rational user, and hence cognitive biases aware user models and evaluation frameworks are needed.
zh
[NLP-23] Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因依赖大语言模型(Large Language Models, LLMs)进行实用性判断而导致的高计算成本问题,尤其在处理复杂查询时,受限于LLMs对候选文档进行实用性评估的能力,难以高效筛选出真正有助于生成准确答案的文本片段。解决方案的关键在于将LLMs的实用性判断能力蒸馏(distill)至轻量级学生模型中,通过伪答案生成与实用性判断联合训练,实现基于实用性的动态片段选择机制,而非传统的固定阈值相关性排序;该方法采用滑动窗口策略,在不预设固定阈值的前提下,根据具体查询动态挑选最有用的段落,从而在显著降低计算开销的同时提升答案质量。
链接: https://arxiv.org/abs/2507.19102
作者: Hengran Zhang,Keping Bi,Jiafeng Guo,Jiaming Zhang,Shuaiqiang Wang,Dawei Yin,Xueqi Cheng
机构: Key Laboratory of Network Data Science and Technology, ICT, CAS; State Key Laboratory of AI Safety; University of Chinese Academy of Sciences; Baidu Inc
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 5 figures
Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating retrieved information. Standard retrieval process prioritized relevance, focusing on topical alignment between queries and passages. In contrast, in RAG, the emphasis has shifted to utility, which considers the usefulness of passages for generating accurate answers. Despite empirical evidence showing the benefits of utility-based retrieval in RAG, the high computational cost of using LLMs for utility judgments limits the number of passages evaluated. This restriction is problematic for complex queries requiring extensive information. To address this, we propose a method to distill the utility judgment capabilities of LLMs into smaller, more efficient models. Our approach focuses on utility-based selection rather than ranking, enabling dynamic passage selection tailored to specific queries without the need for fixed thresholds. We train student models to learn pseudo-answer generation and utility judgments from teacher LLMs, using a sliding window method that dynamically selects useful passages. Our experiments demonstrate that utility-based selection provides a flexible and cost-effective solution for RAG, significantly reducing computational costs while improving answer quality. We present the distillation results using Qwen3-32B as the teacher model for both relevance ranking and utility-based selection, distilled into RankQwen1.7B and UtilityQwen1.7B. Our findings indicate that for complex questions, utility-based selection is more effective than relevance ranking in enhancing answer generation performance. We will release the relevance ranking and utility-based selection annotations for the MS MARCO dataset, supporting further research in this area.
zh
[NLP-24] Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents
【速读】: 该论文旨在解决复杂声明验证(claim verification)任务中,现有单大语言模型(single-LLM)方法在处理多维度证据时表现不足的问题。其核心解决方案是提出 DebateCV 框架,首次采用辩论驱动的方法,通过多个大语言模型(LLM)代理协作完成验证:两名辩手(Debaters)分别持相反立场进行多轮论辩,一名裁判(Moderator)评估论点并给出带理由的判断。为提升裁判性能,作者进一步引入一种基于零样本 DebateCV 生成合成辩论数据的后训练策略,有效缓解了真实辩论驱动型验证数据稀缺的问题。实验表明,该方法在不同证据质量下均优于现有方法。
链接: https://arxiv.org/abs/2507.19090
作者: Haorui He,Yupeng Li,Dacheng Wen,Reynold Cheng,Francis C. M. Lau
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Claim verification is critical for enhancing digital literacy. However, the state-of-the-art single-LLM methods struggle with complex claim verification that involves multi-faceted evidences. Inspired by real-world fact-checking practices, we propose DebateCV, the first claim verification framework that adopts a debate-driven methodology using multiple LLM agents. In our framework, two Debaters take opposing stances on a claim and engage in multi-round argumentation, while a Moderator evaluates the arguments and renders a verdict with justifications. To further improve the performance of the Moderator, we introduce a novel post-training strategy that leverages synthetic debate data generated by the zero-shot DebateCV, effectively addressing the scarcity of real-world debate-driven claim verification data. Experimental results show that our method outperforms existing claim verification methods under varying levels of evidence quality. Our code and dataset are publicly available at this https URL.
zh
[NLP-25] Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement
【速读】: 该论文旨在解决论证摘要(argument summarization)中生成阶段的不足问题,即现有方法多依赖单次生成,难以实现事实校正或结构优化。其解决方案的关键在于提出Arg-LLaDA——一种基于大语言模型的扩散框架,通过迭代式的掩码引导重生成机制(sufficiency-guided remasking and regeneration),结合灵活的掩码控制器与充分性检查模块,识别并修正不支持、冗余或不完整的语义片段,从而提升摘要的忠实性(faithfulness)、简洁性(conciseness)和连贯性(coherence)。
链接: https://arxiv.org/abs/2507.19081
作者: Hao Li,Yizheng Sun,Viktor Schlegel,Kailai Yang,Riza Batista-Navarro,Goran Nenadic
机构: The University of Manchester (曼彻斯特大学); Imperial Global Singapore (帝国理工学院新加坡全球校区); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Argument summarization aims to generate concise, structured representations of complex, multi-perspective debates. While recent work has advanced the identification and clustering of argumentative components, the generation stage remains underexplored. Existing approaches typically rely on single-pass generation, offering limited support for factual correction or structural refinement. To address this gap, we introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries via sufficiency-guided remasking and regeneration. Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans, yielding more faithful, concise, and coherent outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics. In addition, human evaluations reveal substantial improvements across core dimensions, coverage, faithfulness, and conciseness, validating the effectiveness of our iterative, sufficiency-aware generation strategy.
zh
[NLP-26] PurpCode: Reasoning for Safer Code Generation
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在代码生成过程中可能引发的安全隐患问题,即模型在推理和编码时容易生成存在漏洞的代码或协助恶意网络活动。为应对这一挑战,作者提出了一种名为 PurpCode 的后训练方法,其核心创新在于两阶段训练机制:第一阶段为规则学习(Rule Learning),通过显式引入网络安全规则来指导模型生成无漏洞代码并规避有害行为;第二阶段为强化学习(Reinforcement Learning),利用多目标奖励机制优化模型安全性与功能性之间的平衡。关键在于结合内部红队测试(internal red-teaming)构建高覆盖率的诱导性提示数据集,从而有效激发模型潜在的不安全行为并进行针对性对齐训练,最终实现更安全且实用的代码推理模型(PurpCode-32B)。
链接: https://arxiv.org/abs/2507.19060
作者: Jiawei Liu,Nirav Diwan,Zhe Wang,Haoyu Zhai,Xiaona Zhou,Kiet A. Nguyen,Tianjiao Yu,Muntasir Wahed,Yinlin Deng,Hadjer Benkraouda,Yuxiang Wei,Lingming Zhang,Ismini Lourentzou,Gang Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:We introduce PurpCode, the first post-training recipe for training safe code reasoning models towards generating secure code and defending against malicious cyberactivities. PurpCode trains a reasoning model in two stages: (i) Rule Learning, which explicitly teaches the model to reference cybersafety rules to generate vulnerability-free code and to avoid facilitating malicious cyberactivities; and (ii) Reinforcement Learning, which optimizes model safety and preserves model utility through diverse, multi-objective reward mechanisms. To empower the training pipelines with comprehensive cybersafety data, we conduct internal red-teaming to synthesize comprehensive and high-coverage prompts based on real-world tasks for inducing unsafe cyberactivities in the model. Based on PurpCode, we develop a reasoning-based coding model, namely PurpCode-32B, which demonstrates state-of-the-art cybersafety, outperforming various frontier models. Meanwhile, our alignment method decreases the model overrefusal rates in both general and cybersafety-specific scenarios, while preserving model utility in both code generation and common security knowledge.
zh
[NLP-27] Closing the Modality Gap for Mixed Modality Search
【速读】: 该论文旨在解决混合模态搜索(Mixed Modality Search)中因视觉-语言对比模型(如CLIP)在嵌入空间中存在的显著模态差距(Modality Gap)所导致的跨模态融合失败与同模态排序偏差问题。其解决方案的关键在于提出一种轻量级的后处理校准方法GR-CLIP,通过消除CLIP嵌入空间中的模态差距,实现更均衡的跨模态匹配与检索性能提升,在MixBench基准上相较原始CLIP提升NDCG@10达26个百分点,且计算开销仅为同类生成式视觉-语言嵌入模型的1/75。
链接: https://arxiv.org/abs/2507.19054
作者: Binxu Li,Yuhui Zhang,Xiaohan Wang,Weixin Liang,Ludwig Schmidt,Serena Yeung-Levy
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Mixed modality search – retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents – is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP’s embedding space. Evaluated on MixBench – the first benchmark specifically designed for mixed modality search – GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75x less compute.
zh
[NLP-28] MLLM -based Speech Recognition: When and How is Multimodality Beneficial?
【速读】: 该论文旨在解决在噪声环境下自动语音识别(ASR)准确率下降的问题,通过引入多模态信息(如视觉、图像等)来提升模型鲁棒性。其解决方案的关键在于利用不同模态间的互补信息:同步模态(如唇动)在高噪声下更有效,非同步模态(如场景图像)在中等噪声下更具优势;同时,高质量的视觉表征和合理的模态输入顺序及损失权重设计对性能提升至关重要。研究还表明,Mamba架构在多模态收益上与Transformer类似,为多模态ASR提供了新的优化方向。
链接: https://arxiv.org/abs/2507.19037
作者: Yiwen Guan,Viet Anh Trinh,Vivek Voleti,Jacob Whitehill
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent advances in multi-modal large language models (MLLMs) have opened new possibilities for unified modeling of speech, text, images, and other modalities. Building on our prior work, this paper examines the conditions and model architectures under which multiple input modalities can improve automatic speech recognition (ASR) accuracy in noisy environments. Through experiments on synthetic and real-world data, we find that (1) harnessing more modalities usually improves ASR accuracy, as each modality provides complementary information, but the improvement depends on the amount of auditory noise. (2) Synchronized modalities (e.g., lip movements) are more useful at high noise levels whereas unsynchronized modalities (e.g., image context) are most helpful at moderate noise levels. (3) Higher-quality visual representations consistently improve ASR accuracy, highlighting the importance of developing more powerful visual encoders. (4) Mamba exhibits similar trends regarding the benefits of multimodality as do Transformers. (5) The input order of modalities as well as their weights in the loss function can significantly impact accuracy. These findings both offer practical insights and help to deepen our understanding of multi-modal speech recognition under challenging conditions.
zh
[NLP-29] A Toolbox Not a Hammer – Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation
【速读】: 该论文旨在解决当前基于单工具增强的大语言模型(Large Language Models, LLMs)在处理复杂数学推理任务时表现受限的问题,尤其是在需要多步骤精确推理的场景下。现有方法通常通过微调LLM来选择并调用单一工具,虽在简单基准如GSM8K上取得一定成效,但在更复杂的数学竞赛题(如MATH500、AIME等)中难以保证推理的准确性和鲁棒性。解决方案的关键在于提出Multi-TAG框架——一种无需微调、仅在推理阶段运行的多工具聚合机制:它允许LLM在每个推理步骤中同时调用多个工具,并对它们的多样化输出进行聚合,从而验证和优化推理路径,显著提升解题准确性与稳定性。该设计使得Multi-TAG可直接应用于任意LLM骨干模型,包括计算资源密集的开源模型和无法微调的闭源前沿模型。
链接: https://arxiv.org/abs/2507.18973
作者: Bohan Yao,Vikas Yadav
机构: ServiceNow AI (ServiceNow人工智能); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 3 figures
Abstract:Augmenting large language models (LLMs) with external tools is a promising avenue for developing high-performance mathematical reasoning systems. Prior tool-augmented approaches typically finetune an LLM to select and invoke a single tool at each reasoning step and show promising results on simpler math reasoning benchmarks such as GSM8K. However, these approaches struggle with more complex math problems that require precise reasoning over multiple steps. To address this limitation, in this work, we propose Multi-TAG, a Multi-Tool AGgregation-based framework. Instead of relying on a single tool, Multi-TAG guides an LLM to concurrently invoke multiple tools at each reasoning step. It then aggregates their diverse outputs to verify and refine the reasoning process, enhancing solution robustness and accuracy. Notably, Multi-TAG is a finetuning-free, inference-only framework, making it readily applicable to any LLM backbone, including large open-weight models which are computationally expensive to finetune and proprietary frontier models which cannot be finetuned with custom recipes. We evaluate Multi-TAG on four challenging benchmarks: MATH500, AIME, AMC, and OlympiadBench. Across both open-weight and closed-source LLM backbones, Multi-TAG consistently and substantially outperforms state-of-the-art baselines, achieving average improvements of 6.0% to 7.5% over state-of-the-art baselines.
zh
[NLP-30] A Similarity Measure for Comparing Conversational Dynamics
【速读】: 该论文旨在解决当前缺乏鲁棒的自动化方法来比较对话整体互动动态的问题,从而提升对对话数据的分析能力,并实现对对话代理(Conversational Agent)更全面的评估。其解决方案的关键在于提出一种基于对话动态相似性的度量方法,能够捕捉对话中回复之间交互模式的整体结构特征;同时设计了一个验证框架,用于测试该度量在识别对话动态差异上的稳健性及其对话题敏感性的控制能力,最终通过在大规模在线社区中的应用验证了该方法的有效性,揭示了情境权力(situational power)在对话动态中的作用机制。
链接: https://arxiv.org/abs/2507.18956
作者: Sang Min Jung,Kaixiang Zhang,Cristian Danescu-Niculescu-Mizil
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注: Code and demos available in ConvoKit ( this https URL )
Abstract:The quality of a conversation goes beyond the individual quality of each reply, and instead emerges from how these combine into interactional patterns that give the conversation its distinctive overall “shape”. However, there is no robust automated method for comparing conversations in terms of their overall interactional dynamics. Such methods could enhance the analysis of conversational data and help evaluate conversational agents more holistically. In this work, we introduce a similarity measure for comparing conversations with respect to their dynamics. We design a validation framework for testing the robustness of the metric in capturing differences in conversation dynamics and for assessing its sensitivity to the topic of the conversations. Finally, to illustrate the measure’s utility, we use it to analyze conversational dynamics in a large online community, bringing new insights into the role of situational power in conversations. Comments: Code and demos available in ConvoKit (this https URL) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.18956 [cs.CL] (or arXiv:2507.18956v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.18956 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-31] Legal Document Summarization: Enhancing Judicial Efficiency through Automation Detection
【速读】: 该论文旨在解决法律文档信息提取效率低下的问题,通过自动化手段提升司法流程中关键信息的识别与总结能力。其解决方案的关键在于利用先进的自然语言处理(Natural Language Processing, NLP)技术和机器学习算法,从大量法律文本中精准识别并提取核心要素,生成高质量摘要,从而减轻法律从业者的手动审查负担,降低信息遗漏风险,并显著提高处理效率。
链接: https://arxiv.org/abs/2507.18952
作者: Yongjie Li,Ruilin Nong,Jianan Liu,Lucas Evans
机构: University of Utah (犹他大学); Tianjin University (天津大学); University of Pennsylvania (宾夕法尼亚大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Legal document summarization represents a significant advancement towards improving judicial efficiency through the automation of key information detection. Our approach leverages state-of-the-art natural language processing techniques to meticulously identify and extract essential data from extensive legal texts, which facilitates a more efficient review process. By employing advanced machine learning algorithms, the framework recognizes underlying patterns within judicial documents to create precise summaries that encapsulate the crucial elements. This automation alleviates the burden on legal professionals, concurrently reducing the likelihood of overlooking vital information that could lead to errors. Through comprehensive experiments conducted with actual legal datasets, we demonstrate the capability of our method to generate high-quality summaries while preserving the integrity of the original content and enhancing processing times considerably. The results reveal marked improvements in operational efficiency, allowing legal practitioners to direct their efforts toward critical analytical and decision-making activities instead of manual reviews. This research highlights promising technology-driven strategies that can significantly alter workflow dynamics within the legal sector, emphasizing the role of automation in refining judicial processes.
zh
[NLP-32] Adaptive Learning Systems: Personalized Curriculum Design Using LLM -Powered Analytics
【速读】: 该论文旨在解决传统教育模式中个性化不足、学习路径僵化的问题,难以满足学生个体差异化的学习需求。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的自适应学习系统(Adaptive Learning System),通过LLM驱动的分析技术实现对学生实时数据的动态评估,并据此定制个性化的课程设计与资源推荐,从而提升学习的相关性与参与度。该框架的核心优势在于能够持续调整教学策略,使学习内容始终贴合学生的进展状态,实验结果验证了其在增强学习者投入度和知识保留率方面的显著效果。
链接: https://arxiv.org/abs/2507.18949
作者: Yongjie Li,Ruilin Nong,Jianan Liu,Lucas Evans
机构: University of Utah (犹他大学); Tianjin University (天津大学); University of Pennsylvania (宾夕法尼亚大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are revolutionizing the field of education by enabling personalized learning experiences tailored to individual student needs. In this paper, we introduce a framework for Adaptive Learning Systems that leverages LLM-powered analytics for personalized curriculum design. This innovative approach uses advanced machine learning to analyze real-time data, allowing the system to adapt learning pathways and recommend resources that align with each learner’s progress. By continuously assessing students, our framework enhances instructional strategies, ensuring that the materials presented are relevant and engaging. Experimental results indicate a marked improvement in both learner engagement and knowledge retention when using a customized curriculum. Evaluations conducted across varied educational environments demonstrate the framework’s flexibility and positive influence on learning outcomes, potentially reshaping conventional educational practices into a more adaptive and student-centered model.
zh
[NLP-33] reeReader: A Hierarchical Academic Paper Reader Powered by Language Models
【速读】: 该论文旨在解决传统线性格式(如PDF和HTML)在阅读学术论文时导致的认知过载问题,以及大语言模型(Large Language Models, LLMs)驱动的聊天机器人在摘要生成中缺乏对特定章节的细致理解、可能产生不可靠信息且忽略文档结构导航的问题。其解决方案的关键在于提出TreeReader——一种基于LLM增强的论文阅读工具,通过将论文分解为交互式树状结构,使每个章节初始由LLM生成简洁摘要,同时允许用户按需访问底层细节,从而实现核心思想的快速把握、兴趣区域的定向探索及摘要与原文的交叉验证,有效融合层次化摘要与交互式探索,提升复杂学术文献的阅读效率与理解深度。
链接: https://arxiv.org/abs/2507.18945
作者: Zijian Zhang,Pan Chen,Fangshi Du,Runlong Ye,Oliver Huang,Michael Liut,Alán Aspuru-Guzik
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Efficiently navigating and understanding academic papers is crucial for scientific progress. Traditional linear formats like PDF and HTML can cause cognitive overload and obscure a paper’s hierarchical structure, making it difficult to locate key information. While LLM-based chatbots offer summarization, they often lack nuanced understanding of specific sections, may produce unreliable information, and typically discard the document’s navigational structure. Drawing insights from a formative study on academic reading practices, we introduce TreeReader, a novel language model-augmented paper reader. TreeReader decomposes papers into an interactive tree structure where each section is initially represented by an LLM-generated concise summary, with underlying details accessible on demand. This design allows users to quickly grasp core ideas, selectively explore sections of interest, and verify summaries against the source text. A user study was conducted to evaluate TreeReader’s impact on reading efficiency and comprehension. TreeReader provides a more focused and efficient way to navigate and understand complex academic literature by bridging hierarchical summarization with interactive exploration.
zh
[NLP-34] LLaVA-NeuMT: Selective Layer-Neuron Modulation for Efficient Multilingual Multimodal Translation
【速读】: 该论文旨在解决多语言场景下跨语言干扰(cross-lingual interference)和参数共享策略低效的问题,从而提升多模态多语言翻译(Multimodal Multilingual Translation, MMT)的质量。其解决方案的关键在于提出了一种名为LLaVA-NeuMT的新框架,该框架通过显式建模语言特异性(language-specific)与语言无关(language-agnostic)表示来缓解多语言干扰;并引入层选择机制(layer selection mechanism)以识别不同语言对最有效的模型层,以及神经元级适应策略(neuron-level adaptation strategy)动态选择语言特异性和通用神经元,从而在仅微调40%模型参数的情况下实现优于全量微调的性能,并达到当前最优(SOTA)结果。
链接: https://arxiv.org/abs/2507.18940
作者: Jingxuan Wei,Caijun Jia,Qi Chen,Yujun Cai,Linzhuang Sun,Xiangxiang Zhang,Gaowei Wu,Bihui Yu
机构: University of Chinese Academy of Sciences (中国科学院大学); The University of Queensland (昆士兰大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:Multimodal Machine Translation (MMT) enhances translation quality by incorporating visual context, helping to resolve textual ambiguities. While existing MMT methods perform well in bilingual settings, extending them to multilingual translation remains challenging due to cross-lingual interference and ineffective parameter-sharing strategies. To address this, we propose LLaVA-NeuMT, a novel multimodal multilingual translation framework that explicitly models language-specific and language-agnostic representations to mitigate multilingual interference. Our approach consists of a layer selection mechanism that identifies the most informative layers for different language pairs and a neuron-level adaptation strategy that dynamically selects language-specific and agnostic neurons to improve translation quality while reducing redundancy. We conduct extensive experiments on the M3-Multi30K and M3-AmbigCaps datasets, demonstrating that LLaVA-NeuMT, while fine-tuning only 40% of the model parameters, surpasses full fine-tuning approaches and ultimately achieves SOTA results on both datasets. Our analysis further provides insights into the importance of selected layers and neurons in multimodal multilingual adaptation, offering an efficient and scalable solution to cross-lingual adaptation in multimodal translation.
zh
[NLP-35] Uncovering Cross-Linguistic Disparities in LLM s using Sparse Autoencoders
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在中低资源语言上性能显著低于高资源语言的问题,尤其是在ARC-Challenge、MMLU和HellaSwag等基准测试中的表现差距。其核心发现是,这些语言在模型早期和深层的激活强度均显著偏低(早期低至26.27%,深层仍保持19.89%的差距)。解决方案的关键在于通过激活感知微调(activation-aware fine-tuning)结合低秩适应(Low-Rank Adaptation, LoRA),提升中低资源语言的激活水平,从而实现对这些语言性能的显著改善(如马拉雅拉姆语激活提升87.69%,印地语提升86.32%),同时保持英语性能稳定(约91%),表明激活对齐是增强多语言模型跨语言泛化能力的关键因素。
链接: https://arxiv.org/abs/2507.18918
作者: Richmond Sin Jing Xuan,Jalil Huseynov,Yang Zhang
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multilingual large language models (LLMs) exhibit strong cross-linguistic generalization, yet medium to low resource languages underperform on common benchmarks such as ARC-Challenge, MMLU, and HellaSwag. We analyze activation patterns in Gemma-2-2B across all 26 residual layers and 10 languages: Chinese (zh), Russian (ru), Spanish (es), Italian (it), medium to low resource languages including Indonesian (id), Catalan (ca), Marathi (mr), Malayalam (ml), and Hindi (hi), with English (en) as the reference. Using Sparse Autoencoders (SAEs), we reveal systematic disparities in activation patterns. Medium to low resource languages receive up to 26.27 percent lower activations in early layers, with a persistent gap of 19.89 percent in deeper layers. To address this, we apply activation-aware fine-tuning via Low-Rank Adaptation (LoRA), leading to substantial activation gains, such as 87.69 percent for Malayalam and 86.32 percent for Hindi, while maintaining English retention at approximately 91 percent. After fine-tuning, benchmark results show modest but consistent improvements, highlighting activation alignment as a key factor in enhancing multilingual LLM performance.
zh
[NLP-36] Mining Contextualized Visual Associations from Images for Creativity Understanding
【速读】: 该论文旨在解决视觉语言模型(如CLIP)在训练时依赖于短且 predominantly 字面意义的替代文本(alt-text),导致难以捕捉人类对图像创造性理解中隐含的语义关联问题。其解决方案的关键在于提出一种可扩展至任意无标签数据集的方法,用于挖掘图像中显著视觉元素的上下文化关联(contextualized associations),并基于这些关联生成具有不同程度抽象性的高质量创意描述(creative captions)。通过该方法,作者构建了一个包含170万条创意描述的MSCOCO图像数据集,经人工评估验证其既保持视觉接地性(visually grounded)又具备明显的抽象层级递增特性;进一步地,利用该数据集微调视觉编码器,在诗歌和隐喻可视化两个创意领域实现了零样本图文检索性能的显著提升。
链接: https://arxiv.org/abs/2507.18915
作者: Ananya Sahu,Amith Ananthram,Kathleen McKeown
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding another person’s creative output requires a shared language of association. However, when training vision-language models such as CLIP, we rely on web-scraped datasets containing short, predominantly literal, alt-text. In this work, we introduce a method for mining contextualized associations for salient visual elements in an image that can scale to any unlabeled dataset. Given an image, we can use these mined associations to generate high quality creative captions at increasing degrees of abstraction. With our method, we produce a new dataset of visual associations and 1.7m creative captions for the images in MSCOCO. Human evaluation confirms that these captions remain visually grounded while exhibiting recognizably increasing abstraction. Moreover, fine-tuning a visual encoder on this dataset yields meaningful improvements in zero-shot image-text retrieval in two creative domains: poetry and metaphor visualization. We release our dataset, our generation code and our models for use by the broader community.
zh
[NLP-37] A Systematic Review of Key Retrieval-Augmented Generation (RAG ) Systems: Progress Gaps and Future Directions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中普遍存在的幻觉(hallucination)和知识过时问题,其核心解决方案是引入检索增强生成(Retrieval-Augmented Generation, RAG)架构。RAG通过将LLMs与信息检索系统相结合,实现对动态、外部知识源的实时访问和融合,从而提升生成内容的事实准确性、上下文相关性和时效性。其关键技术包括:高效的检索机制(如向量相似度匹配)、序列到序列生成模型(sequence-to-sequence generation models)以及多模态融合策略(fusion strategies),这些组件共同构成了一个可扩展、可部署于企业级系统的知识密集型自然语言处理框架。
链接: https://arxiv.org/abs/2507.18910
作者: Agada Joseph Oche,Ademola Glory Folashade,Tirthankar Ghosal,Arpan Biswas
机构: Bredesen Center for Interdisciplinary Research, University of Tennessee, Knoxville, USA, 37996; National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, USA, 37830; University of Tennessee-Oak Ridge Innovation Institute, University of Tennessee, Knoxville, USA, 37996
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 33 pages, 2 figures
Abstract:Retrieval-Augmented Generation (RAG) represents a major advancement in natural language processing (NLP), combining large language models (LLMs) with information retrieval systems to enhance factual grounding, accuracy, and contextual relevance. This paper presents a comprehensive systematic review of RAG, tracing its evolution from early developments in open domain question answering to recent state-of-the-art implementations across diverse applications. The review begins by outlining the motivations behind RAG, particularly its ability to mitigate hallucinations and outdated knowledge in parametric models. Core technical components-retrieval mechanisms, sequence-to-sequence generation models, and fusion strategies are examined in detail. A year-by-year analysis highlights key milestones and research trends, providing insight into RAG’s rapid growth. The paper further explores the deployment of RAG in enterprise systems, addressing practical challenges related to retrieval of proprietary data, security, and scalability. A comparative evaluation of RAG implementations is conducted, benchmarking performance on retrieval accuracy, generation fluency, latency, and computational efficiency. Persistent challenges such as retrieval quality, privacy concerns, and integration overhead are critically assessed. Finally, the review highlights emerging solutions, including hybrid retrieval approaches, privacy-preserving techniques, optimized fusion strategies, and agentic RAG architectures. These innovations point toward a future of more reliable, efficient, and context-aware knowledge-intensive NLP systems.
zh
[NLP-38] Large language models provide unsafe answers to patient-posed medical questions
【速读】: 该论文旨在解决公众广泛使用大型语言模型(Large Language Model, LLM)聊天机器人获取医疗建议所带来的患者安全风险问题。研究通过一项由医生主导的红队测试(red-teaming study),对四个主流公开聊天机器人(Claude、Gemini、GPT-4o 和 Llama3-70B)在真实医疗咨询场景下的安全性进行量化与定性评估,其解决方案的关键在于构建了一个名为 HealthAdvice 的新数据集,并设计了一套结合定量指标与定性分析的评价框架,从而系统揭示了不同模型在初级医疗咨询中产生不当或危险回应的比例差异(从 21.6% 到 43.2% 不等),并识别出可能引发严重患者伤害的响应模式,为提升生成式 AI 在临床场景中的安全性提供了实证依据和改进方向。
链接: https://arxiv.org/abs/2507.18905
作者: Rachel L. Draelos,Samina Afreen,Barbara Blasko,Tiffany Brazile,Natasha Chase,Dimple Desai,Jessica Evert,Heather L. Gardner,Lauren Herrmann,Aswathy Vaikom House,Stephanie Kass,Marianne Kavan,Kirshma Khemani,Amanda Koire,Lauren M. McDonald,Zahraa Rabeeah,Amy Shah
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 20 pages
Abstract:Millions of patients are already using large language model (LLM) chatbots for medical advice on a regular basis, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots–Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta–on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women’s health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.
zh
[NLP-39] SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在多语言翻译任务中支持语言数量有限的问题,尤其针对字典增强型提示方法(dictionary-based prompting)因使用全部可用字典而导致的高Token消耗问题。其核心挑战在于如何在保证翻译性能的同时实现Token使用的优化,即建立一个灵活的权衡机制。解决方案的关键在于提出了一种名为自动字典选择(Automatic Dictionary Selection, ADS)的新任务,并设计了SLoW(Select Low-frequency Words)方法:该方法通过优先选择低频词对应的字典来减少Token使用,且无需访问训练数据即可完成频率估计(利用公开资源即可有效估算),同时保留了字典方法无需额外微调LLM的优势。实验表明,SLoW在100种语言上优于强基线,显著降低Token消耗,部分语言甚至超越全字典基线的翻译效果。
链接: https://arxiv.org/abs/2507.18902
作者: Hongyuan Lu,Zixuan Li,Zefan Zhang,Wai Lam
机构: The Chinese University of Hong Kong (香港中文大学); Southeast University (东南大学); Jilin University (吉林大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbfAutomatic \textbfDictionary \textbfSelection (\textbfADS). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbfSelect \textbfLow-frequency \textbfWords! (\textbfSLoW) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnoteA shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.\footnoteCode and data available upon publication.
zh
[NLP-40] REPRO-Bench: Can Agent ic AI Systems Assess the Reproducibility of Social Science Research? ACL2025
【速读】: 该论文旨在解决社会科学研究中可复现性评估自动化不足的问题,当前手动评估成本高且效率低。现有基准测试在评估过程中仅关注代码与数据的复现结果,忽视了与论文内容的一致性,且场景过于简化、缺乏多样性。解决方案的关键在于提出REPRO-Bench,一个包含112个任务实例的基准数据集,每个实例对应一篇社会科学论文及其公开的复现报告,要求AI代理基于原始论文PDF和复现包进行端到端的可复现性判断。该基准设计复杂度接近真实世界评估,从而推动更先进AI代理的发展,实验表明其改进后的REPRO-Agent相较现有最优模型准确率提升71%,显著提升了自动化评估能力。
链接: https://arxiv.org/abs/2507.18901
作者: Chuxuan Hu,Liyun Zhang,Yeji Lim,Aum Wadhwani,Austin Peters,Daniel Kang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Shanghai Jiao Tong University (上海交通大学); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Findings
Abstract:Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench features end-to-end evaluation tasks on the reproducibility of social science papers with complexity comparable to real-world assessments. We evaluate three representative AI agents on REPRO-Bench, with the best-performing agent achieving an accuracy of only 21.4%. Building on our empirical analysis, we develop REPRO-Agent, which improves the highest accuracy achieved by existing agents by 71%. We conclude that more advanced AI agents should be developed to automate real-world reproducibility assessment. REPRO-Bench is publicly available at this https URL.
zh
[NLP-41] NUTMEG: Separating Signal From Noise in Annotator Disagreement
【速读】: 该论文旨在解决人类标注数据中因标注者背景差异导致的标注冲突问题,传统聚合方法通常将这些冲突视为噪声并予以消除,但忽略了标注者之间可能存在系统性分歧(systematic disagreement)这一信号。其解决方案的关键在于提出一种新的贝叶斯模型 NUTMEG,该模型通过引入标注者背景信息来区分噪声与真实分歧:一方面识别并剔除低质量或随机标注(即噪声),另一方面保留具有统计意义的系统性不一致,从而更准确地恢复真实标签。实验表明,NUTMEG 在合成数据和实际下游任务中均优于传统聚合方法,验证了同时建模标注者能力(annotator competence)与系统性分歧的重要性。
链接: https://arxiv.org/abs/2507.18890
作者: Jonathan Ivey,Susan Gauch,David Jurgens
机构: University of Arkansas (阿肯色大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:NLP models often rely on human-labeled data for training and evaluation. Many approaches crowdsource this data from a large number of annotators with varying skills, backgrounds, and motivations, resulting in conflicting annotations. These conflicts have traditionally been resolved by aggregation methods that assume disagreements are errors. Recent work has argued that for many tasks annotators may have genuine disagreements and that variation should be treated as signal rather than noise. However, few models separate signal and noise in annotator disagreement. In this work, we introduce NUTMEG, a new Bayesian model that incorporates information about annotator backgrounds to remove noisy annotations from human-labeled training data while preserving systematic disagreements. Using synthetic data, we show that NUTMEG is more effective at recovering ground-truth from annotations with systematic disagreement than traditional aggregation methods. We provide further analysis characterizing how differences in subpopulation sizes, rates of disagreement, and rates of spam affect the performance of our model. Finally, we demonstrate that downstream models trained on NUTMEG-aggregated data significantly outperform models trained on data from traditionally aggregation methods. Our results highlight the importance of accounting for both annotator competence and systematic disagreements when training on human-labeled data.
zh
[NLP-42] MindFlow: A Self-Evolving Agent for E-Commerce Customer Service
【速读】: 该论文旨在解决传统基于意图的对话系统在电商客服场景中难以应对动态、多轮交互的问题。其核心解决方案在于提出MindFlow+,一个通过结合大语言模型(Large Language Models, LLMs)、模仿学习(imitation learning)与离线强化学习(offline reinforcement learning, RL)实现自进化的能力。关键创新点包括:一是工具增强的示范构建机制(tool-augmented demonstration construction),使模型能够从知识增强且具备代理行为(ReAct-style)的交互中学习有效工具使用;二是奖励条件的数据建模方法(reward-conditioned data modeling),利用奖励信号对响应进行任务目标对齐。实验表明,MindFlow+在上下文相关性、灵活性和任务准确性方面显著优于基线模型,验证了LLM、工具推理与奖励引导学习相结合在构建领域专业化、情境感知对话系统中的潜力。
链接: https://arxiv.org/abs/2507.18884
作者: Ming Gong,Xucheng Huang,Ziheng Xu,Vijayan K. Asari
机构: Xiaoduo AI Lab (小多AI实验室); University of Dayton (代顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:High-quality dialogue is crucial for e-commerce customer service, yet traditional intent-based systems struggle with dynamic, multi-turn interactions. We present MindFlow+, a self-evolving dialogue agent that learns domain-specific behavior by combining large language models (LLMs) with imitation learning and offline reinforcement learning (RL). MindFlow+ introduces two data-centric mechanisms to guide learning: tool-augmented demonstration construction, which exposes the model to knowledge-enhanced and agentic (ReAct-style) interactions for effective tool use; and reward-conditioned data modeling, which aligns responses with task-specific goals using reward signals. To evaluate the model’s role in response generation, we introduce the AI Contribution Ratio, a novel metric quantifying AI involvement in dialogue. Experiments on real-world e-commerce conversations show that MindFlow+ outperforms strong baselines in contextual relevance, flexibility, and task accuracy. These results demonstrate the potential of combining LLMs tool reasoning, and reward-guided learning to build domain-specialized, context-aware dialogue systems.
zh
[NLP-43] PrismRAG : Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning
【速读】: 该论文旨在解决检索增强生成(Retrieval-augmented generation, RAG)在面对混淆性半相关段落或需要深层次上下文理解与推理时性能下降的问题。其解决方案的关键在于提出一种高效的微调框架 PrismRAG,该框架通过两个核心机制实现:一是使用包含真实证据与细微干扰项(distractor)的问答对进行训练,使模型具备识别和排除干扰信息的能力;二是引入以推理为中心的习惯培养机制,促使大语言模型(Large Language Model, LLM)自主规划、推理并整合信息,而无需依赖大量人工设计的指令。
链接: https://arxiv.org/abs/2507.18857
作者: Mohammad Kachuee,Teja Gollapudi,Minseok Kim,Yin Huang,Kai Sun,Xiao Yang,Jiaqi Wang,Nirav Shah,Yue Liu,Aaron Colak,Anuj Kumar,Wen-tau Yih,Xin Luna Dong
机构: Meta(元)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-augmented generation (RAG) often falls short when retrieved context includes confusing semi-relevant passages, or when answering questions require deep contextual understanding and reasoning. We propose an efficient fine-tuning framework, called PrismRAG, that (i) trains the model with distractor-aware QA pairs mixing gold evidence with subtle distractor passages, and (ii) instills reasoning-centric habits that make the LLM plan, rationalize, and synthesize without relying on extensive human engineered instructions. Evaluated across 12 open-book RAG QA benchmarks spanning diverse application domains and scenarios, PrismRAG improves average factuality by 5.4%, outperforming state-of-the-art solutions.
zh
[NLP-44] CueBuddy: helping non-native English speakers navigate English-centric STEM education
【速读】: 该论文旨在解决全球南方地区STEM课程中,非英语母语学生因难以掌握专业英语术语而落后于英语流利同龄人的教育不平等问题。尽管这些学生在科学基础能力上与英语母语者相当,但他们在理解技术性英语词汇时仍面临显著困难,且多数前置课程内容以资源匮乏语言(low-resource languages)授课。为应对这一挑战,作者提出CueBuddy系统,其核心解决方案在于通过实时语音识别与多语言词典查询相结合的“词汇提示”机制(lexical cues),实现对课堂中关键技术术语的即时识别与本地化翻译,从而帮助学生在不中断听课专注度的前提下理解复杂英文术语。该方法有效规避了大规模部署昂贵语音翻译模型的问题,并针对技术文本优化了关键词检测精度。
链接: https://arxiv.org/abs/2507.18827
作者: Pranav Gupta
机构: Lowe’s(劳氏公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Students across the world in STEM classes, especially in the Global South, fall behind their peers who are more fluent in English, despite being at par with them in terms of scientific prerequisites. While many of them are able to follow everyday English at ease, key terms in English stay challenging. In most cases, such students have had most of their course prerequisites in a lower resource language. Live speech translation to lower resource languages is a promising area of research, however, models for speech translation can be too expensive on a large scale and often struggle with technical content. In this paper, we describe CueBuddy, which aims to remediate these issues by providing real-time “lexical cues” through technical keyword spotting along real-time multilingual glossary lookup to help students stay up to speed with complex English jargon without disrupting their concentration on the lecture. We also describe the limitations and future extensions of our approach.
zh
[NLP-45] Evaluating Code-Mixing in LLM s Across 18 Languages
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨语言混合(code-mixing)场景下的性能评估不足与数据生成方法匮乏的问题。当前主流基准如LinCE和GLUECoS受限于语言对和任务的单一性,难以全面衡量LLMs处理多语言混合文本的能力;同时,缺乏有效的合成代码混合文本生成策略限制了相关研究的发展。论文的关键解决方案是提出一种基于词替换与GPT-4提示工程相结合的新型合成数据生成方法,并在此基础上对涵盖18种语言、来自7个语系的代码混合数据集进行系统性评估,揭示了LLMs在多语系混合场景下的持续性能短板,为未来改进训练数据规模、模型容量及少样本学习能力提供了方向。
链接: https://arxiv.org/abs/2507.18791
作者: Yilun Yang,Yekun Chai
机构: Baidu(百度)
类目: Computation and Language (cs.CL)
备注:
Abstract:Code-mixing, the practice of switching between languages within a conversation, presents unique challenges for traditional natural language processing. Existing benchmarks, such as LinCE and GLUECoS, are limited by narrow language pairings and tasks, failing to adequately evaluate the code-mixing capabilities of large language models (LLMs). Despite the significance of code-mixing for multilingual users, research on LLMs in this context remains limited. Additionally, current methods for generating code-mixed data are underdeveloped. In this paper, we conduct a comprehensive evaluation of LLMs’ performance on code-mixed data across 18 languages from seven language families. We also propose a novel approach for generating synthetic code-mixed texts by combining word substitution with GPT-4 prompting. Our analysis reveals consistent underperformance of LLMs on code-mixed datasets involving multiple language families. We suggest that improvements in training data size, model scale, and few-shot learning could enhance their performance.
zh
[NLP-46] ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting
【速读】: 该论文旨在解决多语言文本净化(Multilingual Text Detoxification)问题,即在多种语言环境下识别并净化含有毒性内容的文本。其解决方案的关键在于构建一个鲁棒的多语言文本净化流水线,核心创新包括:利用多语言毒性词典(multilingual_toxic_lexicon)引导标记以提升净化精度和跨语言泛化能力;引入微调后的序列到序列模型(s-nlp/mt0-xl-detox-orpo)进行文本重构;并通过迭代式分类器门控机制实现更精细的毒性控制。该方法显著优于以往无监督或单语方案,在多个语言上均展现出优越性能与一致性净化强度。
链接: https://arxiv.org/abs/2507.18769
作者: Nicole Lai-Lopez,Lusha Wang,Su Yuan,Liza Zhang
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 5 figures, 3 tables,
Abstract:In this work, we introduce our solution for the Multilingual Text Detoxification Task in the PAN-2025 competition for the ylmmcl team: a robust multilingual text detoxification pipeline that integrates lexicon-guided tagging, a fine-tuned sequence-to-sequence model (s-nlp/mt0-xl-detox-orpo) and an iterative classifier-based gatekeeping mechanism. Our approach departs from prior unsupervised or monolingual pipelines by leveraging explicit toxic word annotation via the multilingual_toxic_lexicon to guide detoxification with greater precision and cross-lingual generalization. Our final model achieves the highest STA (0.922) from our previous attempts, and an average official J score of 0.612 for toxic inputs in both the development and test sets. It also achieved xCOMET scores of 0.793 (dev) and 0.787 (test). This performance outperforms baseline and backtranslation methods across multiple languages, and shows strong generalization in high-resource settings (English, Russian, French). Despite some trade-offs in SIM, the model demonstrates consistent improvements in detoxification strength. In the competition, our team achieved ninth place with a score of 0.612.
zh
[NLP-47] he Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages
【速读】: 该论文旨在解决多语言模型(如mBERT和XLM-RoBERTa)在处理使用阿拉伯文字符但语法、拼写规范及文化背景差异显著的语言(如库尔德语索拉尼语、阿拉伯语、波斯语和乌尔都语)时表现不佳的问题。解决方案的关键在于提出了一种基于RoBERTa架构的专用模型家族——阿拉伯文字符RoBERTa(AS-RoBERTa),其核心创新是在预训练阶段聚焦于每种语言特有的书写特征与统计规律,而非采用通用语言覆盖策略。实验表明,这种以脚本为导向的专门化预训练方法显著提升了分类任务上的性能,相较基线模型提升2至5个百分点,且消融研究验证了脚本特异性预训练对性能提升的核心作用。
链接: https://arxiv.org/abs/2507.18762
作者: Abdulhady Abas Abdullah,Amir H. Gandomi,Tarik A Rashid,Seyedali Mirjalili,Laith Abualigah,Milena Živković,Hadi Veisi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In natural language processing, multilingual models like mBERT and XLM-RoBERTa promise broad coverage but often struggle with languages that share a script yet differ in orthographic norms and cultural context. This issue is especially notable in Arabic-script languages such as Kurdish Sorani, Arabic, Persian, and Urdu. We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language. By focusing pre-training on language-specific script features and statistics, our models capture patterns overlooked by general-purpose models. When fine-tuned on classification tasks, AS-RoBERTa variants outperform mBERT and XLM-RoBERTa by 2 to 5 percentage points. An ablation study confirms that script-focused pre-training is central to these gains. Error analysis using confusion matrices shows how shared script traits and domain-specific content affect performance. Our results highlight the value of script-aware specialization for languages using the Arabic script and support further work on pre-training strategies rooted in script and language specificity.
zh
[NLP-48] Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在上下文学习中易受“奖励黑客”(in-context reward hacking)攻击的问题,即模型利用有缺陷或被污染的规范(specification)中的漏洞来获取高分,而并非真正满足用户意图。解决方案的关键在于提出一种名为“规范自修正”(Specification Self-Correction, SSC)的新颖推理框架,该框架在推理阶段通过多步过程实现动态修复:模型首先基于原始规范生成响应,随后对该输出进行批判性评估,并据此修改自身所依赖的规范以消除可被利用的漏洞,最终基于修正后的规范生成更鲁棒的输出。此方法无需调整模型权重,即可显著提升模型对恶意规范的抗干扰能力,在创意写作和代理编码任务中将奖励黑客行为从50–70%降低至不足10%。
链接: https://arxiv.org/abs/2507.18742
作者: Víctor Gallego
机构: Komorebi AI Technologies(科莫贝AI技术公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to SCALR Workshop @ COLM 2025
Abstract:Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user’s true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70% of cases, the SSC process reduces this vulnerability by over 90%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at this https URL .
zh
[NLP-49] People Are Highly Cooperative with Large Language Models Especially When Communication Is Possible or Following Human Interaction
【速读】: 该论文旨在解决在商业场景中,当人类与大语言模型(Large Language Models, LLMs)交互时,合作行为是否会发生变化这一问题,尤其是在沟通、协作和利益相关者信任至关重要的情境下。研究通过引入囚徒困境博弈(Prisoner’s Dilemma game)作为现实管理与经济情境的代理任务,系统比较了参与者与人类对手、传统机器人及LLM(GPT实时交互)之间的合作行为差异。解决方案的关键在于:首先,实证发现尽管与LLM的合作率比与人类互动低约10–15个百分点,但依然保持较高水平;其次,允许沟通显著提升了与人类和LLM的协作可能性(提升88%),且这种提升在LLM上尤为意外,因其非人类属性通常被认为会削弱合作意愿;最后,先前与人类的互动产生了正向溢出效应,进一步增强了后续与LLM的合作倾向。这些结果表明,在谨慎设计的前提下,LLM可在具有合作需求的商业环境中有效辅助人类决策与协作。
链接: https://arxiv.org/abs/2507.18639
作者: Paweł Niszczota,Tomasz Grzegorczyk,Alexander Pastukhov
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY); General Economics (econ.GN)
备注:
Abstract:Machines driven by large language models (LLMs) have the potential to augment humans across various tasks, a development with profound implications for business settings where effective communication, collaboration, and stakeholder trust are paramount. To explore how interacting with an LLM instead of a human might shift cooperative behavior in such settings, we used the Prisoner’s Dilemma game – a surrogate of several real-world managerial and economic scenarios. In Experiment 1 (N=100), participants engaged in a thirty-round repeated game against a human, a classic bot, and an LLM (GPT, in real-time). In Experiment 2 (N=192), participants played a one-shot game against a human or an LLM, with half of them allowed to communicate with their opponent, enabling LLMs to leverage a key advantage over older-generation machines. Cooperation rates with LLMs – while lower by approximately 10-15 percentage points compared to interactions with human opponents – were nonetheless high. This finding was particularly notable in Experiment 2, where the psychological cost of selfish behavior was reduced. Although allowing communication about cooperation did not close the human-machine behavioral gap, it increased the likelihood of cooperation with both humans and LLMs equally (by 88%), which is particularly surprising for LLMs given their non-human nature and the assumption that people might be less receptive to cooperating with machines compared to human counterparts. Additionally, cooperation with LLMs was higher following prior interaction with humans, suggesting a spillover effect in cooperative behavior. Our findings validate the (careful) use of LLMs by businesses in settings that have a cooperative component.
zh
[NLP-50] Should Top-Down Clustering Affect Boundaries in Unsupervised Word Discovery?
【速读】: 该论文旨在解决无监督语音分割问题,即如何将未标注的语音信号分割成类词单元(word-like units),并进一步聚类形成词汇表(lexicon)。其核心挑战在于如何有效确定语音边界并实现高质量的聚类。解决方案的关键在于对比两种不同范式:一种是自底向上(bottom-up)方法,直接基于相邻自监督特征间的差异性预测词边界,随后对分割结果进行聚类;另一种是自顶向下(top-down)方法,如ES-KMeans动态规划算法,通过迭代使用K-means更新边界以引入聚类反馈信息。研究发现,尽管top-down机制在特定条件下(如候选边界质量较高时)具有一定优势,但简单的bottom-up策略在五个语言的ZeroSpeech基准上达到了相当的性能,且速度接近五倍更快,表明top-down信息并非必要。此外,作者指出聚类步骤是当前方法的瓶颈,建议未来工作应聚焦于改进聚类技术及学习更具判别性的类词表示。
链接: https://arxiv.org/abs/2507.19204
作者: Simon Malan,Benjamin van Niekerk,Herman Kamper
机构: Stellenbosch University (斯泰伦博斯大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 figures, 5 tables
Abstract:We investigate the problem of segmenting unlabeled speech into word-like units and clustering these to create a lexicon. Prior work can be categorized into two frameworks. Bottom-up methods first determine boundaries and then cluster the fixed segmented words into a lexicon. In contrast, top-down methods incorporate information from the clustered words to inform boundary selection. However, it is unclear whether top-down information is necessary to improve segmentation. To explore this, we look at two similar approaches that differ in whether top-down clustering informs boundary selection. Our simple bottom-up strategy predicts word boundaries using the dissimilarity between adjacent self-supervised features, then clusters the resulting segments to construct a lexicon. Our top-down system is an updated version of the ES-KMeans dynamic programming method that iteratively uses K-means to update its boundaries. On the five-language ZeroSpeech benchmarks, both approaches achieve comparable state-of-the-art results, with the bottom-up system being nearly five times faster. Through detailed analyses, we show that the top-down influence of ES-KMeans can be beneficial (depending on factors like the candidate boundaries), but in many cases the simple bottom-up method performs just as well. For both methods, we show that the clustering step is a limiting factor. Therefore, we recommend that future work focus on improved clustering techniques and learning more discriminative word-like representations. Project code repository: this https URL.
zh
[NLP-51] FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems INTERSPEECH2025
【速读】: 该论文旨在解决全双工语音对话系统(Full-duplex Spoken Dialogue Systems, FDSDS)缺乏有效评估指标的问题,尤其是在用户中断(user interruptions)场景下的性能衡量。现有基准测试未充分覆盖FDSDS的核心能力,如实时响应中断、处理延迟以及在噪声等复杂环境中的鲁棒性。解决方案的关键在于构建一个基于大语言模型(LLM)、文本转语音(TTS)和自动语音识别(ASR)的综合性基准测试流程,引入多样化的新型评估指标,以量化系统在模拟真实交互中对中断的响应能力与稳定性。通过该框架对三个开源FDSDS模型(Moshi、Freeze-omni 和 VITA-1.5)进行测试,验证了其有效性并揭示了当前模型在频繁干扰和噪声条件下的普遍不足。
链接: https://arxiv.org/abs/2507.19040
作者: Yizhou Peng,Yi-Wen Chao,Dianwen Ng,Yukun Ma,Chongjia Ni,Bin Ma,Eng Siong Chng
机构: Alibaba-NTU Global e-Sustainability CorpLab (阿里巴巴-南洋理工大学全球可持续发展联合实验室); College of Computing and Data Science (计算机与数据科学学院); Alibaba (阿里巴巴)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted to Interspeech 2025. 5 pages
Abstract:Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizing LLMs, TTS, and ASR to address this gap. It assesses FDSDS’s ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with diverse novel metrics. We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours of generated speech, with 293 simulated conversations and 1,200 interruptions. The results show that all models continue to face challenges, such as failing to respond to user interruptions, under frequent disruptions and noisy conditions. Demonstrations, data, and code will be released.
zh
计算机视觉
[CV-0] HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars ICCV2025
【速读】:该论文旨在解决现有3D头像先验模型在建模人脸与头发时采用整体化(holistic)方法所导致的表示难以解耦问题,尤其是在数据有限情况下,无法有效分离面部和头发特征,且难以支持灵活可控的3D人脸与发型交换应用。其解决方案的关键在于提出一种显式建模头发组成性的先验模型,通过构建合成无发数据集(利用扩散先验估计无发几何与纹理),获得配对的带发与无发捕获数据,从而训练出分别学习面部和头发潜在空间的解耦先验模型,并将组成性作为归纳偏置(inductive bias)以促进有效分离。这一设计使得头像间面部与头发组件可无缝迁移并保持身份一致性,同时支持少量单目图像即可微调生成高保真、具有头发组成性的3D头像,显著提升了实际应用场景中的灵活性与表达能力。
链接: https://arxiv.org/abs/2507.19481
作者: Byungjun Kim,Shunsuke Saito,Giljoo Nam,Tomas Simon,Jason Saragih,Hanbyul Joo,Junxuan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project Page: this https URL
Abstract:We present a universal prior model for 3D head avatars with explicit hair compositionality. Existing approaches to build generalizable priors for 3D head avatars often adopt a holistic modeling approach, treating the face and hair as an inseparable entity. This overlooks the inherent compositionality of the human head, making it difficult for the model to naturally disentangle face and hair representations, especially when the dataset is limited. Furthermore, such holistic models struggle to support applications like 3D face and hairstyle swapping in a flexible and controllable manner. To address these challenges, we introduce a prior model that explicitly accounts for the compositionality of face and hair, learning their latent spaces separately. A key enabler of this approach is our synthetic hairless data creation pipeline, which removes hair from studio-captured datasets using estimated hairless geometry and texture derived from a diffusion prior. By leveraging a paired dataset of hair and hairless captures, we train disentangled prior models for face and hair, incorporating compositionality as an inductive bias to facilitate effective separation. Our model’s inherent compositionality enables seamless transfer of face and hair components between avatars while preserving identity. Additionally, we demonstrate that our model can be fine-tuned in a few-shot manner using monocular captures to create high-fidelity, hair-compositional 3D head avatars for unseen subjects. These capabilities highlight the practical applicability of our approach in real-world scenarios, paving the way for flexible and expressive 3D avatar generation.
zh
[CV-1] DINO-SLAM: DINO-informed RGB-D SLAM for Neural Implicit and Explicit Representations
【速读】:该论文旨在解决传统SLAM系统在场景表示上存在的局限性,即难以同时实现对复杂场景的高保真与高效建模。为提升神经隐式(Neural Radiance Field, NeRF)和显式(3D Gaussian Splatting, 3DGS)表示在SLAM中的性能,作者提出DINO-SLAM策略——其核心在于引入一个场景结构编码器(Scene Structure Encoder, SSE),将原始DINO特征增强为更具层次结构感知能力的增强型DINO特征(Enhanced DINO, EDINO),从而更全面地捕捉场景元素及其空间关系。基于此EDINO特征,论文构建了两种面向NeRF与3DGS的SLAM基础范式,显著提升了在Replica、ScanNet和TUM等基准数据集上的表现。
链接: https://arxiv.org/abs/2507.19474
作者: Ziren Gong,Xiaohan Li,Fabio Tosi,Youmin Zhang,Stefano Mattoccia,Jun Wu,Matteo Poggi
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents DINO-SLAM, a DINO-informed design strategy to enhance neural implicit (Neural Radiance Field – NeRF) and explicit representations (3D Gaussian Splatting – 3DGS) in SLAM systems through more comprehensive scene representations. Purposely, we rely on a Scene Structure Encoder (SSE) that enriches DINO features into Enhanced DINO ones (EDINO) to capture hierarchical scene elements and their structural relationships. Building upon it, we propose two foundational paradigms for NeRF and 3DGS SLAM systems integrating EDINO features. Our DINO-informed pipelines achieve superior performance on the Replica, ScanNet, and TUM compared to state-of-the-art methods.
zh
[CV-2] Efficient Lines Detection for Robot Soccer
【速读】:该论文旨在解决机器人足球中自定位(self-localization)问题,核心挑战在于如何高效、准确地检测视觉场中的场地线条和边界特征,以实现可靠的位姿估计。解决方案的关键在于提出一种轻量级且高效的场地线条检测方法:基于ELSED算法扩展出一个分类步骤,通过分析RGB颜色变化来识别属于场地的线条;同时引入基于粒子群优化(Particle Swarm Optimization, PSO)的阈值校准流水线,仅需少量标注样本即可优化检测性能,从而在保持与先进深度学习模型相当精度的同时显著提升处理速度,适用于低功耗机器人平台的实时应用。
链接: https://arxiv.org/abs/2507.19469
作者: João G. Melo,João P. Mafaldo,Edna Barros
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 12 pages, 8 figures, RoboCup Symposium 2025
Abstract:Self-localization is essential in robot soccer, where accurate detection of visual field features, such as lines and boundaries, is critical for reliable pose estimation. This paper presents a lightweight and efficient method for detecting soccer field lines using the ELSED algorithm, extended with a classification step that analyzes RGB color transitions to identify lines belonging to the field. We introduce a pipeline based on Particle Swarm Optimization (PSO) for threshold calibration to optimize detection performance, requiring only a small number of annotated samples. Our approach achieves accuracy comparable to a state-of-the-art deep learning model while offering higher processing speed, making it well-suited for real-time applications on low-power robotic platforms.
zh
[CV-3] Back to the Features: DINO as a Foundation for Video World Models
【速读】:该论文旨在解决视频世界模型在泛化能力与物理理解上的局限性问题,即如何构建一个能够捕捉多样化场景时序动态并具备强直观物理理解能力的通用视频预测模型。其解决方案的关键在于利用预训练的图像编码器(DINOv2)将视频帧映射到潜在空间,并在此基础上训练一个未来帧预测器,从而在大规模未筛选视频数据上学习跨场景的时空动态特性。该方法不仅提升了在分割和深度预测等任务上的性能,还支持通过观测-动作轨迹微调实现动作条件建模,进而用于潜在空间中的轨迹模拟与规划。
链接: https://arxiv.org/abs/2507.19468
作者: Federico Baldassarre,Marc Szafraniec,Basile Terver,Vasil Khalidov,Francisco Massa,Yann LeCun,Patrick Labatut,Maximilian Seitzer,Piotr Bojanowski
机构: Meta FAIR
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.
zh
[CV-4] Fast Learning of Non-Cooperative Spacecraft 3D Models through Primitive Initialization
【速读】:该论文旨在解决基于单目图像的三维重建技术(如3D Gaussian Splatting, 3DGS)在空间应用中面临的两大限制:一是训练过程中需依赖精确相机位姿(pose),二是训练与推理阶段计算成本过高。其解决方案的关键在于提出一个基于卷积神经网络(Convolutional Neural Network, CNN)的初始化模块,该模块仅需单张图像输入即可输出粗略的3D模型(由一组几何基元组成)及目标相对相机的位姿估计;该初始化结果可显著降低3DGS的训练迭代次数和所需图像数量(至少一个数量级)。此外,该方法还支持使用噪声或隐式位姿估计进行训练,并通过对比不同CNN变体在低质量位姿监督下的表现,验证了即使在位姿不精确的情况下仍能学习高保真三维表示,从而为将新型视图合成技术应用于空间场景提供了可行路径。
链接: https://arxiv.org/abs/2507.19459
作者: Pol Francesch Huc,Emily Bates,Simone D’Amico
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:The advent of novel view synthesis techniques such as NeRF and 3D Gaussian Splatting (3DGS) has enabled learning precise 3D models only from posed monocular images. Although these methods are attractive, they hold two major limitations that prevent their use in space applications: they require poses during training, and have high computational cost at training and inference. To address these limitations, this work contributes: (1) a Convolutional Neural Network (CNN) based primitive initializer for 3DGS using monocular images; (2) a pipeline capable of training with noisy or implicit pose estimates; and (3) and analysis of initialization variants that reduce the training cost of precise 3D models. A CNN takes a single image as input and outputs a coarse 3D model represented as an assembly of primitives, along with the target’s pose relative to the camera. This assembly of primitives is then used to initialize 3DGS, significantly reducing the number of training iterations and input images needed – by at least an order of magnitude. For additional flexibility, the CNN component has multiple variants with different pose estimation techniques. This work performs a comparison between these variants, evaluating their effectiveness for downstream 3DGS training under noisy or implicit pose estimates. The results demonstrate that even with imperfect pose supervision, the pipeline is able to learn high-fidelity 3D representations, opening the door for the use of novel view synthesis in space applications.
zh
[CV-5] GS-Occ3D: Scaling Vision-only Occupancy Reconstruction for Autonomous Driving with Gaussian Splatting ICCV2025
【速读】:该论文旨在解决当前自动驾驶中占用感知(Occupancy Perception)依赖激光雷达(LiDAR)标注数据导致的可扩展性差、难以利用大规模视觉数据进行自动标注的问题。现有基于视觉的方法多采用网格(mesh)表示,存在几何不完整和后处理复杂等问题,限制了其在真实场景中的应用。解决方案的关键在于提出一种可扩展的纯视觉框架 GS-Occ3D,其核心创新是采用基于八叉树(Octree)结构的高斯表面元(Gaussian Surfel)显式占用表示,兼顾效率与精度;同时将场景解耦为静态背景、地面和动态物体三类,分别设计建模策略:地面被显式重建以增强大尺度一致性,动态车辆单独建模以捕捉运动模式,从而显著提升几何重建质量与下游任务泛化能力。
链接: https://arxiv.org/abs/2507.19451
作者: Baijun Ye,Minghui Qin,Saining Zhang,Moonjun Gong,Shaoting Zhu,Zebang Shen,Luan Zhang,Lu Zhang,Hao Zhao,Hang Zhao
机构: IIIS, THU(清华大学交叉信息研究院); Shanghai Qi Zhi Institute(上海期智研究院); AIR, THU(清华大学人工智能研究院); BAAI(北京人工智能研究院); Mercedes-Benz Group China Ltd.(梅赛德斯-奔驰集团中国有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project Page: this https URL
Abstract:Occupancy is crucial for autonomous driving, providing essential geometric priors for perception and planning. However, existing methods predominantly rely on LiDAR-based occupancy annotations, which limits scalability and prevents leveraging vast amounts of potential crowdsourced data for auto-labeling. To address this, we propose GS-Occ3D, a scalable vision-only framework that directly reconstructs occupancy. Vision-only occupancy reconstruction poses significant challenges due to sparse viewpoints, dynamic scene elements, severe occlusions, and long-horizon motion. Existing vision-based methods primarily rely on mesh representation, which suffer from incomplete geometry and additional post-processing, limiting scalability. To overcome these issues, GS-Occ3D optimizes an explicit occupancy representation using an Octree-based Gaussian Surfel formulation, ensuring efficiency and scalability. Additionally, we decompose scenes into static background, ground, and dynamic objects, enabling tailored modeling strategies: (1) Ground is explicitly reconstructed as a dominant structural element, significantly improving large-area consistency; (2) Dynamic vehicles are separately modeled to better capture motion-related occupancy patterns. Extensive experiments on the Waymo dataset demonstrate that GS-Occ3D achieves state-of-the-art geometry reconstruction results. By curating vision-only binary occupancy labels from diverse urban scenes, we show their effectiveness for downstream occupancy models on Occ3D-Waymo and superior zero-shot generalization on Occ3D-nuScenes. It highlights the potential of large-scale vision-based occupancy reconstruction as a new paradigm for autonomous driving perception. Project Page: this https URL
zh
[CV-6] CircuitProbe: Dissecting Spatiotemporal Visual Semantics with Circuit Tracing
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在时空理解(spatiotemporal understanding)方面的内部推理机制尚不清晰的问题。现有研究多集中于单帧图像中的对象识别,而对动态场景中时空语义的表征与处理机制缺乏系统性理解。解决方案的关键在于提出一个基于电路(circuit-based)的系统性分析框架,包含三个核心组件:视觉审计电路(visual auditing circuit)、语义追踪电路(semantic tracing circuit)和注意力流电路(attention flow circuit)。通过该框架,研究发现视觉语义高度局域化于特定对象标记(object tokens),移除这些标记可导致模型性能下降高达92.6%;同时揭示了物体与动作的可解释概念在LVLM的中后期层中逐步涌现并细化,并且这些层具有针对时空语义的功能性局部化特征,从而为构建更鲁棒、可解释的LVLM提供了机制层面的洞见。
链接: https://arxiv.org/abs/2507.19420
作者: Yiming Zhang,Chengzhang Yu,Zhuokai Zhao,Kun Wang,Qiankun Li,Zihan Chen,Yang Liu,Zenghui Ding,Yining Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The processing mechanisms underlying language and image understanding in large vision-language models (LVLMs) have been extensively studied. However, the internal reasoning mechanisms of LVLMs for spatiotemporal understanding remain poorly understood. In this work, we introduce a systematic, circuit-based framework designed to investigate how spatiotemporal visual semantics are represented and processed within these LVLMs. Specifically, our framework comprises three circuits: visual auditing circuit, semantic tracing circuit, and attention flow circuit. Through the lens of these circuits, we discover that visual semantics are highly localized to specific object tokens–removing these tokens can degrade model performance by up to 92.6%. Furthermore, we identify that interpretable concepts of objects and actions emerge and become progressively refined in the middle-to-late layers of LVLMs. In contrary to the current works that solely focus on objects in one image, we reveal that the middle-to-late layers of LVLMs exhibit specialized functional localization for spatiotemporal semantics. Our findings offer significant mechanistic insights into spatiotemporal semantics analysis of LVLMs, laying a foundation for designing more robust and interpretable models.
zh
[CV-7] DEFNet: Multitasks-based Deep Evidential Fusion Network for Blind Image Quality Assessment
【速读】:该论文旨在解决盲图像质量评估(Blind Image Quality Assessment, BIQA)方法中因辅助任务整合不足和缺乏灵活不确定性估计而导致性能受限的问题。解决方案的关键在于提出一种基于多任务的深度证据融合网络(Deep Evidential Fusion Network, DEFNet),其核心创新包括:1)设计了一种可信信息融合策略,通过子区域间特征与模式的融合提升信息丰富度,并实现局部-全局信息的平衡;2)引入受证据学习启发的先进不确定性估计技术,利用正态逆伽马分布混合模型对预测结果进行可靠量化,从而增强模型在未见场景下的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2507.19418
作者: Yiwei Lou,Yuanpeng He,Rongchao Zhang,Yongzhi Cao,Hanpin Wang,Yu Huang
机构: Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Blind image quality assessment (BIQA) methods often incorporate auxiliary tasks to improve performance. However, existing approaches face limitations due to insufficient integration and a lack of flexible uncertainty estimation, leading to suboptimal performance. To address these challenges, we propose a multitasks-based Deep Evidential Fusion Network (DEFNet) for BIQA, which performs multitask optimization with the assistance of scene and distortion type classification tasks. To achieve a more robust and reliable representation, we design a novel trustworthy information fusion strategy. It first combines diverse features and patterns across sub-regions to enhance information richness, and then performs local-global information fusion by balancing fine-grained details with coarse-grained context. Moreover, DEFNet exploits advanced uncertainty estimation technique inspired by evidential learning with the help of normal-inverse gamma distribution mixture. Extensive experiments on both synthetic and authentic distortion datasets demonstrate the effectiveness and robustness of the proposed framework. Additional evaluation and analysis are carried out to highlight its strong generalization capability and adaptability to previously unseen scenarios.
zh
[CV-8] Modality Agnostic Efficient Long Range Encoder
【速读】:该论文旨在解决当前大模型在单设备上处理长上下文时面临的计算和内存复杂度高(二次增长)的问题,尤其针对通用单设备实现中现有方法(如token合并与改进注意力机制)存在模态依赖性且难以平衡精度与效率的局限。其解决方案的关键在于提出一种统一且高效的Transformer架构MAELRE(Modality Agnostic Efficient Long Range Encoder),通过在内部计算模块的不同阶段逐步合并token,并结合轻量级注意力近似与标准点积注意力的动态切换策略:当token数量较多时采用低开销的注意力近似以降低内存占用和推理成本,随着序列经由聚合逐渐缩短则转为精确的点积注意力,从而在多模态分类任务(文本、时间序列、音频、视觉)中实现更优的准确率与更低的计算开销。
链接: https://arxiv.org/abs/2507.19409
作者: Toufiq Parag,Ahmed Elgammal
机构: Amazon Prime Video (亚马逊Prime视频); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The long-context capability of recent large transformer models can be surmised to rely on techniques such as attention/model parallelism, as well as hardware-level optimizations. While these strategies allow input lengths to scale to millions of tokens, they do not fundamentally mitigate the quadratic computational and memory complexity of the core attention mechanism. In this paper, we address the challenge of long-context processing on a single device using generic implementations by reducing the quadratic memory footprint and inference cost. Existing approaches to extend the context length for generic single device implementations – such as token merging and modified attentions – are often modality specific and attain a suboptimal tradeoff between accuracy and efficiency. To overcome these limitations, we propose MAELRE (Modality Agnostic Efficient Long Range Encoder), a unified and efficient transformer architecture designed for long-range encoding across diverse modalities. MAELRE integrates token merging with attention approximation, progressively merging tokens at different stages of internal computational blocks. It employs a lightweight attention approximation when the number of tokens is large, and switches to standard dot-product attention as the sequence becomes shorter through successive aggregation. We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models on classification tasks spanning multiple modalities, including text, time series, audio, and vision.
zh
[CV-9] CXR-CML: Improved zero-shot classification of long-tailed multi-label diseases in Chest X-Rays
【速读】:该论文旨在解决医学影像领域中胸部X光片(CXR)分类任务因临床发现分布存在长尾现象(long-tailed distribution)而导致的自监督深度学习模型性能下降问题,尤其是对罕见类别的识别能力不足。其核心解决方案在于引入一种基于潜在空间(latent space)的类别加权机制,通过高斯混合模型(Gaussian Mixture Model, GMM)聚类结合学生t-分布(Student t-distribution)优化特征嵌入,并利用度量损失(metric loss)调整嵌入表示,从而实现更稳定且自适应的特征聚类,显著提升了零样本分类(zero-shot classification)性能,尤其在MIMIC-CXR-JPG数据集上实现了40个类别平均AUC提升7个百分点。
链接: https://arxiv.org/abs/2507.19398
作者: Rajesh Madhipati,Sheethal Bhat,Lukas Buess,Andreas Maier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Chest radiography (CXR) plays a crucial role in the diagnosis of various diseases. However, the inherent class imbalance in the distribution of clinical findings presents a significant challenge for current self-supervised deep learning models. These models often fail to accurately classify long-tailed classes. Current Vision-Language models such as Contrastive Language Image Pre-training (CLIP) models effectively model the manifold distribution of the latent space, enabling high zero-shot classification accuracies. Although CLIP performs well on most of the primary classes in the dataset, our work reveals that its effectiveness decreases significantly for classes with a long-tailed distribution. Our approach employs a class-weighting mechanism that directly aligns with the distribution of classes within the latent space. This method ensures a substantial improvement in overall classification performance, with particular emphasis on enhancing the recognition and accuracy of rarely observed classes. We accomplish this by applying Gaussian Mixture Model (GMM) clustering to the latent space. The subsequent clusters are further refined by Student t-distribution, followed by a metric loss that utilizes the altered embeddings. Our approach facilitates stable and adaptive clustering of the features. This results in a notable average improvement of 7% points in zero-shot AUC scores across 40 classes in the MIMIC-CXR-JPG dataset from previous SOTA models.
zh
[CV-10] BEV-LLM : Leverag ing Multimodal BEV Maps for Scene Captioning in Autonomous Driving
【速读】:该论文旨在解决自动驾驶场景中决策透明性与可解释性不足的问题,特别是通过生成自然语言描述来提升人机交互和安全性。其核心解决方案是提出BEV-LLM模型,该模型基于BEVFusion架构融合3D激光雷达点云与多视角图像信息,并引入一种新颖的绝对位置编码机制以实现针对不同视角的场景描述。尽管仅采用1B参数的小型基础模型,BEV-LLM在nuCaption数据集上性能优于现有方法,BLEU分数最高提升5%,同时发布nuView(关注环境条件与视角)和GroundView(关注物体定位)两个新数据集,用于更全面评估场景描述能力并填补当前基准测试的空白。
链接: https://arxiv.org/abs/2507.19370
作者: Felix Brandstaetter,Erik Schuetz,Katharina Winter,Fabian Flohr
机构: Munich University of Applied Sciences (慕尼黑应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions. Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5% in BLEU scores. Additionally, we release two new datasets - nuView (focused on environmental conditions and viewpoints) and GroundView (focused on object grounding) - to better assess scene captioning across diverse driving scenarios and address gaps in current benchmarks, along with initial benchmarking results demonstrating their effectiveness.
zh
[CV-11] EA-ViT: Efficient Adaptation for Elastic Vision Transformer ICCV2025
【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在多样化资源约束环境下部署时面临的效率问题,即传统方法需为不同设备重新训练多个尺寸特定的 ViT 模型,导致时间与能源成本高昂。其解决方案的关键在于提出一种高效的 ViT 适应框架 EA-ViT,包含两个核心阶段:第一阶段通过嵌套弹性架构(nested elastic architecture)增强预训练 ViT 的结构灵活性,支持 MLP 扩展比例、注意力头数、嵌入维度和网络深度等关键超参数的动态调整,并采用基于课程的学习策略逐步引入弹性以保持知识迁移稳定;第二阶段设计轻量级路由器(lightweight router),基于计算预算和下游任务需求选择最优子模型,该路由器初始配置由定制化的 NSGA-II 多目标优化算法生成,并与主干网络联合优化,从而实现单次适应过程即可生成多种尺寸模型以适配不同平台。
链接: https://arxiv.org/abs/2507.19360
作者: Chen Zhu,Wangbo Zhao,Huiwen Zhang,Samir Khaki,Yuhao Zhou,Weidong Tang,Shuo Wang,Zhihang Yuan,Yuzhang Shang,Xiaojiang Peng,Kai Wang,Dawei Yang
机构: National University of Singapore (新加坡国立大学); Xidian University (西安电子科技大学); University of Toronto (多伦多大学); Houmo AI; UCF; Shenzhen Technology University (深圳技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a conference paper at ICCV 2025
Abstract:Vision Transformers (ViTs) have emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks. However, deploying ViTs to support diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and energy-intensive. To address this issue, we propose an efficient ViT adaptation framework that enables a single adaptation process to generate multiple models of varying sizes for deployment on platforms with various resource constraints. Our approach comprises two stages. In the first stage, we enhance a pre-trained ViT with a nested elastic architecture that enables structural flexibility across MLP expansion ratio, number of attention heads, embedding dimension, and network depth. To preserve pre-trained knowledge and ensure stable adaptation, we adopt a curriculum-based training strategy that progressively increases elasticity. In the second stage, we design a lightweight router to select submodels according to computational budgets and downstream task demands. Initialized with Pareto-optimal configurations derived via a customized NSGA-II algorithm, the router is then jointly optimized with the backbone. Extensive experiments on multiple benchmarks demonstrate the effectiveness and versatility of EA-ViT. The code is available at this https URL.
zh
[CV-12] SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning ICCV
【速读】:该论文旨在解决虚拟角色在语音同步手势生成中语义一致性不足的问题,即现有方法多聚焦于节奏性手势(rhythmic beat gestures)的生成,忽视了手势与语音内容之间的语义关联。其解决方案的关键在于提出一种双阶段的语义 grounding 方法:首先通过向量量化变分自编码器(vector-quantized variational autoencoder)学习运动先验,随后构建第二阶段模块,融合语音、文本语义和说话者身份信息,并引入语义一致性和相关性模块,确保生成手势在细粒度和全局层面均与语音语义保持高度匹配。实验表明,该方法在客观指标和主观评价上均优于当前最优方法,在两个基准数据集上显著提升了手势的现实感与语义连贯性。
链接: https://arxiv.org/abs/2507.19359
作者: Lanmiao Liu,Esam Ghaleb,Aslı Özyürek,Zerrin Yumak
机构: Max Planck Institute for Psycholinguistics (马普所心理语言学研究所); Donders Institute for Brain Cognition and Behaviour (多德斯研究所大脑认知与行为); Utrecht University (乌得勒支大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE/CVF International Conference on Computer Vision (ICCV) 2025
Abstract:Creating a virtual avatar with semantically coherent gestures that are aligned with speech is a challenging task. Existing gesture generation research mainly focused on generating rhythmic beat gestures, neglecting the semantic context of the gestures. In this paper, we propose a novel approach for semantic grounding in co-speech gesture generation that integrates semantic information at both fine-grained and global levels. Our approach starts with learning the motion prior through a vector-quantized variational autoencoder. Built on this model, a second-stage module is applied to automatically generate gestures from speech, text-based semantics and speaker identity that ensures consistency between the semantic relevance of generated gestures and co-occurring speech semantics through semantic coherence and relevance modules. Experimental results demonstrate that our approach enhances the realism and coherence of semantic gestures. Extensive experiments and user studies show that our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation in both objective and subjective metrics. The qualitative results of our model, code, dataset and pre-trained models can be viewed at this https URL.
zh
[CV-13] EffiComm: Bandwidth Efficient Multi Agent Communication ITSC2025
【速读】:该论文旨在解决车联网中协同感知(Collaborative Perception)因传输原始点云或完整特征图导致的通信带宽瓶颈问题,从而引发延迟高和可扩展性差的挑战。其核心解决方案是提出EffiComm框架,关键在于采用两阶段数据压缩机制:首先通过选择性传输(Selective Transmission, ST)利用置信度掩码剔除低效区域;其次借助图神经网络(Graph Neural Network, GNN)实现自适应网格缩减(Adaptive Grid Reduction, AGR),根据车辆角色与网络负载动态分配保留比例。最终通过软门控混合专家(Soft-gated Mixture-of-Experts, MoE)注意力层融合特征,显著提升通信效率与检测精度的平衡,在OPV2V基准上仅需平均约1.5 MB/帧即可达到0.84 mAP@0.7,优于现有方法的比特精度曲线表现。
链接: https://arxiv.org/abs/2507.19354
作者: Melih Yazgan,Allen Xavier Arasan,J. Marius Zöllner
机构: FZI Research Center for Information Technology (弗劳恩霍夫信息技术研究所); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted for publication at ITSC 2025
Abstract:Collaborative perception allows connected vehicles to exchange sensor information and overcome each vehicle’s blind spots. Yet transmitting raw point clouds or full feature maps overwhelms Vehicle-to-Vehicle (V2V) communications, causing latency and scalability problems. We introduce EffiComm, an end-to-end framework that transmits less than 40% of the data required by prior art while maintaining state-of-the-art 3D object detection accuracy. EffiComm operates on Bird’s-Eye-View (BEV) feature maps from any modality and applies a two-stage reduction pipeline: (1) Selective Transmission (ST) prunes low-utility regions with a confidence mask; (2) Adaptive Grid Reduction (AGR) uses a Graph Neural Network (GNN) to assign vehicle-specific keep ratios according to role and network load. The remaining features are fused with a soft-gated Mixture-of-Experts (MoE) attention layer, offering greater capacity and specialization for effective feature integration. On the OPV2V benchmark, EffiComm reaches 0.84 mAP@0.7 while sending only an average of approximately 1.5 MB per frame, outperforming previous methods on the accuracy-per-bit curve. These results highlight the value of adaptive, learned communication for scalable Vehicle-to-Everything (V2X) perception.
zh
[CV-14] NerT-CA: Efficient Dynamic Reconstruction from Sparse-view X-ray Coronary Angiography
【速读】:该论文旨在解决从X射线冠状动脉造影(X-ray coronary angiography, CA)中实现高精度、快速的三维(3D)及四维(4D,即3D+时间)冠状动脉重建问题,尤其针对血 vessel 结构稀疏、背景与血管区分度低、视角稀疏以及扫描过程中的运动伪影等挑战。现有方法依赖耗时的手动或易出错的自动分割,限制了临床实用性;而基于神经辐射场(Neural Radiance Fields, NeRF)的方法虽具潜力,但因依赖多层感知机(MLP)表示导致训练时间过长。其解决方案的关键在于提出NerT-CA——一种结合神经表示与张量表示的混合架构:利用张量场高效建模低秩静态结构以加速初始重建,同时用神经场捕捉动态稀疏特征,从而在仅需三幅造影片的情况下即可实现高质量4D重建,在训练效率和重建精度上均优于现有方法。
链接: https://arxiv.org/abs/2507.19328
作者: Kirsten W.H. Maas,Danny Ruijters,Nicola Pezzotti,Anna Vilanova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Three-dimensional (3D) and dynamic 3D+time (4D) reconstruction of coronary arteries from X-ray coronary angiography (CA) has the potential to improve clinical procedures. However, there are multiple challenges to be addressed, most notably, blood-vessel structure sparsity, poor background and blood vessel distinction, sparse-views, and intra-scan motion. State-of-the-art reconstruction approaches rely on time-consuming manual or error-prone automatic segmentations, limiting clinical usability. Recently, approaches based on Neural Radiance Fields (NeRF) have shown promise for automatic reconstructions in the sparse-view setting. However, they suffer from long training times due to their dependence on MLP-based representations. We propose NerT-CA, a hybrid approach of Neural and Tensorial representations for accelerated 4D reconstructions with sparse-view CA. Building on top of the previous NeRF-based work, we model the CA scene as a decomposition of low-rank and sparse components, utilizing fast tensorial fields for low-rank static reconstruction and neural fields for dynamic sparse reconstruction. Our approach outperforms previous works in both training time and reconstruction accuracy, yielding reasonable reconstructions from as few as three angiogram views. We validate our approach quantitatively and qualitatively on representative 4D phantom datasets.
zh
[CV-15] SIDE: Sparse Information Disentanglement for Explainable Artificial Intelligence
【速读】:该论文旨在解决深度神经网络在高风险领域(如医学影像和自动驾驶)中缺乏透明性的问题,尤其是在计算机视觉任务中,现有原型部件-based 模型(prototypical-parts-based neural networks)虽能提供概念级解释,但多数局限于细粒度分类任务,且在大规模数据集(如ImageNet)上生成的解释过于复杂。其解决方案的关键在于提出一种名为Sparse Information Disentanglement for Explainability (SIDE) 的新方法,通过专门设计的训练与剪枝策略强制引入稀疏性,并用Sigmoid激活函数替代Softmax,使每个类别仅关联少量相关原型,从而在保持模型准确率的同时,将解释规模减少超过90%,显著提升原型解释的可理解性。
链接: https://arxiv.org/abs/2507.19321
作者: Viktar Dubovik,Łukasz Struski,Jacek Tabor,Dawid Rymarczyk
机构: Jagiellonian University, Faculty of Mathematics and Computer Science (亚捷隆大学,数学与计算机科学学院); Ardigen SA (Ardigen公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Understanding the decisions made by deep neural networks is essential in high-stakes domains such as medical imaging and autonomous driving. Yet, these models often lack transparency, particularly in computer vision. Prototypical-parts-based neural networks have emerged as a promising solution by offering concept-level explanations. However, most are limited to fine-grained classification tasks, with few exceptions such as InfoDisent. InfoDisent extends prototypical models to large-scale datasets like ImageNet, but produces complex explanations. We introduce Sparse Information Disentanglement for Explainability (SIDE), a novel method that improves the interpretability of prototypical parts through a dedicated training and pruning scheme that enforces sparsity. Combined with sigmoid activations in place of softmax, this approach allows SIDE to associate each class with only a small set of relevant prototypes. Extensive experiments show that SIDE matches the accuracy of existing methods while reducing explanation size by over 90% , substantially enhancing the understandability of prototype-based explanations. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.19321 [cs.CV] (or arXiv:2507.19321v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.19321 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-16] Multistream Network for LiDAR and Camera-based 3D Object Detection in Outdoor Scenes IROS2025
【速读】:该论文旨在解决室外三维目标检测中多模态数据融合的难题,尤其是如何高效整合LiDAR与RGB图像信息以提升检测精度的问题。其核心挑战在于如何从两种异构模态中提取任务相关特征并实现精确融合,从而增强对复杂场景下目标的空间、纹理和几何信息的感知能力。解决方案的关键在于提出一种三流结构的MultiStream Detection (MuStD)网络:首先通过LiDAR-PillarNet流提取稀疏2D柱状特征,LiDAR-Height Compression流生成鸟瞰图(Bird’s-Eye View, BEV)特征;随后引入3D多模态流,利用UV映射和极坐标索引将RGB与LiDAR特征进行跨模态对齐与融合;最终,融合后的多维特征被送入检测头完成高精度3D目标检测。该方法在KITTI基准上实现了新的或具有竞争力的性能,并保持了较高的计算效率。
链接: https://arxiv.org/abs/2507.19304
作者: Muhammad Ibrahim,Naveed Akhtar,Haitian Wang,Saeed Anwar,Ajmal Mian
机构: The University of Western Australia (西澳大利亚大学); The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by IEEE/RSJ IROS 2025 for oral presentation on 19 Oct. 2025
Abstract:Fusion of LiDAR and RGB data has the potential to enhance outdoor 3D object detection accuracy. To address real-world challenges in outdoor 3D object detection, fusion of LiDAR and RGB input has started gaining traction. However, effective integration of these modalities for precise object detection task still remains a largely open problem. To address that, we propose a MultiStream Detection (MuStD) network, that meticulously extracts task-relevant information from both data modalities. The network follows a three-stream structure. Its LiDAR-PillarNet stream extracts sparse 2D pillar features from the LiDAR input while the LiDAR-Height Compression stream computes Bird’s-Eye View features. An additional 3D Multimodal stream combines RGB and LiDAR features using UV mapping and polar coordinate indexing. Eventually, the features containing comprehensive spatial, textural and geometric information are carefully fused and fed to a detection head for 3D object detection. Our extensive evaluation on the challenging KITTI Object Detection Benchmark using public testing server at this https URL establishes the efficacy of our method by achieving new state-of-the-art or highly competitive results in different categories while remaining among the most efficient methods. Our code will be released through MuStD GitHub repository at this https URL
zh
[CV-17] ABCD: Automatic Blood Cell Detection via Attention-Guided Improved YOLOX
【速读】:该论文旨在解决显微图像中血细胞自动检测的难题,传统人工检测方法存在效率低、耗时长且易出错的问题。为提升检测精度与速度,作者提出了一种基于改进YOLOX架构的自动血细胞检测方法(ABCD),其关键创新在于三点:首先,在网络主干部分引入卷积块注意力模块(Convolutional Block Attention Module, CBAM),增强特征提取效率;其次,在网络颈部结构中采用自适应空间特征融合(Adaptively Spatial Feature Fusion, ASFF)机制,优化多尺度特征融合;最后,将交并比(Intersection over Union, IoU)损失函数替换为完整交并比(Complete Intersection over Union, CIOU)损失函数,加速模型收敛。实验表明,该方法在BCCD数据集上显著优于现有方法,mAP@0.5达到95.49%,mAP@0.5-0.9提升至86.89%,同时检测速度提高2.9%,具备良好的实时性。
链接: https://arxiv.org/abs/2507.19296
作者: Ahmed Endris Hasen,Yang Shangming,Chiagoziem C. Ukwuoma,Biniyam Gashaw,Abel Zenebe Yutra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detection of blood cells in microscopic images has become a major focus of medical image analysis, playing a crucial role in gaining valuable insights into a patient’s health. Manual blood cell checks for disease detection are known to be time-consuming, inefficient, and error-prone. To address these limitations, analyzing blood cells using deep learning-based object detectors can be regarded as a feasible solution. In this study, we propose automatic blood cell detection method (ABCD) based on an improved version of YOLOX, an object detector, for detecting various types of blood cells, including white blood cells, red blood cells, and platelets. Firstly, we introduce the Convolutional Block Attention Module (CBAM) into the network’s backbone to enhance the efficiency of feature extraction. Furthermore, we introduce the Adaptively Spatial Feature Fusion (ASFF) into the network’s neck, which optimizes the fusion of different features extracted from various stages of the network. Finally, to speed up the model’s convergence, we substitute the Intersection over Union (IOU) loss function with the Complete Intersection over Union (CIOU) loss function. The experimental results demonstrate that the proposed method is more effective than other existing methods for BCCD dataset. Compared to the baseline algorithm, our method ABCD achieved 95.49 % mAP@0.5 and 86.89 % mAP@0.5-0.9, which are 2.8% and 23.41% higher, respectively, and increased the detection speed by 2.9%, making it highly efficient for real-time applications.
zh
[CV-18] PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups ICCV2025
【速读】:该论文旨在解决多角色群体交互生成中因群体规模扩大而导致的复杂性增加问题,现有条件扩散模型通常依赖单一共享提示进行逐帧生成,难以实现细腻的交互控制且易产生过于简化的交互行为。其解决方案的关键在于提出一种无需训练的框架Person-Interaction Noise Optimization (PINO),通过将复杂的群组交互分解为语义相关的成对交互,并利用预训练的两人交互扩散模型逐步组合出群体交互;同时在噪声优化过程中引入基于物理的惩罚机制以确保动作的物理合理性,避免角色间穿插或重叠等常见伪影,从而实现对角色朝向、速度和空间关系的精确用户控制。
链接: https://arxiv.org/abs/2507.19292
作者: Sakuya Ota,Qing Yu,Kent Fujiwara,Satoshi Ikehata,Ikuro Sato
机构: Institute of Science Tokyo (东京科学研究所); LY Corporation (LY公司); National Institute of Informatics (NII) (日本信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025, Project page: this https URL
Abstract:Generating realistic group interactions involving multiple characters remains challenging due to increasing complexity as group size expands. While existing conditional diffusion models incrementally generate motions by conditioning on previously generated characters, they rely on single shared prompts, limiting nuanced control and leading to overly simplified interactions. In this paper, we introduce Person-Interaction Noise Optimization (PINO), a novel, training-free framework designed for generating realistic and customizable interactions among groups of arbitrary size. PINO decomposes complex group interactions into semantically relevant pairwise interactions, and leverages pretrained two-person interaction diffusion models to incrementally compose group interactions. To ensure physical plausibility and avoid common artifacts such as overlapping or penetration between characters, PINO employs physics-based penalties during noise optimization. This approach allows precise user control over character orientation, speed, and spatial relationships without additional training. Comprehensive evaluations demonstrate that PINO generates visually realistic, physically coherent, and adaptable multi-person interactions suitable for diverse animation, gaming, and robotics applications.
zh
[CV-19] Relaxed Total Generalized Variation Regularized Piecewise Smooth Mumford-Shah Model for Triangulated Surface Segmentation
【速读】:该论文旨在解决现有Mumford-Shah (MS)模型在网格分割中仅追求边界长度最短,而难以有效处理具有不规则结构的网格问题。传统方法多基于全变差(Total Variation, TV)正则化,导致分割结果对几何细节敏感度不足。本文的关键解决方案是提出一种新的分段光滑MS网格分割模型,引入松弛总广义变差(relaxed Total Generalized Variation, rTGV)正则化项,该方法假设网格特征函数可近似为分段常数函数与光滑函数之和,从而更好地刻画几何结构的高阶不连续性。此设计使分割边界更符合实际几何特性而非单纯最小化长度,显著提升了对复杂形状网格的分割质量。
链接: https://arxiv.org/abs/2507.19284
作者: Huayan Zhang,Shanqiang Wang,Xiaochao Wang
机构: Tiangong University (天津工业大学)
类目: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Mumford-Shah (MS) model is an important technique for mesh segmentation. Many existing researches focus on piecewise constant MS mesh segmentation model with total variation regularization, which pursue the shortest length of boundaries. Different from previous efforts, in this article, we propose a novel piecewise smooth MS mesh segmentation model by utilizing the relaxed total generalized variation regularization (rTGV). The new model assumes that the feature function of a mesh can be approximated by the sum of piecewise constant function and asmooth function, and the rTGV regularization is able to characterize the high order discontinuity of the geometric structure. The newly introduced method is effective in segmenting meshes with irregular structures and getting the better boundaries rather than the shortest boundaries. We solve the new model by alternating minimization and alternating direction method of multipliers (ADMM). Our algorithm is discussed from several aspects, and comparisons with several state-of-art methods. Experimental results show that our method can yield competitive results when compared to other approaches. In addition, our results compare favorably to those of the several state-of-art techniques when evaluated on the Princeton Segmentation Benchmark. Furthermore, the quantitative errors and computational costs confirm the robustness and efficiency of the proposed method.
zh
[CV-20] RemoteReason er: Towards Unifying Geospatial Reasoning Workflow
【速读】:该论文旨在解决遥感影像中复杂查询的推理问题,即如何在不依赖预定义推理路径的前提下,实现对空间上下文和用户意图的精准理解,并支持多粒度输出格式(如区域级与像素级)。现有方法受限于监督微调范式,缺乏推理自主性且难以灵活适应不同任务输出形式。解决方案的关键在于提出RemoteReasoner框架,其核心创新包括:1)融合多模态大语言模型(MLLM)以解析用户指令并定位目标;2)采用强化学习(RL)训练策略赋予MLLM充分的推理自主性;3)设计任务适配机制,在推理阶段无需特定解码器或再微调即可生成多样化的输出格式,从而实现统一架构下的灵活性与鲁棒性。
链接: https://arxiv.org/abs/2507.19280
作者: Liang Yao,Fan Liu,Hongbo Lu,Chuanyi Zhang,Rui Min,Shengxiang Xu,Shimin Di,Pai Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing imagery presents vast, inherently unstructured spatial data, demanding sophisticated reasoning to interpret complex user intents and contextual relationships beyond simple recognition tasks. In this paper, we aim to construct an Earth observation workflow to handle complex queries by reasoning about spatial context and user intent. As a reasoning workflow, it should be somewhat autonomous, where predefined ground-truth reasoning paths do not constrain the learning process. Furthermore, its architecture ought to be unified yet flexible, enabling the model to perform diverse reasoning tasks with distinct output formats through a single forward pass. Existing remote sensing approaches fail to address these requirements, as they rely on supervised fine-tuning paradigms that constrain the autonomy of reasoning. To this end, we propose RemoteReasoner, a flexible and robust workflow for remote sensing reasoning tasks. The design of RemoteReasoner integrates a multi-modal large language model (MLLM) for interpreting user instructions and localizing targets, together with task adaptation strategies that enable multi-granularity output generation. In contrast to existing methods, our framework is trained with reinforcement learning (RL) to endow the MLLM sufficient autonomy for precise reasoning. At the inference stage, our adaptation strategies enable diverse output formats at inference time without requiring task-specific decoders or further fine-tuning. Preliminary experiments demonstrated that RemoteReasoner achieves remarkable performance across multi-granularity reasoning tasks, including region-level and pixel-level. Additionally, our framework enables novel capabilities such as the contour extraction task beyond the reach of existing reasoning pipelines.
zh
[CV-21] Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception
【速读】:该论文旨在解决当前自监督图像编码器(如DINO)在训练时仅使用静态图像,从而忽略视频中固有时序信息的问题。其解决方案的关键在于提出一种基于视频蒸馏的单图像编码器训练方法:通过设计一个简单的目标函数,使模型从当前帧预测下一帧的表示,从而在不依赖光流或跟踪技术的前提下,自然注入三维空间和时序先验。实验表明,仅用2小时视频预训练即可显著提升ADE20K语义分割任务的mIoU指标(从35.0提升至36.4),且保持与纯图像流水线的兼容性,验证了视频自蒸馏作为轻量级几何感知学习路径的有效性。
链接: https://arxiv.org/abs/2507.19272
作者: Marcel Simon,Tae-Ho Kim,Seul-Ki Yeom
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures, 2 tables
Abstract:Self-supervised image encoders such as DINO have recently gained significant interest for learning robust visual features without labels. However, most SSL methods train on static images and miss the temporal cues inherent in videos. We introduce a video-distilled single-image encoder trained to predict the next-frame representation from the current frame. This simple objective injects 3D spatial and temporal priors without optical flow or tracking. When pre-training on a single 2-hour video, our approach raises the mean Intersection-over-Union (mIoU) on ADE20K from 35.0 (DoRA) to 36.4 while remaining a drop-in replacement for image-only pipelines. Our results highlight video self-distillation as a lightweight route to geometry-aware perception an essential ingredient for physically plausible world models and Physical AI.
zh
[CV-22] SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality
【速读】:该论文旨在解决多模态学习中缺失模态(missing modalities)场景下的模型性能下降问题,即当部分模态在测试阶段不可用时,现有方法往往因依赖完整模态输入而表现不稳定。其解决方案的核心在于提出SimMLM框架,关键创新包括:一是设计了通用的动态模态专家混合(Dynamic Mixture of Modality Experts, DMoME)架构,通过可学习的门控机制自动调节不同模态的贡献权重,从而适应全模态与部分模态输入;二是引入More vs. Fewer (MoFe)排序损失函数,确保随着可用模态数量增加,任务准确率不会降低,从而保证模型在模态缺失时的鲁棒性和一致性。
链接: https://arxiv.org/abs/2507.19264
作者: Sijie Li,Chen Chen,Jungong Han
机构: University of Sheffield (谢菲尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we propose SimMLM, a simple yet powerful framework for multimodal learning with missing modalities. Unlike existing approaches that rely on sophisticated network architectures or complex data imputation techniques, SimMLM provides a generic and effective solution that can adapt to various missing modality scenarios with improved accuracy and robustness. Specifically, SimMLM consists of a generic Dynamic Mixture of Modality Experts (DMoME) architecture, featuring a dynamic, learnable gating mechanism that automatically adjusts each modality’s contribution in both full and partial modality settings. A key innovation of SimMLM is the proposed More vs. Fewer (MoFe) ranking loss, which ensures that task accuracy improves or remains stable as more modalities are made available. This aligns the model with an intuitive principle: removing one or more modalities should not increase accuracy. We validate SimMLM on multimodal medical image segmentation (BraTS 2018) and multimodal classification (UPMC Food-101, avMNIST) tasks, where it consistently surpasses competitive methods, demonstrating superior accuracy, interpretability, robustness, and reliability across both complete and missing modality scenarios at test time.
zh
[CV-23] OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models
【速读】:该论文旨在解决大模型在生成长篇且事实准确的图像描述(caption)时面临的挑战,特别是现有衡量幻觉(hallucination)和事实性(factuality)的指标难以适用于长文本、多样化内容,且依赖人工标注的问题。其解决方案的关键在于提出一种名为OV-Fact的新方法,该方法不依赖人类标注,而是通过开放词汇视觉定位(open-vocabulary visual grounding)与工具辅助验证(tool-based verification)实现对长描述的事实性评估,同时兼顾描述完整性(recall)与事实精确度(precision),并支持无参考(reference-free)场景下的数据过滤应用。
链接: https://arxiv.org/abs/2507.19262
作者: Monika Wysoczańska,Shyamal Buch,Anurag Arnab,Cordelia Schmid
机构: Google DeepMind(谷歌深度思维); Warsaw University of Technology(华沙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large vision-language models (VLMs) often struggle to generate long and factual captions. However, traditional measures for hallucination and factuality are not well suited for evaluating longer, more diverse captions and in settings where ground-truth human-annotated captions are unavailable. We introduce OV-Fact, a novel method for measuring caption factuality of long captions that leverages open-vocabulary visual grounding and tool-based verification without depending on human annotations. Our method improves agreement with human judgments and captures both caption descriptiveness (recall) and factual precision in the same metric. Furthermore, unlike previous metrics, our reference-free method design enables new applications towards factuality-based data filtering. We observe models trained on an OVFact-filtered (2.5-5x less) subset of a large-scale, noisy (VLM-generated) pretraining set meaningfully improve factuality precision without sacrificing caption descriptiveness across a range of downstream long caption benchmarks.
zh
[CV-24] BridgeNet: A Unified Multimodal Framework for Bridging 2D and 3D Industrial Anomaly Detection
【速读】:该论文旨在解决工业场景中基于2D图像的异常检测方法难以有效捕捉3D深度信息的问题,尤其是在多模态(RGB与深度)数据下,由于不同模态间存在差异导致特征表示不充分,且异常样本稀缺限制了模型性能。其解决方案的关键在于提出一个统一的多模态异常检测框架:首先通过分离深度与外观信息,从3D点云中提取可见深度并用2D RGB图像表征外观,实现模态解耦以支持统一异常生成;其次引入多尺度高斯异常生成器和统一纹理异常生成器,在RGB与深度通道上生成更丰富的异常样本;最后所有模块共享参数,实现RGB与深度特征的无缝融合,无需复杂的多模态融合操作,从而提升跨模态异常检测性能。
链接: https://arxiv.org/abs/2507.19253
作者: An Xiang,Zixuan Huang,Xitong Gao,Kejiang Ye,Cheng-zhong Xu
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (深圳先进技术研究院,中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Shenzhen University of Advanced Technology (深圳先进技术大学); State Key Lab of IOTSC, Department of CIS, University of Macau (澳门特别行政区大学信息科学系物联网技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Industrial anomaly detection for 2D objects has gained significant attention and achieved progress in anomaly detection (AD) methods. However, identifying 3D depth anomalies using only 2D information is insufficient. Despite explicitly fusing depth information into RGB images or using point cloud backbone networks to extract depth features, both approaches struggle to adequately represent 3D information in multimodal scenarios due to the disparities among different modal information. Additionally, due to the scarcity of abnormal samples in industrial data, especially in multimodal scenarios, it is necessary to perform anomaly generation to simulate real-world abnormal samples. Therefore, we propose a novel unified multimodal anomaly detection framework to address these issues. Our contributions consist of 3 key aspects. (1) We extract visible depth information from 3D point cloud data simply and use 2D RGB images to represent appearance, which disentangles depth and appearance to support unified anomaly generation. (2) Benefiting from the flexible input representation, the proposed Multi-Scale Gaussian Anomaly Generator and Unified Texture Anomaly Generator can generate richer anomalies in RGB and depth. (3) All modules share parameters for both RGB and depth data, effectively bridging 2D and 3D anomaly detection. Subsequent modules can directly leverage features from both modalities without complex fusion. Experiments show our method outperforms state-of-the-art (SOTA) on MVTec-3D AD and Eyecandies datasets. Code available at: this https URL
zh
[CV-25] CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception ICCV2025
【速读】:该论文旨在解决多车协同感知中长期被忽视的协作序列感知问题,特别是协同3D多目标跟踪(cooperative 3D multi-object tracking)任务的挑战。现有方法主要聚焦于单帧感知,难以应对动态场景下目标的持续追踪与跨车辆信息融合的复杂性。其解决方案的关键在于提出一个全实例级端到端框架CoopTrack,核心创新包括:1)可学习的实例关联机制,实现跨车辆目标的精准匹配;2)通过稀疏实例级特征传输,在保证低通信开销的同时显著提升感知性能;3)包含多维特征提取与跨代理关联聚合两个模块,分别构建融合语义与运动信息的完整实例表征,并基于特征图自适应地进行跨车辆信息关联与融合。
链接: https://arxiv.org/abs/2507.19239
作者: Jiaru Zhong,Jiahao Wang,Jiahui Xu,Xiaofan Li,Zaiqing Nie,Haibao Yu
机构: Institute for AI Industry Research, Tsinghua University (清华大学人工智能产业研究院); The Hong Kong Polytechnic University (香港理工大学); The University of Hong Kong (香港大学); School of Vehicle and Mobility, Tsinghua University (清华大学车辆与移动机械学院); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025 (Highlight)
Abstract:Cooperative perception aims to address the inherent limitations of single-vehicle autonomous driving systems through information exchange among multiple agents. Previous research has primarily focused on single-frame perception tasks. However, the more challenging cooperative sequential perception tasks, such as cooperative 3D multi-object tracking, have not been thoroughly investigated. Therefore, we propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking, featuring learnable instance association, which fundamentally differs from existing approaches. CoopTrack transmits sparse instance-level features that significantly enhance perception capabilities while maintaining low transmission costs. Furthermore, the framework comprises two key components: Multi-Dimensional Feature Extraction, and Cross-Agent Association and Aggregation, which collectively enable comprehensive instance representation with semantic and motion features, and adaptive cross-agent association and fusion based on a feature graph. Experiments on both the V2X-Seq and Griffin datasets demonstrate that CoopTrack achieves excellent performance. Specifically, it attains state-of-the-art results on V2X-Seq, with 39.0% mAP and 32.8% AMOTA. The project is available at this https URL.
zh
[CV-26] Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene
【速读】:该论文旨在解决如何在虚拟动态场景中生成具有上下文感知的多人类交互运动问题,其核心挑战在于对人与人、人与环境之间动态关系进行整体推理。解决方案的关键在于引入大语言模型(Large Language Model, LLM)来解析文本输入中的复杂语境,并将其分解为可执行的小事件序列;每个事件定义了相关角色与物体的具体动作,再结合空间引导采样和高层语义模块,将事件转化为相对坐标描述以实现高可扩展性且精确的多智能体行为合成。该方法首次在大规模和多样性上系统性地解决了此类问题,并提供了基准测试用于评估上下文推理能力。
链接: https://arxiv.org/abs/2507.19232
作者: Donggeun Lim,Jinseok Bae,Inwoo Hwang,Seungmin Lee,Hwanhee Lee,Young Min Kim
机构: Seoul National University (首尔国立大学); Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, project page: this https URL
Abstract:In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi-human contextual motion requires holistic reasoning over dynamic relationships among human-human and human-scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi-agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well-defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high-level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability. The code and benchmark, along with result videos, are available at our project page: this https URL.
zh
[CV-27] Unstable Prompts Unreliable Segmentations: A Challenge for Longitudinal Lesion Analysis
【速读】:该论文试图解决的是在肿瘤学临床实践中,如何实现纵向病变(longitudinal lesion)的稳定且准确的分割与追踪问题。当前主流的通用病变分割模型通常针对单一时相的医学图像设计,在应用于基线和随访CT扫描时表现出显著性能下降,其根源在于模型对病变中心位置的假设过于敏感,导致在存在扫描间配准误差的情况下,分割质量急剧恶化,并进一步引发病变对应关系(lesion correspondence)过程的失效。解决方案的关键在于摒弃传统的“分步处理”范式(即先分割再追踪),转而开发端到端(end-to-end)的集成模型,该模型从设计之初就具备对时间维度信息的内在建模能力,从而保障纵向分析的鲁棒性与一致性。
链接: https://arxiv.org/abs/2507.19230
作者: Niels Rocholl,Ewoud Smit,Mathias Prokop,Alessa Hering
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Longitudinal lesion analysis is crucial for oncological care, yet automated tools often struggle with temporal consistency. While universal lesion segmentation models have advanced, they are typically designed for single time points. This paper investigates the performance of the ULS23 segmentation model in a longitudinal context. Using a public clinical dataset of baseline and follow-up CT scans, we evaluated the model’s ability to segment and track lesions over time. We identified two critical, interconnected failure modes: a sharp degradation in segmentation quality in follow-up cases due to inter-scan registration errors, and a subsequent breakdown of the lesion correspondence process. To systematically probe this vulnerability, we conducted a controlled experiment where we artificially displaced the input volume relative to the true lesion center. Our results demonstrate that the model’s performance is highly dependent on its assumption of a centered lesion; segmentation accuracy collapses when the lesion is sufficiently displaced. These findings reveal a fundamental limitation of applying single-timepoint models to longitudinal data. We conclude that robust oncological tracking requires a paradigm shift away from cascading single-purpose tools towards integrated, end-to-end models inherently designed for temporal analysis.
zh
[CV-28] Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation
【速读】:该论文旨在解决现有语音驱动人脸动画生成方法中依赖固定语音输入所带来的局限性问题,尤其是无法实现面部与语音的灵活匹配(如face-voice mismatch)以及缺乏对语音副语言特征(paralinguistic features)的可控性。其解决方案的关键在于提出了一种名为Face2VoiceSync的新框架,核心创新包括:1)语音-人脸对齐机制(Voice-Face Alignment),确保生成语音与人脸外观一致;2)多样性操控能力(Diversity Manipulation),使生成语音可控制在副语言特征空间内;3)高效训练策略,通过轻量级变分自编码器(VAE)桥接视觉和音频大模型,显著减少可训练参数;4)新的评估指标,公平衡量生成结果的多样性与身份一致性。该方法在单张40GB GPU上实现了视觉与音频方面的最先进性能。
链接: https://arxiv.org/abs/2507.19225
作者: Fang Kang,Yin Cao,Haoyu Chen
机构: Xi’an Jiaotong Liverpool University (西安交通大学利物浦大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent studies in speech-driven talking face generation achieve promising results, but their reliance on fixed-driven speech limits further applications (e.g., face-voice mismatch). Thus, we extend the task to a more challenging setting: given a face image and text to speak, generating both talking face animation and its corresponding speeches. Accordingly, we propose a novel framework, Face2VoiceSync, with several novel contributions: 1) Voice-Face Alignment, ensuring generated voices match facial appearance; 2) Diversity \ Manipulation, enabling generated voice control over paralinguistic features space; 3) Efficient Training, using a lightweight VAE to bridge visual and audio large-pretrained models, with significantly fewer trainable parameters than existing methods; 4) New Evaluation Metric, fairly assessing the diversity and identity consistency. Experiments show Face2VoiceSync achieves both visual and audio state-of-the-art performances on a single 40GB GPU.
zh
[CV-29] PRE-MAP: Personalized Reinforced Eye-tracking Multimodal LLM for High-Resolution Multi-Attribute Point Prediction
【速读】:该论文旨在解决现有视觉显著性预测模型在广告视频场景中忽视个体主观认知差异的问题,以及多模态大语言模型(MLLMs)在点级注视预测任务中因幻觉倾向导致格式不一致和定位精度不足的局限。其核心解决方案是提出一个大规模、多属性标注的广告视频眼动数据集SPA-ADV(包含4,500名不同年龄与性别参与者对486个视频的注视行为),并设计PRE-MAP模型——该模型基于MLLMs,通过强化学习优化的眼动机制(Reinforcement learning-optimized Eye-tracking)结合多属性用户画像(Multi-Attribute user profiles)来建模个性化视觉差异,从而实现精准点级预测;关键创新在于引入一致性组相对策略优化(Consistency Group Relative Policy Optimization, C-GRPO),有效提升MLLM输出点坐标的格式正确性和空间准确性,克服传统方法难以捕捉个体化注意力模式的瓶颈。
链接: https://arxiv.org/abs/2507.19213
作者: Hanbing Wu,Ping Jiang,Anyang Su,Chenxu Zhao,Tianyu Fu,Minghui Wu,Beiping Tan,Huiying Li
机构: Jilin University (吉林大学); Peking University (北京大学); Mininglamp Technology (矿灯科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual selective attention, driven by individual preferences, regulates human prioritization of visual stimuli by bridging subjective cognitive mechanisms with objective visual elements, thereby steering the semantic interpretation and hierarchical processing of dynamic visual scenes. However, existing models and datasets predominantly neglect the influence of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models, typically employing segmentation approaches, rely on low-resolution imagery to generate saliency heatmaps, subsequently upscaled to native resolutions, which limiting their capacity to capture personalized attention patterns. Furthermore, MLLMs are constrained by factors such as hallucinations, making it very costly to strictly adhere to the expected format in tasks involving multiple point predictions, and achieving precise point positioning is challenging. To address these limitations, we present Subjective Personalized Attention for Advertisement Videos, namely SPA-ADV, a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants varying in age and gender with 486 videos. Furthermore, we propose PRE-MAP, a novel eye-tracking saliency model that characterizes Personalized visual disparities through Reinforcement learning-optimized Eye-tracking, built upon MLLMs and guided by Multi-Attribute user profiles to predict Points. To ensure MLLMs produce prediction points that are both format-correct and spatially accurate, we introduce Consistency Group Relative Policy Optimization (C-GRPO), inspired by the variability in eye movement points and Multi-Attribute profiles. Extensive experiments on SPA-ADV and other benchmarks demonstrate the effectiveness of our approach. The code and dataset are available at \hrefthis https URLthis URL.
zh
[CV-30] Querying Autonomous Vehicle Point Clouds: Enhanced by 3D Object Counting with CounterNet
【速读】:该论文旨在解决自动驾驶车辆生成的大规模点云数据中,因现有检测模型无法提供可靠对象计数而导致查询结果误差较大的问题。由于点云查询(如RETRIEVAL、COUNT和AGGREGATION)高度依赖准确的对象计数,传统方法在3D点云场景下性能受限。其解决方案的关键在于提出CounterNet——一种基于热力图的网络架构,通过检测对象中心而非精确定位来提升计数准确性,并结合特征图分块策略(使用重叠区域)以更好处理复杂交通场景中的大小不一目标,同时引入每帧动态模型选择机制以适应不同帧的特性,从而显著提高计数精度(提升5%至20%),进而增强各类查询任务的可靠性。
链接: https://arxiv.org/abs/2507.19209
作者: Xiaoyu Zhang,Zhifeng Bao,Hai Dong,Ziwei Wang,Jiajun Liu
机构: RMIT University (皇家墨尔本理工大学); Queensland University (昆士兰大学); CSIRO (澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Autonomous vehicles generate massive volumes of point cloud data, yet only a subset is relevant for specific tasks such as collision detection, traffic analysis, or congestion monitoring. Effectively querying this data is essential to enable targeted analytics. In this work, we formalize point cloud querying by defining three core query types: RETRIEVAL, COUNT, and AGGREGATION, each aligned with distinct analytical scenarios. All these queries rely heavily on accurate object counts to produce meaningful results, making precise object counting a critical component of query execution. Prior work has focused on indexing techniques for 2D video data, assuming detection models provide accurate counting information. However, when applied to 3D point cloud data, state-of-the-art detection models often fail to generate reliable object counts, leading to substantial errors in query results. To address this limitation, we propose CounterNet, a heatmap-based network designed for accurate object counting in large-scale point cloud data. Rather than focusing on accurate object localization, CounterNet detects object presence by finding object centers to improve counting accuracy. We further enhance its performance with a feature map partitioning strategy using overlapping regions, enabling better handling of both small and large objects in complex traffic scenes. To adapt to varying frame characteristics, we introduce a per-frame dynamic model selection strategy that selects the most effective configuration for each input. Evaluations on three real-world autonomous vehicle datasets show that CounterNet improves counting accuracy by 5% to 20% across object categories, resulting in more reliable query outcomes across all supported query types.
zh
[CV-31] Joint Holistic and Lesion Controllable Mammogram Synthesis via Gated Conditional Diffusion Model
【速读】:该论文旨在解决乳腺癌筛查中基于深度学习的分析方法因数据不足和病灶特征多样性匮乏而导致的模型准确性与鲁棒性受限的问题。现有生成模型在合成医学图像时往往未能充分强调病灶特异性特征及其与周围组织的关系,从而影响合成图像的真实性和临床适用性。解决方案的关键在于提出一种名为**门控条件扩散模型(Gated Conditional Diffusion Model, GCDM)**的新框架,其核心创新包括:1)在潜在空间扩散过程中引入软掩码嵌入(soft mask embedding),显式建模乳腺、病灶及其过渡区域的 anatomical coherence;2)设计门控条件分支(gated conditioning branch),动态选择并融合病灶的影像组学(radiomic)和几何特征,以增强病灶局部细节的控制力和真实性。实验表明,GCDM能够在保持解剖结构一致性的同时,精准生成小尺寸病灶,并提升合成图像的多样性和逼真度,为临床级乳腺X线摄影图像合成提供了有效工具。
链接: https://arxiv.org/abs/2507.19201
作者: Xin Li,Kaixiang Yang,Qiang Li,Zhiwei Wang
机构: Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology (武汉国家光电实验室,华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted, ACM Multimedia 2025, 10 pages, 5 figures
Abstract:Mammography is the most commonly used imaging modality for breast cancer screening, driving an increasing demand for deep-learning techniques to support large-scale analysis. However, the development of accurate and robust methods is often limited by insufficient data availability and a lack of diversity in lesion characteristics. While generative models offer a promising solution for data synthesis, current approaches often fail to adequately emphasize lesion-specific features and their relationships with surrounding tissues. In this paper, we propose Gated Conditional Diffusion Model (GCDM), a novel framework designed to jointly synthesize holistic mammogram images and localized lesions. GCDM is built upon a latent denoising diffusion framework, where the noised latent image is concatenated with a soft mask embedding that represents breast, lesion, and their transitional regions, ensuring anatomical coherence between them during the denoising process. To further emphasize lesion-specific features, GCDM incorporates a gated conditioning branch that guides the denoising process by dynamically selecting and fusing the most relevant radiomic and geometric properties of lesions, effectively capturing their interplay. Experimental results demonstrate that GCDM achieves precise control over small lesion areas while enhancing the realism and diversity of synthesized mammograms. These advancements position GCDM as a promising tool for clinical applications in mammogram synthesis. Our code is available at this https URL
zh
[CV-32] WACA-UNet: Weakness-Aware Channel Attention for Static IR Drop Prediction in Integrated Circuit Design
【速读】:该论文旨在解决VLSI设计中电源完整性问题(如IR压降)的高精度空间预测难题,传统基于仿真的求解器存在计算成本高、难以扩展的问题。解决方案的关键在于将IR压降估计重构为从电路版图生成的异构多通道物理图上的像素级回归任务,并提出一种弱性感知通道注意力机制(Weakness-Aware Channel Attention, WACA),通过两阶段门控策略递归增强弱特征通道并抑制过主导通道,从而实现自适应且平衡的特征表示。该方法集成于ConvNeXtV2-based注意力U-Net架构中,在ICCAD-2023公开基准上相比竞赛优胜者显著降低平均绝对误差61.1%,提升F1分数71.0%,验证了通道异质性作为物理布局分析中的关键归纳偏置的重要性。
链接: https://arxiv.org/abs/2507.19197
作者: Youngmin Seo,Yunhyeong Kwon,Younghun Park,HwiRyong Kim,Seungho Eum,Jinha Kim,Taigon Song,Juho Kim,Unsang Park
机构: Sogang University (首尔女子大学); Kyungpook National University (庆北国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures
Abstract:Accurate spatial prediction of power integrity issues, such as IR drop, is critical for reliable VLSI design. However, traditional simulation-based solvers are computationally expensive and difficult to scale. We address this challenge by reformulating IR drop estimation as a pixel-wise regression task on heterogeneous multi-channel physical maps derived from circuit layouts. Prior learning-based methods treat all input layers (e.g., metal, via, and current maps) equally, ignoring their varying importance to prediction accuracy. To tackle this, we propose a novel Weakness-Aware Channel Attention (WACA) mechanism, which recursively enhances weak feature channels while suppressing over-dominant ones through a two-stage gating strategy. Integrated into a ConvNeXtV2-based attention U-Net, our approach enables adaptive and balanced feature representation. On the public ICCAD-2023 benchmark, our method outperforms the ICCAD-2023 contest winner by reducing mean absolute error by 61.1% and improving F1-score by 71.0%. These results demonstrate that channel-wise heterogeneity is a key inductive bias in physical layout analysis for VLSI.
zh
[CV-33] VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions
【速读】:该论文旨在解决单目语义场景补全(monocular semantic scene completion)中普遍存在的特征纠缠(feature entanglement)和几何不一致性(geometric inconsistency)问题。其核心解决方案是提出一种两阶段框架VisHall3D,将场景补全任务解耦为两个独立阶段:第一阶段通过引入可见性感知的投影模块VisFrontierNet,精确追踪视觉边界并保留细粒度细节;第二阶段则采用基于噪声注入机制的幻觉网络OcclusionMAE,生成不可见区域的合理几何结构。这种分阶段处理策略有效缓解了特征混杂与几何冲突,显著提升了重建质量,在SemanticKITTI和SSCBench-KITTI-360两个基准上均达到当前最优性能。
链接: https://arxiv.org/abs/2507.19188
作者: Haoang Lu,Yuanqi Su,Xiaoning Zhang,Longjun Gao,Yu Xue,Le Wang
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces VisHall3D, a novel two-stage framework for monocular semantic scene completion that aims to address the issues of feature entanglement and geometric inconsistency prevalent in existing methods. VisHall3D decomposes the scene completion task into two stages: reconstructing the visible regions (vision) and inferring the invisible regions (hallucination). In the first stage, VisFrontierNet, a visibility-aware projection module, is introduced to accurately trace the visual frontier while preserving fine-grained details. In the second stage, OcclusionMAE, a hallucination network, is employed to generate plausible geometries for the invisible regions using a noise injection mechanism. By decoupling scene completion into these two distinct stages, VisHall3D effectively mitigates feature entanglement and geometric inconsistency, leading to significantly improved reconstruction quality. The effectiveness of VisHall3D is validated through extensive experiments on two challenging benchmarks: SemanticKITTI and SSCBench-KITTI-360. VisHall3D achieves state-of-the-art performance, outperforming previous methods by a significant margin and paves the way for more accurate and reliable scene understanding in autonomous driving and other applications. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.19188 [cs.CV] (or arXiv:2507.19188v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.19188 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-34] Reconstruct or Generate: Exploring the Spectrum of Generative Modeling for Cardiac MRI
【速读】:该论文旨在解决医学影像中生成模型在重建(reconstruction)与生成(generation)任务之间存在目标冲突的问题,即如何在保持数据保真度的同时提升感知质量。其解决方案的关键在于构建一个“生成模型动物园”(generative model zoo),系统性地评估现代潜空间扩散模型(latent diffusion models)和自回归模型(autoregressive models)在心脏医学影像任务中的表现差异,尤其是在不同掩码比例(masking ratios)下的图像修复(inpainting)和无条件图像生成能力。研究发现,扩散模型在无条件生成中具有更优的感知质量,但在高掩码比例下易产生幻觉;而自回归模型则在各种掩码水平下保持稳定的感知性能,尽管整体保真度较低。这一对比揭示了两类模型在重建-生成谱系中的权衡机制,为选择合适模型提供实证依据。
链接: https://arxiv.org/abs/2507.19186
作者: Niklas Bubeck,Yundi Zhang,Suprosanna Shit,Daniel Rueckert,Jiazhen Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In medical imaging, generative models are increasingly relied upon for two distinct but equally critical tasks: reconstruction, where the goal is to restore medical imaging (usually inverse problems like inpainting or superresolution), and generation, where synthetic data is created to augment datasets or carry out counterfactual analysis. Despite shared architecture and learning frameworks, they prioritize different goals: generation seeks high perceptual quality and diversity, while reconstruction focuses on data fidelity and faithfulness. In this work, we introduce a “generative model zoo” and systematically analyze how modern latent diffusion models and autoregressive models navigate the reconstruction-generation spectrum. We benchmark a suite of generative models across representative cardiac medical imaging tasks, focusing on image inpainting with varying masking ratios and sampling strategies, as well as unconditional image generation. Our findings show that diffusion models offer superior perceptual quality for unconditional generation but tend to hallucinate as masking ratios increase, whereas autoregressive models maintain stable perceptual performance across masking levels, albeit with generally lower fidelity.
zh
[CV-35] Continual Learning-Based Unified Model for Unpaired Image Restoration Tasks
【速读】:该论文旨在解决多类恶劣天气条件下图像退化问题的统一恢复难题,即如何构建一个能够同时处理雾霾、降雪和降雨等不同天气干扰的通用图像复原模型,以满足自动驾驶等实际应用场景的需求。其解决方案的关键在于提出了一种持续学习框架,包含三项核心创新:(1) 选择性核融合(Selective Kernel Fusion)层,用于动态整合全局与局部特征以实现鲁棒的自适应特征选择;(2) 弹性权重固化(Elastic Weight Consolidation, EWC)机制,有效缓解在多个恢复任务间迁移时的灾难性遗忘问题;(3) 一种新颖的循环对比损失(Cycle-Contrastive Loss),在域转换过程中增强特征判别力的同时保持语义一致性。此外,还设计了无配对图像恢复方法,降低对训练数据依赖性,从而显著提升PSNR、SSIM及感知质量指标。
链接: https://arxiv.org/abs/2507.19184
作者: Kotha Kartheek,Lingamaneni Gnanesh Chowdary,Snehasis Mukherjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Restoration of images contaminated by different adverse weather conditions such as fog, snow, and rain is a challenging task due to the varying nature of the weather conditions. Most of the existing methods focus on any one particular weather conditions. However, for applications such as autonomous driving, a unified model is necessary to perform restoration of corrupted images due to different weather conditions. We propose a continual learning approach to propose a unified framework for image restoration. The proposed framework integrates three key innovations: (1) Selective Kernel Fusion layers that dynamically combine global and local features for robust adaptive feature selection; (2) Elastic Weight Consolidation (EWC) to enable continual learning and mitigate catastrophic forgetting across multiple restoration tasks; and (3) a novel Cycle-Contrastive Loss that enhances feature discrimination while preserving semantic consistency during domain translation. Further, we propose an unpaired image restoration approach to reduce the dependance of the proposed approach on the training data. Extensive experiments on standard benchmark datasets for dehazing, desnowing and deraining tasks demonstrate significant improvements in PSNR, SSIM, and perceptual quality over the state-of-the-art.
zh
[CV-36] Patch Pruning Strategy Based on Robust Statistical Measures of Attention Weight Diversity in Vision Transformers
【速读】:该论文旨在解决视觉 Transformer 中多头自注意力机制(multi-head self-attention)因计算复杂度与输入图像块数量呈二次关系而导致的效率瓶颈问题。其解决方案的关键在于提出一种基于注意力权重方差(variance of attention weights across multiple attention heads)的图像块剪枝策略,通过量化每个图像块的重要性来识别并移除冗余块,从而在训练和推理阶段均实现更高的吞吐量,同时保持分类精度。此外,研究还发现使用鲁棒统计量(如中位数绝对偏差,median absolute deviation)替代方差可进一步提升性能,并通过引入重叠的图像块嵌入(overlapping patch embeddings)在相近吞吐量下获得更优的模型表现。
链接: https://arxiv.org/abs/2507.19175
作者: Yuki Igaue,Hiroaki Aizawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-head self-attention is a distinctive feature extraction mechanism of vision transformers that computes pairwise relationships among all input patches, contributing significantly to their high performance. However, it is known to incur a quadratic computational complexity with respect to the number of patches. One promising approach to address this issue is patch pruning, which improves computational efficiency by identifying and removing redundant patches. In this work, we propose a patch pruning strategy that evaluates the importance of each patch based on the variance of attention weights across multiple attention heads. This approach is inspired by the design of multi-head self-attention, which aims to capture diverse attention patterns across different subspaces of feature representations. The proposed method can be easily applied during both training and inference, and achieves improved throughput while maintaining classification accuracy in scenarios such as fine-tuning with pre-trained models. In addition, we also found that using robust statistical measures, such as the median absolute deviation in place of variance, to assess patch importance can similarly lead to strong performance. Furthermore, by introducing overlapping patch embeddings, our method achieves better performance with comparable throughput to conventional approaches that utilize all patches.
zh
[CV-37] PhysDrive: A Multimodal Remote Physiological Measurement Dataset for In-vehicle Driver Monitoring
【速读】:该论文旨在解决车载无接触生理监测(contactless in-vehicle physiological sensing)在真实驾驶场景中因数据集稀缺、模态单一、生物特征标注不全及环境条件覆盖不足而导致的性能瓶颈问题。其解决方案的关键在于构建首个大规模多模态数据集PhysDrive,该数据集整合了48名驾驶员的同步RGB、近红外摄像头和毫米波雷达(mmWave radar)数据,并配有六种生理指标(ECG、BVP、呼吸率、心率、呼吸频率和血氧饱和度)作为地面真值,同时涵盖多样化的驾驶条件(如驾驶员动作、光照变化、车辆类型与道路状况),并通过系统性评估信号处理与深度学习方法,为多模态驾驶员监测提供全面基准和开源工具链,从而推动智能座舱系统的研究与应用发展。
链接: https://arxiv.org/abs/2507.19172
作者: Jiyao Wang,Xiao Yang,Qingyong Hu,Jiankai Tang,Can Liu,Dengbo He,Yuntao Wang,Yingcong Chen,Kaishun Wu
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学); Tsinghua University (清华大学); Sichuan Agricultural University (四川农业大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: It is the initial version, not the final version
Abstract:Robust and unobtrusive in-vehicle physiological monitoring is crucial for ensuring driving safety and user experience. While remote physiological measurement (RPM) offers a promising non-invasive solution, its translation to real-world driving scenarios is critically constrained by the scarcity of comprehensive datasets. Existing resources are often limited in scale, modality diversity, the breadth of biometric annotations, and the range of captured conditions, thereby omitting inherent real-world challenges in driving. Here, we present PhysDrive, the first large-scale multimodal dataset for contactless in-vehicle physiological sensing with dedicated consideration on various modality settings and driving factors. PhysDrive collects data from 48 drivers, including synchronized RGB, near-infrared camera, and raw mmWave radar data, accompanied with six synchronized ground truths (ECG, BVP, Respiration, HR, RR, and SpO2). It covers a wide spectrum of naturalistic driving conditions, including driver motions, dynamic natural light, vehicle types, and road conditions. We extensively evaluate both signal-processing and deep-learning methods on PhysDrive, establishing a comprehensive benchmark across all modalities, and release full open-source code with compatibility for mainstream public toolboxes. We envision PhysDrive will serve as a foundational resource and accelerate research on multimodal driver monitoring and smart-cockpit systems.
zh
[CV-38] DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering
【速读】:该论文旨在解决动态场景重建中因低秩假设导致的特征重叠与渲染质量下降问题,以及直接应用4D哈希编码时产生的哈希冲突和冗余问题。其解决方案的关键在于提出一种名为DASH的实时动态场景渲染框架,该框架结合自监督分解机制实现无需人工标注或预计算掩码即可分离动态与静态成分,并引入多分辨率4D哈希编码器对动态元素进行显式建模,从而规避低秩约束;同时设计时空平滑正则化策略以抑制不稳定形变伪影,最终在真实数据集上实现了264 FPS的实时渲染性能与最优视觉质量。
链接: https://arxiv.org/abs/2507.19141
作者: Jie Chen(1),Zhangchi Hu(1),Peixi Wu(1),Huyue Zhu(1),Hebei Li(1),Xiaoyan Sun(1 and 2) ((1) University of Science and Technology of China, Hefei, China, (2) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China)
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dynamic scene reconstruction is a long-term challenge in 3D vision. Existing plane-based methods in dynamic Gaussian splatting suffer from an unsuitable low-rank assumption, causing feature overlap and poor rendering quality. Although 4D hash encoding provides an explicit representation without low-rank constraints, directly applying it to the entire dynamic scene leads to substantial hash collisions and redundancy. To address these challenges, we present DASH, a real-time dynamic scene rendering framework that employs 4D hash encoding coupled with self-supervised decomposition. Our approach begins with a self-supervised decomposition mechanism that separates dynamic and static components without manual annotations or precomputed masks. Next, we introduce a multiresolution 4D hash encoder for dynamic elements, providing an explicit representation that avoids the low-rank assumption. Finally, we present a spatio-temporal smoothness regularization strategy to mitigate unstable deformation artifacts. Experiments on real-world datasets demonstrate that DASH achieves state-of-the-art dynamic rendering performance, exhibiting enhanced visual quality at real-time speeds of 264 FPS on a single 4090 GPU. Code: this https URL.
zh
[CV-39] Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation
【速读】:该论文旨在解决少样本分割(Few-Shot Segmentation, FSS)任务中因原型学习方法预测保守、亲和学习方法预测激进而导致的分割性能受限问题。其解决方案的关键在于提出一种原型-亲和混合网络(Prototype-Affinity Hybrid Network, PAHNet),通过在亲和学习模型的每个注意力模块中引入两个核心组件:原型引导特征增强模块(Prototype-guided Feature Enhancement, PFE)和注意力得分校准模块(Attention Score Calibration, ASC)。PFE利用预训练原型学习模型的预测结果来增强支持图像与查询图像中前景信息的表示,ASC则用于抑制两者之间不匹配的前景-背景(Foreground-Background, FG-BG)关系,从而有效缓解亲和学习的过度激进行为,提升分割精度。
链接: https://arxiv.org/abs/2507.19140
作者: Tianyu Zou,Shengwu Xiong,Ruilin Yao,Yi Rong
机构: Wuhan University of Technology (武汉理工大学); Wuhan College (武汉学院); University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures
Abstract:This paper studies the few-shot segmentation (FSS) task, which aims to segment objects belonging to unseen categories in a query image by learning a model on a small number of well-annotated support samples. Our analysis of two mainstream FSS paradigms reveals that the predictions made by prototype learning methods are usually conservative, while those of affinity learning methods tend to be more aggressive. This observation motivates us to balance the conservative and aggressive information captured by these two types of FSS frameworks so as to improve the segmentation performance. To achieve this, we propose a Prototype-Affinity Hybrid Network (PAHNet), which introduces a Prototype-guided Feature Enhancement (PFE) module and an Attention Score Calibration (ASC) module in each attention block of an affinity learning model (called affinity learner). These two modules utilize the predictions generated by a pre-trained prototype learning model (called prototype predictor) to enhance the foreground information in support and query image representations and suppress the mismatched foreground-background (FG-BG) relationships between them, respectively. In this way, the aggressiveness of the affinity learner can be effectively mitigated, thereby eventually increasing the segmentation accuracy of our PAHNet method. Experimental results show that PAHNet outperforms most recently proposed methods across 1-shot and 5-shot settings on both PASCAL-5 ^i and COCO-20 ^i datasets, suggesting its effectiveness. The code is available at: [GitHub - tianyu-zou/PAHNet: Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation (ICCV’25)](this https URL)
zh
[CV-40] MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective ICCV2025
【速读】:该论文旨在解决量化视觉Transformer模型(如基于窗口的Swin Transformer)在推理效率与模型精度之间难以平衡的问题。其核心挑战在于:传统均匀位宽量化方法无法有效利用激活值内部的稀疏性,导致冗余计算和不必要的精度损失。解决方案的关键是提出MixA-Q框架,该框架通过引入层内激活稀疏性(intra-layer activation sparsity),对Swim块内的窗口级激活进行分组处理——将低重要性窗口的激活分配更低的比特宽度,同时设计**双分支Swin块(Two-Branch Swin Block)**以实现高/低比特精度的并行处理,从而在不改变现有量化训练流程的前提下,显著提升计算效率。实验表明,该方法在无训练条件下实现1.35倍加速且无精度损失,在量化感知训练(QAT)下可达到1.25倍无损加速或1.53倍加速(仅mAP下降1%),并使W4A4量化模型性能提升0.7%,降低24%的量化退化。
链接: https://arxiv.org/abs/2507.19131
作者: Weitian Wang,Rai Shubham,Cecilia De La Parra,Akash Kumar
机构: Robert Bosch GmbH (罗伯特·博世有限公司); Ruhr University Bochum (鲁尔大学波鸿分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:In this paper, we propose MixA-Q, a mixed-precision activation quantization framework that leverages intra-layer activation sparsity (a concept widely explored in activation pruning methods) for efficient inference of quantized window-based vision transformers. For a given uniform-bit quantization configuration, MixA-Q separates the batched window computations within Swin blocks and assigns a lower bit width to the activations of less important windows, improving the trade-off between model performance and efficiency. We introduce a Two-Branch Swin Block that processes activations separately in high- and low-bit precision, enabling seamless integration of our method with most quantization-aware training (QAT) and post-training quantization (PTQ) methods, or with simple modifications. Our experimental evaluations over the COCO dataset demonstrate that MixA-Q achieves a training-free 1.35x computational speedup without accuracy loss in PTQ configuration. With QAT, MixA-Q achieves a lossless 1.25x speedup and a 1.53x speedup with only a 1% mAP drop by incorporating activation pruning. Notably, by reducing the quantization error in important regions, our sparsity-aware quantization adaptation improves the mAP of the quantized W4A4 model (with both weights and activations in 4-bit precision) by 0.7%, reducing quantization degradation by 24%.
zh
[CV-41] Learned Image Compression with Hierarchical Progressive Context Modeling ICCV2025
【速读】:该论文旨在解决现有学习图像压缩方法中上下文建模能力不足的问题,尤其是在利用跨编码步骤的长距离依赖关系和多样化上下文信息方面效率较低。其解决方案的关键在于提出一种分层渐进式上下文模型(Hierarchical Progressive Context Model, HPCM),通过分层编码调度实现多尺度潜变量间上下文依赖关系的顺序建模,从而提升长程上下文建模效率;同时引入渐进式上下文融合机制,将前序编码步骤中的上下文信息逐步整合到当前步骤中,有效挖掘不同编码阶段的多样性上下文特征,最终在率失真性能与计算复杂度之间取得更优平衡。
链接: https://arxiv.org/abs/2507.19125
作者: Yuqi Li,Haotian Zhang,Li Li,Dong Liu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 17 pages, ICCV 2025
Abstract:Context modeling is essential in learned image compression for accurately estimating the distribution of latents. While recent advanced methods have expanded context modeling capacity, they still struggle to efficiently exploit long-range dependency and diverse context information across different coding steps. In this paper, we introduce a novel Hierarchical Progressive Context Model (HPCM) for more efficient context information acquisition. Specifically, HPCM employs a hierarchical coding schedule to sequentially model the contextual dependencies among latents at multiple scales, which enables more efficient long-range context modeling. Furthermore, we propose a progressive context fusion mechanism that incorporates contextual information from previous coding steps into the current step, effectively exploiting diverse contextual information. Experimental results demonstrate that our method achieves state-of-the-art rate-distortion performance and strikes a better balance between compression performance and computational complexity. The code is available at this https URL.
zh
[CV-42] Preserving Topological and Geometric Embeddings for Point Cloud Recovery
【速读】:该论文旨在解决点云恢复(point cloud recovery)过程中难以有效利用拓扑结构(topological)和几何特征(geometric)的问题,现有方法在采样(sampling)与重建(restoration)阶段往往无法协同保持原始空间的结构信息。其核心解决方案是提出一种端到端架构 TopGeoFormer,关键在于:1)通过连续映射邻近点间相对关系生成拓扑嵌入(topological embedding),并在采样与恢复阶段均加以利用以维持原空间结构;2)设计交互式注意力机制(InterTwining Attention),深度融合拓扑与几何嵌入,实现局部感知下的形状上下文建模,包含点级、点-形状级及形状内特征;3)引入全几何损失(full geometry loss)与拓扑约束损失(topological constraint loss),分别优化欧氏空间与拓扑空间中的嵌入表示,从而提升重建精度与拓扑一致性。
链接: https://arxiv.org/abs/2507.19121
作者: Kaiyue Zhou,Zelong Tan,Hongxiao Wang,Ya-li Li,Shengjin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recovering point clouds involves the sequential process of sampling and restoration, yet existing methods struggle to effectively leverage both topological and geometric attributes. To address this, we propose an end-to-end architecture named \textbfTopGeoFormer, which maintains these critical features throughout the sampling and restoration phases. First, we revisit traditional feature extraction techniques to yield topological embedding using a continuous mapping of relative relationships between neighboring points, and integrate it in both phases for preserving the structure of the original space. Second, we propose the \textbfInterTwining Attention to fully merge topological and geometric embeddings, which queries shape with local awareness in both phases to form a learnable shape context facilitated with point-wise, point-shape-wise, and intra-shape features. Third, we introduce a full geometry loss and a topological constraint loss to optimize the embeddings in both Euclidean and topological spaces. The geometry loss uses inconsistent matching between coarse-to-fine generations and targets for reconstructing better geometric details, and the constraint loss limits embedding variances for better approximation of the topological space. In experiments, we comprehensively analyze the circumstances using the conventional and learning-based sampling/upsampling algorithms. The quantitative and qualitative results demonstrate that our method significantly outperforms existing sampling and recovery methods.
zh
[CV-43] PatchTraj: Dynamic Patch Representation Learning for Time-Frequency Trajectory Prediction
【速读】:该论文旨在解决当前行人轨迹预测方法中存在的两个关键问题:一是点式和网格式方法在建模人类运动动力学时存在局限,难以平衡局部运动细节与长距离时空依赖关系;二是时间表示缺乏与频域的交互,无法有效建模轨迹序列的频率特性。解决方案的关键在于提出PatchTraj框架,通过动态补丁(patch)划分将轨迹分解为原始时间序列和频率成分,利用自适应嵌入层进行尺度感知特征提取,并结合跨模态注意力机制实现时域与频域信息的互补融合,最终借助Transformer编码器-解码器结构完成对未来的轨迹预测。
链接: https://arxiv.org/abs/2507.19119
作者: Yanghong Liu,Xingping Dong,Ming Li,Weixing Zhang,Yidong Lou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pedestrian trajectory prediction is crucial for autonomous driving and robotics. While existing point-based and grid-based methods expose two key limitations: insufficiently modeling human motion dynamics, as they fail to balance local motion details with long-range spatiotemporal dependencies, and the time representation lacks interaction with the frequency domain in modeling trajectory sequences. To address these challenges, we propose PatchTraj, a dynamic patch-based trajectory prediction framework that unifies time-domain and frequency-domain representations. Specifically, we decompose the trajectory into raw time sequences and frequency components, employing dynamic patch partitioning for multi-scale trajectory segmentation to capture hierarchical motion patterns. Each patch is processed by an adaptive embedding layer with scale-aware feature extraction, followed by hierarchical feature aggregation to model both fine-grained and long-range dependencies. The outputs of two branches interact via cross-modal attention, enabling complementary fusion of temporal and spectral cues. Finally, a Transformer encoder-decoder integrates both modalities to autoregressively predict future trajectories. Extensive experiments on ETH-UCY, SDD, NBA, and JRDB datasets demonstrate that our method achieves state-of-the-art performance with high efficiency.
zh
[CV-44] Cross Spatial Temporal Fusion Attention for Remote Sensing Object Detection via Image Feature Matching
【速读】:该论文旨在解决跨模态遥感图像匹配中因几何和辐射差异导致特征描述困难的问题,现有方法多依赖全连接层提取特征,难以有效捕捉跨模态相似性。其解决方案的关键在于提出一种跨空间时间融合(Cross Spatial Temporal Fusion, CSTF)机制,通过独立检测参考图与查询图中的尺度不变关键点(scale-invariant keypoints),结合多区域信息构建对应关系图,并将相似性匹配重构为基于SoftMax和全卷积网络(Fully Convolutional Network, FCN)的分类任务,从而在保持局部特征敏感性的同时融入全局上下文信息,实现对多种遥感模态的鲁棒匹配。
链接: https://arxiv.org/abs/2507.19118
作者: Abu Sadat Mohammad Salehin Amit,Xiaoli Zhang,Md Masum Billa Shagar,Zhaojun Liu,Xiongfei Li,Fanlong Meng
机构: Jilin University (吉林大学); Taiyuan University of Technology (太原理工大学); Delaware State University (特拉华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Effectively describing features for cross-modal remote sensing image matching remains a challenging task due to the significant geometric and radiometric differences between multimodal images. Existing methods primarily extract features at the fully connected layer but often fail to capture cross-modal similarities effectively. We propose a Cross Spatial Temporal Fusion (CSTF) mechanism that enhances feature representation by integrating scale-invariant keypoints detected independently in both reference and query images. Our approach improves feature matching in two ways: First, by creating correspondence maps that leverage information from multiple image regions simultaneously, and second, by reformulating the similarity matching process as a classification task using SoftMax and Fully Convolutional Network (FCN) layers. This dual approach enables CSTF to maintain sensitivity to distinctive local features while incorporating broader contextual information, resulting in robust matching across diverse remote sensing modalities. To demonstrate the practical utility of improved feature matching, we evaluate CSTF on object detection tasks using the HRSC2016 and DOTA benchmark datasets. Our method achieves state-of-theart performance with an average mAP of 90.99% on HRSC2016 and 90.86% on DOTA, outperforming existing models. The CSTF model maintains computational efficiency with an inference speed of 12.5 FPS. These results validate that our approach to crossmodal feature matching directly enhances downstream remote sensing applications such as object detection.
zh
[CV-45] LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-语言任务中普遍存在的人工幻觉问题,尤其是图像描述任务中生成不存在的对象(object hallucinations)。解决方案的关键在于提出一种分层集成与抑制方法(Layer-wise Integration and Suppression Approach, LISA),其核心机制包括两个方面:一是通过区域特定的频谱调制(zone-specific spectral modulation)对深层注意力机制进行抑制,以稳定激活并保留浅层的视觉对齐线索;二是基于锚点的路由策略实现选定层的 token 级 logits 融合,通过 token 级锚点选择和软 logit 融合机制,在解码阶段实现自适应的信息整合。LISA 具有完全即插即用特性,可无缝集成至现有 MLLMs(如 Qwen2.5-VL),并在多个基准测试中显著降低幻觉率(最高达 53.6%)并提升 POPE F1 指标(+4.5%),展现出良好的跨模型和跨任务泛化能力。
链接: https://arxiv.org/abs/2507.19110
作者: Zhihui Guo,Xin Man,Hui Xu,Jie Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose \textbfLISA, a \textbfLayer-wise \textbfIntegration and \textbfSuppression \textbfApproach that enhances generation consistency through hierarchical modulation and multi-layer fusion. LISA leverages the functional hierarchy within MLLMs, where shallow layers provide visual grounding, middle layers encode semantics, and deep layers tend to amplify spurious signals. First, zone-specific spectral modulation stabilizes attention by suppressing over-amplified activations in deeper layers while preserving alignment cues in earlier layers. Second, token-level logits from selected layers are fused via anchor-based routing, with token-wise anchor selection and soft logit fusion enabling adaptive integration during decoding. LISA is fully \textbfplug-and-play and can be seamlessly integrated into existing MLLMs, including Qwen2.5-VL. Experiments on multiple benchmarks show that LISA reduces hallucinations by up to 53.6% in \mathrmCHAIR_I and improves POPE F1 by 4.5%, demonstrating strong generalization across models and tasks.
zh
[CV-46] MedSymmFlow: Bridging Generative Modeling and Classification in Medical Imaging through Symmetrical Flow Matching MICCAI2025
【速读】:该论文旨在解决医疗图像分类中准确预测与可靠不确定性估计的双重需求,尤其是在高风险临床场景下,传统判别模型往往缺乏对预测置信度的有效量化。其解决方案的核心在于提出MedSymmFlow——一种基于对称流匹配(Symmetrical Flow Matching)的生成-判别混合模型,通过潜在空间建模实现高分辨率输入的扩展性,并引入语义掩码条件机制以增强诊断相关性;该模型利用生成采样过程自然地估计不确定性,从而在四个MedMNIST数据集上实现了与现有基线相当或更优的分类准确率和AUC指标,同时通过选择性预测策略验证了其不确定性估计的可靠性。
链接: https://arxiv.org/abs/2507.19098
作者: Francisco Caetano,Lemar Abdi,Christiaan Viviers,Amaan Valiuddin,Fons van der Sommen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: DGM4MICCAI 2025
Abstract:Reliable medical image classification requires accurate predictions and well-calibrated uncertainty estimates, especially in high-stakes clinical settings. This work presents MedSymmFlow, a generative-discriminative hybrid model built on Symmetrical Flow Matching, designed to unify classification, generation, and uncertainty quantification in medical imaging. MedSymmFlow leverages a latent-space formulation that scales to high-resolution inputs and introduces a semantic mask conditioning mechanism to enhance diagnostic relevance. Unlike standard discriminative models, it naturally estimates uncertainty through its generative sampling process. The model is evaluated on four MedMNIST datasets, covering a range of modalities and pathologies. The results show that MedSymmFlow matches or exceeds the performance of established baselines in classification accuracy and AUC, while also delivering reliable uncertainty estimates validated by performance improvements under selective prediction.
zh
[CV-47] Fine-Grained Traffic Inference from Road to Lane via Spatio-Temporal Graph Node Generation
【速读】:该论文旨在解决细粒度交通管理与预测中因传感器类型和数量有限、跟踪算法精度不足而导致的车道级交通数据获取瓶颈问题,从而支持自动驾驶、变道引导及交通信号控制等关键应用。其解决方案的核心是提出细粒度道路交通推理(Fine-grained Road Traffic Inference, FRTI)任务,并设计了一个两阶段框架RoadDiff,该框架通过道路-车道相关性自编码器-解码器(Road-Lane Correlation Autoencoder-Decoder)和车道扩散模块(Lane Diffusion Module),充分挖掘有限时空依赖性和分布关系,以高精度推断车道级交通状态,显著提升了交通信息的生成效率与准确性。
链接: https://arxiv.org/abs/2507.19089
作者: Shuhao Li,Weidong Yang,Yue Cui,Xiaoxing Liu,Lingkai Meng,Lipeng Ma,Fan Zhang
机构: Fudan University (复旦大学); Shanghai Key Laboratory of Data Science (上海市数据科学重点实验室); Zhuhai Fudan Innovation Research Institute (珠海复旦创新研究院); The Hong Kong University of Science and Technology (香港科技大学); Guangzhou University (广州大学); GZHU-SCHB Intelligent Transportation Joint Lab (广工-赛宝智能交通联合实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fine-grained traffic management and prediction are fundamental to key applications such as autonomous driving, lane change guidance, and traffic signal control. However, obtaining lane-level traffic data has become a critical bottleneck for data-driven models due to limitations in the types and number of sensors and issues with the accuracy of tracking algorithms. To address this, we propose the Fine-grained Road Traffic Inference (FRTI) task, which aims to generate more detailed lane-level traffic information using limited road data, providing a more energy-efficient and cost-effective solution for precise traffic management. This task is abstracted as the first scene of the spatio-temporal graph node generation problem. We designed a two-stage framework–RoadDiff–to solve the FRTI task. solve the FRTI task. This framework leverages the Road-Lane Correlation Autoencoder-Decoder and the Lane Diffusion Module to fully utilize the limited spatio-temporal dependencies and distribution relationships of road data to accurately infer fine-grained lane traffic states. Based on existing research, we designed several baseline models with the potential to solve the FRTI task and conducted extensive experiments on six datasets representing different road conditions to validate the effectiveness of the RoadDiff model in addressing the FRTI task. The relevant datasets and code are available at this https URL.
zh
[CV-48] Multi-Task Dense Prediction Fine-Tuning with Mixture of Fine-Grained Experts
【速读】:该论文旨在解决多任务学习(Multi-task Learning, MTL)在密集预测任务中如何平衡共享表示与任务特异性专化的问题。其核心挑战在于现有方法难以在保持参数效率的同时实现高效的任务间知识迁移与干扰抑制。解决方案的关键在于提出一种细粒度的专家混合(Fine-Grained Mixture of Experts, FGMoE)架构,包含三项创新:(1) 引入任务内专家(intra-task experts),沿MLP中间隐藏维度进行划分,实现任务信息的细粒度分解;(2) 设计共享专家(shared experts),聚合同一任务不同上下文中的共性信息,减少冗余并使路由专家聚焦于独特特征;(3) 构建全局专家(global expert),根据输入特征和任务需求自适应地跨任务传递知识,促进有益共享并避免有害干扰。此外,通过仅微调解码器参数的方式提升参数效率,实验表明FGMoE在NYUD-v2和PASCAL-Context数据集上以更少参数显著优于当前主流MoE-based MTL模型。
链接: https://arxiv.org/abs/2507.19077
作者: Yangyang Xu,Xi Ye,Duo Su
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM Multimedia 2025 (MM’25)
Abstract:Multi-task learning (MTL) for dense prediction has shown promising results but still faces challenges in balancing shared representations with task-specific specialization. In this paper, we introduce a novel Fine-Grained Mixture of Experts (FGMoE) architecture that explores MoE-based MTL models through a combination of three key innovations and fine-tuning. First, we propose intra-task experts that partition along intermediate hidden dimensions of MLPs, enabling finer decomposition of task information while maintaining parameter efficiency. Second, we introduce shared experts that consolidate common information across different contexts of the same task, reducing redundancy, and allowing routing experts to focus on unique aspects. Third, we design a global expert that facilitates adaptive knowledge transfer across tasks based on both input feature and task requirements, promoting beneficial information sharing while preventing harmful interference. In addition, we use the fine-tuning approach to improve parameter efficiency only by training the parameters of the decoder. Extensive experimental results show that the proposed FGMoE uses fewer parameters and significantly outperforms current MoE-based competitive MTL models on two dense prediction datasets (\textiti.e., NYUD-v2, PASCAL-Context) in various metrics.
zh
[CV-49] SP-Mamba: Spatial-Perception State Space Model for Unsupervised Medical Anomaly Detection
【速读】:该论文旨在解决医学图像中异常检测的准确性与计算效率之间的矛盾问题,尤其针对卷积神经网络(CNN)难以捕捉长距离依赖关系、而Transformer模型存在二次计算复杂度的局限性。解决方案的关键在于提出SP-Mamba框架,其核心创新包括:基于窗口滑动原型学习和Circular-Hilbert扫描机制的Mamba结构,以更有效地利用医学图像中固有的空间结构规律;同时挖掘异常图的集中度与对比度特征,提升异常定位精度。该方法在三个不同医学异常检测基准上均达到最先进性能,验证了其有效性与鲁棒性。
链接: https://arxiv.org/abs/2507.19076
作者: Rui Pan,Ruiying Lu
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Radiography imaging protocols target on specific anatomical regions, resulting in highly consistent images with recurrent structural patterns across patients. Recent advances in medical anomaly detection have demonstrated the effectiveness of CNN- and transformer-based approaches. However, CNNs exhibit limitations in capturing long-range dependencies, while transformers suffer from quadratic computational complexity. In contrast, Mamba-based models, leveraging superior long-range modeling, structural feature extraction, and linear computational efficiency, have emerged as a promising alternative. To capitalize on the inherent structural regularity of medical images, this study introduces SP-Mamba, a spatial-perception Mamba framework for unsupervised medical anomaly detection. The window-sliding prototype learning and Circular-Hilbert scanning-based Mamba are introduced to better exploit consistent anatomical patterns and leverage spatial information for medical anomaly detection. Furthermore, we excavate the concentration and contrast characteristics of anomaly maps for improving anomaly detection. Extensive experiments on three diverse medical anomaly detection benchmarks confirm the proposed method’s state-of-the-art performance, validating its efficacy and robustness. The code is available at this https URL.
zh
[CV-50] A Self-training Framework for Semi-supervised Pulmonary Vessel Segmentation and Its Application in COPD
【速读】:该论文旨在解决慢性阻塞性肺疾病(Chronic Obstructive Pulmonary Disease, COPD)患者胸部CT图像中肺血管(尤其是小血管)的精准分割与量化问题。解决方案的关键在于提出一种基于教师-学生(teacher-student)架构的自训练半监督学习框架:首先在小规模标注数据上训练一个全监督教师模型,随后利用该模型为大量未标注数据生成伪标签,并通过特定策略筛选可靠伪标签用于学生模型的迭代训练,从而显著提升肺血管分割精度(达到90.3%的精确率),同时支持对不同COPD严重程度下肺血管结构差异的定量分析。
链接: https://arxiv.org/abs/2507.19074
作者: Shuiqing Zhao,Meihuan Wang,Jiaxuan Xu,Jie Feng,Wei Qian,Rongchang Chen,Zhenyu Liang,Shouliang Qi,Yanan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Background: It is fundamental for accurate segmentation and quantification of the pulmonary vessel, particularly smaller vessels, from computed tomography (CT) images in chronic obstructive pulmonary disease (COPD) patients. Objective: The aim of this study was to segment the pulmonary vasculature using a semi-supervised method. Methods: In this study, a self-training framework is proposed by leveraging a teacher-student model for the segmentation of pulmonary vessels. First, the high-quality annotations are acquired in the in-house data by an interactive way. Then, the model is trained in the semi-supervised way. A fully supervised model is trained on a small set of labeled CT images, yielding the teacher model. Following this, the teacher model is used to generate pseudo-labels for the unlabeled CT images, from which reliable ones are selected based on a certain strategy. The training of the student model involves these reliable pseudo-labels. This training process is iteratively repeated until an optimal performance is achieved. Results: Extensive experiments are performed on non-enhanced CT scans of 125 COPD patients. Quantitative and qualitative analyses demonstrate that the proposed method, Semi2, significantly improves the precision of vessel segmentation by 2.3%, achieving a precision of 90.3%. Further, quantitative analysis is conducted in the pulmonary vessel of COPD, providing insights into the differences in the pulmonary vessel across different severity of the disease. Conclusion: The proposed method can not only improve the performance of pulmonary vascular segmentation, but can also be applied in COPD analysis. The code will be made available at this https URL.
zh
[CV-51] Cross-Subject Mind Decoding from Inaccurate Representations
【速读】:该论文旨在解决从功能性磁共振成像(fMRI)信号中解码刺激图像时面临的跨被试映射难题,该问题主要源于认知差异和个体特异性导致的序列误差累积,进而降低重建图像的保真度。其解决方案的关键在于提出一种双向自动编码器交织(Bidirectional Autoencoder Intertwining)框架:通过引入**主体偏差调制模块(Subject Bias Modulation Module)统一多被试数据分布,并利用双向映射机制更精准地捕捉特征分布;同时设计语义精炼模块(Semantic Refinement Module)**提升语义表征质量、**视觉一致性模块(Visual Coherence Module)**缓解视觉表征失真,从而显著提升重建图像的准确性与一致性。该方法在ControlNet与Stable Diffusion基础上实现性能突破,在基准数据集上优于现有最先进方法,并具备极强的新被试适应能力(仅需少量训练样本)。
链接: https://arxiv.org/abs/2507.19071
作者: Yangyang Xu,Bangzhen Liu,Wenqi Shao,Yong Du,Shengfeng He,Tingting Zhu
机构: The University of Oxford (牛津大学); South China University of Technology (华南理工大学); Shanghai AI Lab (上海人工智能实验室); Ocean University of China (中国海洋大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Decoding stimulus images from fMRI signals has advanced with pre-trained generative models. However, existing methods struggle with cross-subject mappings due to cognitive variability and subject-specific differences. This challenge arises from sequential errors, where unidirectional mappings generate partially inaccurate representations that, when fed into diffusion models, accumulate errors and degrade reconstruction fidelity. To address this, we propose the Bidirectional Autoencoder Intertwining framework for accurate decoded representation prediction. Our approach unifies multiple subjects through a Subject Bias Modulation Module while leveraging bidirectional mapping to better capture data distributions for precise representation prediction. To further enhance fidelity when decoding representations into stimulus images, we introduce a Semantic Refinement Module to improve semantic representations and a Visual Coherence Module to mitigate the effects of inaccurate visual representations. Integrated with ControlNet and Stable Diffusion, our method outperforms state-of-the-art approaches on benchmark datasets in both qualitative and quantitative evaluations. Moreover, our framework exhibits strong adaptability to new subjects with minimal training samples.
zh
[CV-52] Negation-Aware Test-Time Adaptation for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理否定语义时存在的显著性能瓶颈问题,即模型难以准确理解“不存在”或“错误”的语义内容,这限制了其在医学影像检索等实际场景中的应用。现有方法多依赖大规模含显式否定标注的数据进行微调,但此类数据获取成本高、计算资源消耗大,难以可持续推广。论文提出的关键解决方案是Negation-Aware Test-Time Adaptation (NEAT),其核心在于识别并缓解肯定与否定分布之间的双重概念偏移(dual-concept shifts),通过在推理阶段动态调整与分布相关的参数,在保持语义一致性的同时消除无关语义下的虚假分布一致性,从而实现低能耗、高效且鲁棒的否定理解能力提升。
链接: https://arxiv.org/abs/2507.19064
作者: Haochen Han,Alex Jinpeng Wang,Fangming Liu
机构: Pengcheng Laboratory (鹏城实验室); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper will be submitted to the IEEE for possible publication
Abstract:In this paper, we study a practical but less-touched problem in Vision-Language Models (VLMs), \ie, negation understanding. Specifically, many real-world applications require models to explicitly identify what is false or non-existent, \eg, radiologists may search for images that exclude specific conditions. Despite the impressive transferability of VLMs through large-scale training, they suffer from a critical limitation that fails to handle negation. To address this challenge, existing methods attribute its root cause to the scarcity of negation training data and propose to fine-tune VLMs on massive data containing explicit negation. Undoubtedly, such data-centric solutions demand substantial data and computational resources, limiting their sustainable widespread adoption. To tackle negation in a low-carbon manner, we empirically observe that the key obstacle lies in the dual-concept shifts between the affirmation and negation distributions. Therefore, we propose a Negation-Aware Test-Time Adaptation (NEAT) method to efficiently adjust distribution-related parameters during inference. In brief, NEAT can reduce distribution shift in consistent semantics while eliminating false distributional consistency in unrelated semantics. Extensive experiments on the various negation understanding tasks verify the effectiveness of the proposed method. The code is available at this https URL.
zh
[CV-53] Revisiting DETR for Small Object Detection via Noise-Resilient Query Optimization ICME
【速读】:该论文旨在解决基于Transformer的小目标检测(Small Object Detection, SOD)方法中存在的两个核心问题:一是特征金字塔网络(Feature Pyramid Network, FPN)在特征融合过程中对噪声敏感,导致检测性能下降;二是现有标签分配策略生成的查询质量较低,影响检测精度。解决方案的关键在于提出一种新型的抗噪查询优化范式(Noise-Resilient Query Optimization, NRQO),其核心创新包括:1)引入抗噪特征金字塔网络(Noise-Tolerance FPN, NT-FPN),通过保留空间和语义信息完整性来抑制FPN中的噪声传播;2)设计成对相似性区域建议网络(Pairwise-Similarity Region Proposal Network, PS-RPN),利用位置与形状相似性增强锚框与真实框的匹配关系,从而生成高质量正样本查询,且无需额外超参数调节。
链接: https://arxiv.org/abs/2507.19059
作者: Xiaocheng Fang,Jieyi Cai,Huanyu Liu,Wenxiu Cai,Yishu Liu,Bingzhi Chen
机构: South China Normal University (华南师范大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Beijing Institute of Technology, Zhuhai (北京理工大学珠海校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE International Conference on Multimedia and Expo (ICME)
Abstract:Despite advancements in Transformer-based detectors for small object detection (SOD), recent studies show that these detectors still face challenges due to inherent noise sensitivity in feature pyramid networks (FPN) and diminished query quality in existing label assignment strategies. In this paper, we propose a novel Noise-Resilient Query Optimization (NRQO) paradigm, which innovatively incorporates the Noise-Tolerance Feature Pyramid Network (NT-FPN) and the Pairwise-Similarity Region Proposal Network (PS-RPN). Specifically, NT-FPN mitigates noise during feature fusion in FPN by preserving spatial and semantic information integrity. Unlike existing label assignment strategies, PS-RPN generates a sufficient number of high-quality positive queries by enhancing anchor-ground truth matching through position and shape similarities, without the need for additional hyperparameters. Extensive experiments on multiple benchmarks consistently demonstrate the superiority of NRQO over state-of-the-art baselines.
zh
[CV-54] ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment
【速读】:该论文旨在解决3D场景生成中因逐帧外推(outpainting)导致的语义漂移(semantic drift)问题,即在长距离视图序列生成过程中,由于累积误差使场景语义逐渐偏离初始设定。解决方案的关键在于提出ScenePainter框架,其核心创新是引入一种分层图结构——SceneConceptGraph,用于建模多层级场景概念之间的关系,并将场景特定先验与当前场景理解对齐,从而指导外推模块生成语义一致的新视角,同时支持动态优化以提升多样性。
链接: https://arxiv.org/abs/2507.19058
作者: Chong Xia,Shengjun Zhang,Fangfu Liu,Chang Liu,Khodchaphun Hirunyaratsameewong,Yueqi Duan
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Perpetual 3D scene generation aims to produce long-range and coherent 3D view sequences, which is applicable for long-term video synthesis and 3D scene reconstruction. Existing methods follow a “navigate-and-imagine” fashion and rely on outpainting for successive view expansion. However, the generated view sequences suffer from semantic drift issue derived from the accumulated deviation of the outpainting module. To tackle this challenge, we propose ScenePainter, a new framework for semantically consistent 3D scene generation, which aligns the outpainter’s scene-specific prior with the comprehension of the current scene. To be specific, we introduce a hierarchical graph structure dubbed SceneConceptGraph to construct relations among multi-level scene concepts, which directs the outpainter for consistent novel views and can be dynamically refined to enhance diversity. Extensive experiments demonstrate that our framework overcomes the semantic drift issue and generates more consistent and immersive 3D view sequences. Project Page: this https URL.
zh
[CV-55] Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding
【速读】:该论文旨在解决生成式 AI (Generative AI) 在神经科学中用于预测大脑对自然场景多模态刺激响应时的泛化能力问题,尤其是模型在分布内(in-distribution, ID)与分布外(out-of-distribution, OOD)数据上的性能差异。其解决方案的关键在于:通过构建基于先进视觉(X-CLIP)和听觉(Whisper)特征提取器的脑编码模型,并严格评估其在ID与多样OOD数据上的表现,揭示出模型复杂度与泛化能力之间存在根本权衡——简单线性模型在OOD测试中优于复杂注意力机制模型,且语言特征未提升预测精度,表明在熟悉语言条件下,连续视听流主导神经编码过程;同时,高保真语音表示显著提升了听觉皮层区域的建模性能,凸显了感官层级结构对多模态神经编码的影响。
链接: https://arxiv.org/abs/2507.19052
作者: Hamid Abdollahi,Amir Hossein Mansouri Majoumerd,Amir Hossein Bagheri Baboukani,Amir Abolfazl Suratgar,Mohammad Bagher Menhaj
机构: Amirkabir University of Technology (阿米尔卡比尔理工大学); Distributed and Intelligent Optimization Research Laboratory (分布式与智能优化研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Predicting brain activity in response to naturalistic, multimodal stimuli is a key challenge in computational neuroscience. While encoding models are becoming more powerful, their ability to generalize to truly novel contexts remains a critical, often untested, question. In this work, we developed brain encoding models using state-of-the-art visual (X-CLIP) and auditory (Whisper) feature extractors and rigorously evaluated them on both in-distribution (ID) and diverse out-of-distribution (OOD) data. Our results reveal a fundamental trade-off between model complexity and generalization: a higher-capacity attention-based model excelled on ID data, but a simpler linear model was more robust, outperforming a competitive baseline by 18% on the OOD set. Intriguingly, we found that linguistic features did not improve predictive accuracy, suggesting that for familiar languages, neural encoding may be dominated by the continuous visual and auditory streams over redundant textual information. Spatially, our approach showed marked performance gains in the auditory cortex, underscoring the benefit of high-fidelity speech representations. Collectively, our findings demonstrate that rigorous OOD testing is essential for building robust neuro-AI models and provides nuanced insights into how model architecture, stimulus characteristics, and sensory hierarchies shape the neural encoding of our rich, multimodal world.
zh
[CV-56] A New One-Shot Federated Learning Framework for Medical Imaging Classification with Feature-Guided Rectified Flow and Knowledge Distillation ECAI2025
【速读】:该论文旨在解决多中心场景下基于生成式模型的单轮联邦学习(One-Shot Federated Learning, OSFL)在医疗领域面临的两大挑战:一是现有方法训练效率低且存在潜在隐私泄露风险;二是非独立同分布(non-IID)数据下难以在单轮聚合中实现收敛。解决方案的关键在于提出一种改进的OSFL框架,其中包含两个核心组件:其一为客户端的特征引导修正流模型(Feature-Guided Rectified Flow Model, FG-RF),通过生成特征级图像而非像素级图像,在提升医学影像生成效率的同时降低隐私泄露风险;其二为双层知识蒸馏聚合方法(Dual-Layer Knowledge Distillation, DLKD),使全局学生模型在聚合过程中同时模仿各客户端教师模型的输出logits和中间层特征,从而有效应对non-IID数据分布并促进模型收敛。
链接: https://arxiv.org/abs/2507.19045
作者: Yufei Ma,Hanwen Zhang,Qiya Yang,Guibo Luo,Yuesheng Zhu
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted at ECAI 2025
Abstract:In multi-center scenarios, One-Shot Federated Learning (OSFL) has attracted increasing attention due to its low communication overhead, requiring only a single round of transmission. However, existing generative model-based OSFL methods suffer from low training efficiency and potential privacy leakage in the healthcare domain. Additionally, achieving convergence within a single round of model aggregation is challenging under non-Independent and Identically Distributed (non-IID) data. To address these challenges, in this paper a modified OSFL framework is proposed, in which a new Feature-Guided Rectified Flow Model (FG-RF) and Dual-Layer Knowledge Distillation (DLKD) aggregation method are developed. FG-RF on the client side accelerates generative modeling in medical imaging scenarios while preserving privacy by synthesizing feature-level images rather than pixel-level images. To handle non-IID distributions, DLKD enables the global student model to simultaneously mimic the output logits and align the intermediate-layer features of client-side teacher models during aggregation. Experimental results on three non-IID medical imaging datasets show that our new framework and method outperform multi-round federated learning approaches, achieving up to 21.73% improvement, and exceeds the baseline FedISCA by an average of 21.75%. Furthermore, our experiments demonstrate that feature-level synthetic images significantly reduce privacy leakage risks compared to pixel-level synthetic images.
zh
[CV-57] Dual Path Learning – learning from noise and context for medical image denoising
【速读】:该论文旨在解决医学图像中噪声干扰导致图像质量下降、进而影响临床诊断准确性的关键问题。现有去噪方法通常仅依赖噪声特性或图像上下文信息之一,且多局限于单一成像模态和噪声类型,泛化能力有限。其解决方案的关键在于提出一种双路径学习(Dual-Pathway Learning, DPL)模型架构,该架构能够同时提取并融合噪声特征与图像上下文信息,从而实现跨模态、多噪声类型的鲁棒去噪效果;实验表明,DPL在Gaussian噪声下相较基线UNet提升PSNR达3.35%,验证了其有效性与通用性。
链接: https://arxiv.org/abs/2507.19035
作者: Jitindra Fartiyal,Pedro Freire,Yasmeen Whayeb,James S. Wolffsohn,Sergei K. Turitsyn,Sergei G. Sokolov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures
Abstract:Medical imaging plays a critical role in modern healthcare, enabling clinicians to accurately diagnose diseases and develop effective treatment plans. However, noise, often introduced by imaging devices, can degrade image quality, leading to misinterpretation and compromised clinical outcomes. Existing denoising approaches typically rely either on noise characteristics or on contextual information from the image. Moreover, they are commonly developed and evaluated for a single imaging modality and noise type. Motivated by Geng this http URL CNCL, which integrates both noise and context, this study introduces a Dual-Pathway Learning (DPL) model architecture that effectively denoises medical images by leveraging both sources of information and fusing them to generate the final output. DPL is evaluated across multiple imaging modalities and various types of noise, demonstrating its robustness and generalizability. DPL improves PSNR by 3.35% compared to the baseline UNet when evaluated on Gaussian noise and trained across all modalities. The code is available at https://doi.org/10.5281/zenodo.15836053.
zh
[CV-58] A Survey of Multimodal Hallucination Evaluation and Detection
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在图像到文本(I2T)和文本到图像(T2I)生成任务中普遍存在的幻觉(hallucination)问题,即模型输出看似合理但与输入内容或客观事实不符的内容。其解决方案的关键在于:首先构建了一个基于忠实性(faithfulness)和事实性(factual correctness)的幻觉分类体系,系统梳理了实践中常见的幻觉类型;其次全面综述了现有幻觉评估基准(evaluation benchmarks),涵盖其构建方法、评估目标及指标设计;最后总结了面向实例级别的幻觉检测方法,以作为基准评估的实用补充。论文进一步指出了当前评估与检测方法的局限性,并提出了未来研究方向,为提升MLLMs的可靠性提供了理论框架与技术路径。
链接: https://arxiv.org/abs/2507.19024
作者: Zhiyuan Chen(1 and 2),Yuecong Min(1 and 2),Jie Zhang(1 and 2),Bei Yan(1 and 2),Jiahao Wang(3),Xiaozhen Wang(3),Shiguang Shan(1 and 2) ((1) State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (CAS) (2) University of Chinese Academy of Sciences (3) Trustworthy Technology and Engineering Laboratory, Huawei)
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 5 figures
Abstract:Multi-modal Large Language Models (MLLMs) have emerged as a powerful paradigm for integrating visual and textual information, supporting a wide range of multi-modal tasks. However, these models often suffer from hallucination, producing content that appears plausible but contradicts the input content or established world knowledge. This survey offers an in-depth review of hallucination evaluation benchmarks and detection methods across Image-to-Text (I2T) and Text-to-image (T2I) generation tasks. Specifically, we first propose a taxonomy of hallucination based on faithfulness and factuality, incorporating the common types of hallucinations observed in practice. Then we provide an overview of existing hallucination evaluation benchmarks for both T2I and I2T tasks, highlighting their construction process, evaluation objectives, and employed metrics. Furthermore, we summarize recent advances in hallucination detection methods, which aims to identify hallucinated content at the instance level and serve as a practical complement of benchmark-based evaluation. Finally, we highlight key limitations in current benchmarks and detection methods, and outline potential directions for future research.
zh
[CV-59] MedIQA: A Scalable Foundation Model for Prompt-Driven Medical Image Quality Assessment MICCAI2025
【速读】:该论文旨在解决医学图像质量评估(Medical Image Quality Assessment, IQA)中现有方法在不同成像模态和临床场景下泛化能力不足的问题。其核心解决方案是提出首个面向医学IQA的通用基础模型MedIQA,关键创新在于:1)构建大规模多模态数据集并人工标注高质量评分以支持训练;2)引入显著切片评估模块聚焦诊断相关区域,并结合特征检索机制提升判别力;3)设计自动提示策略,实现上游物理参数预训练与下游专家标注微调之间的对齐,从而显著提升模型在多种下游任务中的性能表现。
链接: https://arxiv.org/abs/2507.19004
作者: Siyi Xun,Yue Sun,Jingkun Chen,Zitong Yu,Tong Tong,Xiaohong Liu,Mingxiang Wu,Tao Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: We note that the version after peer review of this paper has been provisionally accepted by The 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2025)
Abstract:Rapid advances in medical imaging technology underscore the critical need for precise and automated image quality assessment (IQA) to ensure diagnostic accuracy. Existing medical IQA methods, however, struggle to generalize across diverse modalities and clinical scenarios. In response, we introduce MedIQA, the first comprehensive foundation model for medical IQA, designed to handle variability in image dimensions, modalities, anatomical regions, and types. We developed a large-scale multi-modality dataset with plentiful manually annotated quality scores to support this. Our model integrates a salient slice assessment module to focus on diagnostically relevant regions feature retrieval and employs an automatic prompt strategy that aligns upstream physical parameter pre-training with downstream expert annotation fine-tuning. Extensive experiments demonstrate that MedIQA significantly outperforms baselines in multiple downstream tasks, establishing a scalable framework for medical IQA and advancing diagnostic workflows and clinical decision-making.
zh
[CV-60] Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment ICCV2025
【速读】:该论文旨在解决当前图像生成系统中评估框架滞后于生成质量的问题,特别是现有基于CLIP和BLIP架构的人类偏好奖励模型在评估图像细节丰富度与美学价值时存在偏差,导致评分与人类真实审美偏好不一致。其解决方案的关键在于提出一种新的评价指标ICT(Image-Contained-Text)分数,通过量化图像对文本内容的表征程度来实现更精准的图文对齐评估,并在此基础上训练仅依赖图像模态的HP(High-Preference)评分模型,从而在保持图文对齐的同时显著提升图像美学质量和细节表现力。实验表明,该方法相较于现有方法评分准确率提升超过10%,并有效优化了主流文生图模型的性能。
链接: https://arxiv.org/abs/2507.19002
作者: Ying Ba,Tianyu Zhang,Yalong Bai,Wenyi Mo,Tao Liang,Bing Su,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence Renmin University of China (中国人民大学高瓴人工智能学院); Beijing Key Laboratory of Research on Large Models and Intelligent Governance (北京市大模型与智能治理重点实验室); Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE (教育部下一代智能搜索与推荐工程研究中心); iN2X
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality while maintaining text-image alignment. Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical and empirical support for evolving image generation technology toward higher-order human aesthetic preferences. Code is available at this https URL.
zh
[CV-61] GPSMamba: A Global Phase and Spectral Prompt-guided Mamba for Infrared Image Super-Resolution
【速读】:该论文旨在解决红外图像超分辨率(Infrared Image Super-Resolution, IRSR)中因低对比度和纹理稀疏导致的全局结构失真问题,尤其是现有状态空间模型(State-Space Models, SSMs)如Mamba由于其固有的1D因果扫描机制破坏了2D图像的全局上下文信息,从而限制了细节恢复能力。解决方案的关键在于提出Global Phase and Spectral Prompt-guided Mamba(GPSMamba)框架,通过两个核心创新实现:一是设计自适应语义-频域状态空间模块(Adaptive Semantic-Frequency State Space Module, ASF-SSM),将融合后的语义-频域提示注入Mamba块以引入非局部上下文引导重建;二是引入热谱注意力与相位一致性损失(Thermal-Spectral Attention and Phase Consistency Loss),提供显式的非因果监督以强化全局结构与光谱保真度。该方法系统性地缓解了因果建模带来的局限,显著提升了红外图像的重建质量。
链接: https://arxiv.org/abs/2507.18998
作者: Yongsong Huang,Tomo Miyazaki,Xiaofeng Liu,Shinichiro Omachi
机构: 1. National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所); 2. Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript is under review, and copyright will be transferred without notice
Abstract:Infrared Image Super-Resolution (IRSR) is challenged by the low contrast and sparse textures of infrared data, requiring robust long-range modeling to maintain global coherence. While State-Space Models like Mamba offer proficiency in modeling long-range dependencies for this task, their inherent 1D causal scanning mechanism fragments the global context of 2D images, hindering fine-detail restoration. To address this, we propose Global Phase and Spectral Prompt-guided Mamba (GPSMamba), a framework that synergizes architectural guidance with non-causal supervision. First, our Adaptive Semantic-Frequency State Space Module (ASF-SSM) injects a fused semantic-frequency prompt directly into the Mamba block, integrating non-local context to guide reconstruction. Then, a novel Thermal-Spectral Attention and Phase Consistency Loss provides explicit, non-causal supervision to enforce global structural and spectral fidelity. By combining these two innovations, our work presents a systematic strategy to mitigate the limitations of causal modeling. Extensive experiments demonstrate that GPSMamba achieves state-of-the-art performance, validating our approach as a powerful new paradigm for infrared image restoration. Code is available at this https URL.
zh
[CV-62] UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis ICCV2025
【速读】:该论文旨在解决点云数据在真实场景中普遍存在的质量问题(如噪声和不完整),这些问题严重影响了预训练点云分析模型在下游任务中的性能。现有方法通常通过独立设计去噪和补全模型来提升点云质量,但因增强与下游任务分离且去噪与补全目标冲突,难以在多域场景中有效应用,且易破坏关键几何特征。解决方案的关键在于提出一种统一的点级提示(point-level prompting)机制,将去噪和补全重构为提示生成过程:首先引入校正提示器(Rectification Prompter),利用预测的校正向量提示过滤噪声并保留精细几何特征;随后设计补全提示器(Completion Prompter),基于校正后的点云生成辅助提示以增强鲁棒性;最终通过**形状感知单元(Shape-Aware Unit)**高效融合几何特征,实现参数高效的端到端点云分析。
链接: https://arxiv.org/abs/2507.18997
作者: Zixiang Ai,Zhenyu Cui,Yuxin Peng,Jiahuan Zhou
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025 as a Poster
Abstract:Pre-trained point cloud analysis models have shown promising advancements in various downstream tasks, yet their effectiveness is typically suffering from low-quality point cloud (i.e., noise and incompleteness), which is a common issue in real scenarios due to casual object occlusions and unsatisfactory data collected by 3D sensors. To this end, existing methods focus on enhancing point cloud quality by developing dedicated denoising and completion models. However, due to the isolation between the point cloud enhancement and downstream tasks, these methods fail to work in various real-world domains. In addition, the conflicting objectives between denoising and completing tasks further limit the ensemble paradigm to preserve critical geometric features. To tackle the above challenges, we propose a unified point-level prompting method that reformulates point cloud denoising and completion as a prompting mechanism, enabling robust analysis in a parameter-efficient manner. We start by introducing a Rectification Prompter to adapt to noisy points through the predicted rectification vector prompts, effectively filtering noise while preserving intricate geometric features essential for accurate analysis. Sequentially, we further incorporate a Completion Prompter to generate auxiliary point prompts based on the rectified point clouds, facilitating their robustness and adaptability. Finally, a Shape-Aware Unit module is exploited to efficiently unify and capture the filtered geometric features for the downstream point cloud this http URL experiments on four datasets demonstrate the superiority and robustness of our method when handling noisy and incomplete point cloud data against existing state-of-the-art methods. Our code is released at this https URL.
zh
[CV-63] AEDR: Training-Free AI-Generated Image Attribution via Autoencoder Double-Reconstruction
【速读】:该论文旨在解决生成式图像模型(Generative Models)中图像溯源(Attribution)的准确性与计算效率问题,尤其是在面对最先进的(State-of-the-Art, SOTA)潜空间扩散模型时,传统基于重建的方法常因精度下降和高计算开销而难以实用。其解决方案的关键在于提出一种无需训练的新型方法AEDR(AutoEncoder Double-Reconstruction),该方法利用模型自身连续自动编码器(Continuous Autoencoder)进行两次连续重建,并以两阶段重建损失的比值作为溯源信号;同时引入图像同质性(Image Homogeneity)度量对信号进行校准,从而有效消除由图像复杂度引起的绝对偏差,且保持了优于现有方法的计算效率——实验表明,AEDR在8个顶级潜扩散模型上相较现有重建方法提升25.5%的溯源准确率,同时仅需1%的计算时间。
链接: https://arxiv.org/abs/2507.18988
作者: Chao Wang,Kejiang Chen,Zijin Yang,Yaofei Wang,Weiming Zhang
机构: University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:The rapid advancement of image-generation technologies has made it possible for anyone to create photorealistic images using generative models, raising significant security concerns. To mitigate malicious use, tracing the origin of such images is essential. Reconstruction-based attribution methods offer a promising solution, but they often suffer from reduced accuracy and high computational costs when applied to state-of-the-art (SOTA) models. To address these challenges, we propose AEDR (AutoEncoder Double-Reconstruction), a novel training-free attribution method designed for generative models with continuous autoencoders. Unlike existing reconstruction-based approaches that rely on the value of a single reconstruction loss, AEDR performs two consecutive reconstructions using the model’s autoencoder, and adopts the ratio of these two reconstruction losses as the attribution signal. This signal is further calibrated using the image homogeneity metric to improve accuracy, which inherently cancels out absolute biases caused by image complexity, with autoencoder-based reconstruction ensuring superior computational efficiency. Experiments on eight top latent diffusion models show that AEDR achieves 25.5% higher attribution accuracy than existing reconstruction-based methods, while requiring only 1% of the computational time.
zh
[CV-64] Underwater Waste Detection Using Deep Learning A Performance Comparison of YOLOv7 to 10 and Faster RCNN
【速读】:该论文旨在解决水下垃圾识别准确率低的问题,以提升环境监测与污染治理的效率。其解决方案的关键在于系统评估五种前沿目标检测算法(包括YOLOv7、YOLOv8、YOLOv9、YOLOv10及Faster R-CNN)在复杂水下场景中的表现,并通过大规模多类别数据集训练与测试发现:YOLOv8模型凭借其改进的无锚框机制(anchor-free mechanism)和自监督学习(self-supervised learning)等架构特性,在低能见度和不同水深条件下均展现出最优性能,平均精度(mAP)达80.9%,显著优于其他模型,体现出其在水下垃圾检测任务中的高精度与强泛化能力。
链接: https://arxiv.org/abs/2507.18967
作者: UMMPK Nawarathne,HMNS Kumari,HMLS Kumari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 11 figures, to be published in International Journal of Research in Computing (IJRC)
Abstract:Underwater pollution is one of today’s most significant environmental concerns, with vast volumes of garbage found in seas, rivers, and landscapes around the world. Accurate detection of these waste materials is crucial for successful waste management, environmental monitoring, and mitigation strategies. In this study, we investigated the performance of five cutting-edge object recognition algorithms, namely YOLO (You Only Look Once) models, including YOLOv7, YOLOv8, YOLOv9, YOLOv10, and Faster Region-Convolutional Neural Network (R-CNN), to identify which model was most effective at recognizing materials in underwater situations. The models were thoroughly trained and tested on a large dataset containing fifteen different classes under diverse conditions, such as low visibility and variable depths. From the above-mentioned models, YOLOv8 outperformed the others, with a mean Average Precision (mAP) of 80.9%, indicating a significant performance. This increased performance is attributed to YOLOv8’s architecture, which incorporates advanced features such as improved anchor-free mechanisms and self-supervised learning, allowing for more precise and efficient recognition of items in a variety of settings. These findings highlight the YOLOv8 model’s potential as an effective tool in the global fight against pollution, improving both the detection capabilities and scalability of underwater cleanup operations.
zh
[CV-65] YOLO for Knowledge Extraction from Vehicle Images: A Baseline Study
【速读】:该论文旨在解决在复杂真实场景下准确识别车辆属性(如品牌、颜色和形状)的问题,这对执法与情报应用至关重要。其关键解决方案是采用多视角推理(Multi-view Inference, MVI)策略,并对比三种先进的深度学习模型——YOLO-v11、YOLO-World 和 YOLO-Classification,在大规模真实世界车辆图像数据集上的性能表现。研究发现,结合MVI可显著提升模型预测准确性,且基于目标检测的YOLO系列模型在品牌和形状提取上优于纯分类模型;同时,较小规模的YOLO变体在保持高精度的同时具备更高的实时推理效率,为实际部署提供了高效可行的基准方案。
链接: https://arxiv.org/abs/2507.18966
作者: Saraa Al-Saddik,Manna Elizabeth Philip,Ali Haidar
机构: UNSW, Sydney, Australia; NSW Police Force, NSW Police Force Headquarters, Parramatta, Australia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate identification of vehicle attributes such as make, colour, and shape is critical for law enforcement and intelligence applications. This study evaluates the effectiveness of three state-of-the-art deep learning approaches YOLO-v11, YOLO-World, and YOLO-Classification on a real-world vehicle image dataset. This dataset was collected under challenging and unconstrained conditions by NSW Police Highway Patrol Vehicles. A multi-view inference (MVI) approach was deployed to enhance the performance of the models’ predictions. To conduct the analyses, datasets with 100,000 plus images were created for each of the three metadata prediction tasks, specifically make, shape and colour. The models were tested on a separate dataset with 29,937 images belonging to 1809 number plates. Different sets of experiments have been investigated by varying the models sizes. A classification accuracy of 93.70%, 82.86%, 85.19%, and 94.86% was achieved with the best performing make, shape, colour, and colour-binary models respectively. It was concluded that there is a need to use MVI to get usable models within such complex real-world datasets. Our findings indicated that the object detection models YOLO-v11 and YOLO-World outperformed classification-only models in make and shape extraction. Moreover, smaller YOLO variants perform comparably to larger counterparts, offering substantial efficiency benefits for real-time predictions. This work provides a robust baseline for extracting vehicle metadata in real-world scenarios. Such models can be used in filtering and sorting user queries, minimising the time required to search large vehicle images datasets.
zh
[CV-66] PerioDet: Large-Scale Panoramic Radiograph Benchmark for Clinical-Oriented Apical Periodontitis Detection MICCAI2025
【速读】:该论文旨在解决自动化牙槽骨周围炎(apical periodontitis)诊断中因缺乏大规模、高质量标注数据集而导致的计算机辅助诊断(Computer-Aided Diagnosis, CAD)系统发展受限的问题。其关键解决方案是提出了一个面向临床的牙槽骨周围炎检测范式(PerioDet),该范式融合了背景去噪注意力机制(Background-Denoising Attention, BDA)与IoU动态校准机制(IoU-Dynamic Calibration, IDC),有效应对了全景牙片中背景噪声干扰和病灶目标尺寸小等挑战,从而显著提升了检测性能,并通过人机协同实验验证了其在临床辅助诊断中的实用价值。
链接: https://arxiv.org/abs/2507.18958
作者: Xiaocheng Fang,Jieyi Cai,Huanyu Liu,Chengju Zhou,Minhua Lu,Bingzhi Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025(Early Accept)
Abstract:Apical periodontitis is a prevalent oral pathology that presents significant public health challenges. Despite advances in automated diagnostic systems across various medical fields, the development of Computer-Aided Diagnosis (CAD) applications for apical periodontitis is still constrained by the lack of a large-scale, high-quality annotated dataset. To address this issue, we release a large-scale panoramic radiograph benchmark called “PerioXrays”, comprising 3,673 images and 5,662 meticulously annotated instances of apical periodontitis. To the best of our knowledge, this is the first benchmark dataset for automated apical periodontitis diagnosis. This paper further proposes a clinical-oriented apical periodontitis detection (PerioDet) paradigm, which jointly incorporates Background-Denoising Attention (BDA) and IoU-Dynamic Calibration (IDC) mechanisms to address the challenges posed by background noise and small targets in automated detection. Extensive experiments on the PerioXrays dataset demonstrate the superiority of PerioDet in advancing automated apical periodontitis detection. Additionally, a well-designed human-computer collaborative experiment underscores the clinical applicability of our method as an auxiliary diagnostic tool for professional dentists.
zh
[CV-67] Structure Matters: Revisiting Boundary Refinement in Video Object Segmentation
【速读】:该论文针对半监督视频目标分割(Semi-supervised Video Object Segmentation, SVOS)中因遮挡、物体交互以及高特征相似性导致的分割精度下降问题,提出了一种名为OASIS的方法。其核心解决方案在于设计了一个轻量级结构细化模块,通过融合Canny边缘检测器提取的粗略边界先验与存储的目标特征,生成对象级结构图并增强边界特征表示;同时引入证据学习(Evidential Learning)用于不确定性估计,以改善遮挡区域的分割鲁棒性。该方法在保持高效推理速度(如DAVIS-17上达48 FPS)的同时,显著提升了分割性能(F值提升至91.6)。
链接: https://arxiv.org/abs/2507.18944
作者: Guanyi Qin,Ziyue Wang,Daiyun Shen,Haofeng Liu,Hantao Zhou,Junde Wu,Runze Hu,Yueming Jin
机构: National University of Singapore(新加坡国立大学); Tsinghua University(清华大学); University of Oxford(牛津大学); Beijing Institute of Technology(北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Given an object mask, Semi-supervised Video Object Segmentation (SVOS) technique aims to track and segment the object across video frames, serving as a fundamental task in computer vision. Although recent memory-based methods demonstrate potential, they often struggle with scenes involving occlusion, particularly in handling object interactions and high feature similarity. To address these issues and meet the real-time processing requirements of downstream applications, in this paper, we propose a novel bOundary Amendment video object Segmentation method with Inherent Structure refinement, hereby named OASIS. Specifically, a lightweight structure refinement module is proposed to enhance segmentation accuracy. With the fusion of rough edge priors captured by the Canny filter and stored object features, the module can generate an object-level structure map and refine the representations by highlighting boundary features. Evidential learning for uncertainty estimation is introduced to further address challenges in occluded regions. The proposed method, OASIS, maintains an efficient design, yet extensive experiments on challenging benchmarks demonstrate its superior performance and competitive inference speed compared to other state-of-the-art methods, i.e., achieving the F values of 91.6 (vs. 89.7 on DAVIS-17 validation set) and G values of 86.6 (vs. 86.2 on YouTubeVOS 2019 validation set) while maintaining a competitive speed of 48 FPS on DAVIS.
zh
[CV-68] PDT: Point Distribution Transformation with Diffusion Models
【速读】:该论文旨在解决从无结构点云分布中提取有意义的结构信息,并将其转换为语义上合理的点分布这一挑战,这是当前点云学习与处理领域尚未充分探索的问题。解决方案的关键在于提出一种基于扩散模型(diffusion models)的新型框架PDT(Point Distribution Transformation),其通过创新的架构设计和学习策略,在源分布与目标分布之间建立有效的关联,利用去噪过程实现点集从原始几何分布到语义有意义分布的转化。实验表明,该方法能够将输入点云成功转化为多种结构化输出,如表面对齐的关键点、内部稀疏关节及连续特征线,从而在几何与语义层面同时捕捉结构特征,为需要结构化点分布的3D几何处理任务提供了强有力的工具。
链接: https://arxiv.org/abs/2507.18939
作者: Jionghao Wang,Cheng Lin,Yuan Liu,Rui Xu,Zhiyang Dou,Xiao-Xiao Long,Hao-Xiang Guo,Taku Komura,Wenping Wang,Xin Li
机构: Texas A&M University (德州农工大学); The University of Hong Kong (香港大学); The Hong Kong University of Science and Technology (香港科技大学); Nanjing University (南京大学); Skywork AI, Kunlun Inc. (昆仑万维AI); The University of Hong Kong (香港大学); The University of Hong Kong (香港大学); Texas A&M University (德州农工大学); Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Point-based representations have consistently played a vital role in geometric data structures. Most point cloud learning and processing methods typically leverage the unordered and unconstrained nature to represent the underlying geometry of 3D shapes. However, how to extract meaningful structural information from unstructured point cloud distributions and transform them into semantically meaningful point distributions remains an under-explored problem. We present PDT, a novel framework for point distribution transformation with diffusion models. Given a set of input points, PDT learns to transform the point set from its original geometric distribution into a target distribution that is semantically meaningful. Our method utilizes diffusion models with novel architecture and learning strategy, which effectively correlates the source and the target distribution through a denoising process. Through extensive experiments, we show that our method successfully transforms input point clouds into various forms of structured outputs - ranging from surface-aligned keypoints, and inner sparse joints to continuous feature lines. The results showcase our framework’s ability to capture both geometric and semantic features, offering a powerful tool for various 3D geometry processing tasks where structured point distributions are desired. Code will be available at this link: this https URL.
zh
[CV-69] MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition
【速读】:该论文旨在解决贴纸情感理解(sticker emotion understanding)中存在的挑战,即由于依赖多视角信息(如背景知识和风格线索)而导致的语义理解困难问题。解决方案的关键在于提出一种多粒度层次融合Transformer(Multi-Granularity Hierarchical Fusion Transformer, MGHFT),其核心创新包括:首先利用多模态大语言模型(Multimodal Large Language Models, MLLMs)从多个视角生成丰富的文本描述以增强贴纸的情感解释;其次设计了一种分层融合策略,通过金字塔视觉Transformer在不同阶段提取全局与局部视觉特征,并结合对比学习与注意力机制将文本特征注入视觉主干网络,实现跨模态语义对齐;最后引入文本引导的融合注意力机制,有效整合多粒度的跨模态特征,从而显著提升细粒度情感识别性能。实验表明,该方法在两个公开贴纸情感数据集上均优于现有方法,在F1分数和准确率上分别提升5.4%和4.0%。
链接: https://arxiv.org/abs/2507.18929
作者: Jian Chen,Yuxuan Hu,Haifeng Lu,Wei Wang,Min Yang,Chengming Li,Xiping Hu
机构: Shenzhen MSU-BIT University (深圳莫斯科大学-比特大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ACMMM2025
Abstract:Although pre-trained visual models with text have demonstrated strong capabilities in visual feature extraction, sticker emotion understanding remains challenging due to its reliance on multi-view information, such as background knowledge and stylistic cues. To address this, we propose a novel multi-granularity hierarchical fusion transformer (MGHFT), with a multi-view sticker interpreter based on Multimodal Large Language Models. Specifically, inspired by the human ability to interpret sticker emotions from multiple views, we first use Multimodal Large Language Models to interpret stickers by providing rich textual context via multi-view descriptions. Then, we design a hierarchical fusion strategy to fuse the textual context into visual understanding, which builds upon a pyramid visual transformer to extract both global and local sticker features at multiple stages. Through contrastive learning and attention mechanisms, textual features are injected at different stages of the visual backbone, enhancing the fusion of global- and local-granularity visual semantics with textual guidance. Finally, we introduce a text-guided fusion attention mechanism to effectively integrate the overall multimodal features, enhancing semantic understanding. Extensive experiments on 2 public sticker emotion datasets demonstrate that MGHFT significantly outperforms existing sticker emotion recognition approaches, achieving higher accuracy and more fine-grained emotion recognition. Compared to the best pre-trained visual models, our MGHFT also obtains an obvious improvement, 5.4% on F1 and 4.0% on accuracy. The code is released at this https URL.
zh
[CV-70] WiSE-OD: Benchmarking Robustness in Infrared Object Detection
【速读】:该论文旨在解决红外(Infrared, IR)图像中目标检测(Object Detection, OD)模型在缺乏大规模标注数据时,依赖于RGB图像预训练权重所导致的跨模态泛化能力不足问题,尤其是在分布偏移(distribution shift)下鲁棒性下降的问题。其关键解决方案是提出两种跨模态外部分布(Out-of-Distribution, OOD)基准LLVIP-C和FLIR-C以系统评估模型鲁棒性,并设计了一种无需额外训练或推理成本的权重空间集成方法WiSE-OD,通过融合RGB零样本(Zero-Shot, ZS)与IR微调或线性探测(Linear Probing, LP)权重,实现跨模态准确率与抗扰动鲁棒性的协同提升。
链接: https://arxiv.org/abs/2507.18925
作者: Heitor R. Medeiros,Atif Belal,Masih Aminbeidokhti,Eric Granger,Marco Pedersoli
机构: LIVIA, Dept. of Systems Engineering, ETS Montreal, Canada; International Laboratory on Learning Systems (ILLS)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, conference
Abstract:Object detection (OD) in infrared (IR) imagery is critical for low-light and nighttime applications. However, the scarcity of large-scale IR datasets forces models to rely on weights pre-trained on RGB images. While fine-tuning on IR improves accuracy, it often compromises robustness under distribution shifts due to the inherent modality gap between RGB and IR. To address this, we introduce LLVIP-C and FLIR-C, two cross-modality out-of-distribution (OOD) benchmarks built by applying corruption to standard IR datasets. Additionally, to fully leverage the complementary knowledge from RGB and infrared trained models, we propose WiSE-OD, a weight-space ensembling method with two variants: WiSE-OD _ZS , which combines RGB zero-shot and IR fine-tuned weights, and WiSE-OD _LP , which blends zero-shot and linear probing. Evaluated across three RGB-pretrained detectors and two robust baselines, WiSE-OD improves both cross-modality and corruption robustness without any additional training or inference cost.
zh
[CV-71] Gaussian Set Surface Reconstruction through Per-Gaussian Optimization
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在场景几何重建中存在高斯点分布不均、难以精确贴合潜在表面的问题,尤其针对现有改进方法(如PGSR)虽引入深度与法向量约束但仍忽略个体高斯点位置优化的局限性。解决方案的关键在于提出Gaussian Set Surface Reconstruction (GSSR),通过像素级和高斯级单视角法向一致性以及多视角光度一致性联合约束,实现局部与全局几何对齐;同时引入不透明度正则化损失以剔除冗余高斯点,并结合基于深度和法向引导的周期性高斯重初始化机制,确保高斯点在潜在表面上均匀分布且主法向与表面法向一致,从而显著提升几何精度并支持直观场景编辑。
链接: https://arxiv.org/abs/2507.18923
作者: Zhentao Huang,Di Wu,Zhenbang He,Minglun Gong
机构: University of Guelph (圭尔夫大学); University of Macau (澳门大学); University of British Columbia Okanagan (不列颠哥伦比亚大学奥卡纳根分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) effectively synthesizes novel views through its flexible representation, yet fails to accurately reconstruct scene geometry. While modern variants like PGSR introduce additional losses to ensure proper depth and normal maps through Gaussian fusion, they still neglect individual placement optimization. This results in unevenly distributed Gaussians that deviate from the latent surface, complicating both reconstruction refinement and scene editing. Motivated by pioneering work on Point Set Surfaces, we propose Gaussian Set Surface Reconstruction (GSSR), a method designed to distribute Gaussians evenly along the latent surface while aligning their dominant normals with the surface normal. GSSR enforces fine-grained geometric alignment through a combination of pixel-level and Gaussian-level single-view normal consistency and multi-view photometric consistency, optimizing both local and global perspectives. To further refine the representation, we introduce an opacity regularization loss to eliminate redundant Gaussians and apply periodic depth- and normal-guided Gaussian reinitialization for a cleaner, more uniform spatial distribution. Our reconstruction results demonstrate significantly improved geometric precision in Gaussian placement, enabling intuitive scene editing and efficient generation of novel Gaussian-based 3D environments. Extensive experiments validate GSSR’s effectiveness, showing enhanced geometric accuracy while preserving high-quality rendering performance.
zh
[CV-72] HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback
【速读】:该论文旨在解决视频目标分割(Video Object Segmentation, VOS)中普遍存在的问题,包括分割掩码边界不精确、形变物体难以建模、拓扑结构变化导致的跟踪漂移、长视频序列中的记忆冗余与效率低下,以及基于SAM(Segment Anything Model)的提示模糊性等挑战。解决方案的关键在于提出HQ-SMem方法,其核心创新包括:(i) 利用高精度掩码的SAM-HQ(SAM with High-Quality masks)结合基于外观的候选选择机制,显著提升掩码边界质量;(ii) 设计动态智能记忆机制(Smart Memory),根据重要性选择性存储关键帧并丢弃冗余信息,优化长期视频中的内存利用和计算效率;(iii) 动态更新外观模型以适应复杂拓扑变化,有效抑制跟踪漂移。这些改进使HQ-SMem在多个公开数据集上达到领先性能,尤其在长视频场景下表现突出。
链接: https://arxiv.org/abs/2507.18921
作者: Elham Soltani Kazemi,Imad Eddine Toubal,Gani Rahmon,Jaired Collins,K. Palaniappan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submit/6651762
Abstract:Video Object Segmentation (VOS) is foundational to numerous computer vision applications, including surveillance, autonomous driving, robotics and generative video editing. However, existing VOS models often struggle with precise mask delineation, deformable objects, topologically transforming objects, tracking drift and long video sequences. In this paper, we introduce HQ-SMem, for High Quality video segmentation and tracking using Smart Memory, a novel method that enhances the performance of VOS base models by addressing these limitations. Our approach incorporates three key innovations: (i) leveraging SAM with High-Quality masks (SAM-HQ) alongside appearance-based candidate-selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) implementing a dynamic smart memory mechanism that selectively stores relevant key frames while discarding redundant ones, thereby optimizing memory usage and processing efficiency for long-term videos; and (iii) dynamically updating the appearance model to effectively handle complex topological object variations and reduce drift throughout the video. These contributions mitigate several limitations of existing VOS models including, coarse segmentations that mix-in background pixels, fixed memory update schedules, brittleness to drift and occlusions, and prompt ambiguity issues associated with SAM. Extensive experiments conducted on multiple public datasets and state-of-the-art base trackers demonstrate that our method consistently ranks among the top two on VOTS and VOTSt 2024 datasets. Moreover, HQ-SMem sets new benchmarks on Long Video Dataset and LVOS, showcasing its effectiveness in challenging scenarios characterized by complex multi-object dynamics over extended temporal durations.
zh
[CV-73] Synthetic-to-Real Camouflaged Object Detection
【速读】:该论文旨在解决伪装目标检测(Camouflaged Object Detection, COD)中因真实数据收集与标注成本高昂而导致的数据集稀缺问题,尤其是针对特定类别样本不足的挑战。为缓解这一问题,作者提出了一项新任务——从合成数据到真实场景的伪装目标检测(Syn-to-Real Camouflaged Object Detection, S2R-COD),其核心解决方案是构建一种基于学生-教师模型的循环式跨域自适应框架(Cycling Syn-to-Real Domain Adaptation Framework, CSRDA)。该方法的关键在于:通过伪标签(pseudo labeling)结合一致性正则化(consistency regularization)将源域(合成图像)的类别信息传播至无标签的目标域(真实图像);同时,利用递归学习机制构建一个动态演化的“真实域”以缩小域内差异,从而提升伪标签质量并有效弥合合成域与真实域之间的分布差距,最终增强模型在真实场景下的泛化性能。
链接: https://arxiv.org/abs/2507.18911
作者: Zhihao Luo,Luojun Lin,Zheng Lin
机构: Fuzhou University (福州大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Due to the high cost of collection and labeling, there are relatively few datasets for camouflaged object detection (COD). In particular, for certain specialized categories, the available image dataset is insufficiently populated. Synthetic datasets can be utilized to alleviate the problem of limited data to some extent. However, directly training with synthetic datasets compared to real datasets can lead to a degradation in model performance. To tackle this problem, in this work, we investigate a new task, namely Syn-to-Real Camouflaged Object Detection (S2R-COD). In order to improve the model performance in real world scenarios, a set of annotated synthetic camouflaged images and a limited number of unannotated real images must be utilized. We propose the Cycling Syn-to-Real Domain Adaptation Framework (CSRDA), a method based on the student-teacher model. Specially, CSRDA propagates class information from the labeled source domain to the unlabeled target domain through pseudo labeling combined with consistency regularization. Considering that narrowing the intra-domain gap can improve the quality of pseudo labeling, CSRDA utilizes a recurrent learning framework to build an evolving real domain for bridging the source and target domain. Extensive experiments demonstrate the effectiveness of our framework, mitigating the problem of limited data and handcraft annotations in COD. Our code is publicly available at: this https URL
zh
[CV-74] Dealing with Segmentation Errors in Needle Reconstruction for MRI-Guided Brachytherapy
【速读】:该论文旨在解决图像引导的近距离放射治疗(brachytherapy)规划中,因手动标注植入针具耗时且易出错而带来的挑战,特别是针对自动重建针具过程中由深度学习分割结果中存在的误差导致的精度下降问题。其解决方案的关键在于对现有后处理技术进行适应性改进,重点增强其对各类分割错误的鲁棒性,从而提升针具尖端、底端及轴线的定位准确性。实验表明,优化后的后处理方法在前列腺癌MRI数据集上实现了极高的重建精度(中位数针尖和针底定位误差分别为1.07 mm和0.43 mm,轴线误差为0.75 mm),且无假阳性或假阴性针具漏检。
链接: https://arxiv.org/abs/2507.18895
作者: Vangelis Kostoulas,Arthur Guijt,Ellen M. Kerkhof,Bradley R. Pieters,Peter A.N. Bosman,Tanja Alderliesten
机构: Leiden University Medical Center (莱顿大学医学中心); Centrum Wiskunde & Informatica (荷兰数学与计算机科学研究中心); Amsterdam University Medical Centers, University of Amsterdam (阿姆斯特丹大学医学中心,阿姆斯特丹大学); Cancer Center Amsterdam (癌症中心阿姆斯特丹)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in: Proc. SPIE Medical Imaging 2025, Vol. 13408, 1340826
Abstract:Brachytherapy involves bringing a radioactive source near tumor tissue using implanted needles. Image-guided brachytherapy planning requires amongst others, the reconstruction of the needles. Manually annotating these needles on patient images can be a challenging and time-consuming task for medical professionals. For automatic needle reconstruction, a two-stage pipeline is commonly adopted, comprising a segmentation stage followed by a post-processing stage. While deep learning models are effective for segmentation, their results often contain errors. No currently existing post-processing technique is robust to all possible segmentation errors. We therefore propose adaptations to existing post-processing techniques mainly aimed at dealing with segmentation errors and thereby improving the reconstruction accuracy. Experiments on a prostate cancer dataset, based on MRI scans annotated by medical professionals, demonstrate that our proposed adaptations can help to effectively manage segmentation errors, with the best adapted post-processing technique achieving median needle-tip and needle-bottom point localization errors of 1.07 (IQR \pm 1.04 ) mm and 0.43 (IQR \pm 0.46 ) mm, respectively, and median shaft error of 0.75 (IQR \pm 0.69 ) mm with 0 false positive and 0 false negative needles on a test set of 261 needles.
zh
[CV-75] Perspective from a Higher Dimension: Can 3D Geometric Priors Help Visual Floorplan Localization? ACM-MM2025
【速读】:该论文旨在解决建筑平面图(floorplan)中的自定位(self-localization)问题,其核心挑战在于视觉感知与平面图之间的模态差异(modal differences)和几何不一致(geometric discrepancies),尤其是在频繁的视觉变化和视场遮挡(view occlusions)导致传统方法定位误差显著增加的情况下。解决方案的关键在于引入三维几何先验(3D geometric priors)以增强二维平面图定位(2D Floorplan Localization, FLoc)算法的鲁棒性:首先通过多视角约束建模几何感知的视图不变性(geometrically aware view invariance),利用成像几何原理提供跨图像匹配约束;其次构建视图-场景对齐的几何先验(view-scene aligned geometric priors),通过将场景表面重建与RGB帧关联来强化跨模态几何-颜色对应关系。两项3D先验均采用自监督对比学习(self-supervised contrastive learning)建模,无需额外标注,从而在不增加计算负担的前提下有效弥合模态鸿沟并显著提升定位精度。
链接: https://arxiv.org/abs/2507.18881
作者: Bolei Chen,Jiaxu Kang,Haonan Yang,Ping Zhong,Jianxin Wang
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ACM MM 2025
Abstract:Since a building’s floorplans are easily accessible, consistent over time, and inherently robust to changes in visual appearance, self-localization within the floorplan has attracted researchers’ interest. However, since floorplans are minimalist representations of a building’s structure, modal and geometric differences between visual perceptions and floorplans pose challenges to this task. While existing methods cleverly utilize 2D geometric features and pose filters to achieve promising performance, they fail to address the localization errors caused by frequent visual changes and view occlusions due to variously shaped 3D objects. To tackle these issues, this paper views the 2D Floorplan Localization (FLoc) problem from a higher dimension by injecting 3D geometric priors into the visual FLoc algorithm. For the 3D geometric prior modeling, we first model geometrically aware view invariance using multi-view constraints, i.e., leveraging imaging geometric principles to provide matching constraints between multiple images that see the same points. Then, we further model the view-scene aligned geometric priors, enhancing the cross-modal geometry-color correspondences by associating the scene’s surface reconstruction with the RGB frames of the sequence. Both 3D priors are modeled through self-supervised contrastive learning, thus no additional geometric or semantic annotations are required. These 3D priors summarized in extensive realistic scenes bridge the modal gap while improving localization success without increasing the computational burden on the FLoc algorithm. Sufficient comparative studies demonstrate that our method significantly outperforms state-of-the-art methods and substantially boosts the FLoc accuracy. All data and code will be released after the anonymous review.
zh
[CV-76] ransferable and Undefendable Point Cloud Attacks via Medial Axis Transform
【速读】:该论文旨在解决点云对抗攻击中普遍存在的迁移性不足与防御鲁棒性差的问题,即现有攻击方法在白盒环境下表现良好,但在面对未知模型或常见防御机制时效果显著下降。其解决方案的关键在于提出一种基于中轴变换(Medial Axis Transform, MAT)表示的新型对抗攻击框架MAT-Adv:通过自动编码器将输入点云映射到紧凑的MAT表示空间,该空间能有效捕捉点云的内在几何结构;随后对MAT表示进行显式扰动,从而在结构层面引入固有的对抗特性,使生成的对抗样本具备更强的跨模型迁移能力和对防御策略的免疫力。此外,为避免过拟合和扰动坍塌,作者在优化过程中引入了dropout策略,进一步提升了攻击的有效性与普适性。
链接: https://arxiv.org/abs/2507.18870
作者: Keke Tang,Yuze Gao,Weilong Peng,Xiaofei Wang,Meie Fang,Peican Zhu
机构: Cyberspace Institute of Advanced Technology, Guangzhou University (广州大学先进科技网络研究院); School of Computer Science and Cyber Engineering, Guangzhou University (广州大学计算机科学与网络工程学院); Department of Automation, University of Science and Technology of China (中国科学技术大学自动化系); School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University (西北工业大学人工智能学院、光电与信息工程研究院(iOPEN)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Studying adversarial attacks on point clouds is essential for evaluating and improving the robustness of 3D deep learning models. However, most existing attack methods are developed under ideal white-box settings and often suffer from limited transferability to unseen models and insufficient robustness against common defense mechanisms. In this paper, we propose MAT-Adv, a novel adversarial attack framework that enhances both transferability and undefendability by explicitly perturbing the medial axis transform (MAT) representations, in order to induce inherent adversarialness in the resulting point clouds. Specifically, we employ an autoencoder to project input point clouds into compact MAT representations that capture the intrinsic geometric structure of point clouds. By perturbing these intrinsic representations, MAT-Adv introduces structural-level adversarial characteristics that remain effective across diverse models and defense strategies. To mitigate overfitting and prevent perturbation collapse, we incorporate a dropout strategy into the optimization of MAT perturbations, further improving transferability and undefendability. Extensive experiments demonstrate that MAT-Adv significantly outperforms existing state-of-the-art methods in both transferability and undefendability. Codes will be made public upon paper acceptance.
zh
[CV-77] Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction
【速读】:该论文旨在解决视觉自动语音识别(Visual Automatic Speech Recognition, V-ASR)中因缺乏听觉线索及音素视觉相似性(viseme ambiguity)导致的高识别错误率问题,以及现有方法对大规模预训练数据依赖性强的局限性。其解决方案的关键在于提出一种基于音素的两阶段框架:第一阶段通过融合视觉特征与面部关键点运动特征进行音素预测,有效降低模型训练复杂度并增强对说话人特异性特征的建模能力;第二阶段利用NLLB(No Language Left Behind)编码器-解码器大语言模型(Large Language Model, LLM)将预测音素重构为最终词汇,从而缓解音素歧义带来的影响,显著提升识别准确率,在LRS2和LRS3数据集上分别达到17.4%和21.0%的词错误率(Word Error Rate, WER)。
链接: https://arxiv.org/abs/2507.18863
作者: Matthew Kit Khinn Teng,Haibo Zhang,Takeshi Saitoh
机构: Kyushu Institute of Technology (九州工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures
Abstract:Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions. This task is notably challenging due to the absence of auditory cues and the visual ambiguity of phonemes that exhibit similar visemes-distinct sounds that appear identical in lip motions. Existing methods often aim to predict words or characters directly from visual cues, but they commonly suffer from high error rates due to viseme ambiguity and require large amounts of pre-training data. We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features, followed by an LLM model for word reconstruction to address these challenges. Stage 1 consists of V-ASR, which outputs the predicted phonemes, thereby reducing training complexity. Meanwhile, the facial landmark features address speaker-specific facial characteristics. Stage 2 comprises an encoder-decoder LLM model, NLLB, that reconstructs the output phonemes back to words. Besides using a large visual dataset for deep learning fine-tuning, our PV-ASR method demonstrates superior performance by achieving 17.4% WER on the LRS2 and 21.0% WER on the LRS3 dataset.
zh
[CV-78] PTCMIL: Multiple Instance Learning via Prompt Token Clustering for Whole Slide Image Analysis
【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)分析中基于多实例学习(Multiple Instance Learning, MIL)方法在处理WSI复杂性和异质性时的局限性,特别是现有方法难以有效聚合多样化切片区域信息以构建鲁棒的WSI表示。其解决方案的关键在于提出一种基于提示词聚类的视觉Transformer(Prompt Token Clustering-based ViT for MIL, PTCMIL),通过引入可学习的提示词token(prompt token)将聚类与预测任务统一于端到端框架中,动态对齐聚类结果与下游任务,并采用针对每个WSI定制的投影聚类策略,在降低计算复杂度的同时保留patch异质性;同时结合token合并与原型池化机制,高效捕获任务相关模式,从而显著提升分类和生存分析等任务的性能与可解释性。
链接: https://arxiv.org/abs/2507.18848
作者: Beidi Zhao,SangMook Kim,Hao Chen,Chen Zhou,Zu-hua Gao,Gang Wang,Xiaoxiao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multiple Instance Learning (MIL) has advanced WSI analysis but struggles with the complexity and heterogeneity of WSIs. Existing MIL methods face challenges in aggregating diverse patch information into robust WSI representations. While ViTs and clustering-based approaches show promise, they are computationally intensive and fail to capture task-specific and slide-specific variability. To address these limitations, we propose PTCMIL, a novel Prompt Token Clustering-based ViT for MIL aggregation. By introducing learnable prompt tokens into the ViT backbone, PTCMIL unifies clustering and prediction tasks in an end-to-end manner. It dynamically aligns clustering with downstream tasks, using projection-based clustering tailored to each WSI, reducing complexity while preserving patch heterogeneity. Through token merging and prototype-based pooling, PTCMIL efficiently captures task-relevant patterns. Extensive experiments on eight datasets demonstrate its superior performance in classification and survival analysis tasks, outperforming state-of-the-art methods. Systematic ablation studies confirm its robustness and strong interpretability. The code is released at this https URL.
zh
[CV-79] Flow Stochastic Segmentation Networks ICCV2025
【速读】:该论文旨在解决现有生成式分割模型在建模像素级协方差时的局限性,特别是低秩参数化导致的表达能力不足问题。此前方法通常假设协方差矩阵具有低秩结构或需存储完整的分布参数,限制了其对复杂数据分布的拟合能力。为此,作者提出流随机分割网络(Flow Stochastic Segmentation Network, Flow-SSN),其核心创新在于采用离散时间自回归与连续时间流(continuous-time flow)两种变体,能够在不假设协方差秩或显式存储分布参数的前提下,估计任意高秩的像素级协方差。此外,由于模型容量主要分配给学习流的基分布(即表达性强的先验),Flow-SSN在采样效率上优于标准扩散分割模型,从而在医学图像分割任务中实现更优性能。
链接: https://arxiv.org/abs/2507.18838
作者: Fabio De Sousa Ribeiro,Omar Todd,Charles Jones,Avinash Kori,Raghav Mehta,Ben Glocker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at ICCV 2025
Abstract:We introduce the Flow Stochastic Segmentation Network (Flow-SSN), a generative segmentation model family featuring discrete-time autoregressive and modern continuous-time flow variants. We prove fundamental limitations of the low-rank parameterisation of previous methods and show that Flow-SSNs can estimate arbitrarily high-rank pixel-wise covariances without assuming the rank or storing the distributional parameters. Flow-SSNs are also more efficient to sample from than standard diffusion-based segmentation models, thanks to most of the model capacity being allocated to learning the base distribution of the flow, constituting an expressive prior. We apply Flow-SSNs to challenging medical imaging benchmarks and achieve state-of-the-art results. Code available: this https URL.
zh
[CV-80] RealDeal: Enhancing Realism and Details in Brain Image Generation via Image-to-Image Diffusion Models
【速读】:该论文旨在解决潜在扩散模型(Latent Diffusion Models, LDM)生成的脑部磁共振成像(MRI)图像过于平滑、缺乏精细解剖结构和真实扫描噪声的问题,从而提升图像的真实感与细节表现。解决方案的关键在于将增强图像真实感和细节的过程建模为图像到图像的扩散模型(image-to-image diffusion models),通过该框架对LDM生成的图像进行精细化重构,以恢复锐利边缘、细微纹理、隐含的解剖特征及实际成像噪声分布,同时引入新的评估指标来量化噪声分布、清晰度和纹理真实性,从而实现更贴近临床真实图像的质量提升。
链接: https://arxiv.org/abs/2507.18830
作者: Shen Zhu,Yinzhu Jin,Tyler Spears,Ifrah Zawar,P. Thomas Fletcher
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 10 figures
Abstract:We propose image-to-image diffusion models that are designed to enhance the realism and details of generated brain images by introducing sharp edges, fine textures, subtle anatomical features, and imaging noise. Generative models have been widely adopted in the biomedical domain, especially in image generation applications. Latent diffusion models achieve state-of-the-art results in generating brain MRIs. However, due to latent compression, generated images from these models are overly smooth, lacking fine anatomical structures and scan acquisition noise that are typically seen in real images. This work formulates the realism enhancing and detail adding process as image-to-image diffusion models, which refines the quality of LDM-generated images. We employ commonly used metrics like FID and LPIPS for image realism assessment. Furthermore, we introduce new metrics to demonstrate the realism of images generated by RealDeal in terms of image noise distribution, sharpness, and texture.
zh
[CV-81] Deepfake Detection Via Facial Feature Extraction and Modeling
【速读】:该论文旨在解决深度伪造(deepfake)视频难以与真实媒体区分的问题,即如何有效检测由人工智能生成的虚假视频内容。其解决方案的关键在于摒弃传统的直接图像处理方法,转而采用仅基于面部关键点(facial landmarks)的特征提取策略,通过分析人脸运动中的细微不一致性来实现检测。实验表明,该方法在多种神经网络模型(包括RNN、ANN和CNN)中均表现出良好性能,尤其在RNN和ANN模型上准确率分别达到96%和93%,显著优于CNN模型的约78%,同时所需参数更少,具备良好的实际应用潜力。
链接: https://arxiv.org/abs/2507.18815
作者: Benjamin Carter,Nathan Dilla,Micheal Callahan,Atuhaire Ambala
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Keywords: deepfake, facial recognition, feature extraction, artificial intelligence, recurrent neural network, convolutional neural network, artificial neural network
Abstract:The rise of deepfake technology brings forth new questions about the authenticity of various forms of media found online today. Videos and images generated by artificial intelligence (AI) have become increasingly more difficult to differentiate from genuine media, resulting in the need for new models to detect artificially-generated media. While many models have attempted to solve this, most focus on direct image processing, adapting a convolutional neural network (CNN) or a recurrent neural network (RNN) that directly interacts with the video image data. This paper introduces an approach of using solely facial landmarks for deepfake detection. Using a dataset consisting of both deepfake and genuine videos of human faces, this paper describes an approach for extracting facial landmarks for deepfake detection, focusing on identifying subtle inconsistencies in facial movements instead of raw image processing. Experimental results demonstrated that this feature extraction technique is effective in various neural network models, with the same facial landmarks tested on three neural network models, with promising performance metrics indicating its potential for real-world applications. The findings discussed in this paper include RNN and artificial neural network (ANN) models with accuracy between 96% and 93%, respectively, with a CNN model hovering around 78%. This research challenges the assumption that raw image processing is necessary to identify deepfake videos by presenting a facial feature extraction approach compatible with various neural network models while requiring fewer parameters.
zh
[CV-82] Perpetua: Multi-Hypothesis Persistence Modeling for Semi-Static Environments IROS2025
【速读】:该论文旨在解决机器人在复杂动态环境中长期部署时,传统环境建模算法难以有效表示和预测半静态特征(semi-static features)未来状态的问题。现有方法通常通过滤除或加权平均的方式处理动态变化,无法捕捉特征的演化规律。解决方案的关键在于提出Perpetua方法,其核心是基于贝叶斯框架,将“持续性”(persistence)与“出现性”(emergence)滤波器组合成混合模型,以概率形式刻画特征消失或重现的可能性,并支持多假设跟踪与在线自适应更新,从而实现对环境中特征当前及未来状态的高效、鲁棒估计。
链接: https://arxiv.org/abs/2507.18808
作者: Miguel Saavedra-Ruiz,Samer B. Nashed,Charlie Gauthier,Liam Paull
机构: 1. University of Waterloo (滑铁卢大学); 2. Canadian Institute for Advanced Research (加拿大高级研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025) Code available at this https URL . Webpage and additional videos at this https URL
Abstract:Many robotic systems require extended deployments in complex, dynamic environments. In such deployments, parts of the environment may change between subsequent robot observations. Most robotic mapping or environment modeling algorithms are incapable of representing dynamic features in a way that enables predicting their future state. Instead, they opt to filter certain state observations, either by removing them or some form of weighted averaging. This paper introduces Perpetua, a method for modeling the dynamics of semi-static features. Perpetua is able to: incorporate prior knowledge about the dynamics of the feature if it exists, track multiple hypotheses, and adapt over time to enable predicting of future feature states. Specifically, we chain together mixtures of “persistence” and “emergence” filters to model the probability that features will disappear or reappear in a formal Bayesian framework. The approach is an efficient, scalable, general, and robust method for estimating the states of features in an environment, both in the present as well as at arbitrary future times. Through experiments on simulated and real-world data, we find that Perpetua yields better accuracy than similar approaches while also being online adaptable and robust to missing observations.
zh
[CV-83] Me What You See: An Iterative Deep Learning Framework for Image Captioning
【速读】:该论文旨在解决图像描述生成(Image Captioning)任务中模型架构演进的问题,即如何通过系统性迭代优化提升模型性能。其关键解决方案在于揭示了在经典CNN-LSTM架构中,仅升级视觉主干网络(backbone)而不引入注意力机制会导致性能下降,因为单一向量瓶颈无法有效传递更丰富的视觉细节;这一发现验证了从传统编码器-解码器结构向基于注意力机制的架构转变的必要性,最终通过构建包含EfficientNetV2B3主干和动态注意力机制的Nexus模型,在MS COCO 2017数据集上实现了BLEU-4得分为31.4的先进性能,为视觉-语言任务提供了可复现的架构设计范式。
链接: https://arxiv.org/abs/2507.18788
作者: Hitesh Kumar Gupta
机构: Independent Researcher(独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 12 total figures (including a 7-figure appendix), 4 tables
Abstract:Image captioning, a task at the confluence of computer vision and natural language processing, requires a sophisticated understanding of both visual scenes and linguistic structure. While modern approaches are dominated by large-scale Transformer architectures, this paper documents a systematic, iterative development of foundational image captioning models, progressing from a simple CNN-LSTM encoder-decoder to a competitive attention-based system. We present a series of five models, beginning with Genesis and concluding with Nexus, an advanced model featuring an EfficientNetV2B3 backbone and a dynamic attention mechanism. Our experiments chart the impact of architectural enhancements and demonstrate a key finding within the classic CNN-LSTM paradigm: merely upgrading the visual backbone without a corresponding attention mechanism can degrade performance, as the single-vector bottleneck cannot transmit the richer visual detail. This insight validates the architectural shift to attention. Trained on the MS COCO 2017 dataset, our final model, Nexus, achieves a BLEU-4 score of 31.4, surpassing several foundational benchmarks and validating our iterative design process. This work provides a clear, replicable blueprint for understanding the core architectural principles that underpin modern vision-language tasks.
zh
[CV-84] Diffusion-FS: Multimodal Free-Space Prediction via Diffusion for Autonomous Driving IROS2025
【速读】:该论文旨在解决自动驾驶中可行驶自由空间(drivable free-space)的走廊预测问题,即从整个道路区域中精准识别出安全、可导航的驾驶通道(driving corridors),而非简单地分割所有非障碍物区域。传统方法多依赖于鸟瞰图(BEV)表示,但获取此类表示困难;本文则提出一种纯图像感知范式,仅使用单目相机输入进行推理。其关键创新在于:首先设计了一种自监督策略,利用未来车辆轨迹与前视图像生成自由空间走廊样本,使视觉预测过程与车辆自身运动状态耦合;其次引入ContourDiff架构,采用基于轮廓点的扩散模型替代传统的二值掩码表示,实现结构化且可解释的自由空间预测,从而在nuScenes和CARLA数据集上验证了方法在多模态安全走廊预测上的有效性。
链接: https://arxiv.org/abs/2507.18763
作者: Keshav Gupta,Tejas S. Stanley,Pranjal Paul,Arun K. Singh,K. Madhava Krishna
机构: Robotics Research Center, IIIT-Hyderabad, India; The University of Tartu, Estonia
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 7 figures, IROS 2025
Abstract:Drivable Free-space prediction is a fundamental and crucial problem in autonomous driving. Recent works have addressed the problem by representing the entire non-obstacle road regions as the free-space. In contrast our aim is to estimate the driving corridors that are a navigable subset of the entire road region. Unfortunately, existing corridor estimation methods directly assume a BEV-centric representation, which is hard to obtain. In contrast, we frame drivable free-space corridor prediction as a pure image perception task, using only monocular camera input. However such a formulation poses several challenges as one doesn’t have the corresponding data for such free-space corridor segments in the image. Consequently, we develop a novel self-supervised approach for free-space sample generation by leveraging future ego trajectories and front-view camera images, making the process of visual corridor estimation dependent on the ego trajectory. We then employ a diffusion process to model the distribution of such segments in the image. However, the existing binary mask-based representation for a segment poses many limitations. Therefore, we introduce ContourDiff, a specialized diffusion-based architecture that denoises over contour points rather than relying on binary mask representations, enabling structured and interpretable free-space predictions. We evaluate our approach qualitatively and quantitatively on both nuScenes and CARLA, demonstrating its effectiveness in accurately predicting safe multimodal navigable corridors in the image.
zh
[CV-85] Learning Efficient and Generalizable Human Representation with Human Gaussian Model
【速读】:该论文旨在解决从视频中建模可动画化人类虚拟形象(animatable human avatar)的问题,其核心挑战在于如何有效捕捉不同时间戳下3D高斯分布(3D Gaussians)之间的关联关系。传统方法依赖于实例级优化,而近期的前馈式方法虽能生成高斯分布,但对每一帧独立预测,忽略了跨帧信息。为此,作者提出Human Gaussian Graph结构,将高斯分布作为第一层节点、SMPL人体网格顶点作为第二层节点,构建双层图结构;关键创新在于引入了节点内操作(intra-node operation)聚合连接至同一网格顶点的多个高斯分布,并通过节点间操作(inter-node operation)实现网格顶点间的消息传递,从而利用全序列信息恢复可动画的人体表示。
链接: https://arxiv.org/abs/2507.18758
作者: Yifan Liu,Shengjun Zhang,Chensheng Dai,Yang Chen,Hao Liu,Chen Li,Yueqi Duan
机构: Tsinghua University (清华大学); WeChat Vision, Tecent Inc. (微信视觉,腾讯公司); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modeling animatable human avatars from videos is a long-standing and challenging problem. While conventional methods require per-instance optimization, recent feed-forward methods have been proposed to generate 3D Gaussians with a learnable network. However, these methods predict Gaussians for each frame independently, without fully capturing the relations of Gaussians from different timestamps. To address this, we propose Human Gaussian Graph to model the connection between predicted Gaussians and human SMPL mesh, so that we can leverage information from all frames to recover an animatable human representation. Specifically, the Human Gaussian Graph contains dual layers where Gaussians are the first layer nodes and mesh vertices serve as the second layer nodes. Based on this structure, we further propose the intra-node operation to aggregate various Gaussians connected to one mesh vertex, and inter-node operation to support message passing among mesh node neighbors. Experimental results on novel view synthesis and novel pose animation demonstrate the efficiency and generalization of our method.
zh
[CV-86] SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and Progressive Transfer Learning
【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像在语义理解方面的瓶颈问题,即缺乏大规模、高质量的SAR图像-文本配对数据集,从而限制了视觉语言模型(Vision Language Models, VLMs)在遥感领域的应用效果。解决方案的关键在于构建了一个包含超过13万对SAR图像与文本描述的大规模高质量数据集SAR-Text,并设计了SAR-Narrator框架,该框架采用多阶段渐进式迁移学习策略自动生成SAR图像的文本描述,为后续下游任务提供可靠的数据基础和生成工具。
链接: https://arxiv.org/abs/2507.18743
作者: Xinjun Cheng,Yiguo He,Junjie Zhu,Chunping Qiu,Jun Wang,Qiangjuan Huang,Ke Yang
机构: Intelligent Game and Decision Lab, Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Submission
Abstract:Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-Text, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-Text dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage progressive transfer learning strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 16.43% and 10.54% on the OSdataset-512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves BLEU-4, SPICE, and CIDEr scores exceeding those of the original CoCa model by more than 8x, 4x, and 10x, respectively. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets.
zh
[CV-87] KuiSCIMA v2.0: Improved Baselines Calibration and Cross-Notation Generalization for Historical Chinese Music Notations in Jiang Kuis Baishidaoren Gequ ICDAR2025
【速读】:该论文旨在解决历史中文乐谱(如俗字谱和律吕谱)的光学乐谱识别(Optical Music Recognition, OMR)中因类别极度不平衡和训练数据稀缺带来的挑战。其关键解决方案在于构建并评估一种针对稀疏且高度不平衡数据的字符识别模型,通过优化网络架构与训练策略,将俗字谱的字符错误率(Character Error Rate, CER)从10.4%显著降低至7.1%,同时在律吕谱上实现0.9%的极低CER,优于人类转录者(平均CER 15.9%,最佳7.6%)。此外,采用温度缩放(temperature scaling)技术使模型校准良好(Expected Calibration Error < 0.0162),并通过留一版次交叉验证确保跨五种历史版本的鲁棒性,从而推动了中国古乐数字化与文化多样性在OMR领域的应用。
链接: https://arxiv.org/abs/2507.18741
作者: Tristan Repolusk,Eduardo Veas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: International Conference on Document Analysis and Recognition. This preprint has not undergone any post-submission improvements or corrections. The Version of Record of this contribution is published in “19th International Conference on Document Analysis and Recognition (ICDAR 2025), Wuhan, China, September 16-21, 2025, Proceedings”, and is available online at the External DOI field below
Abstract:Optical Music Recognition (OMR) for historical Chinese musical notations, such as suzipu and lülüpu, presents unique challenges due to high class imbalance and limited training data. This paper introduces significant advancements in OMR for Jiang Kui’s influential collection Baishidaoren Gequ from 1202. In this work, we develop and evaluate a character recognition model for scarce imbalanced data. We improve upon previous baselines by reducing the Character Error Rate (CER) from 10.4% to 7.1% for suzipu, despite working with 77 highly imbalanced classes, and achieve a remarkable CER of 0.9% for lülüpu. Our models outperform human transcribers, with an average human CER of 15.9% and a best-case CER of 7.6%. We employ temperature scaling to achieve a well-calibrated model with an Expected Calibration Error (ECE) below 0.0162. Using a leave-one-edition-out cross-validation approach, we ensure robust performance across five historical editions. Additionally, we extend the KuiSCIMA dataset to include all 109 pieces from Baishidaoren Gequ, encompassing suzipu, lülüpu, and jianzipu notations. Our findings advance the digitization and accessibility of historical Chinese music, promoting cultural diversity in OMR and expanding its applicability to underrepresented music traditions.
zh
[CV-88] Learned Single-Pixel Fluorescence Microscopy
【速读】:该论文旨在解决单像素成像(Single-pixel Imaging)在荧光显微镜中因重建速度慢、图像质量受限以及难以实现多光谱成像而面临的挑战。传统方法依赖总变差最小化(Total Variation Minimization)从噪声测量数据中重构图像,但效率与性能存在瓶颈。解决方案的关键在于利用自监督学习训练一个自动编码器(Autoencoder),通过学习编码器(即测量矩阵)和解码器来联合优化测量过程与重建算法;该方法使测量矩阵可物理集成到设备中,从而显著提升重建速度(降低两个数量级)、改善图像质量,并首次实现了高效的多光谱重建,为生物医学成像提供低成本且高性能的替代方案。
链接: https://arxiv.org/abs/2507.18740
作者: Serban C. Tudosie,Valerio Gandolfi,Shivaprasad Varakkoth,Andrea Farina,Cosimo D’Andrea,Simon Arridge
机构: 1. University College London (伦敦大学学院); 2. Politecnico di Milano (米兰理工大学); 3. Imperial College London (帝国理工学院); 4. King’s College London (国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Optics (physics.optics)
备注: 10 pages, 6 figures, 1 table
Abstract:Single-pixel imaging has emerged as a key technique in fluorescence microscopy, where fast acquisition and reconstruction are crucial. In this context, images are reconstructed from linearly compressed measurements. In practice, total variation minimisation is still used to reconstruct the image from noisy measurements of the inner product between orthogonal sampling pattern vectors and the original image data. However, data can be leveraged to learn the measurement vectors and the reconstruction process, thereby enhancing compression, reconstruction quality, and speed. We train an autoencoder through self-supervision to learn an encoder (or measurement matrix) and a decoder. We then test it on physically acquired multispectral and intensity data. During acquisition, the learned encoder becomes part of the physical device. Our approach can enhance single-pixel imaging in fluorescence microscopy by reducing reconstruction time by two orders of magnitude, achieving superior image quality, and enabling multispectral reconstructions. Ultimately, learned single-pixel fluorescence microscopy could advance diagnosis and biological research, providing multispectral imaging at a fraction of the cost.
zh
[CV-89] SaLF: Sparse Local Fields for Multi-Sensor Rendering in Real-Time
【速读】:该论文旨在解决现有传感器仿真方法在训练与渲染效率低、对非针孔相机和旋转式LiDAR支持不足,以及表示与渲染过程耦合导致互操作性差的问题。其核心解决方案是提出一种新型体素级表示方法——稀疏局部场(Sparse Local Fields, SaLF),该方法将体积表示为稀疏的3D voxel primitives,每个体素包含一个局部隐式场,从而同时支持光栅化(rasterization)与光线追踪(raytracing),实现了快速训练(30分钟)和高效渲染(相机50+ FPS,LiDAR 600+ FPS),并具备自适应剪枝与加密能力,可灵活处理大规模场景,且兼容非针孔相机与旋转式LiDAR,显著提升了自动驾驶仿真系统的可扩展性与实用性。
链接: https://arxiv.org/abs/2507.18713
作者: Yun Chen,Matthew Haines,Jingkang Wang,Krzysztof Baron-Lis,Sivabalan Manivasagam,Ze Yang,Raquel Urtasun
机构: Waabi; University of Toronto (多伦多大学); University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:High-fidelity sensor simulation of light-based sensors such as cameras and LiDARs is critical for safe and accurate autonomy testing. Neural radiance field (NeRF)-based methods that reconstruct sensor observations via ray-casting of implicit representations have demonstrated accurate simulation of driving scenes, but are slow to train and render, hampering scale. 3D Gaussian Splatting (3DGS) has demonstrated faster training and rendering times through rasterization, but is primarily restricted to pinhole camera sensors, preventing usage for realistic multi-sensor autonomy evaluation. Moreover, both NeRF and 3DGS couple the representation with the rendering procedure (implicit networks for ray-based evaluation, particles for rasterization), preventing interoperability, which is key for general usage. In this work, we present Sparse Local Fields (SaLF), a novel volumetric representation that supports rasterization and raytracing. SaLF represents volumes as a sparse set of 3D voxel primitives, where each voxel is a local implicit field. SaLF has fast training (30 min) and rendering capabilities (50+ FPS for camera and 600+ FPS LiDAR), has adaptive pruning and densification to easily handle large scenes, and can support non-pinhole cameras and spinning LiDARs. We demonstrate that SaLF has similar realism as existing self-driving sensor simulation methods while improving efficiency and enhancing capabilities, enabling more scalable simulation. this https URL
zh
[CV-90] Concept Probing: Where to Find Human-Defined Concepts (Extended Version)
【速读】:该论文旨在解决神经网络中概念探测(concept probing)的层选择问题,即如何自动识别在探测特定人类定义概念时应从模型的哪一层提取表示。传统方法依赖人工经验或试错来确定最优探测层,但性能高度依赖于所选层的表示能力。本文的关键解决方案是基于表示的信息量(informative)和规律性(regularity)两个指标,提出一种自动化方法来评估各层表示与目标概念的相关性,从而识别出最适合探测该概念的层。通过在多种神经网络模型和数据集上的广泛实证分析验证了该方法的有效性。
链接: https://arxiv.org/abs/2507.18681
作者: Manuel de Sousa Ribeiro,Afonso Leote,João Leite
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: Extended version of the paper published in Proceedings of the International Conference on Neurosymbolic Learning and Reasoning (NeSy 2025)
Abstract:Concept probing has recently gained popularity as a way for humans to peek into what is encoded within artificial neural networks. In concept probing, additional classifiers are trained to map the internal representations of a model into human-defined concepts of interest. However, the performance of these probes is highly dependent on the internal representations they probe from, making identifying the appropriate layer to probe an essential task. In this paper, we propose a method to automatically identify which layer’s representations in a neural network model should be considered when probing for a given human-defined concept of interest, based on how informative and regular the representations are with respect to the concept. We validate our findings through an exhaustive empirical analysis over different neural network models and datasets.
zh
[CV-91] owards Scalable Spatial Intelligence via 2D-to-3D Data Lifting ICCV2025
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 领域中空间智能(Spatial Intelligence)发展受限于大规模 3D 数据集稀缺的问题。相较于丰富的 2D 图像数据,获取高质量 3D 数据通常依赖专用传感器和耗时的标注流程,成为制约三维场景理解能力提升的关键瓶颈。其解决方案的核心在于提出一个可扩展的端到端流水线,通过融合深度估计(depth estimation)、相机标定(camera calibration)与尺度校准(scale calibration),将单视角图像自动转换为包含点云、相机位姿、深度图及伪 RGBD(pseudo-RGBD)等要素的完整、尺度一致且外观真实的 3D 表示。该方法显著降低了 3D 数据采集成本,并有效弥合了海量二维图像资源与日益增长的空间场景理解需求之间的鸿沟。
链接: https://arxiv.org/abs/2507.18678
作者: Xingyu Miao,Haoran Duan,Quanhao Qian,Jiuniu Wang,Yang Long,Ling Shao,Deli Zhao,Ran Xu,Gongjie Zhang
机构: Durham University (杜伦大学); DAMO Academy, Alibaba Group (阿里巴巴达摩院); Tsinghua University (清华大学); UCAS-Terminus AI Lab (中国科学院大学-终端人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025 (Highlight)
Abstract:Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of large-scale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation. In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations - including point clouds, camera poses, depth maps, and pseudo-RGBD - via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence. We release two generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various 3D tasks, ranging from fundamental perception to MLLM-based reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.
zh
[CV-92] HeartUnloadNet: A Weakly-Supervised Cycle-Consistent Graph Network for Predicting Unloaded Cardiac Geometry from Diastolic States
【速读】:该论文旨在解决从临床影像中估计心脏无载几何(unloaded cardiac geometry,即无腔内压力状态下的心肌初始构型)这一难题,该几何是个性化心脏生物力学建模的关键参考基准,对理解健康与疾病生理机制及预测心脏干预效果至关重要。传统方法依赖逆向有限元(inverse finite element, FE)求解器,需迭代优化且计算成本高。其解决方案的核心在于提出HeartUnloadNet——一种深度学习框架,通过引入生物物理先验知识,直接从舒张末期(end diastolic, ED)网格预测无载左心室(LV)形状;该网络采用图注意力架构并结合循环一致性策略,实现加载与卸载过程的双向预测,从而利用部分自监督提升精度并显著降低对大规模训练数据的依赖,最终在20,700组FE仿真数据上实现亚毫米级精度(平均Dice相似系数DSC=0.986,豪斯多夫距离HD=0.083 cm),推理时间仅0.02秒/例,较传统方法快超过10⁵倍。
链接: https://arxiv.org/abs/2507.18677
作者: Siyu Mu,Wei Xuan Chan,Choon Hwai Yap
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: Codes are available at this https URL
Abstract:The unloaded cardiac geometry (i.e., the state of the heart devoid of luminal pressure) serves as a valuable zero-stress and zero-strain reference and is critical for personalized biomechanical modeling of cardiac function, to understand both healthy and diseased physiology and to predict the effects of cardiac interventions. However, estimating the unloaded geometry from clinical images remains a challenging task. Traditional approaches rely on inverse finite element (FE) solvers that require iterative optimization and are computationally expensive. In this work, we introduce HeartUnloadNet, a deep learning framework that predicts the unloaded left ventricular (LV) shape directly from the end diastolic (ED) mesh while explicitly incorporating biophysical priors. The network accepts a mesh of arbitrary size along with physiological parameters such as ED pressure, myocardial stiffness scale, and fiber helix orientation, and outputs the corresponding unloaded mesh. It adopts a graph attention architecture and employs a cycle-consistency strategy to enable bidirectional (loading and unloading) prediction, allowing for partial self-supervision that improves accuracy and reduces the need for large training datasets. Trained and tested on 20,700 FE simulations across diverse LV geometries and physiological conditions, HeartUnloadNet achieves sub-millimeter accuracy, with an average DSC of 0.986 and HD of 0.083 cm, while reducing inference time to just 0.02 seconds per case, over 10^5 times faster and significantly more accurate than traditional inverse FE solvers. Ablation studies confirm the effectiveness of the architecture. Notably, the cycle-consistent design enables the model to maintain a DSC of 97% even with as few as 200 training samples. This work thus presents a scalable and accurate surrogate for inverse FE solvers, supporting real-time clinical applications in the future.
zh
[CV-93] Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks
【速读】:该论文旨在解决生成式 AI(Generative AI)在医疗健康领域中行为识别任务的泛化能力不足问题,特别是针对视觉语言模型CLIP在复杂和多样化动作场景下表现不稳定、易误分类的问题。其解决方案的关键在于引入一种基于自定义损失函数学习的类别特定噪声(class-specific noise),以强化模型对类别判别性特征的关注,从而提升分类准确率与置信度,并减少因视觉线索缺失或偏差引发的错误分类。
链接: https://arxiv.org/abs/2507.18675
作者: Sanyam Jain,Marsha Mariya Kappan,Vijeta Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Human action recognition plays a critical role in healthcare and medicine, supporting applications such as patient behavior monitoring, fall detection, surgical robot supervision, and procedural skill assessment. While traditional models like CNNs and RNNs have achieved moderate success, they often struggle to generalize across diverse and complex actions. Recent advancements in vision-language models, especially the transformer-based CLIP model, offer promising capabilities for generalizing action recognition from video data. In this work, we evaluate CLIP on the UCF-101 dataset and systematically analyze its performance under three masking strategies: (1) percentage-based and shape-based black masking at 10%, 30%, and 50%, (2) feature-specific masking to suppress bias-inducing elements, and (3) isolation masking that retains only class-specific regions. Our results reveal that CLIP exhibits inconsistent behavior and frequent misclassifications, particularly when essential visual cues are obscured. To overcome these limitations, we propose incorporating class-specific noise, learned via a custom loss function, to reinforce attention to class-defining features. This enhancement improves classification accuracy and model confidence while reducing bias. We conclude with a discussion on the challenges of applying such models in clinical domains and outline directions for future work to improve generalizability across domain-independent healthcare scenarios.
zh
[CV-94] Gen-AI Police Sketches with Stable Diffusion
【速读】:该论文旨在解决嫌疑人画像(suspect sketching)自动化与质量提升的问题,传统方法依赖人工绘制且效率低、一致性差。解决方案的关键在于构建多模态AI驱动的生成框架,通过引入文本-图像对齐机制优化生成结果:具体采用三种模型管道进行对比,其中最优方案为基于Stable Diffusion的基线模型(Model 1),其在结构相似性(SSIM)和峰值信噪比(PSNR)指标上表现最佳(SSIM=0.72,PSNR=25 dB);而创新性地对CLIP模型进行LoRA微调并嵌入Stable Diffusion架构(Model 3)虽在感知相似性(LPIPS)上有所改善,但整体性能仍不及Model 1,表明结构保真度对生成质量更为关键。
链接: https://arxiv.org/abs/2507.18667
作者: Nicholas Fidalgo,Aaron Contreras,Katherine Harvey,Johnny Ni
机构: Harvard College (哈佛学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This project investigates the use of multimodal AI-driven approaches to automate and enhance suspect sketching. Three pipelines were developed and evaluated: (1) baseline image-to-image Stable Diffusion model, (2) same model integrated with a pre-trained CLIP model for text-image alignment, and (3) novel approach incorporating LoRA fine-tuning of the CLIP model, applied to self-attention and cross-attention layers, and integrated with Stable Diffusion. An ablation study confirmed that fine-tuning both self- and cross-attention layers yielded the best alignment between text descriptions and sketches. Performance testing revealed that Model 1 achieved the highest structural similarity (SSIM) of 0.72 and a peak signal-to-noise ratio (PSNR) of 25 dB, outperforming Model 2 and Model 3. Iterative refinement enhanced perceptual similarity (LPIPS), with Model 3 showing improvement over Model 2 but still trailing Model 1. Qualitatively, sketches generated by Model 1 demonstrated the clearest facial features, highlighting its robustness as a baseline despite its simplicity.
zh
[CV-95] Generating real-time detailed ground visualisations from sparse aerial point clouds SIGGRAPH
【速读】:该论文旨在解决大规模室外三维场景构建中成本高、精度不足的问题,尤其是在需要达到步行视角或车载视角下高质量视觉效果时,传统依赖人工建模、贴图、材质着色和光照设计的方法往往导致高昂成本与真实地形细节失真。其解决方案的关键在于提出一种自动放大真实扫描数据并实现实时渲染的流程,使生成的3D内容可在近距离交互中保持高保真度,适用于训练、仿真、视频游戏及可视化等应用场景。
链接: https://arxiv.org/abs/2507.18664
作者: Aidan Murray,Eddie Waite,Caleb Ross,Scarlet Mitchell,Alexander Bradley,Joanna Jamrozy,Kenny Mitchell
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: CVMP Short Paper. 1 page, 3 figures, CVMP 2022: The 19th ACM SIGGRAPH European Conference on Visual Media Production, London. This work was supported by the European Union’s Horizon 2020 research and innovation programme under Grant 101017779
Abstract:Building realistic wide scale outdoor 3D content with sufficient visual quality to observe at walking eye level or from driven vehicles is often carried out by large teams of artists skilled in modelling, texturing, material shading and lighting, which typically leads to both prohibitive costs and reduced accuracy honoring the variety of real world ground truth landscapes. In our proposed method, we define a process to automatically amplify real-world scanned data and render real-time in animated 3D to explore at close range with high quality for training, simulation, video game and visualisation applications.
zh
[CV-96] Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by Reinforcement Learning from Visual Map Feed Back
【速读】:该论文旨在解决传统下一位置预测模型在人类移动轨迹推理中缺乏对地图信息(如道路连通性和运动趋势)显式建模的问题,从而导致预测性能受限且跨城市泛化能力不足。其核心解决方案是利用视觉语言模型(Vision-Language Models, VLMs)的视觉感知与推理能力,通过两种关键策略实现人机一致的轨迹推理:首先,在第一阶段设计两个监督微调(Supervised Fine-Tuning, SFT)任务,使VLM掌握道路网络和轨迹结构的视觉表示并获得基础推理能力;其次,在第二阶段引入基于视觉地图反馈的强化学习机制(Reinforcement Learning from Visual Map Feedback),使模型能通过与环境交互自我优化下一位置预测能力。这一两阶段框架显著提升了预测精度与跨城市迁移性能,达到当前最优水平。
链接: https://arxiv.org/abs/2507.18661
作者: Ruixing Zhang,Yang Zhang,Tongyu Zhu,Leilei Sun,Weifeng Lv
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Next Location Prediction is a fundamental task in the study of human mobility, with wide-ranging applications in transportation planning, urban governance, and epidemic forecasting. In practice, when humans attempt to predict the next location in a trajectory, they often visualize the trajectory on a map and reason based on road connectivity and movement trends. However, the vast majority of existing next-location prediction models do not reason over maps \textbfin the way that humans do. Fortunately, the recent development of Vision-Language Models (VLMs) has demonstrated strong capabilities in visual perception and even visual reasoning. This opens up a new possibility: by rendering both the road network and trajectory onto an image and leveraging the reasoning abilities of VLMs, we can enable models to perform trajectory inference in a human-like manner. To explore this idea, we first propose a method called Vision-Guided Location Search (VGLS), which evaluates whether a general-purpose VLM is capable of trajectory-based reasoning without modifying any of its internal parameters. Based on insights from the VGLS results, we further propose our main approach: VLMLocPredictor, which is composed of two stages: In the first stage, we design two Supervised Fine-Tuning (SFT) tasks that help the VLM understand road network and trajectory structures and acquire basic reasoning ability on such visual inputs. In the second stage, we introduce Reinforcement Learning from Visual Map Feedback, enabling the model to self-improve its next-location prediction ability through interaction with the environment. Experiments conducted on datasets from four different cities show that our method achieves state-of-the-art (SOTA) performance and exhibits superior cross-city generalization compared to other LLM-based approaches.
zh
[CV-97] Fuzzy Theory in Computer Vision: A Review
【速读】:该论文旨在解决计算机视觉中普遍存在的不确定性、噪声和数据不精确性问题,这些问题在图像识别、分割和特征提取等任务中严重影响模型性能。其解决方案的关键在于引入模糊逻辑(fuzzy logic),通过建模渐变过渡和类人推理机制,提供更具适应性和可解释性的处理方法。文中强调的核心技术包括模糊聚类、模糊推理系统、二型模糊集以及基于规则的决策机制,并进一步探讨了模糊逻辑与卷积神经网络(CNN)等深度学习模型的融合策略,从而提升复杂视觉任务中的表现力与透明度。
链接: https://arxiv.org/abs/2507.18660
作者: Adilet Yerkin,Ayan Igali,Elnara Kadyrgali,Maksat Shagyrov,Malika Ziyada,Muragul Muratbekova,Pakizar Shamoi
机构: Kazakh-British Technical University (哈萨克-英国技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Journal of Intelligent and Fuzzy Systems for consideration (8 pages, 6 figures, 1 table)
Abstract:Computer vision applications are omnipresent nowadays. The current paper explores the use of fuzzy logic in computer vision, stressing its role in handling uncertainty, noise, and imprecision in image data. Fuzzy logic is able to model gradual transitions and human-like reasoning and provides a promising approach to computer vision. Fuzzy approaches offer a way to improve object recognition, image segmentation, and feature extraction by providing more adaptable and interpretable solutions compared to traditional methods. We discuss key fuzzy techniques, including fuzzy clustering, fuzzy inference systems, type-2 fuzzy sets, and fuzzy rule-based decision-making. The paper also discusses various applications, including medical imaging, autonomous systems, and industrial inspection. Additionally, we explore the integration of fuzzy logic with deep learning models such as convolutional neural networks (CNNs) to enhance performance in complex vision tasks. Finally, we examine emerging trends such as hybrid fuzzy-deep learning models and explainable AI.
zh
[CV-98] VGS-ATD: Robust Distributed Learning for Multi-Label Medical Image Classification Under Heterogeneous and Imbalanced Conditions
【速读】:该论文旨在解决医疗影像领域中传统集中式学习和现有去中心化学习方法(如联邦学习和蜂群学习)在隐私保护、数据异构性与不平衡性、计算效率以及持续学习能力方面的局限性。具体而言,这些问题导致模型性能下降、难以扩展,并且在引入新数据时易发生灾难性遗忘(catastrophic forgetting),从而需要频繁重新训练整个模型。解决方案的关键在于提出一种新型分布式学习框架VGS-ATD,其核心创新包括:通过优化权重聚合机制缓解数据异构性和通信效率问题,同时设计具备增量式学习能力的结构以有效应对多模态、多标签场景下的持续学习挑战;实验表明,该框架在30个数据集和80个独立标签上实现了92.7%的整体准确率,显著优于集中式学习(84.9%)和蜂群学习(72.99%),并在系统扩展后仅出现1%的性能下降,远低于集中式学习的20%,同时将计算成本降低至原有水平的50%,充分验证了其高效性、可扩展性和对灾难性遗忘的鲁棒性。
链接: https://arxiv.org/abs/2507.18657
作者: Zehui Zhao,Laith Alzubaidi,Haider A.Alwzwazy,Jinglan Zhang,Yuantong Gu
机构: Queensland University of Technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 15 pages, 8 figures, 6 tables
Abstract:In recent years, advanced deep learning architectures have shown strong performance in medical imaging tasks. However, the traditional centralized learning paradigm poses serious privacy risks as all data is collected and trained on a single server. To mitigate this challenge, decentralized approaches such as federated learning and swarm learning have emerged, allowing model training on local nodes while sharing only model weights. While these methods enhance privacy, they struggle with heterogeneous and imbalanced data and suffer from inefficiencies due to frequent communication and the aggregation of weights. More critically, the dynamic and complex nature of clinical environments demands scalable AI systems capable of continuously learning from diverse modalities and multilabels. Yet, both centralized and decentralized models are prone to catastrophic forgetting during system expansion, often requiring full model retraining to incorporate new data. To address these limitations, we propose VGS-ATD, a novel distributed learning framework. To validate VGS-ATD, we evaluate it in experiments spanning 30 datasets and 80 independent labels across distributed nodes, VGS-ATD achieved an overall accuracy of 92.7%, outperforming centralized learning (84.9%) and swarm learning (72.99%), while federated learning failed under these conditions due to high requirements on computational resources. VGS-ATD also demonstrated strong scalability, with only a 1% drop in accuracy on existing nodes after expansion, compared to a 20% drop in centralized learning, highlighting its resilience to catastrophic forgetting. Additionally, it reduced computational costs by up to 50% relative to both centralized and swarm learning, confirming its superior efficiency and scalability.
zh
[CV-99] ShrinkBox: Backdoor Attack on Object Detection to Disrupt Collision Avoidance in Machine Learning-based Advanced Driver Assistance Systems
【速读】:该论文旨在解决机器学习驱动的高级驾驶辅助系统(ML-ADAS)中目标检测模块存在的安全漏洞问题,尤其是针对碰撞避免功能因后门攻击导致距离估计失真而引发的安全风险。解决方案的关键在于提出一种名为ShrinkBox的新颖后门攻击方法,该方法通过轻微缩小真实标注边界框(ground truth bounding boxes)的方式,在不引起数据集检查和标准基准测试异常的前提下,显著破坏下游距离估计性能;实验表明,该攻击在YOLOv9m检测器上实现了96%的攻击成功率(ASR),仅需4%的中毒比例即可使距离估计的平均绝对误差(MAE)提升3倍以上,从而可能导致碰撞预警延迟或失效。
链接: https://arxiv.org/abs/2507.18656
作者: Muhammad Zaeem Shahzad,Muhammad Abdullah Hanif,Bassem Ouni,Muhammad Shafique
机构: eBRAIN Lab, New York University Abu Dhabi (NYUAD), UAE; AI and Digital Science Research Center, Technology Innovation Institute (TII), Abu Dhabi, UAE
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, 1 table
Abstract:Advanced Driver Assistance Systems (ADAS) significantly enhance road safety by detecting potential collisions and alerting drivers. However, their reliance on expensive sensor technologies such as LiDAR and radar limits accessibility, particularly in low- and middle-income countries. Machine learning-based ADAS (ML-ADAS), leveraging deep neural networks (DNNs) with only standard camera input, offers a cost-effective alternative. Critical to ML-ADAS is the collision avoidance feature, which requires the ability to detect objects and estimate their distances accurately. This is achieved with specialized DNNs like YOLO, which provides real-time object detection, and a lightweight, detection-wise distance estimation approach that relies on key features extracted from the detections like bounding box dimensions and size. However, the robustness of these systems is undermined by security vulnerabilities in object detectors. In this paper, we introduce ShrinkBox, a novel backdoor attack targeting object detection in collision avoidance ML-ADAS. Unlike existing attacks that manipulate object class labels or presence, ShrinkBox subtly shrinks ground truth bounding boxes. This attack remains undetected in dataset inspections and standard benchmarks while severely disrupting downstream distance estimation. We demonstrate that ShrinkBox can be realized in the YOLOv9m object detector at an Attack Success Rate (ASR) of 96%, with only a 4% poisoning ratio in the training instances of the KITTI dataset. Furthermore, given the low error targets introduced in our relaxed poisoning strategy, we find that ShrinkBox increases the Mean Absolute Error (MAE) in downstream distance estimation by more than 3x on poisoned samples, potentially resulting in delays or prevention of collision warnings altogether.
zh
[CV-100] Part Segmentation of Human Meshes via Multi-View Human Parsing
【速读】:该论文旨在解决大规模人体网格(human meshes)的逐顶点语义分割问题,即如何在不依赖纹理信息的前提下,实现对复杂人体几何结构的精细化语义标注。其关键解决方案包括:首先构建伪真值标注(pseudo-ground truth labeling)流程,通过将Thuman2.1数据集中的网格对齐至标准姿态并从多视角进行分割,再将点级标签回投影至原始网格以生成每点标注;其次提出一种内存高效的采样策略——基于空间填充曲线(space-filling curve)序列化的窗口迭代最远点采样(windowed iterative farthest point sampling, FPS),有效降低点云规模;最后采用PointTransformer模型进行纯几何分割,实现无需图像纹理信息的人体网格语义解析。
链接: https://arxiv.org/abs/2507.18655
作者: James Dickens,Kamyar Hamad
机构: University of Ottawa (渥太华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Recent advances in point cloud deep learning have led to models that achieve high per-part labeling accuracy on large-scale point clouds, using only the raw geometry of unordered point sets. In parallel, the field of human parsing focuses on predicting body part and clothing/accessory labels from images. This work aims to bridge these two domains by enabling per-vertex semantic segmentation of large-scale human meshes. To achieve this, a pseudo-ground truth labeling pipeline is developed for the Thuman2.1 dataset: meshes are first aligned to a canonical pose, segmented from multiple viewpoints, and the resulting point-level labels are then backprojected onto the original mesh to produce per-point pseudo ground truth annotations. Subsequently, a novel, memory-efficient sampling strategy is introduced, a windowed iterative farthest point sampling (FPS) with space-filling curve-based serialization to effectively downsample the point clouds. This is followed by a purely geometric segmentation using PointTransformer, enabling semantic parsing of human meshes without relying on texture information. Experimental results confirm the effectiveness and accuracy of the proposed approach.
zh
[CV-101] Diffusion Models for Solving Inverse Problems via Posterior Sampling with Piecewise Guidance
【速读】:该论文旨在解决逆问题(inverse problems)中的图像恢复任务,如图像修复(inpainting)和超分辨率(super-resolution),其核心挑战在于如何在保证重建质量的同时提升计算效率。解决方案的关键在于提出了一种基于扩散模型(diffusion models)的分段引导框架(piecewise guidance scheme),通过将引导项定义为扩散时间步的分段函数,在高噪声阶段与低噪声阶段分别采用不同的近似策略,从而有效平衡了引导项的准确性与计算效率。该方法无需针对特定任务重新训练模型,具有问题无关性(problem-agnostic),且显式建模了测量噪声,实验证明其在保持PSNR和SSIM几乎不变的前提下,相较基准方法(PGDM)显著降低了推理时间(如图像修复减少25%,4×和8×超分辨率分别减少23%和24%)。
链接: https://arxiv.org/abs/2507.18654
作者: Saeed Mohseni-Sehdeh,Walid Saad,Kei Sakaguchi,Tao Yu
机构: Virginia Tech (弗吉尼亚理工大学); Institute of Science Tokyo (东京科学研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models are powerful tools for sampling from high-dimensional distributions by progressively transforming pure noise into structured data through a denoising process. When equipped with a guidance mechanism, these models can also generate samples from conditional distributions. In this paper, a novel diffusion-based framework is introduced for solving inverse problems using a piecewise guidance scheme. The guidance term is defined as a piecewise function of the diffusion timestep, facilitating the use of different approximations during high-noise and low-noise phases. This design is shown to effectively balance computational efficiency with the accuracy of the guidance term. Unlike task-specific approaches that require retraining for each problem, the proposed method is problem-agnostic and readily adaptable to a variety of inverse problems. Additionally, it explicitly incorporates measurement noise into the reconstruction process. The effectiveness of the proposed framework is demonstrated through extensive experiments on image restoration tasks, specifically image inpainting and super-resolution. Using a class conditional diffusion model for recovery, compared to the \pgdm baseline, the proposed framework achieves a reduction in inference time of (25%) for inpainting with both random and center masks, and (23%) and (24%) for (4\times) and (8\times) super-resolution tasks, respectively, while incurring only negligible loss in PSNR and SSIM.
zh
[CV-102] Adapt But Dont Forget: Fine-Tuning and Contrastive Routing for Lane Detection under Distribution Shift ICCV2025
【速读】:该论文旨在解决车道检测模型在跨数据集分布迁移场景下出现的灾难性遗忘(catastrophic forgetting)问题,尤其是在同一领域内不同数据集间存在分布偏移时,传统微调策略会导致模型性能显著下降。其解决方案的关键在于提出一种基于分支结构的参数高效适应框架:首先在源分布上训练基础模型,随后针对每个目标分布创建独立分支,并仅微调选定组件以保持原始源分支不变;同时,在推理阶段引入监督对比学习(supervised contrastive learning)模型实现输入分布识别并动态路由至对应分支,从而在保持高精度的同时大幅降低参数冗余。
链接: https://arxiv.org/abs/2507.18653
作者: Mohammed Abdul Hafeez Khan,Parth Ganeriwala,Sarah M. Lehman,Siddhartha Bhattacharyya,Amy Alvarez,Natasha Neogi
机构: Florida Institute of Technology (佛罗里达理工学院); NASA Langley Research Center (美国国家航空航天局兰利研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025, 2COOOL Workshop. Total 14 pages, 5 tables, and 4 figures
Abstract:Lane detection models are often evaluated in a closed-world setting, where training and testing occur on the same dataset. We observe that, even within the same domain, cross-dataset distribution shifts can cause severe catastrophic forgetting during fine-tuning. To address this, we first train a base model on a source distribution and then adapt it to each new target distribution by creating separate branches, fine-tuning only selected components while keeping the original source branch fixed. Based on a component-wise analysis, we identify effective fine-tuning strategies for target distributions that enable parameter-efficient adaptation. At inference time, we propose using a supervised contrastive learning model to identify the input distribution and dynamically route it to the corresponding branch. Our framework achieves near-optimal F1-scores while using significantly fewer parameters than training separate models for each distribution.
zh
[CV-103] Features extraction for image identification using computer vision
【速读】:该论文旨在解决计算机视觉中特征提取技术的性能差异与适用性问题,特别是如何提升模型在复杂场景下的表征能力。其解决方案的关键在于系统比较了多种特征提取方法,包括生成式对抗网络(Generative Adversarial Networks, GANs)、深度特征模型、传统算法(如SIFT、SURF、ORB)以及非对比学习和对比学习模型,并重点剖析了视觉Transformer(Vision Transformers, ViTs)的架构优势,如patch embedding、位置编码和多头自注意力机制,这些设计使其在多个任务上超越传统卷积神经网络(Convolutional Neural Networks, CNNs),从而为计算机视觉的进一步发展提供了更高效、灵活的特征表示框架。
链接: https://arxiv.org/abs/2507.18650
作者: Venant Niyonkuru,Sylla Sekou,Jimmy Jackson Sinzinkayo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study examines various feature extraction techniques in computer vision, the primary focus of which is on Vision Transformers (ViTs) and other approaches such as Generative Adversarial Networks (GANs), deep feature models, traditional approaches (SIFT, SURF, ORB), and non-contrastive and contrastive feature models. Emphasizing ViTs, the report summarizes their architecture, including patch embedding, positional encoding, and multi-head self-attention mechanisms with which they overperform conventional convolutional neural networks (CNNs). Experimental results determine the merits and limitations of both methods and their utilitarian applications in advancing computer vision.
zh
[CV-104] Livatar-1: Real-Time Talking Heads Generation with Tailored Flow Matching
【速读】:该论文旨在解决现有音频驱动人脸视频生成方法中存在的唇形同步(lip-sync)准确性不足以及长期姿态漂移(long-term pose drift)的问题。其解决方案的关键在于提出了一种基于流匹配(flow matching)的框架,通过该机制有效提升了唇形同步的质量并抑制了长时间生成过程中的姿态失真,同时结合系统优化实现了高达141 FPS的吞吐量和0.17秒的端到端延迟,在单张A10 GPU上达到高保真虚拟人像的实时生成能力。
链接: https://arxiv.org/abs/2507.18649
作者: Haiyang Liu,Xiaolin Hong,Xuancheng Yang,Yudi Ruan,Xiang Lian,Michael Lingelbach,Hongwei Yi,Wei Li
机构: Hedra Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:We present Livatar, a real-time audio-driven talking heads videos generation framework. Existing baselines suffer from limited lip-sync accuracy and long-term pose drift. We address these limitations with a flow matching based framework. Coupled with system optimizations, Livatar achieves competitive lip-sync quality with a 8.50 LipSync Confidence on the HDTF dataset, and reaches a throughput of 141 FPS with an end-to-end latency of 0.17s on a single A10 GPU. This makes high-fidelity avatars accessible to broader applications. Our project is available at this https URL with with examples at this https URL
zh
[CV-105] XAI-Guided Analysis of Residual Networks for Interpretable Pneumonia Detection in Paediatric Chest X-rays
【速读】:该论文旨在解决儿童肺炎(pneumonia)诊断中缺乏快速、准确且可解释的AI辅助工具的问题。其解决方案的关键在于构建一个基于残差网络(Residual Networks, ResNets)的可解释深度学习模型,并引入贝叶斯梯度加权类激活映射(Bayesian Gradient-weighted Class Activation Mapping, BayesGrad-CAM),以量化视觉解释中的不确定性并提供决策过程的空间定位信息,从而在保持高分类性能(准确率95.94%、AUC-ROC 98.91%、Cohen’s Kappa 0.913)的同时实现临床可接受的可解释性,为AI在儿科影像诊断中的部署提供可靠依据。
链接: https://arxiv.org/abs/2507.18647
作者: Rayyan Ridwan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 14 figures
Abstract:Pneumonia remains one of the leading causes of death among children worldwide, underscoring a critical need for fast and accurate diagnostic tools. In this paper, we propose an interpretable deep learning model on Residual Networks (ResNets) for automatically diagnosing paediatric pneumonia on chest X-rays. We enhance interpretability through Bayesian Gradient-weighted Class Activation Mapping (BayesGrad-CAM), which quantifies uncertainty in visual explanations, and which offers spatial locations accountable for the decision-making process of the model. Our ResNet-50 model, trained on a large paediatric chest X-rays dataset, achieves high classification accuracy (95.94%), AUC-ROC (98.91%), and Cohen’s Kappa (0.913), accompanied by clinically meaningful visual explanations. Our findings demonstrate that high performance and interpretability are not only achievable but critical for clinical AI deployment.
zh
[CV-106] Quantum-Cognitive Tunnelling Neural Networks for Military-Civilian Vehicle Classification and Sentiment Analysis
【速读】:该论文旨在解决如何提升人工智能在复杂战场环境中对模糊信息(如军事与民用车辆图像、特定语境下的情感倾向)的识别准确率问题,尤其关注人机协同作战场景中AI模型对人类感知特性的模拟能力。其解决方案的关键在于引入基于量子隧穿(quantum tunnelling, QT)概率的新型神经网络架构,通过将QT机制嵌入模型结构中,使AI能够更有效地捕捉人类感知中的不确定性与主观性特征,从而增强多模态AI系统在军事应用场景下的推理能力和泛化性能。
链接: https://arxiv.org/abs/2507.18645
作者: Milan Maksimovic,Anna Bohdanets,Immaculate Motsi-Omoijiade,Guido Governatori,Ivan S. Maksymov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Prior work has demonstrated that incorporating well-known quantum tunnelling (QT) probability into neural network models effectively captures important nuances of human perception, particularly in the recognition of ambiguous objects and sentiment analysis. In this paper, we employ novel QT-based neural networks and assess their effectiveness in distinguishing customised CIFAR-format images of military and civilian vehicles, as well as sentiment, using a proprietary military-specific vocabulary. We suggest that QT-based models can enhance multimodal AI applications in battlefield scenarios, particularly within human-operated drone warfare contexts, imbuing AI with certain traits of human reasoning.
zh
[CV-107] How good are humans at detecting AI-generated images? Learnings from an experiment
【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)图像生成技术的进步,人类在辨别真实图像与AI生成或修改图像方面的能力如何。研究发现,参与者整体识别准确率仅为62%,略高于随机水平,且在自然景观和城市景观等无明显人工痕迹的图像上表现最差,表明人类对高质量AI生成内容的辨识能力有限。解决方案的关键在于推动透明度工具的应用,例如数字水印和鲁棒性强的AI检测技术,以降低由AI生成内容引发的信息误导风险。
链接: https://arxiv.org/abs/2507.18640
作者: Thomas Roca,Anthony Cintron Roman,Jehú Torres Vega,Marcelo Duarte,Pengce Wang,Kevin White,Amit Misra,Juan Lavista Ferres
机构: Microsoft AI for Good Lab (微软AI for Good实验室)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As AI-powered image generation improves, a key question is how well human beings can differentiate between “real” and AI-generated or modified images. Using data collected from the online game “Real or Not Quiz.”, this study investigates how effectively people can distinguish AI-generated images from real ones. Participants viewed a randomized set of real and AI-generated images, aiming to identify their authenticity. Analysis of approximately 287,000 image evaluations by over 12,500 global participants revealed an overall success rate of only 62%, indicating a modest ability, slightly above chance. Participants were most accurate with human portraits but struggled significantly with natural and urban landscapes. These results highlight the inherent challenge humans face in distinguishing AI-generated visual content, particularly images without obvious artifacts or stylistic cues. This study stresses the need for transparency tools, such as watermarks and robust AI detection tools to mitigate the risks of misinformation arising from AI-generated content
zh
[CV-108] SAM2-Aug: Prior knowledge-based Augmentation for Target Volume Auto-Segmentation in Adaptive Radiation Therapy Using Segment Anything Model 2
【速读】:该论文旨在解决自适应放射治疗(Adaptive Radiation Therapy, ART)中肿瘤分割准确率低、耗时且依赖人工标注的问题。其核心解决方案是通过引入先验知识增强策略对Segment Anything Model 2(SAM2)进行改进,形成SAM2-Aug模型:关键在于利用历史MR图像及其标注作为上下文输入以提供解剖结构先验信息,并通过随机边界框扩展与掩码腐蚀/膨胀来提升提示(prompt)的鲁棒性,从而显著提高分割精度和跨模态、跨病种的泛化能力。
链接: https://arxiv.org/abs/2507.19282
作者: Guoping Xu,Yan Dai,Hengrui Zhao,Ying Zhang,Jie Deng,Weiguo Lu,You Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 26 pages, 10 figures
Abstract:Purpose: Accurate tumor segmentation is vital for adaptive radiation therapy (ART) but remains time-consuming and user-dependent. Segment Anything Model 2 (SAM2) shows promise for prompt-based segmentation but struggles with tumor accuracy. We propose prior knowledge-based augmentation strategies to enhance SAM2 for ART. Methods: Two strategies were introduced to improve SAM2: (1) using prior MR images and annotations as contextual inputs, and (2) improving prompt robustness via random bounding box expansion and mask erosion/dilation. The resulting model, SAM2-Aug, was fine-tuned and tested on the One-Seq-Liver dataset (115 MRIs from 31 liver cancer patients), and evaluated without retraining on Mix-Seq-Abdomen (88 MRIs, 28 patients) and Mix-Seq-Brain (86 MRIs, 37 patients). Results: SAM2-Aug outperformed convolutional, transformer-based, and prompt-driven models across all datasets, achieving Dice scores of 0.86(liver), 0.89(abdomen), and 0.90(brain). It demonstrated strong generalization across tumor types and imaging sequences, with improved performance in boundary-sensitive metrics. Conclusions: Incorporating prior images and enhancing prompt diversity significantly boosts segmentation accuracy and generalizability. SAM2-Aug offers a robust, efficient solution for tumor segmentation in ART. Code and models will be released at this https URL. Comments: 26 pages, 10 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph) Cite as: arXiv:2507.19282 [eess.IV] (or arXiv:2507.19282v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.19282 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Guoping Xu [view email] [v1] Fri, 25 Jul 2025 13:59:10 UTC (2,273 KB)
zh
[CV-109] Enhancing Diabetic Retinopathy Classification Accuracy through Dual Attention Mechanism in Deep Learning
【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)自动分类中因数据分布不均衡导致深度学习模型泛化能力受限的问题。解决方案的关键在于引入基于注意力机制的改进模型,具体通过融合全局注意力块(Global Attention Block, GAB)和类别注意力块(Category Attention Block, CAB),有效提升了模型对不平衡数据的适应能力。实验表明,该方法在两个公开眼底图像数据集(APTOS 和 EYEPACS)上均取得了具有竞争力的性能,同时 MobileNetV3-small 骨干网络参数量显著减少,体现出轻量化优势。
链接: https://arxiv.org/abs/2507.19199
作者: Abdul Hannan,Zahid Mahmood,Rizwan Qureshi,Hazrat Ali
机构: Unknown
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to Computer Methods in Biomechanics and Biomedical Engineering: Imaging Visualization
Abstract:Automatic classification of Diabetic Retinopathy (DR) can assist ophthalmologists in devising personalized treatment plans, making it a critical component of clinical practice. However, imbalanced data distribution in the dataset becomes a bottleneck in the generalization of deep learning models trained for DR classification. In this work, we combine global attention block (GAB) and category attention block (CAB) into the deep learning model, thus effectively overcoming the imbalanced data distribution problem in DR classification. Our proposed approach is based on an attention mechanism-based deep learning model that employs three pre-trained networks, namely, MobileNetV3-small, Efficientnet-b0, and DenseNet-169 as the backbone architecture. We evaluate the proposed method on two publicly available datasets of retinal fundoscopy images for DR. Experimental results show that on the APTOS dataset, the DenseNet-169 yielded 83.20% mean accuracy, followed by the MobileNetV3-small and EfficientNet-b0, which yielded 82% and 80% accuracies, respectively. On the EYEPACS dataset, the EfficientNet-b0 yielded a mean accuracy of 80%, while the DenseNet-169 and MobileNetV3-small yielded 75.43% and 76.68% accuracies, respectively. In addition, we also compute the F1-score of 82.0%, precision of 82.1%, sensitivity of 83.0%, specificity of 95.5%, and a kappa score of 88.2% for the experiments. Moreover, in our work, the MobileNetV3-small has 1.6 million parameters on the APTOS dataset and 0.90 million parameters on the EYEPACS dataset, which is comparatively less than other methods. The proposed approach achieves competitive performance that is at par with recently reported works on DR classification.
zh
[CV-110] Extreme Cardiac MRI Analysis under Respiratory Motion: Results of the CMRxMotion Challenge
【速读】:该论文旨在解决深度学习模型在心脏磁共振成像(Cardiac Magnetic Resonance, CMR)分析中对呼吸运动伪影(respiratory motion artifacts)鲁棒性不足的问题。当前生成式AI(Generative AI)在CMR自动分析中虽已取得先进性能,但其效果高度依赖高质量、无伪影的图像数据,而临床实践中常见呼吸运动导致的图像退化问题尚未得到充分研究。解决方案的关键在于构建一个公开的、具有可控运动伪影谱的CMR cine序列数据集(共320个系列,来自40名健康志愿者),并组织MICCAI CMRxMotion挑战赛,聚焦两个核心任务:1)基于运动严重程度的图像质量自动评估;2)在存在运动伪影情况下的心肌分割鲁棒性优化。通过22种算法的系统评估与对比,揭示了不同方法在真实临床伪影场景下的表现差异,并进一步量化了运动伪影对五个临床相关生物标志物的影响,为提升深度学习模型在实际医疗场景中的可靠性提供了关键基准和实践指导。
链接: https://arxiv.org/abs/2507.19165
作者: Kang Wang,Chen Qin,Zhang Shi,Haoran Wang,Xiwen Zhang,Chen Chen,Cheng Ouyang,Chengliang Dai,Yuanhan Mo,Chenchen Dai,Xutong Kuang,Ruizhe Li,Xin Chen,Xiuzheng Yue,Song Tian,Alejandro Mora-Rubio,Kumaradevan Punithakumar,Shizhan Gong,Qi Dou,Sina Amirrajab,Yasmina Al Khalil,Cian M. Scannell,Lexiaozi Fan,Huili Yang,Xiaowu Sun,Rob van der Geest,Tewodros Weldebirhan Arega,Fabrice Meriaudeau,Caner Özer,Amin Ranem,John Kalkhof,İlkay Öksüz,Anirban Mukhopadhyay,Abdul Qayyum,Moona Mazher,Steven A Niederer,Carles Garcia-Cabrera,Eric Arazo,Michal K. Grzeszczyk,Szymon Płotka,Wanqin Ma,Xiaomeng Li,Rongjun Ge,Yongqing Kou,Xinrong Chen,He Wang,Chengyan Wang,Wenjia Bai,Shuo Wang
机构: Fudan University (复旦大学); Imperial College London (帝国理工学院); University of Sheffield (谢菲尔德大学); University of Oxford (牛津大学); Shanghai Pudong Hospital and Human Phenome Institute, Fudan University (复旦大学浦东医院和人类表型研究院); University of Nottingham (诺丁汉大学); Philips Healthcare (飞利浦医疗); University of Alberta (阿尔伯塔大学); Universidad Autónoma de Manizales (曼萨莱斯自治大学); The Chinese University of Hong Kong (香港中文大学); Maastricht University (马斯特里赫特大学); Eindhoven University of Technology (埃因霍温理工大学); Northwestern University (西北大学); United Imaging Research (联影智能); Leiden University Medical Center (莱顿大学医学中心); Université Bourgogne Europe (勃艮第欧洲大学); Istanbul Technical University (伊斯坦布尔技术大学); Technical University of Darmstadt (达姆施塔特工业大学); National Heart and Lung Institute, Faculty of Medicine, Imperial College London (帝国理工学院医学院心血管与肺部研究所); Hawkes Institute, Department of Computer Science, University College London (伦敦大学学院计算机科学哈克斯研究所); University College Dublin (都柏林圣三一学院); CeADAR: Ireland’s Centre for AI (爱尔兰人工智能中心); Sano Centre for Computational Medicine (波兰克拉科夫计算医学中心); Jagiellonian University (雅盖隆大学); The Hong Kong University of Science and Technology (香港科技大学); Southeast University (东南大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Academy for Engineering and Technology, Fudan University (复旦大学工程与技术学院); College of Biomedical Engineering, Fudan University (复旦大学生物医学工程学院); Institute of Science and Technology for Brain-inspired Intelligence, Fudan University (复旦大学脑科学与智能技术研究院); Department of Brain Sciences, Imperial College London (帝国理工学院脑科学系); Data Science Institute, Imperial College London (帝国理工学院数据科学研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning models have achieved state-of-the-art performance in automated Cardiac Magnetic Resonance (CMR) analysis. However, the efficacy of these models is highly dependent on the availability of high-quality, artifact-free images. In clinical practice, CMR acquisitions are frequently degraded by respiratory motion, yet the robustness of deep learning models against such artifacts remains an underexplored problem. To promote research in this domain, we organized the MICCAI CMRxMotion challenge. We curated and publicly released a dataset of 320 CMR cine series from 40 healthy volunteers who performed specific breathing protocols to induce a controlled spectrum of motion artifacts. The challenge comprised two tasks: 1) automated image quality assessment to classify images based on motion severity, and 2) robust myocardial segmentation in the presence of motion artifacts. A total of 22 algorithms were submitted and evaluated on the two designated tasks. This paper presents a comprehensive overview of the challenge design and dataset, reports the evaluation results for the top-performing methods, and further investigates the impact of motion artifacts on five clinically relevant biomarkers. All resources and code are publicly available at: this https URL
zh
[CV-111] RealisVSR: Detail-enhanced Diffusion for Real-World 4K Video Super-Resolution
【速读】:该论文针对视频超分辨率(Video Super-Resolution, VSR)领域存在的四大挑战展开研究:1)基础模型中时间动态建模不一致;2)在复杂真实退化条件下高频细节恢复能力有限;3)细节增强效果评估不足;4)缺乏对4K超分任务的有效评测。为解决这些问题,作者提出RealisVSR,其核心创新在于:1)引入一致性保持控制网络(Consistency Preserved ControlNet, CPC),结合Wan2.1视频扩散模型以建模平滑复杂的运动并抑制伪影;2)设计高频修正扩散损失(High-Frequency Rectified Diffusion Loss, HR-Loss),融合小波分解与HOG特征约束实现纹理重建;3)构建首个公开的4K VSR基准数据集RealisVideo-4K,包含1000对高清视频-文本配对样本。该方案显著提升了高分辨率场景下的细节恢复能力和评估可靠性,且训练数据需求仅为现有方法的5–25%。
链接: https://arxiv.org/abs/2507.19138
作者: Weisong Zhao,Jingkai Zhou,Xiangyu Zhu,Weihua Chen,Xiao-Yu Zhang,Zhen Lei,Fan Wang
机构: 1. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2. University of Chinese Academy of Sciences (中国科学院大学); 3. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); 4. School of Information Science and Technology, Sun Yat-sen University (中山大学信息科学与技术学院); 5. Guangdong Key Laboratory of Intelligent Information Processing and Security (广东省智能信息处理与安全重点实验室); 6. Alibaba Group (阿里巴巴集团)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Super-Resolution (VSR) has achieved significant progress through diffusion models, effectively addressing the over-smoothing issues inherent in GAN-based methods. Despite recent advances, three critical challenges persist in VSR community: 1) Inconsistent modeling of temporal dynamics in foundational models; 2) limited high-frequency detail recovery under complex real-world degradations; and 3) insufficient evaluation of detail enhancement and 4K super-resolution, as current methods primarily rely on 720P datasets with inadequate details. To address these challenges, we propose RealisVSR, a high-frequency detail-enhanced video diffusion model with three core innovations: 1) Consistency Preserved ControlNet (CPC) architecture integrated with the Wan2.1 video diffusion to model the smooth and complex motions and suppress artifacts; 2) High-Frequency Rectified Diffusion Loss (HR-Loss) combining wavelet decomposition and HOG feature constraints for texture restoration; 3) RealisVideo-4K, the first public 4K VSR benchmark containing 1,000 high-definition video-text pairs. Leveraging the advanced spatio-temporal guidance of Wan2.1, our method requires only 5-25% of the training data volume compared to existing approaches. Extensive experiments on VSR benchmarks (REDS, SPMCS, UDM10, YouTube-HQ, VideoLQ, RealisVideo-720P) demonstrate our superiority, particularly in ultra-high-resolution scenarios.
zh
[CV-112] PGKET: A Photonic Gaussian Kernel Enhanced Transformer
【速读】:该论文旨在解决自注意力机制(Self-Attention Mechanism, SAM)在处理长序列时效率低下的问题。其核心解决方案是提出光子高斯核增强变压器(Photonic Gaussian Kernel Enhanced Transformer, PGKET),其关键在于基于光子高斯核自注意力机制(Photonic Gaussian Kernel Self-Attention Mechanism, PGKSAM),利用光子干涉和叠加原理并行处理多个输入,从而计算出光子高斯核自注意力分数(Photonic Gaussian Kernel Self-Attention Score, PGKSAS),显著提升了模型在多分类任务中的性能,并有望加速光子计算(Photonic Computing, PC)与机器学习的融合。
链接: https://arxiv.org/abs/2507.19041
作者: Ren-Xin Zhao
机构: Central South Univerisity (中南大学); Trinity College Dublin (都柏林圣三一学院)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-Attention Mechanisms (SAMs) enhance model performance by extracting key information but are inefficient when dealing with long sequences. To this end, a photonic Gaussian Kernel Enhanced Transformer (PGKET) is proposed, based on the Photonic Gaussian Kernel Self-Attention Mechanism (PGKSAM). The PGKSAM calculates the Photonic Gaussian Kernel Self-Attention Score (PGKSAS) using photon interferometry and superposition to process multiple inputs in parallel. Experimental results show that PGKET outperforms some state-of-the-art transformers in multi-classification tasks on MedMNIST v2 and CIFAR-10, and is expected to improve performance in complex tasks and accelerate the convergence of Photonic Computing (PC) and machine learning.
zh
人工智能
[AI-0] Let It Go? Not Quite: Addressing Item Cold Start in Sequential Recommendations with Content-Based Initialization
【速读】:该论文旨在解决顺序推荐系统中的冷启动问题(cold start problem),即对于交互数据稀疏或缺失的新物品,由于缺乏训练后的嵌入表示而难以被有效建模。传统内容驱动的方法常使用基于物品元数据(如文本描述)的嵌入作为初始值,但直接冻结这些嵌入会导致性能不佳,而完全微调又可能使新物品的表示偏离原始语义结构,影响效果。解决方案的关键在于引入一个小型可训练的delta模块来修正冻结的内容嵌入,使得模型能够在保留原始语义结构的前提下自适应调整物品表示,从而在多个数据集和模态(如电商文本描述与音乐音频特征)上实现稳定提升。
链接: https://arxiv.org/abs/2507.19473
作者: Anton Pembek,Artem Fatkulin,Anton Klenitskiy,Alexey Vasilev
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Many sequential recommender systems suffer from the cold start problem, where items with few or no interactions cannot be effectively used by the model due to the absence of a trained embedding. Content-based approaches, which leverage item metadata, are commonly used in such scenarios. One possible way is to use embeddings derived from content features such as textual descriptions as initialization for the model embeddings. However, directly using frozen content embeddings often results in suboptimal performance, as they may not fully adapt to the recommendation task. On the other hand, fine-tuning these embeddings can degrade performance for cold-start items, as item representations may drift far from their original structure after training. We propose a novel approach to address this limitation. Instead of entirely freezing the content embeddings or fine-tuning them extensively, we introduce a small trainable delta to frozen embeddings that enables the model to adapt item representations without letting them go too far from their original semantic structure. This approach demonstrates consistent improvements across multiple datasets and modalities, including e-commerce datasets with textual descriptions and a music dataset with audio-based representation.
zh
[AI-1] Hierarchical Deep Reinforcement Learning Framework for Multi-Year Asset Management Under Budget Constraints
【速读】:该论文旨在解决基础设施资产管理中的多年度预算规划与维护优化问题,其核心挑战在于组合动作空间的复杂性、资产退化多样性、严格的预算约束以及环境不确定性,这些因素严重限制了现有方法的可扩展性。解决方案的关键在于提出一种分层深度强化学习(Hierarchical Deep Reinforcement Learning)框架,将问题分解为两个层级:高层预算规划器(Budget Planner)在显式可行范围内分配年度预算,低层维护规划器(Maintenance Planner)在预算内对资产进行优先级排序;并通过在分层软演员-评论家(Soft Actor-Critic)框架中引入线性规划投影,有效缓解动作空间指数增长问题并确保严格的预算合规性,从而实现快速收敛、良好可扩展性及近优解的稳定输出。
链接: https://arxiv.org/abs/2507.19458
作者: Amir Fard,Arnold X.-X. Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:
Abstract:Budget planning and maintenance optimization are crucial for infrastructure asset management, ensuring cost-effectiveness and sustainability. However, the complexity arising from combinatorial action spaces, diverse asset deterioration, stringent budget constraints, and environmental uncertainty significantly limits existing methods’ scalability. This paper proposes a Hierarchical Deep Reinforcement Learning methodology specifically tailored to multi-year infrastructure planning. Our approach decomposes the problem into two hierarchical levels: a high-level Budget Planner allocating annual budgets within explicit feasibility bounds, and a low-level Maintenance Planner prioritizing assets within the allocated budget. By structurally separating macro-budget decisions from asset-level prioritization and integrating linear programming projection within a hierarchical Soft Actor-Critic framework, the method efficiently addresses exponential growth in the action space and ensures rigorous budget compliance. A case study evaluating sewer networks of varying sizes (10, 15, and 20 sewersheds) illustrates the effectiveness of the proposed approach. Compared to conventional Deep Q-Learning and enhanced genetic algorithms, our methodology converges more rapidly, scales effectively, and consistently delivers near-optimal solutions even as network size grows.
zh
[AI-2] Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在解码阶段硬件效率低下的问题,尤其是在长上下文推理任务中。其核心解决方案在于通过软硬件协同设计实现显著的解码成本优化:首先提出一种新颖的多矩阵分解注意力机制(Multi-Matrix Factorization Attention, MFA),在保持高注意力表达能力的同时大幅降低KV缓存占用和计算量;其次引入注意力-前馈网络解耦(Attention-FFN Disaggregation, AFD)分布式推理系统,将注意力层与前馈网络(Feed-Forward Network, FFN)分置于专用子系统中执行。这一协同设计使得Step-3在激活每token 38B参数的情况下仍实现比DeepSeek-V3和Qwen3 MoE 235B更低的理论解码成本,且在更长上下文中优势更加明显,最终在Hopper GPU上达到4,039 tokens/秒/GPU的解码吞吐量,刷新了LLM解码性能的帕累托前沿。
链接: https://arxiv.org/abs/2507.19427
作者: StepFun:Bin Wang,Bojun Wang,Changyi Wan,Guanzhe Huang,Hanpeng Hu,Haonan Jia,Hao Nie,Mingliang Li,Nuo Chen,Siyu Chen,Song Yuan,Wuxun Xie,Xiaoniu Song,Xing Chen,Xingping Yang,Xuelin Zhang,Yanbo Yu,Yaoyu Wang,Yibo Zhu,Yimin Jiang,Yu Zhou,Yuanwei Lu,Houyi Li,Jingcheng Hu,Ka Man Lo,Ailin Huang,Binxing Jiao,Bo Li,Boyu Chen,Changxin Miao,Chang Lou,Chen Hu,Chen Xu,Chenfeng Yu,Chengyuan Yao,Daokuan Lv,Dapeng Shi,Deshan Sun,Ding Huang,Dingyuan Hu,Dongqing Pang,Enle Liu,Fajie Zhang,Fanqi Wan,Gulin Yan,Han Zhang,Han Zhou,Hanghao Wu,Hangyu Guo,Hanqi Chen,Hanshan Zhang,Hao Wu,Haocheng Zhang,Haolong Yan,Haoran Lv,Haoran Wei,Hebin Zhou,Heng Wang,Heng Wang,Hongxin Li,Hongyu Zhou,Hongyuan Wang,Huiyong Guo,Jia Wang,Jiahao Gong,Jialing Xie,Jian Zhou,Jianjian Sun,Jiaoren Wu,Jiaran Zhang,Jiayu Liu,Jie Cheng,Jie Luo,Jie Yan,Jie Yang,Jieyi Hou,Jinguang Zhang,Jinlan Cao,Jisheng Yin,Junfeng Liu,Junhao Huang,Junzhe Lin,Kaijun Tan,Kaixiang Li,Kang An,Kangheng Lin,Kenkun Liu,Lei Yang,Liang Zhao,Liangyu Chen,Lieyu Shi,Liguo Tan,Lin Lin,Lin Zhang,Lina Chen,Liwen Huang,Liying Shi,Longlong Gu,Mei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3’s 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.
zh
[AI-3] On Arbitrary Predictions from Equally Valid Models
【速读】:该论文旨在解决医学领域中机器学习模型预测不确定性问题,即“模型多重性”(model multiplicity)——指多个在训练数据上表现相当的模型对同一患者可能产生不同预测结果的现象。这种现象可能导致临床诊断的任意性和不可靠性,而当前标准验证指标无法识别最优模型,且大量预测依赖于模型开发过程中的随机选择。论文提出的关键解决方案是:采用小规模集成模型(small ensemble)结合弃权策略(abstention strategy),通过提升模型间的一致性来有效缓解可测量的预测多重性;当预测具备高一致性时,可实现自动化分类;若模型间缺乏足够共识,则建议将决策交由专家审查,从而增强诊断可靠性。
链接: https://arxiv.org/abs/2507.19408
作者: Sarah Lockfisch,Kristian Schwethelm,Martin Menten,Rickmer Braren,Daniel Rueckert,Alexander Ziller,Georgios Kaissis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Model multiplicity refers to the existence of multiple machine learning models that describe the data equally well but may produce different predictions on individual samples. In medicine, these models can admit conflicting predictions for the same patient – a risk that is poorly understood and insufficiently addressed. In this study, we empirically analyze the extent, drivers, and ramifications of predictive multiplicity across diverse medical tasks and model architectures, and show that even small ensembles can mitigate/eliminate predictive multiplicity in practice. Our analysis reveals that (1) standard validation metrics fail to identify a uniquely optimal model and (2) a substantial amount of predictions hinges on arbitrary choices made during model development. Using multiple models instead of a single model reveals instances where predictions differ across equally plausible models – highlighting patients that would receive arbitrary diagnoses if any single model were used. In contrast, (3) a small ensemble paired with an abstention strategy can effectively mitigate measurable predictive multiplicity in practice; predictions with high inter-model consensus may thus be amenable to automated classification. While accuracy is not a principled antidote to predictive multiplicity, we find that (4) higher accuracy achieved through increased model capacity reduces predictive multiplicity. Our findings underscore the clinical importance of accounting for model multiplicity and advocate for ensemble-based strategies to improve diagnostic reliability. In cases where models fail to reach sufficient consensus, we recommend deferring decisions to expert review. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.19408 [cs.LG] (or arXiv:2507.19408v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.19408 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-4] SDVDiag: A Modular Platform for the Diagnosis of Connected Vehicle Functions
【速读】:该论文旨在解决连接车辆(Connected Vehicles)在复杂云/边缘架构中发生故障时,因依赖关系错综复杂而导致人工诊断效率低下、响应延迟的问题。其核心挑战在于如何快速定位故障的根本原因(Root Cause),以保障车辆服务的高可靠性和可用性。解决方案的关键在于提出一个可扩展的自动化诊断平台 SDVDiag,该平台通过构建动态依赖图(Dynamic Graph View)实时追踪功能间的依赖关系,并结合运行时模块自适应替换机制,实现从数据采集到根因分析的全流程自动化诊断流水线。此外,系统持续监控关键指标异常,在故障发生时捕获图快照并融合相关异常信息,最终通过图遍历生成最可能故障原因的排序列表,从而显著提升问题识别速度与准确性。
链接: https://arxiv.org/abs/2507.19403
作者: Matthias Weiß,Falk Dettinger,Michael Weyrich
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 7 pages, 5 figures
Abstract:Connected and software-defined vehicles promise to offer a broad range of services and advanced functions to customers, aiming to increase passenger comfort and support autonomous driving capabilities. Due to the high reliability and availability requirements of connected vehicles, it is crucial to resolve any occurring failures quickly. To achieve this however, a complex cloud/edge architecture with a mesh of dependencies must be navigated to diagnose the responsible root cause. As such, manual analyses become unfeasible since they would significantly delay the troubleshooting. To address this challenge, this paper presents SDVDiag, an extensible platform for the automated diagnosis of connected vehicle functions. The platform enables the creation of pipelines that cover all steps from initial data collection to the tracing of potential root causes. In addition, SDVDiag supports self-adaptive behavior by the ability to exchange modules at runtime. Dependencies between functions are detected and continuously updated, resulting in a dynamic graph view of the system. In addition, vital system metrics are monitored for anomalies. Whenever an incident is investigated, a snapshot of the graph is taken and augmented by relevant anomalies. Finally, the analysis is performed by traversing the graph and creating a ranking of the most likely causes. To evaluate the platform, it is deployed inside an 5G test fleet environment for connected vehicle functions. The results show that injected faults can be detected reliably. As such, the platform offers the potential to gain new insights and reduce downtime by identifying problems and their causes at an early stage. Comments: 7 pages, 5 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) ACMclasses: B.8.2; C.2.4 Cite as: arXiv:2507.19403 [cs.SE] (or arXiv:2507.19403v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.19403 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-5] Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)集成原生代码解释器后引入的系统级网络安全风险问题,这类风险不同于传统的提示词攻击,主要表现为CPU、内存和磁盘资源耗尽等恶意执行行为。解决方案的关键在于提出CIRCLE(Code-Interpreter Resilience Check for LLM Exploits)基准测试框架,该框架包含1,260个针对资源消耗的提示样本,涵盖直接恶意与间接社会工程类提示,并通过自动化评估流程判断模型是否拒绝生成危险代码、生成代码的正确性、是否存在安全简化操作或执行超时,从而系统性量化LLM在代码解释器环境下的鲁棒性。
链接: https://arxiv.org/abs/2507.19399
作者: Gabriel Chua
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) increasingly integrate native code interpreters, they enable powerful real-time execution capabilities, substantially expanding their utility. However, such integrations introduce potential system-level cybersecurity threats, fundamentally different from prompt-based vulnerabilities. To systematically evaluate these interpreter-specific risks, we propose CIRCLE (Code-Interpreter Resilience Check for LLM Exploits), a simple benchmark comprising 1,260 prompts targeting CPU, memory, and disk resource exhaustion. Each risk category includes explicitly malicious (“direct”) and plausibly benign (“indirect”) prompt variants. Our automated evaluation framework assesses not only whether LLMs refuse or generates risky code, but also executes the generated code within the interpreter environment to evaluate code correctness, simplifications made by the LLM to make the code safe, or execution timeouts. Evaluating 7 commercially available models from OpenAI and Google, we uncover significant and inconsistent vulnerabilities. For instance, evaluations show substantial disparities even within providers - OpenAI’s o4-mini correctly refuses risky requests at 7.1%, notably higher rates compared to GPT-4.1 at 0.5%. Results particularly underscore that indirect, socially-engineered prompts substantially weaken model defenses. This highlights an urgent need for interpreter-specific cybersecurity benchmarks, dedicated mitigation tools (e.g., guardrails), and clear industry standards to guide safe and responsible deployment of LLM interpreter integrations. The benchmark dataset and evaluation code are publicly released to foster further research.
zh
[AI-6] ReCatcher: Towards LLM s Regression Testing for Code Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中因微调、合并或新版本发布而导致的回归问题,这些问题不仅影响代码逻辑正确性,还可能降低静态代码质量与执行性能。解决方案的关键在于提出 ReCatcher,一个系统化的回归测试框架,能够对两个LLM(如当前模型与候选更新版本)从逻辑正确性、静态代码质量及执行性能三个维度进行量化对比评估。通过在CodeLlama、DeepSeek-Coder和GPT-4o等模型上的实证分析,ReCatcher揭示了不同更新策略下的显著退化现象,并展现出相较于基线方法更优且一致的准确性,从而为模型更新决策提供可靠依据。
链接: https://arxiv.org/abs/2507.19390
作者: Altaf Allah Abbassi,Leuson Da Silva,Amin Nikanjam,Foutse Khomh
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 24 pages, 3 Figures, 2 Tables
Abstract:Large Language Models (LLMs) for code generation evolve rapidly through fine-tuning, merging, or new model releases. However, such updates can introduce regressions, not only in correctness but also in code quality and performance. To address this, we present ReCatcher, a regression testing framework for Python code generation. ReCatcher systematically compares two LLMs, typically a current model and a candidate update, across three dimensions: logical correctness, static code quality, and execution performance. We apply ReCatcher to assess regressions across three update scenarios, fine-tuning, merging, and model release, using CodeLlama, DeepSeek-Coder, and GPT-4o. Our evaluation shows that fine-tuning with cross-language datasets increases syntax errors by up to 12%. Merging with general-purpose models like Llama2 leads to regressions in correctness by up to 18%. GPT-4o introduces regressions of up to 50% in handling missing imports compared to GPT-3.5-turbo, while GPT-4o-mini suffers up to 80% performance degradation in execution time versus GPT-4o. Overall, logical correctness, performance, and error handling (e.g., syntax errors and missing imports) are the most regression-prone areas. Comparing ReCatcher with baseline solutions, it presents better and consistent accuracy across logical and performance aspects. ReCatcher highlights the importance of systematic regression evaluation before adopting new models, while assisting researchers and practitioners in making more informed update decisions.
zh
[AI-7] Learning neuro-symbolic convergent term rewriting systems
【速读】:该论文旨在解决神经网络系统在学习执行符号算法时面临的挑战,特别是如何实现强泛化能力和分布外(out-of-distribution)性能。其核心问题在于现有方法难以在未见过的数据上保持稳定且准确的推理能力。解决方案的关键在于提出一种受重写算法启发的神经符号架构(neuro-symbolic architecture),通过设计具有收敛特性的项重写系统(term rewriting system),使模型能够以结构化、可解释的方式执行符号计算。该框架包含两个模块化实现:神经重写系统(Neural Rewriting System, NRS)和快速神经重写系统(Fast Neural Rewriting System, FastNRS),其中FastNRS在内存效率、训练速度和推理时间上均有显著优化,同时保持了对复杂数学公式简化任务及多领域联合学习场景的卓越泛化能力。
链接: https://arxiv.org/abs/2507.19372
作者: Flavio Petruzzellis,Alberto Testolin,Alessandro Sperduti
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 48 pages, 31 figures. Submitted for review by Artificial Intelligence Journal
Abstract:Building neural systems that can learn to execute symbolic algorithms is a challenging open problem in artificial intelligence, especially when aiming for strong generalization and out-of-distribution performance. In this work, we introduce a general framework for learning convergent term rewriting systems using a neuro-symbolic architecture inspired by the rewriting algorithm itself. We present two modular implementations of such architecture: the Neural Rewriting System (NRS) and the Fast Neural Rewriting System (FastNRS). As a result of algorithmic-inspired design and key architectural elements, both models can generalize to out-of-distribution instances, with FastNRS offering significant improvements in terms of memory efficiency, training speed, and inference time. We evaluate both architectures on four tasks involving the simplification of mathematical formulas and further demonstrate their versatility in a multi-domain learning scenario, where a single model is trained to solve multiple types of problems simultaneously. The proposed system significantly outperforms two strong neural baselines: the Neural Data Router, a recent transformer variant specifically designed to solve algorithmic problems, and GPT-4o, one of the most powerful general-purpose large-language models. Moreover, our system matches or outperforms the latest o1-preview model from OpenAI that excels in reasoning benchmarks.
zh
[AI-8] Counterfactual Explanations in Medical Imaging: Exploring SPN-Guided Latent Space Manipulation
【速读】:该论文旨在解决生成式 AI(Generative AI)在医疗影像分析中因模型黑箱特性导致的可解释性不足问题,尤其是如何生成符合人类可理解性的反事实解释(counterfactual explanations),即在保持输入数据相似性前提下,提供能够改变模型分类结果的“假设性”输入变化路径。其解决方案的关键在于提出一种基于模型特定优化的方法:利用变分自编码器(VAE)的潜在空间表示能力,并通过求和乘积网络(SPN)建模该潜在空间的概率分布,从而实现双重功能——既作为潜在空间的描述符,又作为判别任务的分类器。这一架构使得在潜在空间中生成的反事实样本既能贴近原始数据分布,又能对齐目标类别分布,显著提升了反事实解释的合理性与可解释性。
链接: https://arxiv.org/abs/2507.19368
作者: Julia Siekiera,Stefan Kramer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures
Abstract:Artificial intelligence is increasingly leveraged across various domains to automate decision-making processes that significantly impact human lives. In medical image analysis, deep learning models have demonstrated remarkable performance. However, their inherent complexity makes them black box systems, raising concerns about reliability and interpretability. Counterfactual explanations provide comprehensible insights into decision processes by presenting hypothetical “what-if” scenarios that alter model classifications. By examining input alterations, counterfactual explanations provide patterns that influence the decision-making process. Despite their potential, generating plausible counterfactuals that adhere to similarity constraints providing human-interpretable explanations remains a challenge. In this paper, we investigate this challenge by a model-specific optimization approach. While deep generative models such as variational autoencoders (VAEs) exhibit significant generative power, probabilistic models like sum-product networks (SPNs) efficiently represent complex joint probability distributions. By modeling the likelihood of a semi-supervised VAE’s latent space with an SPN, we leverage its dual role as both a latent space descriptor and a classifier for a given discrimination task. This formulation enables the optimization of latent space counterfactuals that are both close to the original data distribution and aligned with the target class distribution. We conduct experimental evaluation on the cheXpert dataset. To evaluate the effectiveness of the integration of SPNs, our SPN-guided latent space manipulation is compared against a neural network baseline. Additionally, the trade-off between latent variable regularization and counterfactual quality is analyzed.
zh
[AI-9] Integrating LLM in Agent -Based Social Simulation: Opportunities and Challenges
【速读】:该论文旨在解决如何在社会仿真中有效利用大语言模型(Large Language Models, LLMs)的问题,特别是在计算社会科学背景下,明确其在模拟人类社会行为时的潜力与局限性。论文指出,尽管LLMs在复现人类心智理论(Theory of Mind)和社会推理方面展现出一定能力,但仍存在认知偏差、缺乏真正理解及行为不一致性等关键限制。解决方案的关键在于提出一种混合建模方法:将LLMs集成到传统的基于规则的多智能体建模平台(如GAMA、NetLogo)中,从而结合语言驱动推理的表达灵活性与经典规则系统所具备的透明度和分析严谨性,以提升大规模社会仿真中行为保真度、校准性和可重复性。
链接: https://arxiv.org/abs/2507.19364
作者: Patrick Taillandier,Jean Daniel Zucker,Arnaud Grignard,Benoit Gaudou,Nghi Quang Huynh,Alexis Drogoul
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:This position paper examines the use of Large Language Models (LLMs) in social simulation, analyzing both their potential and their limitations from a computational social science perspective. The first part reviews recent findings on the ability of LLMs to replicate key aspects of human cognition, including Theory of Mind reasoning and social inference, while also highlighting significant limitations such as cognitive biases, lack of true understanding, and inconsistencies in behavior. The second part surveys emerging applications of LLMs in multi-agent simulation frameworks, focusing on system architectures, scale, and validation strategies. Notable projects such as Generative Agents (Smallville) and AgentSociety are discussed in terms of their design choices, empirical grounding, and methodological innovations. Particular attention is given to the challenges of behavioral fidelity, calibration, and reproducibility in large-scale LLM-driven simulations. The final section distinguishes between contexts where LLMs, like other black-box systems, offer direct value-such as interactive simulations and serious games-and those where their use is more problematic, notably in explanatory or predictive modeling. The paper concludes by advocating for hybrid approaches that integrate LLMs into traditional agent-based modeling platforms (GAMA, Netlogo, etc), enabling modelers to combine the expressive flexibility of language-based reasoning with the transparency and analytical rigor of classical rule-based systems.
zh
[AI-10] Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM -Induced Dependency Graphs
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的表格数据增强方法中存在的两个关键问题:一是特征间密集依赖建模易引入偏差,二是采样过程计算开销过高。其解决方案的核心在于提出SPADA(SPArse Dependency-driven Augmentation),一种轻量级生成框架,通过LLM诱导的图结构显式建模特征间的稀疏依赖关系,将每个特征视为节点,并仅基于其父节点进行条件值合成;同时采用两种高效合成策略——非参数化的高斯核密度估计与可逆映射的条件归一化流模型,从而在显著降低约束违反率(较扩散方法减少4%)的同时,实现接近9500倍于LLM基线的生成加速。
链接: https://arxiv.org/abs/2507.19334
作者: Shuo Yang,Zheyu Zhang,Bardh Prenkaj,Gjergji Kasneci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Tabular data is critical across diverse domains, yet high-quality datasets remain scarce due to privacy concerns and the cost of collection. Contemporary approaches adopt large language models (LLMs) for tabular augmentation, but exhibit two major limitations: (1) dense dependency modeling among tabular features that can introduce bias, and (2) high computational overhead in sampling. To address these issues, we propose SPADA for SPArse Dependency-driven Augmentation, a lightweight generative framework that explicitly captures sparse dependencies via an LLM-induced graph. We treat each feature as a node and synthesize values by traversing the graph, conditioning each feature solely on its parent nodes. We explore two synthesis strategies: a non-parametric method using Gaussian kernel density estimation, and a conditional normalizing flow model that learns invertible mappings for conditional density estimation. Experiments on four datasets show that SPADA reduces constraint violations by 4% compared to diffusion-based methods and accelerates generation by nearly 9,500 times over LLM-based baselines.
zh
[AI-11] owards LLM -Enhanced Group Recommender Systems
【速读】:该论文旨在解决群组推荐系统(group recommender systems)在实际应用中面临的复杂性问题,包括理解群体动态(如群体内部的社会依赖关系)、定义有效的决策机制、确保推荐结果对所有成员均适用,以及提供群体层面和个体层面的解释。其解决方案的关键在于利用大语言模型(large language models, LLMs)的能力,以增强上述各环节的处理能力,从而提升群组推荐系统的整体决策支持质量和可应用性。
链接: https://arxiv.org/abs/2507.19283
作者: Sebastian Lubos,Alexander Felfernig,Thi Ngoc Trang Tran,Viet-Man Le,Damian Garber,Manuel Henrich,Reinhard Willfort,Jeremias Fuchs
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:In contrast to single-user recommender systems, group recommender systems are designed to generate and explain recommendations for groups. This group-oriented setting introduces additional complexities, as several factors - absent in individual contexts - must be addressed. These include understanding group dynamics (e.g., social dependencies within the group), defining effective decision-making processes, ensuring that recommendations are suitable for all group members, and providing group-level explanations as well as explanations for individual users. In this paper, we analyze in which way large language models (LLMs) can support these aspects and help to increase the overall decision support quality and applicability of group recommender systems.
zh
[AI-12] Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects
【速读】:该论文旨在解决工业环境中代码审查(Code Review)因耗时且认知负荷高而难以高效执行的问题,尤其关注如何利用语言模型(Language Models, LMs)自动化核心代码审查任务。其解决方案的关键在于通过单语种微调(Monolingual Fine-tuning)提升开源语言模型在三种关键任务上的性能:代码变更质量评估(Code Change Quality Estimation)、审查评论生成(Review Comment Generation)以及代码优化(Code Refinement)。研究发现,针对特定编程语言(如C#)和自然语言(如中文或英文)的训练数据配置能显著提高模型准确性与相关性,优于多语种基线模型;同时,尽管大模型可有效支持常规或重复性审查任务,人类评审者在处理语义复杂或上下文敏感的变更时仍具优势,凸显了语言对齐与任务特异性适配在优化自动化代码审查系统中的重要性。
链接: https://arxiv.org/abs/2507.19271
作者: Igli Begolli,Meltem Aksoy,Daniel Neider
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Code review is essential for maintaining software quality but often time-consuming and cognitively demanding, especially in industrial environments. Recent advancements in language models (LMs) have opened new avenues for automating core review tasks. This study presents the empirical evaluation of monolingual fine-tuning on the performance of open-source LMs across three key automated code review tasks: Code Change Quality Estimation, Review Comment Generation, and Code Refinement. We fine-tuned three distinct models, CodeReviewer, CodeLlama-7B, and DeepSeek-R1-Distill, on a C# specific dataset combining public benchmarks with industrial repositories. Our study investigates how different configurations of programming languages and natural languages in the training data affect LM performance, particularly in comment generation. Additionally, we benchmark the fine-tuned models against an automated software analysis tool (ASAT) and human reviewers to evaluate their practical utility in real-world settings. Our results show that monolingual fine-tuning improves model accuracy and relevance compared to multilingual baselines. While LMs can effectively support code review workflows, especially for routine or repetitive tasks, human reviewers remain superior in handling semantically complex or context-sensitive changes. Our findings highlight the importance of language alignment and task-specific adaptation in optimizing LMs for automated code review.
zh
[AI-13] Modeling Uncertainty: Constraint-Based Belief States in Imperfect-Information Games
【速读】:该论文旨在解决不完美信息博弈中代理(agent)因缺乏对游戏状态的完整认知而难以做出有效决策的问题。其核心挑战在于如何高效地表示和更新代理对隐藏状态(如隐藏棋子身份)的信念。解决方案的关键在于引入Belief Stochastic Game模型,将状态估计任务交由游戏模型自身处理,使代理可直接基于外部提供的信念状态进行决策,从而避免了为每类游戏设计特定的推理逻辑。论文进一步比较了两种信念表示方法:基于约束满足问题(Constraint Satisfaction Problem, CSP)的约束模型与基于信念传播(Belief Propagation)的概率扩展模型,并发现约束模型在两类不同游戏中均能实现与概率推理相当的代理性能,表明在许多场景下仅使用约束型信念状态即可支持有效的决策制定。
链接: https://arxiv.org/abs/2507.19263
作者: Achille Morenville,Éric Piette
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In imperfect-information games, agents must make decisions based on partial knowledge of the game state. The Belief Stochastic Game model addresses this challenge by delegating state estimation to the game model itself. This allows agents to operate on externally provided belief states, thereby reducing the need for game-specific inference logic. This paper investigates two approaches to represent beliefs in games with hidden piece identities: a constraint-based model using Constraint Satisfaction Problems and a probabilistic extension using Belief Propagation to estimate marginal probabilities. We evaluated the impact of both representations using general-purpose agents across two different games. Our findings indicate that constraint-based beliefs yield results comparable to those of probabilistic inference, with minimal differences in agent performance. This suggests that constraint-based belief states alone may suffice for effective decision-making in many settings.
zh
[AI-14] Knowledge Grafting: A Mechanism for Optimizing AI Model Deployment in Resource-Constrained Environments
【速读】:该论文旨在解决大规模人工智能(Artificial Intelligence, AI)模型在资源受限环境中的部署难题,即如何在保持高性能的同时显著降低模型体积与计算需求。其解决方案的关键在于提出了一种名为“知识嫁接”(knowledge grafting)的新机制:通过将大型捐赠模型(donor model)中精选的特征(称为“接穗”,scion)迁移至小型根茎模型(rootstock model),实现模型压缩与性能提升的协同优化。实验表明,该方法使模型尺寸减少88.54%,同时验证准确率从捐赠模型的87.47%提升至89.97%,并在未见测试数据上达到90.45%的准确率,有效突破了模型大小与性能之间的权衡限制。
链接: https://arxiv.org/abs/2507.19261
作者: Osama Almurshed,Ashish Kaushal,Asmail Muftah,Nitin Auluck,Omer Rana
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: 18 pages, 4 figures, ArXiv preprint - Novel “knowledge grafting” technique achieving 88.54% AI model size reduction while improving accuracy for resource-constrained deployment
Abstract:The increasing adoption of Artificial Intelligence (AI) has led to larger, more complex models with numerous parameters that require substantial computing power – resources often unavailable in many real-world application scenarios. Our paper addresses this challenge by introducing knowledge grafting, a novel mechanism that optimizes AI models for resource-constrained environments by transferring selected features (the scion) from a large donor model to a smaller rootstock model. The approach achieves an 88.54% reduction in model size (from 64.39 MB to 7.38 MB), while improving generalization capability of the model. Our new rootstock model achieves 89.97% validation accuracy (vs. donor’s 87.47%), maintains lower validation loss (0.2976 vs. 0.5068), and performs exceptionally well on unseen test data with 90.45% accuracy. It addresses the typical size vs performance trade-off, and enables deployment of AI frameworks on resource-constrained devices with enhanced performance. We have tested our approach on an agricultural weed detection scenario, however, it can be extended across various edge computing scenarios, potentially accelerating AI adoption in areas with limited hardware/software support – by mirroring in a similar manner the horticultural grafting enables productive cultivation in challenging agri-based environments.
zh
[AI-15] ransfinite Fixed Points in Alpay Algebra as Ordinal Game Equilibria in Dependent Type Theory
【速读】:该论文旨在解决无限自指系统(self-referential systems)在形式化语义框架下如何实现稳定收敛的问题,特别是针对Alpay代数中提出的语义收敛哲学主张提供严格的逻辑证明。其核心问题在于:当一个系统通过无限次迭代与环境进行修订对话(unbounded revision dialogue)时,是否存在唯一且稳定的极限状态?解决方案的关键在于将经典不动点理论扩展至超限序数域(transfinite domain),利用良基归纳法(well-founded induction)和序理论连续性原理构建超限不动点算子,并进一步将其嵌入依赖类型理论(dependent type theory),从而在现代证明助手(proof assistant)中对每一步迭代及其极限进行机器可验证的构造性证明。这一方法不仅确立了迭代对话必然收敛且极限唯一的结论,也为无限自指系统的计算验证提供了形式化工具。
链接: https://arxiv.org/abs/2507.19245
作者: Faruk Alpay,Bugra Kilictas,Taylan Alpay
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 21 pages, 1 figure
Abstract:This paper contributes to the Alpay Algebra by demonstrating that the stable outcome of a self referential process, obtained by iterating a transformation through all ordinal stages, is identical to the unique equilibrium of an unbounded revision dialogue between a system and its environment. The analysis initially elucidates how classical fixed point theorems guarantee such convergence in finite settings and subsequently extends the argument to the transfinite domain, relying upon well founded induction and principles of order theoretic continuity. Furthermore, the resulting transordinal fixed point operator is embedded into dependent type theory, a formalization which permits every step of the transfinite iteration and its limit to be verified within a modern proof assistant. This procedure yields a machine checked proof that the iterative dialogue necessarily stabilizes and that its limit is unique. The result provides a foundation for Alpay’s philosophical claim of semantic convergence within the framework of constructive logic. By unifying concepts from fixed point theory, game semantics, ordinal analysis, and type theory, this research establishes a broadly accessible yet formally rigorous foundation for reasoning about infinite self referential systems and offers practical tools for certifying their convergence within computational environments. Comments: 21 pages, 1 figure Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) MSC classes: 68T27, 03B70, 68Q55 Cite as: arXiv:2507.19245 [cs.LO] (or arXiv:2507.19245v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2507.19245 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-16] Virne: A Comprehensive Benchmark for Deep RL-based Network Resource Allocation in NFV
【速读】:该论文旨在解决网络功能虚拟化(Network Function Virtualization, NFV)中资源分配(Resource Allocation, RA)问题的评估缺乏系统性基准框架与深入分析的问题,从而阻碍了新兴网络环境下高效算法的发展和一致性评价。解决方案的关键在于提出一个名为Virne的综合性基准测试框架,其核心优势包括:支持多种网络场景(如云、边缘和5G)的可定制仿真环境;提供模块化且可扩展的实现流程,兼容30余种不同类型的深度强化学习(Deep Reinforcement Learning, Deep RL)方法;并引入超越单纯有效性的多维评估维度(如可扩展性、泛化能力等)。通过大规模实验验证,Virne不仅为现有方法提供了性能权衡的洞察,也为未来研究方向提供了可操作的指导,成为推动NFV-RA与深度强化学习应用发展的有力工具。
链接: https://arxiv.org/abs/2507.19234
作者: Tianfu Wang,Liwei Deng,Xi Chen,Junyang Wang,Huiguo He,Leilei Ding,Wei Wu,Qilin Fan,Hui Xiong
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:Resource allocation (RA) is critical to efficient service deployment in Network Function Virtualization (NFV), a transformative networking paradigm. Recently, deep Reinforcement Learning (RL)-based methods have been showing promising potential to address this complexity. However, the lack of a systematic benchmarking framework and thorough analysis hinders the exploration of emerging networks and the development of more robust algorithms while causing inconsistent evaluation. In this paper, we introduce Virne, a comprehensive benchmarking framework for the NFV-RA problem, with a focus on supporting deep RL-based methods. Virne provides customizable simulations for diverse network scenarios, including cloud, edge, and 5G environments. It also features a modular and extensible implementation pipeline that supports over 30 methods of various types, and includes practical evaluation perspectives beyond effectiveness, such as scalability, generalization, and scalability. Furthermore, we conduct in-depth analysis through extensive experiments to provide valuable insights into performance trade-offs for efficient implementation and offer actionable guidance for future research directions. Overall, with its diverse simulations, rich implementations, and extensive evaluation capabilities, Virne could serve as a comprehensive benchmark for advancing NFV-RA methods and deep RL applications. The code is publicly available at this https URL.
zh
[AI-17] PrompTrend: Continuous Community-Driven Vulnerability Discovery and Assessment for Large Language Models
【速读】:该论文旨在解决静态基准测试无法捕捉通过在线社区实验不断涌现的大语言模型(Large Language Models, LLMs)漏洞的问题。其解决方案的关键在于提出PrompTrend系统,该系统通过跨平台采集漏洞数据并采用多维评分机制进行评估,具备可扩展的监控架构;研究发现,高级能力与特定架构中的漏洞增加相关,心理类攻击显著优于技术性漏洞利用,且平台动态对攻击效果具有显著影响,揭示出当前LLM安全需依赖超越传统周期性评估的综合性社会-技术监测体系。
链接: https://arxiv.org/abs/2507.19185
作者: Tarek Gasmi,Ramzi Guesmi,Mootez Aloui,Jihene Bennaceur
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Static benchmarks fail to capture LLM vulnerabilities emerging through community experimentation in online forums. We present PrompTrend, a system that collects vulnerability data across platforms and evaluates them using multidimensional scoring, with an architecture designed for scalable monitoring. Cross-sectional analysis of 198 vulnerabilities collected from online communities over a five-month period (January-May 2025) and tested on nine commercial models reveals that advanced capabilities correlate with increased vulnerability in some architectures, psychological attacks significantly outperform technical exploits, and platform dynamics shape attack effectiveness with measurable model-specific patterns. The PrompTrend Vulnerability Assessment Framework achieves 78% classification accuracy while revealing limited cross-model transferability, demonstrating that effective LLM security requires comprehensive socio-technical monitoring beyond traditional periodic assessment. Our findings challenge the assumption that capability advancement improves security and establish community-driven psychological manipulation as the dominant threat vector for current language models.
zh
[AI-18] Faster Lifting for Ordered Domains with Predecessor Relations
【速读】:该论文旨在解决在有序域(ordered domains)中基于前驱关系(predecessor relations)的提升推理(lifted inference)效率问题,尤其是在传统加权一阶模型计数(Weighted First-Order Model Counting, WFOMC)方法因线性序公理引入二元谓词而导致计算复杂度高的情况下。其关键解决方案是将前驱关系视为公理的原生组成部分,而非通过线性序谓词间接编码,并设计了一种新颖的算法,该算法天然支持即时前驱和次前驱关系,实现了对这些已知为可 tractable 的关系的指数级加速,同时还能处理一般的 k-th 前驱关系,从而显著提升了 lifted inference 任务及组合数学问题求解的效率。
链接: https://arxiv.org/abs/2507.19182
作者: Kuncheng Zou,Jiahao Mai,Yonggang Zhang,Yuyi Wang,Ondřej Kuželka,Yuanhong Wang,Yi Chang
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:We investigate lifted inference on ordered domains with predecessor relations, where the elements of the domain respect a total (cyclic) order, and every element has a distinct (clockwise) predecessor. Previous work has explored this problem through weighted first-order model counting (WFOMC), which computes the weighted sum of models for a given first-order logic sentence over a finite domain. In WFOMC, the order constraint is typically encoded by the linear order axiom introducing a binary predicate in the sentence to impose a linear ordering on the domain elements. The immediate and second predecessor relations are then encoded by the linear order predicate. Although WFOMC with the linear order axiom is theoretically tractable, existing algorithms struggle with practical applications, particularly when the predecessor relations are involved. In this paper, we treat predecessor relations as a native part of the axiom and devise a novel algorithm that inherently supports these relations. The proposed algorithm not only provides an exponential speedup for the immediate and second predecessor relations, which are known to be tractable, but also handles the general k-th predecessor relations. The extensive experiments on lifted inference tasks and combinatorics math problems demonstrate the efficiency of our algorithm, achieving speedups of a full order of magnitude.
zh
[AI-19] ReCoDe: Reinforcement Learning-based Dynamic Constraint Design for Multi-Agent Coordination
【速读】:该论文旨在解决多智能体系统中传统基于约束的优化控制器在复杂协调场景下失效的问题,尤其是在需要上下文感知行为和共识决策的导航任务中。其解决方案的关键在于提出了一种去中心化的混合框架ReCoDe(Reinforcement-based Constraint Design),该框架不摒弃用户定义的专家控制器,而是通过多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)动态学习额外的约束项,以捕捉更细微的行为模式(如避免拥挤区域),并通过局部通信机制使智能体协同调整动作空间,从而提升整体协调性与鲁棒性。实验证明,保留并改进现有控制器比从零开始学习更高效,且ReCoDe可根据环境变化动态调节对原始控制器的依赖程度。
链接: https://arxiv.org/abs/2507.19151
作者: Michael Amir,Guang Yang,Zhan Gao,Keisuke Okumura,Heedo Woo,Amanda Prorok
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Constraint-based optimization is a cornerstone of robotics, enabling the design of controllers that reliably encode task and safety requirements such as collision avoidance or formation adherence. However, handcrafted constraints can fail in multi-agent settings that demand complex coordination. We introduce ReCoDe–Reinforcement-based Constraint Design–a decentralized, hybrid framework that merges the reliability of optimization-based controllers with the adaptability of multi-agent reinforcement learning. Rather than discarding expert controllers, ReCoDe improves them by learning additional, dynamic constraints that capture subtler behaviors, for example, by constraining agent movements to prevent congestion in cluttered scenarios. Through local communication, agents collectively constrain their allowed actions to coordinate more effectively under changing conditions. In this work, we focus on applications of ReCoDe to multi-agent navigation tasks requiring intricate, context-based movements and consensus, where we show that it outperforms purely handcrafted controllers, other hybrid approaches, and standard MARL baselines. We give empirical (real robot) and theoretical evidence that retaining a user-defined controller, even when it is imperfect, is more efficient than learning from scratch, especially because ReCoDe can dynamically change the degree to which it relies on this controller.
zh
[AI-20] Solar Photovoltaic Assessment with Large Language Model
【速读】:该论文旨在解决卫星遥感图像中太阳能光伏(Photovoltaic, PV)面板检测与定位的准确性与泛化能力问题,现有方法普遍缺乏算法透明度、依赖高质量且大规模的训练数据,并在不同地理区域或环境条件下难以迁移,导致检测结果不一致,阻碍了微电网和主动配电网络(Active Distribution Networks, ADNs)的大规模部署与数据驱动优化。其解决方案的关键在于提出PV Assessment with LLMs (PVAL)框架,通过任务分解提升流程效率、输出标准化确保格式一致性、少样本提示(few-shot prompting)增强分类精度,并结合人工标注的精细化PV数据集进行微调,从而实现跨异构数据集的透明性、可扩展性和适应性,同时降低计算开销,构建自动化、可复现的光伏面板检测流水线。
链接: https://arxiv.org/abs/2507.19144
作者: Muhao Guo,Yang Weng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 7 figures
Abstract:Accurate detection and localization of solar photovoltaic (PV) panels in satellite imagery is essential for optimizing microgrids and active distribution networks (ADNs), which are critical components of renewable energy systems. Existing methods lack transparency regarding their underlying algorithms or training datasets, rely on large, high-quality PV training data, and struggle to generalize to new geographic regions or varied environmental conditions without extensive re-training. These limitations lead to inconsistent detection outcomes, hindering large-scale deployment and data-driven grid optimization. In this paper, we investigate how large language models (LLMs) can be leveraged to overcome these challenges. Despite their promise, LLMs face several challenges in solar panel detection, including difficulties with multi-step logical processes, inconsistent output formatting, frequent misclassification of visually similar objects (e.g., shadows, parking lots), and low accuracy in complex tasks such as spatial localization and quantification. To overcome these issues, we propose the PV Assessment with LLMs (PVAL) framework, which incorporates task decomposition for more efficient workflows, output standardization for consistent and scalable formatting, few-shot prompting to enhance classification accuracy, and fine-tuning using curated PV datasets with detailed annotations. PVAL ensures transparency, scalability, and adaptability across heterogeneous datasets while minimizing computational overhead. By combining open-source accessibility with robust methodologies, PVAL establishes an automated and reproducible pipeline for solar panel detection, paving the way for large-scale renewable energy integration and optimized grid management.
zh
[AI-21] Graph Structure Learning with Privacy Guarantees for Open Graph Data
【速读】:该论文旨在解决大规模开放图数据集在发布阶段的隐私保护问题,尤其是在区分数据发布者与用户场景下,现有隐私保护数据发布(Privacy-Preserving Data Publishing, PPDP)方法难以兼顾隐私保障与数据效用的问题。其核心解决方案是提出一种基于高斯差分隐私(Gaussian Differential Privacy, GDP)的结构化噪声注入机制,通过在数据发布阶段而非模型训练阶段引入噪声,实现对图结构的无偏恢复,同时确保严格的差分隐私约束。该方法突破了传统依赖梯度或模型更新扰动的范式,首次在离散变量图场景中提供了理论保障,并在图学习任务中验证了其鲁棒性能。
链接: https://arxiv.org/abs/2507.19116
作者: Muhao Guo,Jiaqi Wu,Yang Weng,Yizheng Liao,Shengzhe Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Ensuring privacy in large-scale open datasets is increasingly challenging under regulations such as the General Data Protection Regulation (GDPR). While differential privacy (DP) provides strong theoretical guarantees, it primarily focuses on noise injection during model training, neglecting privacy preservation at the data publishing stage. Existing privacy-preserving data publishing (PPDP) approaches struggle to balance privacy and utility, particularly when data publishers and users are distinct entities. To address this gap, we focus on the graph recovery problem and propose a novel privacy-preserving estimation framework for open graph data, leveraging Gaussian DP (GDP) with a structured noise-injection mechanism. Unlike traditional methods that perturb gradients or model updates, our approach ensures unbiased graph structure recovery while enforcing DP at the data publishing stage. Moreover, we provide theoretical guarantees on estimation accuracy and extend our method to discrete-variable graphs, a setting often overlooked in DP research. Experimental results in graph learning demonstrate robust performance, offering a viable solution for privacy-conscious graph analysis.
zh
[AI-22] Automated Code Review Using Large Language Models at Ericsson: An Experience Report
【速读】:该论文旨在解决软件代码审查(code review)过程中对经验丰富的开发人员依赖性强且耗时的问题,从而减轻其认知负担,使其能更专注于编写新功能和修复缺陷。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)与静态程序分析相结合,开发出一款轻量级自动化代码审查工具,并通过初步实验验证了其有效性。
链接: https://arxiv.org/abs/2507.19115
作者: Shweta Ramesh,Joy Bose,Hamender Singh,A K Raghavan,Sujoy Roychowdhury,Giriprasad Sridhara,Nishrith Saini,Ricardo Britto
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code review is one of the primary means of assuring the quality of released software along with testing and static analysis. However, code review requires experienced developers who may not always have the time to perform an in-depth review of code. Thus, automating code review can help alleviate the cognitive burden on experienced software developers allowing them to focus on their primary activities of writing code to add new features and fix bugs. In this paper, we describe our experience in using Large Language Models towards automating the code review process in Ericsson. We describe the development of a lightweight tool using LLMs and static program analysis. We then describe our preliminary experiments with experienced developers in evaluating our code review tool and the encouraging results.
zh
[AI-23] Pareto-NRPA: A Novel Monte-Carlo Search Algorithm for Multi-Objective Optimization ECAI2025
【速读】:该论文旨在解决离散搜索空间中的多目标优化(Multi-Objective Optimization, MOO)问题,其核心挑战在于如何在多个相互冲突的目标之间高效地探索并生成高质量、多样化的非支配解集。解决方案的关键在于提出了一种名为Pareto-NRPA的新蒙特卡洛算法,它通过扩展原始针对单目标问题设计的嵌套回溯策略适应机制(Nested Rollout Policy Adaptation, NRPA),将嵌套搜索与策略更新机制推广至多目标场景;具体而言,算法使用一组策略并行探索解空间的不同区域,并在每一层搜索中维护非支配前沿(Non-dominated Front),同时基于序列在帕累托前沿中的多样性与隔离度进行策略自适应调整,从而实现对复杂约束环境下多目标解集的高效收敛与良好分布。
链接: https://arxiv.org/abs/2507.19109
作者: Noé Lallouet,Tristan Cazenave,Cyrille Enderli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint ; accepted to ECAI 2025
Abstract:We introduce Pareto-NRPA, a new Monte-Carlo algorithm designed for multi-objective optimization problems over discrete search spaces. Extending the Nested Rollout Policy Adaptation (NRPA) algorithm originally formulated for single-objective problems, Pareto-NRPA generalizes the nested search and policy update mechanism to multi-objective optimization. The algorithm uses a set of policies to concurrently explore different regions of the solution space and maintains non-dominated fronts at each level of search. Policy adaptation is performed with respect to the diversity and isolation of sequences within the Pareto front. We benchmark Pareto-NRPA on two classes of problems: a novel bi-objective variant of the Traveling Salesman Problem with Time Windows problem (MO-TSPTW), and a neural architecture search task on well-known benchmarks. Results demonstrate that Pareto-NRPA achieves competitive performance against state-of-the-art multi-objective algorithms, both in terms of convergence and diversity of solutions. Particularly, Pareto-NRPA strongly outperforms state-of-the-art evolutionary multi-objective algorithms on constrained search spaces. To our knowledge, this work constitutes the first adaptation of NRPA to the multi-objective setting.
zh
[AI-24] PBiLoss: Popularity-Aware Regularization to Improve Fairness in Graph-Based Recommender Systems
【速读】:该论文旨在解决基于图神经网络(Graph Neural Networks, GNNs)的推荐系统中存在的流行度偏差(Popularity Bias)问题,即模型倾向于过度推荐热门物品,从而导致内容多样性下降和推荐公平性受损。解决方案的关键在于提出一种名为PBiLoss的新颖正则化损失函数,通过在传统训练目标基础上对模型偏向热门物品的倾向进行惩罚,从而引导模型推荐更冷门但更具个性化的物品。PBiLoss结合两种采样策略——热门正样本(Popular Positive, PopPos)和热门负样本(Popular Negative, PopNeg),分别调节训练过程中热门正负样本的贡献,并采用固定阈值或无阈值两种方式识别热门物品,使方法具备灵活性与适应性。该方法具有模型无关性,可无缝集成至LightGCN等先进图推荐框架中,在保持甚至提升推荐准确率的同时显著降低用户和物品层面的流行度排名相关性(PRU和PRI),实现了精度与公平性的有效平衡。
链接: https://arxiv.org/abs/2507.19067
作者: Mohammad Naeimi,Mostafa Haghir Chehreghani
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Recommender systems, especially those based on graph neural networks (GNNs), have achieved remarkable success in capturing user-item interaction patterns. However, they remain susceptible to popularity bias–the tendency to over-recommend popular items–resulting in reduced content diversity and compromised fairness. In this paper, we propose PBiLoss, a novel regularization-based loss function designed to counteract popularity bias in graph-based recommender models explicitly. PBiLoss augments traditional training objectives by penalizing the model’s inclination toward popular items, thereby encouraging the recommendation of less popular but potentially more personalized content. We introduce two sampling strategies: Popular Positive (PopPos) and Popular Negative (PopNeg), which respectively modulate the contribution of the positive and negative popular items during training. We further explore two methods to distinguish popular items: one based on a fixed popularity threshold and another without any threshold, making the approach flexible and adaptive. Our proposed method is model-agnostic and can be seamlessly integrated into state-of-the-art graph-based frameworks such as LightGCN and its variants. Comprehensive experiments across multiple real-world datasets demonstrate that PBiLoss significantly improves fairness, as demonstrated by reductions in the Popularity-Rank Correlation for Users (PRU) and Popularity-Rank Correlation for Items (PRI), while maintaining or even enhancing standard recommendation accuracy and ranking metrics. These results highlight the effectiveness of directly embedding fairness objectives into the optimization process, providing a practical and scalable solution for balancing accuracy and equitable content exposure in modern recommender systems.
zh
[AI-25] MindSpeed RL: Distributed Dataflow for Scalable and Efficient RL Training on Ascend NPU Cluster
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)训练系统在大规模分布式场景下存在的集群可扩展性差和内存利用率低的问题。其核心挑战源于RL训练中复杂的跨节点依赖关系,尤其是样本流(sample flow)与重分片流(resharding flow)的数据流动模式。解决方案的关键在于从分布式视角重构数据依赖管理:首先提出一种分布式传输码头策略(distributed transfer dock strategy),通过在传统回放缓冲区基础上设置控制器和仓库来降低样本流中的调度开销;其次设计了一种实用的allgather–swap策略,消除重分片流中的冗余内存占用。此外,系统还融合多种并行化策略与加速技术以实现整体性能优化,实验证明该方案在Qwen和DeepSeek系列模型上的吞吐量提升达1.42~3.97倍。
链接: https://arxiv.org/abs/2507.19017
作者: Laingjun Feng,Chenyi Pan,Xinjie Guo,Fei Mei,Benzhe Ning,Jianxiang Zhang,Xinyang Liu,Beirong Zhou,Zeng Shu,Chang Liu,Guang Yang,Zhenyu Han,Jiangben Wang,Bo Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:Reinforcement learning (RL) is a paradigm increasingly used to align large language models. Popular RL algorithms utilize multiple workers and can be modeled as a graph, where each node is the status of a worker and each edge represents dataflow between nodes. Owing to the heavy cross-node dependencies, the RL training system usually suffers from poor cluster scalability and low memory utilization. In this article, we introduce MindSpeed RL, an effective and efficient system for large-scale RL training. Unlike existing centralized methods, MindSpeed RL organizes the essential data dependencies in RL training, i.e., sample flow and resharding flow, from a distributed view. On the one hand, a distributed transfer dock strategy, which sets controllers and warehouses on the basis of the conventional replay buffer, is designed to release the dispatch overhead in the sample flow. A practical allgather–swap strategy is presented to eliminate redundant memory usage in resharding flow. In addition, MindSpeed RL further integrates numerous parallelization strategies and acceleration techniques for systematic optimization. Compared with existing state-of-the-art systems, comprehensive experiments on the RL training of popular Qwen2.5-Dense-7B/32B, Qwen3-MoE-30B, and DeepSeek-R1-MoE-671B show that MindSpeed RL increases the throughput by 1.42 ~ 3.97 times. Finally, we open–source MindSpeed RL and perform all the experiments on a super pod of Ascend with 384 neural processing units (NPUs) to demonstrate the powerful performance and reliability of Ascend.
zh
[AI-26] A diffusion-based generative model for financial time series via geometric Brownian motion
【速读】:该论文旨在解决传统扩散模型在生成金融时间序列时未能充分捕捉资产价格动态特性的局限性,尤其是对异方差性(heteroskedasticity)和关键金融统计特征(如厚尾分布、波动聚集性和杠杆效应)建模不足的问题。其解决方案的关键在于将几何布朗运动(geometric Brownian motion, GBM)嵌入前向加噪过程,使噪声注入与资产价格水平成比例,从而更真实地模拟金融数据的尺度依赖性;同时,在反向生成过程中采用基于Transformer的得分匹配训练策略,有效学习从噪声中恢复真实价格路径的条件分布,最终显著提升对金融时间序列核心特性(stylized facts)的再现能力。
链接: https://arxiv.org/abs/2507.19003
作者: Gihun Kim,Sun-Yong Choi,Yeoneung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:We propose a novel diffusion-based generative framework for financial time series that incorporates geometric Brownian motion (GBM), the foundation of the Black–Scholes theory, into the forward noising process. Unlike standard score-based models that treat price trajectories as generic numerical sequences, our method injects noise proportionally to asset prices at each time step, reflecting the heteroskedasticity observed in financial time series. By accurately balancing the drift and diffusion terms, we show that the resulting log-price process reduces to a variance-exploding stochastic differential equation, aligning with the formulation in score-based generative models. The reverse-time generative process is trained via denoising score matching using a Transformer-based architecture adapted from the Conditional Score-based Diffusion Imputation (CSDI) framework. Empirical evaluations on historical stock data demonstrate that our model reproduces key stylized facts heavy-tailed return distributions, volatility clustering, and the leverage effect more realistically than conventional diffusion models.
zh
[AI-27] GENIAL: Generative Design Space Exploration via Network Inversion for Low Power Algorithmic Logic Units
【速读】:该论文旨在解决AI工作负载激增背景下,传统人工或启发式方法在优化算术单元(如乘法器)设计空间时效率低下、难以充分探索的问题。其核心挑战在于如何高效生成并优化硬件实现,以降低功耗与面积等关键指标。解决方案的关键在于提出GENIAL框架——一个基于机器学习的自动设计与优化系统,其核心是一个两阶段训练的Transformer代理模型(surrogate model),先通过自监督预训练再进行监督微调,从而从抽象的设计表示中准确预测功率和面积等硬件指标;进一步利用该模型的可逆性,搜索能直接最小化特定输入数据分布下功耗的新操作数编码方式。实验表明,GENIAL相比现有方法更样本高效且收敛更快,并能在代表性AI任务中实现高达18%的开关活动减少,显著优于传统的补码表示。
链接: https://arxiv.org/abs/2507.18989
作者: Maxence Bouvier,Ryan Amaudruz,Felix Arnold,Renzo Andri,Lukas Cavigelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Under review
Abstract:As AI workloads proliferate, optimizing arithmetic units is becoming increasingly important to reduce the footprint of digital systems. Conventional design flows, which often rely on manual or heuristics-based optimization, are limited in their ability to thoroughly explore the vast design space. In this paper, we introduce GENIAL, a machine learning-based framework for the automatic generation and optimization of arithmetic units, more specifically multipliers. At the core of GENIAL is a Transformer-based surrogate model trained in two stages, involving self-supervised pretraining followed by supervised finetuning, to robustly forecast key hardware metrics such as power and area from abstracted design representations. By inverting the surrogate model, GENIAL efficiently searches for new operand encodings that directly minimize power consumption in arithmetic units for specific input data distributions. Extensive experiments on large datasets demonstrate that GENIAL is consistently more sample efficient than other methods, and converges faster towards optimized designs. This enables to deploy a high-effort logic synthesis optimization flow in the loop, improving the accuracy of the surrogate model. Notably, GENIAL automatically discovers encodings that achieve up to 18% switching activity savings within multipliers on representative AI workloads compared with the conventional two’s complement. We also demonstrate the versatility of our approach by achieving significant improvements on Finite State Machines, highlighting GENIAL’s applicability for a wide spectrum of logic functions. Together, these advances mark a significant step toward automated Quality-of-Results-optimized combinational circuit generation for digital systems. Comments: Under review Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2507.18989 [cs.LG] (or arXiv:2507.18989v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.18989 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-28] Differentiated Thyroid Cancer Recurrence Classification Using Machine Learning Models and Bayesian Neural Networks with Varying Priors: A SHAP-Based Interpretation of the Best Performing Model
【速读】:该论文旨在解决分化型甲状腺癌(Differentiated Thyroid Cancer, DTC)复发预测中模型准确性、可解释性与不确定性量化不足的问题。其解决方案的关键在于构建一个综合框架,首先通过多种机器学习(Machine Learning, ML)模型在完整数据集上进行分类性能评估,结合Boruta特征选择算法降低冗余特征以提升模型效率;随后引入贝叶斯神经网络(Bayesian Neural Networks, BNN)并对比六种不同先验分布(包括正态分布、拉普拉斯分布和柯西分布等),实现对预测结果的不确定性量化,从而增强临床决策的可靠性。实验表明,采用正态先验 $ \mathcal{N}(0,10) $ 的BNN在特征选择后达到最高准确率0.9870,显著优于传统ML方法,体现了该框架在兼顾性能与可信度方面的优势。
链接: https://arxiv.org/abs/2507.18987
作者: HMNS Kumari,HMLS Kumari,UMMPK Nawarathne
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 15 figures, to be published in International Journal of Research in Computing (IJRC)
Abstract:Differentiated thyroid cancer DTC recurrence is a major public health concern, requiring classification and predictive models that are not only accurate but also interpretable and uncertainty aware. This study introduces a comprehensive framework for DTC recurrence classification using a dataset containing 383 patients and 16 clinical and pathological variables. Initially, 11 machine learning ML models were employed using the complete dataset, where the Support Vector Machines SVM model achieved the highest accuracy of 0.9481. To reduce complexity and redundancy, feature selection was carried out using the Boruta algorithm, and the same ML models were applied to the reduced dataset, where it was observed that the Logistic Regression LR model obtained the maximum accuracy of 0.9611. However, these ML models often lack uncertainty quantification, which is critical in clinical decision making. Therefore, to address this limitation, the Bayesian Neural Networks BNN with six varying prior distributions, including Normal 0,1, Normal 0,10, Laplace 0,1, Cauchy 0,1, Cauchy 0,2.5, and Horseshoe 1, were implemented on both the complete and reduced datasets. The BNN model with Normal 0,10 prior distribution exhibited maximum accuracies of 0.9740 and 0.9870 before and after feature selection, respectively.
zh
[AI-29] owards Improving Long-Tail Entity Predictions in Temporal Knowledge Graphs through Global Similarity and Weighted Sampling
【速读】:该论文旨在解决时序知识图谱(Temporal Knowledge Graph, TKG)完成模型在增量训练场景下的两大挑战:一是模型需具备泛化能力以吸收新知识,二是如何有效处理训练阶段未见或连接稀疏的新实体(new or unseen entities)。其解决方案的关键在于提出一个与模型无关的增强层(enhancement layer)和一种加权采样策略(weighted sampling strategy)。增强层通过引入更广泛的全局实体相似性定义,超越了基于图神经网络(GNN)方法仅依赖局部邻域邻近性的局限;加权采样策略则在训练中强化低频实体相关边的采样概率,从而改善长尾实体的建模效果。该框架可无缝集成至现有TKG完成方法中,并在链接预测任务中显著提升性能,尤其在归纳推理和长尾实体处理方面表现突出。
链接: https://arxiv.org/abs/2507.18977
作者: Mehrnoosh Mirtaheri,Ryan A. Rossi,Sungchul Kim,Kanak Mahadik,Tong Yu,Xiang Chen,Mohammad Rostami
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Temporal Knowledge Graph (TKG) completion models traditionally assume access to the entire graph during training. This overlooks challenges stemming from the evolving nature of TKGs, such as: (i) the model’s requirement to generalize and assimilate new knowledge, and (ii) the task of managing new or unseen entities that often have sparse connections. In this paper, we present an incremental training framework specifically designed for TKGs, aiming to address entities that are either not observed during training or have sparse connections. Our approach combines a model-agnostic enhancement layer with a weighted sampling strategy, that can be augmented to and improve any existing TKG completion method. The enhancement layer leverages a broader, global definition of entity similarity, which moves beyond mere local neighborhood proximity of GNN-based methods. The weighted sampling strategy employed in training accentuates edges linked to infrequently occurring entities. We evaluate our method on two benchmark datasets, and demonstrate that our framework outperforms existing methods in total link prediction, inductive link prediction, and in addressing long-tail entities. Notably, our method achieves a 10% improvement and a 15% boost in MRR for these datasets. The results underscore the potential of our approach in mitigating catastrophic forgetting and enhancing the robustness of TKG completion methods, especially in an incremental training context
zh
[AI-30] HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling
【速读】:该论文旨在解决大规模语音到语音系统中因多量化器并行流复杂度高以及高时间维度编码器计算成本大而导致的压缩效率与重建质量难以兼顾的问题。解决方案的关键在于提出一种名为HH-Codec的神经语音编解码器,其核心创新包括:1)设计了一个面向语音语言建模的向量量化(Vector Quantization, VQ)空间,以优化压缩效率并最小化信息损失;2)采用异构编码器-解码器架构(Audio-VQ-Mel-Audio),通过双监督机制和渐进式训练策略提升重建稳定性和保真度,从而在仅使用单量化器推理的前提下实现每秒24个离散语音标记(speech token)的极端压缩率,并达到0.3 kbps的超低带宽下最优的语音重建性能。
链接: https://arxiv.org/abs/2507.18897
作者: Rongkun Xue,Yazhe Niu,Shuai Hu,Zixin Yin,Yongqiang Yao,Jing Yang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Discrete speech tokenization is a fundamental component in speech codecs. However, in large-scale speech-to-speech systems, the complexity of parallel streams from multiple quantizers and the computational cost of high-time-dimensional codecs pose significant challenges. In this paper, we introduce HH-Codec, a neural codec that achieves extreme compression at 24 tokens per second for 24 kHz audio while relying on single-quantizer inference. Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss. Building on this, we propose an asymmetric encoder-decoder architecture (Audio-VQ-Mel-Audio) that leverages dual supervision and progressive training to enhance reconstruction stability and fidelity. HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps. We further evaluate its effectiveness in codebook utilization and generative model adaptation, with extensive ablations validating the necessity of each module. HH-Codec is available at this https URL.
zh
[AI-31] Success in Humanoid Reinforcement Learning under Partial Observation
【速读】:该论文旨在解决在部分可观测环境下(partial observability)实现稳定且高效的机器人策略学习问题,尤其是在高维任务如人形机器人行走控制中,传统方法难以在仅依赖有限状态信息的情况下获得与全状态观测相当的性能。解决方案的关键在于提出了一种新颖的历史编码器(history encoder),该编码器能够并行处理固定长度的历史观测序列,从而从近期观测中重构出关键的上下文信息,使策略能够在不依赖完整状态的前提下做出鲁棒决策。此编码器被集成到标准的无模型强化学习算法中,显著提升了部分可观测环境下的训练稳定性与性能表现,实现了与全状态观测基线相当甚至更优的结果。
链接: https://arxiv.org/abs/2507.18883
作者: Wuhao Wang,Zhiyong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 11 pages, 3 figures, and 4 tables. Not published anywhere else
Abstract:Reinforcement learning has been widely applied to robotic control, but effective policy learning under partial observability remains a major challenge, especially in high-dimensional tasks like humanoid locomotion. To date, no prior work has demonstrated stable training of humanoid policies with incomplete state information in the benchmark Gymnasium Humanoid-v4 environment. The objective in this environment is to walk forward as fast as possible without falling, with rewards provided for staying upright and moving forward, and penalties incurred for excessive actions and external contact forces. This research presents the first successful instance of learning under partial observability in this environment. The learned policy achieves performance comparable to state-of-the-art results with full state access, despite using only one-third to two-thirds of the original states. Moreover, the policy exhibits adaptability to robot properties, such as variations in body part masses. The key to this success is a novel history encoder that processes a fixed-length sequence of past observations in parallel. Integrated into a standard model-free algorithm, the encoder enables performance on par with fully observed baselines. We hypothesize that it reconstructs essential contextual information from recent observations, thereby enabling robust decision-making.
zh
[AI-32] A Comprehensive Review of AI-based Intelligent Tutoring Systems: Applications and Challenges
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 驱动的智能辅导系统(Intelligent Tutoring Systems, ITS)在真实教育场景中应用效果不一的问题,即如何厘清 ITS 在实际教学中的运作机制及其有效性瓶颈。其解决方案的关键在于通过系统性文献综述方法,对2010至2025年间高质量研究进行整合分析,聚焦于教学策略、自然语言处理(Natural Language Processing, NLP)、自适应学习、学生建模及领域特定应用等核心模块,从而识别出影响 ITS 效果的关键因素与现存挑战,并提出提升实验设计严谨性和数据科学分析水平的研究方向与实践建议。
链接: https://arxiv.org/abs/2507.18882
作者: Meriem Zerkouk,Miloud Mihoubi,Belkacem Chikhaoui
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Journal of Computers in Education ( 2025 )
Abstract:AI-based Intelligent Tutoring Systems (ITS) have significant potential to transform teaching and learning. As efforts continue to design, develop, and integrate ITS into educational contexts, mixed results about their effectiveness have emerged. This paper provides a comprehensive review to understand how ITS operate in real educational settings and to identify the associated challenges in their application and evaluation. We use a systematic literature review method to analyze numerous qualified studies published from 2010 to 2025, examining domains such as pedagogical strategies, NLP, adaptive learning, student modeling, and domain-specific applications of ITS. The results reveal a complex landscape regarding the effectiveness of ITS, highlighting both advancements and persistent challenges. The study also identifies a need for greater scientific rigor in experimental design and data analysis. Based on these findings, suggestions for future research and practical implications are proposed.
zh
[AI-33] A Neuroscience-Inspired Dual-Process Model of Compositional Generalization
【速读】:该论文旨在解决人工智能系统在组合任务中缺乏系统性泛化能力的问题,即如何有效构建和理解已知组件的新颖组合。其解决方案的关键在于提出MIRAGE框架,该框架模拟了海马体(HPC)与前额叶皮层(PFC)的协同工作机制:通过一个元训练的Transformer神经分解器(类比新皮层“系统1”)进行迭代式序列表示精炼,以及一个Schema Engine(类比HPC-PFC“系统2”回路)动态提取、排序并应用可复用的语义模式(schema),同时利用情景记忆存储变量绑定并在需要时扩展。该设计使模型在仅使用冻结权重的情况下即可完成全新组合任务,从而实现系统性组合泛化,在SCAN基准上达到99%准确率,且参数量仅为1.19M。
链接: https://arxiv.org/abs/2507.18868
作者: Alex Noviello,Claas Beger,Jacob Groner,Kevin Ellis,Weinan Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Systematic compositional generalization - constructing and understanding novel combinations of known building blocks - remains a core challenge for AI systems. Human cognition achieves this flexibility via the interplay of the hippocampus (HPC) and prefrontal cortex (PFC): the hippocampus rapidly encodes episodes, and the prefrontal cortex consolidates them into reusable schemas for reasoning. Drawing on these insights, we present MIRAGE (Meta-Inference with Rules and Abstractions from Generalized Experience), a framework that achieves systematic generalization on compositional tasks. MIRAGE has two interacting modules mirroring the brain’s deliberative HPC-PFC loop and intuitive neocortical pattern recognition. (1) The meta-trained Transformer Neural Decomposer, paralleling neocortical “System 1” computation, is trained on a task-agnostic stream of randomly sampled compositional grammars and applies one decomposition step per pass, with successive passes iteratively refining the sequence representation. (2) The Schema Engine, analogous to the HPC-PFC “System 2” loop, dynamically extracts, ranks, and applies reusable schemas, storing variable bindings in episodic memory and expanding them when needed. By explicitly equipping the Transformer component of MIRAGE with actively managed schematic structures, our model performs systematic compositional operations through explicit schema application and transformation, relying solely on frozen weights when solving entirely novel tasks. This approach demonstrates systematic compositional generalization on the SCAN benchmark, achieving 99% accuracy on all task splits with only 1.19M parameters in the transformer module. Ablation studies confirm that MIRAGE’s systematicity critically depends on the quality of extracted schemas and the model’s iterative refinement process.
zh
[AI-34] Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning via Incorporating Generalized Human Expertise
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中因仅接收团队奖励而导致的探索效率低下问题,尤其是在稀疏奖励环境中。传统方法通过人工设计密集的个体奖励函数来引导探索,但这类奖励函数缺乏高阶智能,导致学习效率和泛化能力远低于人类水平。解决方案的关键在于提出一种名为LIGHT(Learning Individual Intrinsic reward via Incorporating Generalized Human Expertise)的新框架,该框架以端到端方式将人类知识融入MARL算法:一方面利用人类专家偏好分布与个体动作分布的匹配来抑制无效探索,另一方面基于与Q-learning相关的可操作表征变换设计个体内在奖励,使智能体在最大化联合动作价值的同时,其动作偏好与人类专家一致,从而显著提升性能与跨任务的知识复用能力。
链接: https://arxiv.org/abs/2507.18867
作者: Xuefei Wu,Xiao Yin,Yuanyang Zhu,Chunlin Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: IEEE International Conference on Systems, Man, and Cybernetics
Abstract:Efficient exploration in multi-agent reinforcement learning (MARL) is a challenging problem when receiving only a team reward, especially in environments with sparse rewards. A powerful method to mitigate this issue involves crafting dense individual rewards to guide the agents toward efficient exploration. However, individual rewards generally rely on manually engineered shaping-reward functions that lack high-order intelligence, thus it behaves ineffectively than humans regarding learning and generalization in complex problems. To tackle these issues, we combine the above two paradigms and propose a novel framework, LIGHT (Learning Individual Intrinsic reward via Incorporating Generalized Human experTise), which can integrate human knowledge into MARL algorithms in an end-to-end manner. LIGHT guides each agent to avoid unnecessary exploration by considering both individual action distribution and human expertise preference distribution. Then, LIGHT designs individual intrinsic rewards for each agent based on actionable representational transformation relevant to Q-learning so that the agents align their action preferences with the human expertise while maximizing the joint action value. Experimental results demonstrate the superiority of our method over representative baselines regarding performance and better knowledge reusability across different sparse-reward tasks on challenging scenarios.
zh
[AI-35] Equivariant Volumetric Grasping
【速读】:该论文旨在解决三维抓取模型在旋转对称性下的样本效率低的问题,即传统方法在面对物体绕垂直轴旋转时无法有效利用数据,导致训练效率低下。解决方案的关键在于提出一种具有旋转等变性的三平面(tri-plane)体积特征表示:通过将3D特征投影到三个标准平面,并设计水平面上的特征对90°旋转保持等变性,而其余两平面特征之和保持不变性;这一设计依赖于一种新型可变形可旋转卷积(deformable steerable convolution),它结合了可变形卷积的几何适应能力与可旋转卷积的等变特性,使感受野能自适应局部物体结构的同时维持等变性质。在此基础上,进一步构建了两种先进体积抓取规划器(GIGA 和 IGD)的等变版本,显著提升了性能并降低了计算与内存开销。
链接: https://arxiv.org/abs/2507.18847
作者: Pinhao Song,Yutong Hu,Pengteng Li,Renaud Detry
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 19 pages
Abstract:We propose a new volumetric grasp model that is equivariant to rotations around the vertical axis, leading to a significant improvement in sample efficiency. Our model employs a tri-plane volumetric feature representation – i.e., the projection of 3D features onto three canonical planes. We introduce a novel tri-plane feature design in which features on the horizontal plane are equivariant to 90° rotations, while the sum of features from the other two planes remains invariant to the same transformations. This design is enabled by a new deformable steerable convolution, which combines the adaptability of deformable convolutions with the rotational equivariance of steerable ones. This allows the receptive field to adapt to local object geometry while preserving equivariance properties. We further develop equivariant adaptations of two state-of-the-art volumetric grasp planners, GIGA and IGD. Specifically, we derive a new equivariant formulation of IGD’s deformable attention mechanism and propose an equivariant generative model of grasp orientations based on flow matching. We provide a detailed analytical justification of the proposed equivariance properties and validate our approach through extensive simulated and real-world experiments. Our results demonstrate that the proposed projection-based design significantly reduces both computational and memory costs. Moreover, the equivariant grasp models built on top of our tri-plane features consistently outperform their non-equivariant counterparts, achieving higher performance with only a modest computational overhead. Video and code can be viewed in: this https URL
zh
[AI-36] MemoCoder: Automated Function Synthesis using LLM -Supported Agents
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中面对复杂编程问题时,因缺乏迭代调试能力、错误处理机制以及知识积累与复用机制而导致的性能瓶颈问题。现有方法如微调或自修复策略要么成本高昂,要么无法有效沉淀和利用历史修复经验。其解决方案的关键在于提出 MemoCoder——一个基于多智能体协作的框架,核心创新是引入“修复知识库(Fixing Knowledge Set)”,用于存储并检索过往成功修复案例,并通过中央导师代理(Mentor Agent)识别重复错误模式、优化高阶修复策略,从而实现持续学习与知识引导的代码生成,显著提升了模型在多个基准测试中的表现。
链接: https://arxiv.org/abs/2507.18812
作者: Yiping Jia,Zhen Ming Jiang,Shayan Noei,Ying Zou
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:With the widespread adoption of Large Language Models (LLMs) such as GitHub Copilot and ChatGPT, developers increasingly rely on AI-assisted tools to support code generation. While LLMs can generate syntactically correct solutions for well-structured programming tasks, they often struggle with challenges that require iterative debugging, error handling, or adaptation to diverse problem structures. Existing approaches such as fine-tuning or self-repair strategies either require costly retraining or lack mechanisms to accumulate and reuse knowledge from previous attempts. To address these limitations, we propose MemoCoder, a multi-agent framework that enables collaborative problem solving and persistent learning from past fixes. At the core of MemoCoder is a Fixing Knowledge Set, which stores successful repairs and supports retrieval for future tasks. A central Mentor Agent supervises the repair process by identifying recurring error patterns and refining high-level fixing strategies, providing a novel supervisory role that guides the self-repair loop. We evaluate MemoCoder across three public benchmarks – MBPP, HumanEval, and LiveCodeBench – spanning a range of problem complexities. Experimental results show that MemoCoder consistently outperforms both zero-shot prompting and a Self-Repair strategy, with improvements ranging from 3.1% to 12.1% in Pass@10 and from 1.4% to 14.5% in Pass@50, demonstrating its effectiveness in iterative refinement and knowledge-guided code generation. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.18812 [cs.SE] (or arXiv:2507.18812v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.18812 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-37] DxHF: Providing High-Quality Human Feedback for LLM Alignment via Interactive Decomposition
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)对齐过程中,依赖人类反馈时因直接比较长文本段落而导致的认知负荷过高、反馈质量下降的问题。其核心解决方案是引入“分解原则”(decomposition principle),将原始文本拆解为独立的主张(claims),并通过新型用户界面DxHF可视化呈现这些主张与对话的相关性并链接相似主张,从而降低用户认知负担,提升反馈准确性。关键创新在于通过结构化分解和交互设计优化人类判断效率,尤其在用户不确定情境下显著改善反馈质量。
链接: https://arxiv.org/abs/2507.18802
作者: Danqing Shi,Furui Cheng,Tino Weinkauf,Antti Oulasvirta,Mennatallah El-Assady
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Human preferences are widely used to align large language models (LLMs) through methods such as reinforcement learning from human feedback (RLHF). However, the current user interfaces require annotators to compare text paragraphs, which is cognitively challenging when the texts are long or unfamiliar. This paper contributes by studying the decomposition principle as an approach to improving the quality of human feedback for LLM alignment. This approach breaks down the text into individual claims instead of directly comparing two long-form text responses. Based on the principle, we build a novel user interface DxHF. It enhances the comparison process by showing decomposed claims, visually encoding the relevance of claims to the conversation and linking similar claims. This allows users to skim through key information and identify differences for better and quicker judgment. Our technical evaluation shows evidence that decomposition generally improves feedback accuracy regarding the ground truth, particularly for users with uncertainty. A crowdsourcing study with 160 participants indicates that using DxHF improves feedback accuracy by an average of 5%, although it increases the average feedback time by 18 seconds. Notably, accuracy is significantly higher in situations where users have less certainty. The finding of the study highlights the potential of HCI as an effective method for improving human-AI alignment.
zh
[AI-38] Simulation-Driven Reinforcement Learning in Queuing Network Routing Optimization
【速读】:该论文旨在解决复杂排队网络系统中路由决策优化的问题,尤其是在制造和通信等动态、不确定环境中,传统排队方法难以适应变化并保持高效性能的挑战。解决方案的关键在于提出一种基于仿真驱动的强化学习(Reinforcement Learning, RL)框架,融合深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)与Dyna式规划(Dyna-style planning),形成Dyna-DDPG算法。其核心创新在于引入独立的预测模型分别建模下一状态转移和奖励函数,显著提升了训练稳定性与样本效率,使算法能在多种扰动和大规模网络场景下快速学习鲁棒的路由策略,并具备良好的可复现性和工程可部署性。
链接: https://arxiv.org/abs/2507.18795
作者: Fatima Al-Ani,Molly Wang,Jevon Charles,Aaron Ong,Joshua Forday,Vinayak Modi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This study focuses on the development of a simulation-driven reinforcement learning (RL) framework for optimizing routing decisions in complex queueing network systems, with a particular emphasis on manufacturing and communication applications. Recognizing the limitations of traditional queueing methods, which often struggle with dynamic, uncertain environments, we propose a robust RL approach leveraging Deep Deterministic Policy Gradient (DDPG) combined with Dyna-style planning (Dyna-DDPG). The framework includes a flexible and configurable simulation environment capable of modeling diverse queueing scenarios, disruptions, and unpredictable conditions. Our enhanced Dyna-DDPG implementation incorporates separate predictive models for next-state transitions and rewards, significantly improving stability and sample efficiency. Comprehensive experiments and rigorous evaluations demonstrate the framework’s capability to rapidly learn effective routing policies that maintain robust performance under disruptions and scale effectively to larger network sizes. Additionally, we highlight strong software engineering practices employed to ensure reproducibility and maintainability of the framework, enabling practical deployment in real-world scenarios.
zh
[AI-39] Initial Steps in Integrating Large Reasoning and Action Models for Service Composition
【速读】:该论文旨在解决服务组合(Service Composition)在构建自适应和智能软件系统中面临的挑战,尤其是传统方法受限于推理能力不足或执行机制脆弱的问题。其核心解决方案在于融合两种由大语言模型(Large Language Models, LLMs)驱动的新兴范式:大推理模型(Large Reasoning Models, LRMs)与大动作模型(Large Action Models, LAMs)。关键在于通过LRM-LAM集成架构实现语义推理与动态执行的协同——LRMs负责处理服务需求和约束的深层语义理解,LAMs则保障跨系统互操作下的灵活动作执行,从而弥合意图表达与实际执行之间的鸿沟,推动服务组合向全自动、以自然语言指令驱动的方向演进。
链接: https://arxiv.org/abs/2507.18775
作者: Ilche Georgievski,Marco Aiello
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 16 pages, 3 figures, 19th Symposium and Summer School on Service-Oriented Computing (SummerSOC)
Abstract:Service composition remains a central challenge in building adaptive and intelligent software systems, often constrained by limited reasoning capabilities or brittle execution mechanisms. This paper explores the integration of two emerging paradigms enabled by large language models: Large Reasoning Models (LRMs) and Large Action Models (LAMs). We argue that LRMs address the challenges of semantic reasoning and ecosystem complexity while LAMs excel in dynamic action execution and system interoperability. However, each paradigm has complementary limitations - LRMs lack grounded action capabilities, and LAMs often struggle with deep reasoning. We propose an integrated LRM-LAM architectural framework as a promising direction for advancing automated service composition. Such a system can reason about service requirements and constraints while dynamically executing workflows, thus bridging the gap between intention and execution. This integration has the potential to transform service composition into a fully automated, user-friendly process driven by high-level natural language intent.
zh
[AI-40] Agent ic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback
【速读】:该论文旨在解决大规模软件系统中基于测试失败的程序修复问题,尤其是在拥有复杂代码库的大型企业环境中实现自动化、可扩展的智能修复。其核心挑战在于如何利用大语言模型(Large Language Models, LLMs)结合符号推理与反馈机制,高效生成符合工程规范且能通过测试的补丁(patch)。解决方案的关键在于提出一种“神经符号”(neuro-symbolic)框架:以Llama为基础模型,采用ReAct(Reasoning + Acting)架构构建一个工程代理(Engineering Agent),该代理能够从测试失败出发,执行包括读取文件、分析静态代码、运行测试轨迹在内的15种动作序列,并通过静态分析结果和测试反馈迭代优化修复方案;同时引入LLM-as-a-Judge模块确保补丁满足人工审查标准,从而在保证质量的前提下提升修复效率。实证结果显示,在离线基准测试中,该方法在42.3%的案例中成功解决问题,平均需11.8次反馈迭代,且在生产环境中实现了31.5%的修复落地率。
链接: https://arxiv.org/abs/2507.18755
作者: Chandra Maddila,Adam Tait,Claire Chang,Daniel Cheng,Nauman Ahmad,Vijayaraghavan Murali,Marshall Roch,Arnaud Avondet,Aaron Meltzer,Victor Montalvao,Michael Hopko,Chris Waterson,Parth Thakkar,Renuka Fernandez,Kristian Kristensen,Sivan Barzily,Sherry Chen,Rui Abreu,Nachiappan Nagappan,Payam Shodjai,Killian Murphy,James Everingham,Aparna Ramani,Peter C. Rigby
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Aim: With the advent of LLMs, sophisticated agentic program repair has become viable at large organizations with large codebases. In this work, we develop an Engineering Agent that fixes the source code based on test failures at scale across diverse software offerings internally. Method: Using Llama as the base, we employ the ReAct harness to develop an agent. We start with a test failure that was triaged by a rule-based test failure bot. We then set up an agentic harness and allow the agent to reason and run a set of 15 actions from reading a file to generating a patch. We provide feedback to the agent through static analysis and test failures so it can refine its solution. We leverage an LLM-as-a-Judge to ensure that the patch conforms to the standards followed by a human review to land fixes. Benchmark Findings: We curated offline benchmarks for our patch generator, the Engineering Agent loop, and the LLM-as-a-Judge. In offline evaluations we found that a specialized 70B model is highly competitive with the much larger but vanilla Llama-405B. In an ablation study, we found that the ReAct harness (neural model) benefited from the symbolic information from static analysis tools and test execution traces. A model that strikes a balance between the solve rate and error rate vs the cost and latency has a benchmark solve rate of 42.3% using an average 11.8 feedback iterations. Production Findings: In a three month period, 80% of the generated fixes were reviewed, of which 31.5% were landed (25.5% of the total number of generated fixes). Feedback from Engineers: We used open coding to extract qualitative themes from engineers’ feedback. We saw positive feedback in the form of quick approvals, gratitude, and surprise. We also found mixed feedback when the Engineering Agent’s solution was partially correct and it served as a good starting point. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2507.18755 [cs.SE] (or arXiv:2507.18755v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.18755 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chandra Maddila [view email] [v1] Thu, 24 Jul 2025 19:12:32 UTC (958 KB) Full-text links: Access Paper: View a PDF of the paper titled Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback, by Chandra Maddila and 23 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2025-07 Change to browse by: cs cs.AI cs.PL References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-41] he Right to be Forgotten in Pruning: Unveil Machine Unlearning on Sparse Models
【速读】:该论文旨在解决稀疏模型(sparse model)中删除数据后难以有效消除其记忆效应的问题,即如何实现对已删除数据的“遗忘”(machine unlearning),以满足数据主体的“被遗忘权”。现有方法主要针对密集模型设计,未充分考虑稀疏结构下删除数据对剪枝拓扑(pruned topology)的影响。解决方案的关键在于提出一种新的操作——“去剪枝”(un-pruning),用于重构由保留数据驱动的剪枝拓扑,从而消除删除数据对模型结构的影响。作者进一步证明了该方法在理论上的误差上界,并指出传统的成员推理攻击(Membership Inference Attack, MIA)准确性不可靠,因而设计了适用于稀疏模型的新评估指标来衡量去剪枝的成功程度。此方案可与任意现有机器遗忘算法兼容,且适用于结构化和非结构化稀疏模型。
链接: https://arxiv.org/abs/2507.18725
作者: Yang Xiao,Gen Li,Jie Ji,Ruimeng Ye,Xiaolong Ma,Bo Hui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages for main part
Abstract:Machine unlearning aims to efficiently eliminate the memory about deleted data from trained models and address the right to be forgotten. Despite the success of existing unlearning algorithms, unlearning in sparse models has not yet been well studied. In this paper, we empirically find that the deleted data has an impact on the pruned topology in a sparse model. Motivated by the observation and the right to be forgotten, we define a new terminology ``un-pruning" to eliminate the impact of deleted data on model pruning. Then we propose an un-pruning algorithm to approximate the pruned topology driven by retained data. We remark that any existing unlearning algorithm can be integrated with the proposed un-pruning workflow and the error of un-pruning is upper-bounded in theory. Also, our un-pruning algorithm can be applied to both structured sparse models and unstructured sparse models. In the experiment, we further find that Membership Inference Attack (MIA) accuracy is unreliable for assessing whether a model has forgotten deleted data, as a small change in the amount of deleted data can produce arbitrary MIA results. Accordingly, we devise new performance metrics for sparse models to evaluate the success of un-pruning. Lastly, we conduct extensive experiments to verify the efficacy of un-pruning with various pruning methods and unlearning algorithms. Our code is released at this https URL.
zh
[AI-42] Market Making Strategies with Reinforcement Learning
【速读】:该论文旨在解决金融市场上做市商(Market Maker, MM)在面对库存风险、竞争压力和非平稳市场动态时,难以设计出高效、自适应且盈利的策略问题。其核心解决方案在于将做市任务建模为强化学习(Reinforcement Learning, RL)问题,并采用深度强化学习(Deep Reinforcement Learning, DRL)构建能够在单智能体与多智能体环境中运行的自主代理。关键创新包括:1)通过奖励工程(reward engineering)和多目标强化学习(Multi-Objective Reinforcement Learning, MORL)实现对库存管理的有效控制,其中前者利用动态奖励塑造引导行为,后者通过帕累托前沿优化显式平衡冲突目标;2)提出基于折扣汤普森采样(Discounted Thompson Sampling)的POW-dTS策略加权算法,以实现预训练策略的动态选择与组合,从而提升模型对非平稳市场条件的持续适应能力。实验表明,所提方法显著优于传统及基线算法,在多个性能指标上表现更优。
链接: https://arxiv.org/abs/2507.18680
作者: Óscar Fernández Vicente
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This thesis presents the results of a comprehensive research project focused on applying Reinforcement Learning (RL) to the problem of market making in financial markets. Market makers (MMs) play a fundamental role in providing liquidity, yet face significant challenges arising from inventory risk, competition, and non-stationary market dynamics. This research explores how RL, particularly Deep Reinforcement Learning (DRL), can be employed to develop autonomous, adaptive, and profitable market making strategies. The study begins by formulating the MM task as a reinforcement learning problem, designing agents capable of operating in both single-agent and multi-agent settings within a simulated financial environment. It then addresses the complex issue of inventory management using two complementary approaches: reward engineering and Multi-Objective Reinforcement Learning (MORL). While the former uses dynamic reward shaping to guide behavior, the latter leverages Pareto front optimization to explicitly balance competing objectives. To address the problem of non-stationarity, the research introduces POW-dTS, a novel policy weighting algorithm based on Discounted Thompson Sampling. This method allows agents to dynamically select and combine pretrained policies, enabling continual adaptation to shifting market conditions. The experimental results demonstrate that the proposed RL-based approaches significantly outperform traditional and baseline algorithmic strategies across various performance metrics. Overall, this research thesis contributes new methodologies and insights for the design of robust, efficient, and adaptive market making agents, reinforcing the potential of RL to transform algorithmic trading in complex financial systems. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.18680 [cs.LG] (or arXiv:2507.18680v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.18680 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Oscar Fernández Vicente [view email] [v1] Thu, 24 Jul 2025 16:17:49 UTC (32,154 KB)
zh
[AI-43] Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在持续预训练过程中因引入科学数据而导致的灾难性遗忘(catastrophic forgetting)问题,即科学知识的注入会显著损害模型在通用任务上的性能。解决方案的关键在于提出一种四阶段“升级训练”(upcycle training)范式,将预训练的密集型LLM转化为细粒度的专家混合模型(Mixture-of-Experts, MoE),其中通过共享专家处理通用任务、64个专用科学专家分别学习不同学科领域知识,并利用分层路由机制实现科学与通用能力的解耦。该设计有效避免了跨领域知识干扰,使模型在保持99%通用任务性能的同时,在30项科学任务上平均提升25%,胜率高达70%。
链接: https://arxiv.org/abs/2507.18671
作者: Ning Liao,Xiaoxing Wang,Zehao Lin,Weiyang Guo,Feng Hong,Shixiang Song,Geng Yu,Zihua Zhao,Sitao Xie,Longxuan Wei,Xiangqi Jin,Xiaohan Qin,Jiale Ma,Kai Chen,Jiangchao Yao,Zhouhan Lin,Junchi Yan,Zhiyu Li,Feiyu Xiong,Yanfeng Wang,Linfeng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Technical Report
Abstract:A large language model (LLM) with knowledge in both scientific and general tasks is the foundation of science general intelligence. However, directly continued pretraining an LLM using science data usually leads to catastrophic forgetting, which indicates severe degradation in general ability. In this report, we present Innovator, which solves this problem by upcycling a pre-trained dense LLM into a fine-grained Mixtures-of-Experts model during continued pretraining, where different experts are expected to learn science knowledge in different disciplines, and a shared expert is utilized for general tasks. Innovator introduces a four-stage upcycle training paradigm: (1) Scientific Expert Induction on discipline-specific data, (2) Fine-grained Expert Splitting via FFN dimension decomposition, (3) Science-Aware Routing warmup, and (4) Generalist-Scientist Integration training on hybrid datasets. Such a paradigm enables knowledge in the general domain, and different scientific disciplines can be decoupled, avoiding the negative influence among knowledge in different domains. With 53.3B total parameters and 13.3B activated, Innovator extends Qwen2.5-7B using a shared general expert and 64 specialized scientific experts with 8 activated. Trained on 300B tokens with tri-level quality-controlled data, Innovator achieves 25% average improvement across 30 scientific tasks with a win rate as 70%, while retaining 99% performance in general tasks. Furthermore, Innovator-Reason, which is post-trained from Innovator for reasoning boosting, exhibits excellent reasoning performance in solving complex scientific problems with improvements over 30%.
zh
[AI-44] Efficient Knowledge Tracing Leverag ing Higher-Order Information in Integrated Graphs
【速读】:该论文旨在解决知识追踪(Knowledge Tracing, KT)模型在处理大规模图结构和长学习序列时计算成本急剧上升的问题。现有方法通常依赖全局图结构进行建模,导致内存占用高、计算效率低,难以适应实际在线学习场景的需求。其解决方案的关键在于提出双图注意力机制的知识追踪模型(Dual Graph Attention-based Knowledge Tracing, DGAKT),通过构建基于子图(subgraph)的表示方式,仅对每个目标交互相关的局部子图进行注意力计算,从而显著降低内存与计算开销,同时保留高阶学生-练习-知识点(Student-Exercise-Knowledge Component, KC)关系信息,实现了性能与资源效率的双重优化。
链接: https://arxiv.org/abs/2507.18668
作者: Donghee Han,Daehee Kim,Minjun Lee,Daeyoung Roh,Keejun Han,Mun Yong Yi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of online learning has led to the development of various knowledge tracing (KT) methods. However, existing methods have overlooked the problem of increasing computational cost when utilizing large graphs and long learning sequences. To address this issue, we introduce Dual Graph Attention-based Knowledge Tracing (DGAKT), a graph neural network model designed to leverage high-order information from subgraphs representing student-exercise-KC relationships. DGAKT incorporates a subgraph-based approach to enhance computational efficiency. By processing only relevant subgraphs for each target interaction, DGAKT significantly reduces memory and computational requirements compared to full global graph models. Extensive experimental results demonstrate that DGAKT not only outperforms existing KT models but also sets a new standard in resource efficiency, addressing a critical need that has been largely overlooked by prior KT approaches.
zh
[AI-45] Prompt Engineering and the Effectiveness of Large Language Models in Enhancing Human Productivity
【速读】:该论文旨在解决如何通过优化用户提示(prompt)结构与清晰度来提升大语言模型(Large Language Models, LLMs)在教育、职业及创意领域中的输出效果与使用效率的问题。其解决方案的关键在于强调**提示工程(prompt engineering)**的重要性,研究发现采用清晰、结构化且具备上下文感知能力的提示策略的用户,能够显著提高任务完成效率并获得更优结果,从而为生成式 AI(Generative AI)的实际应用提供了可操作的优化路径。
链接: https://arxiv.org/abs/2507.18638
作者: Rizal Khoirul Anam
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 38 pages, 15 tables, 5 figures. Submitted as a research paper draft for arXiv. Based on survey data collected in 2025
Abstract:The widespread adoption of large language models (LLMs) such as ChatGPT, Gemini, and DeepSeek has significantly changed how people approach tasks in education, professional work, and creative domains. This paper investigates how the structure and clarity of user prompts impact the effectiveness and productivity of LLM outputs. Using data from 243 survey respondents across various academic and occupational backgrounds, we analyze AI usage habits, prompting strategies, and user satisfaction. The results show that users who employ clear, structured, and context-aware prompts report higher task efficiency and better outcomes. These findings emphasize the essential role of prompt engineering in maximizing the value of generative AI and provide practical implications for its everyday use.
zh
[AI-46] More Expert-like Eye Gaze Movement Patterns are Related to Better X-ray Reading
【速读】:该论文旨在解决初学者如何通过视觉搜索技能的习得与提升来实现专业能力发展的问题,特别是在牙科放射影像诊断这一高要求的视觉任务中。其解决方案的关键在于利用网络分析方法将眼动轨迹(eye-gaze scanpaths)建模为有向图,并通过时间序列聚类识别不同学习阶段的视觉搜索策略模式;研究发现,过渡熵(transition entropy)与诊断成绩呈负相关,而节点数、边数及平均PageRank值则与成绩正相关,表明学生在学习过程中呈现出从初级到专家级认知加工方式的转变,这为设计基于生成式 AI (Generative AI) 的个性化学习干预提供了量化依据。
链接: https://arxiv.org/abs/2507.18637
作者: Pingjing Yang,Jennifer Cromley,Jana Diesner
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: This work will appear at the 26th International Conference on Artificial Intelligence in Education (AIED 2025)
Abstract:Understanding how novices acquire and hone visual search skills is crucial for developing and optimizing training methods across domains. Network analysis methods can be used to analyze graph representations of visual expertise. This study investigates the relationship between eye-gaze movements and learning outcomes among undergraduate dentistry students who were diagnosing dental radiographs over multiple semesters. We use network analysis techniques to model eye-gaze scanpaths as directed graphs and examine changes in network metrics over time. Using time series clustering on each metric, we identify distinct patterns of visual search strategies and explore their association with students’ diagnostic performance. Our findings suggest that the network metric of transition entropy is negatively correlated with performance scores, while the number of nodes and edges as well as average PageRank are positively correlated with performance scores. Changes in network metrics for individual students over time suggest a developmental shift from intermediate to expert-level processing. These insights contribute to understanding expertise acquisition in visual tasks and can inform the design of AI-assisted learning interventions.
zh
[AI-47] Voice-based AI Agents : Filling the Economic Gaps in Digital Health Delivery ALT
【速读】:该论文旨在解决数字健康服务在经济可及性和覆盖范围上的不平等难题,特别是在资源匮乏人群中的预防性护理和持续患者监测不足的问题。其核心解决方案是利用大语言模型(Large Language Model, LLM)驱动的语音助手(voice assistant),构建一个成本效益高的AI代理系统——Agent PULSE,通过自动化日常监测任务来提升医疗服务的可及性与效率。关键在于将AI代理嵌入临床工作流,并结合经济模型验证其在人力干预不可行场景下的可行性,同时兼顾实时对话处理、系统集成与隐私合规等技术挑战,以及伦理与监管框架下的公平性与患者自主权保障。
链接: https://arxiv.org/abs/2507.16229
作者: Bo Wen,Chen Wang,Qiwei Han,Raquel Norel,Julia Liu,Thaddeus Stappenbeck,Jeffrey L. Rogers
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: IEEE International Conference on Digital Health (ICDH) 2025
Abstract:The integration of voice-based AI agents in healthcare presents a transformative opportunity to bridge economic and accessibility gaps in digital health delivery. This paper explores the role of large language model (LLM)-powered voice assistants in enhancing preventive care and continuous patient monitoring, particularly in underserved populations. Drawing insights from the development and pilot study of Agent PULSE (Patient Understanding and Liaison Support Engine) – a collaborative initiative between IBM Research, Cleveland Clinic Foundation, and Morehouse School of Medicine – we present an economic model demonstrating how AI agents can provide cost-effective healthcare services where human intervention is economically unfeasible. Our pilot study with 33 inflammatory bowel disease patients revealed that 70% expressed acceptance of AI-driven monitoring, with 37% preferring it over traditional modalities. Technical challenges, including real-time conversational AI processing, integration with healthcare systems, and privacy compliance, are analyzed alongside policy considerations surrounding regulation, bias mitigation, and patient autonomy. Our findings suggest that AI-driven voice agents not only enhance healthcare scalability and efficiency but also improve patient engagement and accessibility. For healthcare executives, our cost-utility analysis demonstrates huge potential savings for routine monitoring tasks, while technologists can leverage our framework to prioritize improvements yielding the highest patient impact. By addressing current limitations and aligning AI development with ethical and regulatory frameworks, voice-based AI agents can serve as a critical entry point for equitable, sustainable digital healthcare solutions.
zh
[AI-48] Controlling Topological Defects in Polar Fluids via Reinforcement Learning
【速读】:该论文旨在解决如何通过反馈控制实现对受限活性极性流体中整数电荷拓扑缺陷(topological defects)的精准操控问题,特别是如何利用空间调制活性应力来引导缺陷沿预定轨迹运动。其解决方案的关键在于结合连续介质水动力学模型与强化学习框架:首先,通过建模揭示局部活性调控可激发非线性耦合驱动的流动场,从而实现缺陷的重新定位和定向移动;其次,借助强化学习算法自动发现高效控制策略,使AI代理能够学习系统内在动力学并自适应地结构化活性分布,从而在训练轨迹及新轨迹上均实现鲁棒的缺陷输运。这一方法为活性物质的可控性研究和自组织材料的设计提供了新范式。
链接: https://arxiv.org/abs/2507.19298
作者: Abhinav Singh,Petros Koumoutsakos
机构: 未知
类目: oft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Topological defects in active polar fluids exhibit complex dynamics driven by internally generated stresses, reflecting the deep interplay between topology, flow, and non-equilibrium hydrodynamics. Feedback control offers a powerful means to guide such systems, enabling transitions between dynamic states. We investigated closed-loop steering of integer-charged defects in a confined active fluid by modulating the spatial profile of activity. Using a continuum hydrodynamic model, we show that localized control of active stress induces flow fields that can reposition and direct defects along prescribed trajectories by exploiting non-linear couplings in the system. A reinforcement learning framework is used to discover effective control strategies that produce robust defect transport across both trained and novel trajectories. The results highlight how AI agents can learn the underlying dynamics and spatially structure activity to manipulate topological excitations, offering insights into the controllability of active matter and the design of adaptive, self-organized materials.
zh
[AI-49] Assessment of Personality Dimensions Across Situations Using Conversational Speech
【速读】:该论文试图解决的问题是:现有自动人格感知(Automatic Personality Perception, APP)研究将人格视为静态特质,忽略了情境对感知人格的动态影响。为填补这一空白,作者通过分析对话语音特征与感知人格之间的关系,探究不同工作情境(中性面试与压力客户互动)下感知人格的变化机制。解决方案的关键在于:首先,识别出在不同情境中具有显著预测能力的声学特征(如音量、声级和频谱通量),并发现这些特征在中性情境中与外向性、宜人性、尽责性和开放性相关,而在压力情境中则与神经质显著相关;其次,验证了手工提取的声学与非言语特征优于说话人嵌入表示,在推断感知人格方面表现更优;最后,强调压力情境更能有效预测神经质,与心理学理论一致,从而支持情境敏感型人格建模的重要性。
链接: https://arxiv.org/abs/2507.19137
作者: Alice Zhang,Skanda Muralidhar,Daniel Gatica-Perez,Mathew Magimai-Doss
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in automatic personality perception (APP), which aims to predict an individual’s perceived personality traits. Previous studies in APP have treated personalities as static traits, independent of context. However, perceived personalities can vary by context and situation as shown in psychological research. In this study, we investigate the relationship between conversational speech and perceived personality for participants engaged in two work situations (a neutral interview and a stressful client interaction). Our key findings are: 1) perceived personalities differ significantly across interactions, 2) loudness, sound level, and spectral flux features are indicative of perceived extraversion, agreeableness, conscientiousness, and openness in neutral interactions, while neuroticism correlates with these features in stressful contexts, 3) handcrafted acoustic features and non-verbal features outperform speaker embeddings in inference of perceived personality, and 4) stressful interactions are more predictive of neuroticism, aligning with existing psychological research.
zh
[AI-50] CNN-based Surface Temperature Forecasts with Ensemble Numerical Weather Prediction over Medium-range Forecast Periods
【速读】:该论文旨在解决中短期(五日以上)地表温度预报中因计算资源受限而依赖低分辨率数值天气预报(NWP)模型所导致的系统性误差和随机误差问题。其解决方案的关键在于:首先,利用卷积神经网络(CNN)对每个集合成员进行后处理,实现偏差校正与空间超分辨率重建,从而降低系统性误差;其次,在CNN修正后的各集合成员基础上进行集合平均,以减少随机误差。实验表明,先进行CNN校正再进行集合平均的顺序优于反向操作,且该方法即使基于低分辨率集合预报,也能显著优于高分辨率确定性NWP模型,体现出CNN与集合平均相结合的有效性与实用性。
链接: https://arxiv.org/abs/2507.18937
作者: Takuya Inoue,Takuya Kawabata(Meteorological Research Institute, Tsukuba, Japan)
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 10 figures
Abstract:This study proposes a method that integrates convolutional neural networks (CNNs) with ensemble numerical weather prediction (NWP) models, enabling surface temperature forecasting at lead times beyond the short-range (five-day) forecast period. Owing to limited computational resources, operational medium-range temperature forecasts typically rely on low-resolution NWP models, which are prone to systematic and random errors. To resolve these limitations, the proposed method first reduces systematic errors through CNN-based post-processing (bias correction and spatial super-resolution) on each ensemble member, reconstructing high-resolution temperature fields from low-resolution model outputs. Second, it reduces random errors through ensemble averaging of the CNN-corrected members. This study also investigates whether the sequence of CNN correction and ensemble averaging affects the forecast accuracy. For comparison with the proposed method, we additionally conducted experiments with the CNN trained on ensemble-averaged forecasts. The first approach–CNN correction before ensemble averaging–consistently achieved higher accuracy than the reverse approach. Although based on low-resolution ensemble forecasts, the proposed method notably outperformed the high-resolution deterministic NWP models. These findings indicate that combining CNN-based correction with ensemble averaging effectively reduces both the systematic and random errors in NWP model outputs. The proposed approach is a practical and scalable solution for improving medium-range temperature forecasts, and is particularly valuable at operational centers with limited computational resources.
zh
[AI-51] Multi-Year Maintenance Planning for Large-Scale Infrastructure Systems: A Novel Network Deep Q-Learning Approach
【速读】:该论文旨在解决大规模基础设施资产管理中因传统维护与修复规划方法在可扩展性和计算复杂度方面的局限性问题,尤其是在预算约束下对成千上万个资产进行优化决策的挑战。其解决方案的关键在于提出一种新颖的深度强化学习(Deep Reinforcement Learning, DRL)框架,通过将网络级马尔可夫决策过程(Markov Decision Process, MDP)分解为单个资产级MDP,并采用统一的神经网络架构,在保持整体策略一致性的同时显著降低计算复杂度;同时,该框架通过预算分配机制直接嵌入年度预算约束,确保维护计划在成本效益和性能表现上的双重最优。
链接: https://arxiv.org/abs/2507.18732
作者: Amir Fard,Arnold X.-X. Yuan
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:Infrastructure asset management is essential for sustaining the performance of public infrastructure such as road networks, bridges, and utility networks. Traditional maintenance and rehabilitation planning methods often face scalability and computational challenges, particularly for large-scale networks with thousands of assets under budget constraints. This paper presents a novel deep reinforcement learning (DRL) framework that optimizes asset management strategies for large infrastructure networks. By decomposing the network-level Markov Decision Process (MDP) into individual asset-level MDPs while using a unified neural network architecture, the proposed framework reduces computational complexity, improves learning efficiency, and enhances scalability. The framework directly incorporates annual budget constraints through a budget allocation mechanism, ensuring maintenance plans are both optimal and cost-effective. Through a case study on a large-scale pavement network of 68,800 segments, the proposed DRL framework demonstrates significant improvements over traditional methods like Progressive Linear Programming and genetic algorithms, both in efficiency and network performance. This advancement contributes to infrastructure asset management and the broader application of reinforcement learning in complex, large-scale environments.
zh
机器学习
[LG-0] Forest-Guided Clustering – Shedding Light into the Random Forest Black Box
链接: https://arxiv.org/abs/2507.19455
作者: Lisa Barros de Andrade e Sousa,Gregor Miller,Ronan Le Gleut,Dominik Thalmeier,Helena Pelin,Marie Piraud
类目: Machine Learning (cs.LG)
*备注:
Abstract:As machine learning models are increasingly deployed in sensitive application areas, the demand for interpretable and trustworthy decision-making has increased. Random Forests (RF), despite their widespread use and strong performance on tabular data, remain difficult to interpret due to their ensemble nature. We present Forest-Guided Clustering (FGC), a model-specific explainability method that reveals both local and global structure in RFs by grouping instances according to shared decision paths. FGC produces human-interpretable clusters aligned with the model’s internal logic and computes cluster-specific and global feature importance scores to derive decision rules underlying RF predictions. FGC accurately recovered latent subclass structure on a benchmark dataset and outperformed classical clustering and post-hoc explanation methods. Applied to an AML transcriptomic dataset, FGC uncovered biologically coherent subpopulations, disentangled disease-relevant signals from confounders, and recovered known and novel gene expression patterns. FGC bridges the gap between performance and interpretability by providing structure-aware insights that go beyond feature-level attribution.
[LG-1] Observations Meet Actions: Learning Control-Sufficient Representations for Robust Policy Generalization
链接: https://arxiv.org/abs/2507.19437
作者: Yuliang Gu,Hongpeng Cao,Marco Caccamo,Naira Hovakimyan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Capturing latent variations (“contexts”) is key to deploying reinforcement-learning (RL) agents beyond their training regime. We recast context-based RL as a dual inference-control problem and formally characterize two properties and their hierarchy: observation sufficiency (preserving all predictive information) and control sufficiency (retaining decision-making relevant information). Exploiting this dichotomy, we derive a contextual evidence lower bound(ELBO)-style objective that cleanly separates representation learning from policy learning and optimizes it with Bottlenecked Contextual Policy Optimization (BCPO), an algorithm that places a variational information-bottleneck encoder in front of any off-policy policy learner. On standard continuous-control benchmarks with shifting physical parameters, BCPO matches or surpasses other baselines while using fewer samples and retaining performance far outside the training regime. The framework unifies theory, diagnostics, and practice for context-based RL.
[LG-2] SILS: Strategic Influence on Liquidity Stability and Whale Detection in Concentrated-Liquidity DEXs
链接: https://arxiv.org/abs/2507.19411
作者: Ali RajabiNekoo,Laleh Rasoul,Amirfarhad Farhadi,Azadeh Zamanifar
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
*备注:
Abstract:Traditional methods for identifying impactful liquidity providers (LPs) in Concentrated Liquidity Market Makers (CLMMs) rely on broad measures, such as nominal capital size or surface-level activity, which often lead to inaccurate risk analysis. The SILS framework offers a significantly more detailed approach, characterizing LPs not just as capital holders but as dynamic systemic agents whose actions directly impact market stability. This represents a fundamental paradigm shift from the static, volume-based analysis to a dynamic, impact-focused understanding. This advanced approach uses on-chain event logs and smart contract execution traces to compute Exponential Time-Weighted Liquidity (ETWL) profiles and apply unsupervised anomaly detection. Most importantly, it defines an LP’s functional importance through the Liquidity Stability Impact Score (LSIS), a counterfactual metric that measures the potential degradation of the market if the LP withdraws. This combined approach provides a more detailed and realistic characterization of an LP’s impact, moving beyond the binary and often misleading classifications used by existing methods. This impact-focused and comprehensive approach enables SILS to accurately identify high-impact LPs-including those missed by traditional methods and supports essential applications like a protective oracle layer and actionable trader signals, thereby significantly enhancing DeFi ecosystem. The framework provides unprecedented transparency into the underlying liquidity structure and associated risks, effectively reducing the common false positives and uncovering critical false negatives found in traditional models. Therefore, SILS provides an effective mechanism for proactive risk management, transforming how DeFi protocols safeguard their ecosystems against asymmetric liquidity behavior.
[LG-3] FD4QC: Application of Classical and Quantum-Hybrid Machine Learning for Financial Fraud Detection A Technical Report
链接: https://arxiv.org/abs/2507.19402
作者: Matteo Cardaioli,Luca Marangoni,Giada Martini,Francesco Mazzolin,Luca Pajola,Andrea Ferretto Parodi,Alessandra Saitta,Maria Chiara Vernillo
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: This is a technical report
Abstract:The increasing complexity and volume of financial transactions pose significant challenges to traditional fraud detection systems. This technical report investigates and compares the efficacy of classical, quantum, and quantum-hybrid machine learning models for the binary classification of fraudulent financial activities. As of our methodology, first, we develop a comprehensive behavioural feature engineering framework to transform raw transactional data into a rich, descriptive feature set. Second, we implement and evaluate a range of models on the IBM Anti-Money Laundering (AML) dataset. The classical baseline models include Logistic Regression, Decision Tree, Random Forest, and XGBoost. These are compared against three hybrid classic quantum algorithms architectures: a Quantum Support Vector Machine (QSVM), a Variational Quantum Classifier (VQC), and a Hybrid Quantum Neural Network (HQNN). Furthermore, we propose Fraud Detection for Quantum Computing (FD4QC), a practical, API-driven system architecture designed for real-world deployment, featuring a classical-first, quantum-enhanced philosophy with robust fallback mechanisms. Our results demonstrate that classical tree-based models, particularly \textitRandom Forest, significantly outperform the quantum counterparts in the current setup, achieving high accuracy ((97.34%)) and F-measure ((86.95%)). Among the quantum models, \textbfQSVM shows the most promise, delivering high precision ((77.15%)) and a low false-positive rate ((1.36%)), albeit with lower recall and significant computational overhead. This report provides a benchmark for a real-world financial application, highlights the current limitations of quantum machine learning in this domain, and outlines promising directions for future research. Comments: This is a technical report Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE) Cite as: arXiv:2507.19402 [cs.LG] (or arXiv:2507.19402v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.19402 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Luca Pajola [view email] [v1] Fri, 25 Jul 2025 16:08:22 UTC (14 KB) Full-text links: Access Paper: View a PDF of the paper titled FD4QC: Application of Classical and Quantum-Hybrid Machine Learning for Financial Fraud Detection A Technical Report, by Matteo Cardaioli and 7 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-07 Change to browse by: cs cs.CE References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-4] A Data-Driven Approach to Estimate LEO Orbit Capacity Models
链接: https://arxiv.org/abs/2507.19365
作者: Braden Stock,Maddox McVarthy,Simone Servadio
类目: Machine Learning (cs.LG)
*备注: 18 pages, 15 figures
Abstract:Utilizing the Sparse Identification of Nonlinear Dynamics algorithm (SINDy) and Long Short-Term Memory Recurrent Neural Networks (LSTM), the population of resident space objects, divided into Active, Derelict, and Debris, in LEO can be accurately modeled to predict future satellite and debris propagation. This proposed approach makes use of a data set coming from a computational expensive high-fidelity model, the MOCAT-MC, to provide a light, low-fidelity counterpart that provides accurate forecasting in a shorter time frame.
[LG-5] Reconstruction of Sparse Urban Wireless Signals via Group Equivariant Non-Expansive Operators
链接: https://arxiv.org/abs/2507.19349
作者: Lorenzo Mario Amorosa,Francesco Conti,Nicola Quercioli,Flavio Zabini,Tayebeh Lotfi Mahyari,Yiqun Ge,Patrizio Frosini
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:In emerging communication systems such as sixth generation (6G) wireless networks, efficient resource management and service delivery rely on accurate knowledge of spatially-varying quantities like signal-to-interference-noise ratio (SINR) maps, which are costly to acquire at high resolution. This work explores the reconstruction of such spatial signals from sparse measurements using Group Equivariant Non-Expansive Operators (GENEOs), offering a low-complexity alternative to traditional neural networks. The concept of GENEO, which originated in topological data analysis (TDA), is a mathematical tool used in machine learning to represent agents modelled as functional operators acting on data while incorporating application-specific invariances. Leveraging these invariances reduces the number of parameters with respect to traditional neural networks and mitigates data scarcity by enforcing known algebraic and geometric constraints that reflect symmetries in the agents’ actions. In this paper, we introduce a novel GENEO-based approach for SINR map reconstruction in urban wireless communication networks using extremely sparse sampling. We demonstrate that this mathematical framework achieves competitive performance compared to established methods. Our evaluation, conducted using both statistical and TDA metrics, highlights the advantages of our approach in accurately reconstructing spatial signals under severe data limitations on the number of samples.
[LG-6] Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges
链接: https://arxiv.org/abs/2507.19346
作者: Andrii Dzhoha,Katya Mirylenka,Egor Malykh,Marco-Andrea Buchmann,Francesca Catino
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, social media users have spent significant amounts of time on short-form video platforms. As a result, established platforms in other domains, such as e-commerce, have begun introducing short-form video content to engage users and increase their time spent on the platform. The success of these experiences is due not only to the content itself but also to a unique UI innovation: instead of offering users a list of choices to click, platforms actively recommend content for users to watch one at a time. This creates new challenges for recommender systems, especially when launching a new video experience. Beyond the limited interaction data, immersive feed experiences introduce stronger position bias due to the UI and duration bias when optimizing for watch-time, as models tend to favor shorter videos. These issues, together with the feedback loop inherent in recommender systems, make it difficult to build effective solutions. In this paper, we highlight the challenges faced when introducing a new short-form video experience and present our experience showing that, even with sufficient video interaction data, it can be more beneficial to leverage a video retrieval system using a fine-tuned multimodal vision-language model to overcome these challenges. This approach demonstrated greater effectiveness compared to conventional supervised learning methods in online experiments conducted on our e-commerce platform.
[LG-7] Negative news posts are less prevalent and generate lower user engagement than non-negative news posts across six countries
链接: https://arxiv.org/abs/2507.19300
作者: Szymon Talaga,Dominik Batorski,Magdalena Wojcieszak
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:Although news negativity is often studied, missing is comparative evidence on the prevalence of and engagement with negative political and non-political news posts on social media. We use 6,081,134 Facebook posts published between January 1, 2020, and April 1, 2024, by 97 media organizations in six countries (U.S., UK, Ireland, Poland, France, Spain) and develop two multilingual classifiers for labeling posts as (non-)political and (non-)negative. We show that: (1) negative news posts constitute a relatively small fraction (12.6%); (2) political news posts are neither more nor less negative than non-political news posts; (3) U.S. political news posts are less negative relative to the other countries on average (40% lower odds); (4) Negative news posts get 15% fewer likes and 13% fewer comments than non-negative news posts. Lastly, (5) we provide estimates of the proportion of the total volume of user engagement with negative news posts and show that only between 10.2% to 13.1% of engagement is linked to negative posts by the analyzed news organizations.
[LG-8] Query Efficient Structured Matrix Learning
链接: https://arxiv.org/abs/2507.19290
作者: Noah Amsel,Pratyush Avi,Tyler Chen,Feyza Duman Keles,Chinmay Hegde,Cameron Musco,Christopher Musco,David Persson
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We study the problem of learning a structured approximation (low-rank, sparse, banded, etc.) to an unknown matrix A given access to matrix-vector product (matvec) queries of the form x \rightarrow Ax and x \rightarrow A^Tx . This problem is of central importance to algorithms across scientific computing and machine learning, with applications to fast multiplication and inversion for structured matrices, building preconditioners for first-order optimization, and as a model for differential operator learning. Prior work focuses on obtaining query complexity upper and lower bounds for learning specific structured matrix families that commonly arise in applications. We initiate the study of the problem in greater generality, aiming to understand the query complexity of learning approximations from general matrix families. Our main result focuses on finding a near-optimal approximation to A from any finite-sized family of matrices, \mathcalF . Standard results from matrix sketching show that O(\log|\mathcalF|) matvec queries suffice in this setting. This bound can also be achieved, and is optimal, for vector-matrix-vector queries of the form x,y\rightarrow x^TAy , which have been widely studied in work on rank- 1 matrix sensing. Surprisingly, we show that, in the matvec model, it is possible to obtain a nearly quadratic improvement in complexity, to \tildeO(\sqrt\log|\mathcalF|) . Further, we prove that this bound is tight up to log-log this http URL covering number arguments, our result extends to well-studied infinite families. As an example, we establish that a near-optimal approximation from any \emphlinear matrix family of dimension q can be learned with \tildeO(\sqrtq) matvec queries, improving on an O(q) bound achievable via sketching techniques and vector-matrix-vector queries. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2507.19290 [cs.DS] (or arXiv:2507.19290v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2507.19290 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-9] Component-Based Machine Learning for Indoor Flow and Temperature Fields Prediction Latent Feature Aggregation and Flow Interaction
链接: https://arxiv.org/abs/2507.19233
作者: Shaofan Wang,Nils Thuerey,Philipp Geyer
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Accurate and efficient prediction of indoor airflow and temperature distributions is essential for building energy optimization and occupant comfort control. However, traditional CFD simulations are computationally intensive, limiting their integration into real-time or design-iterative workflows. This study proposes a component-based machine learning (CBML) surrogate modeling approach to replace conventional CFD simulation for fast prediction of indoor velocity and temperature fields. The model consists of three neural networks: a convolutional autoencoder with residual connections (CAER) to extract and compress flow features, a multilayer perceptron (MLP) to map inlet velocities to latent representations, and a convolutional neural network (CNN) as an aggregator to combine single-inlet features into dual-inlet scenarios. A two-dimensional room with varying left and right air inlet velocities is used as a benchmark case, with CFD simulations providing training and testing data. Results show that the CBML model accurately and fast predicts two-component aggregated velocity and temperature fields across both training and testing datasets.
[LG-10] Dependency-aware synthetic tabular data generation
链接: https://arxiv.org/abs/2507.19211
作者: Chaithra Umesh,Kristian Schultz,Manjunath Mahendra,Saptarshi Bej,Olaf Wolkenhauer
类目: Machine Learning (cs.LG)
*备注: 23 pages, 3 figures, submitted to Pattern Recognition
Abstract:Synthetic tabular data is increasingly used in privacy-sensitive domains such as health care, but existing generative models often fail to preserve inter-attribute relationships. In particular, functional dependencies (FDs) and logical dependencies (LDs), which capture deterministic and rule-based associations between features, are rarely or often poorly retained in synthetic datasets. To address this research gap, we propose the Hierarchical Feature Generation Framework (HFGF) for synthetic tabular data generation. We created benchmark datasets with known dependencies to evaluate our proposed HFGF. The framework first generates independent features using any standard generative model, and then reconstructs dependent features based on predefined FD and LD rules. Our experiments on four benchmark datasets with varying sizes, feature imbalance, and dependency complexity demonstrate that HFGF improves the preservation of FDs and LDs across six generative models, including CTGAN, TVAE, and GReaT. Our findings demonstrate that HFGF can significantly enhance the structural fidelity and downstream utility of synthetic tabular data.
[LG-11] Physics-Informed Graph Neural Networks for Transverse Momentum Estimation in CMS Trigger Systems
链接: https://arxiv.org/abs/2507.19205
作者: Md Abrar Jahin,Shahriar Soudeep,M. F. Mridha,Muhammad Mostafa Monowar,Md. Abdul Hamid
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real-time particle transverse momentum ( p_T ) estimation in high-energy physics demands algorithms that are both efficient and accurate under strict hardware constraints. Static machine learning models degrade under high pileup and lack physics-aware optimization, while generic graph neural networks (GNNs) often neglect domain structure critical for robust p_T regression. We propose a physics-informed GNN framework that systematically encodes detector geometry and physical observables through four distinct graph construction strategies that systematically encode detector geometry and physical observables: station-as-node, feature-as-node, bending angle-centric, and pseudorapidity ( \eta )-centric representations. This framework integrates these tailored graph structures with a novel Message Passing Layer (MPL), featuring intra-message attention and gated updates, and domain-specific loss functions incorporating p_T -distribution priors. Our co-design methodology yields superior accuracy-efficiency trade-offs compared to existing baselines. Extensive experiments on the CMS Trigger Dataset validate the approach: a station-informed EdgeConv model achieves a state-of-the-art MAE of 0.8525 with \ge55% fewer parameters than deep learning baselines, especially TabNet, while an \eta -centric MPL configuration also demonstrates improved accuracy with comparable efficiency. These results establish the promise of physics-guided GNNs for deployment in resource-constrained trigger systems.
[LG-12] Latent Granular Resynthesis using Neural Audio Codecs
链接: https://arxiv.org/abs/2507.19202
作者: Nao Tokui,Tom Baker
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Accepted at ISMIR 2025 Late Breaking Demos
Abstract:We introduce a novel technique for creative audio resynthesis that operates by reworking the concept of granular synthesis at the latent vector level. Our approach creates a “granular codebook” by encoding a source audio corpus into latent vector segments, then matches each latent grain of a target audio signal to its closest counterpart in the codebook. The resulting hybrid sequence is decoded to produce audio that preserves the target’s temporal structure while adopting the source’s timbral characteristics. This technique requires no model training, works with diverse audio materials, and naturally avoids the discontinuities typical of traditional concatenative synthesis through the codec’s implicit interpolation during decoding. We include supplementary material at this https URL , as well as a proof-of-concept implementation to allow users to experiment with their own sounds at this https URL .
[LG-13] Automatic Cough Analysis for Non-Small Cell Lung Cancer Detection
链接: https://arxiv.org/abs/2507.19174
作者: Chiara Giangregorio(1),Cristina Maria Licciardello(1),Vanja Miskovic(1 and 2),Leonardo Provenzano(1 and 2),Alessandra Laura Giulia Pedrocchi(1),Andra Diana Dumitrascu(2),Arsela Prelaj(2),Marina Chiara Garassino(3),Emilia Ambrosini(1),Simona Ferrante(1 and 4) ((1) Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy, (2) Fondazione IRCCS Istituto Nazionale dei Tumori di Milano, Milan, Italy, (3) Department of Medicine, Section of Hematology/Oncology, University of Chicago, Chicago, IL, USA, (4) IRCCS Istituto Neurologico Carlo Besta, Milan, Italy)
类目: Machine Learning (cs.LG)
*备注: Emilia Ambrosini and Simona Ferrante equally contributed to the work
Abstract:Early detection of non-small cell lung cancer (NSCLC) is critical for improving patient outcomes, and novel approaches are needed to facilitate early diagnosis. In this study, we explore the use of automatic cough analysis as a pre-screening tool for distinguishing between NSCLC patients and healthy controls. Cough audio recordings were prospectively acquired from a total of 227 subjects, divided into NSCLC patients and healthy controls. The recordings were analyzed using machine learning techniques, such as support vector machine (SVM) and XGBoost, as well as deep learning approaches, specifically convolutional neural networks (CNN) and transfer learning with VGG16. To enhance the interpretability of the machine learning model, we utilized Shapley Additive Explanations (SHAP). The fairness of the models across demographic groups was assessed by comparing the performance of the best model across different age groups (less than or equal to 58y and higher than 58y) and gender using the equalized odds difference on the test set. The results demonstrate that CNN achieves the best performance, with an accuracy of 0.83 on the test set. Nevertheless, SVM achieves slightly lower performances (accuracy of 0.76 in validation and 0.78 in the test set), making it suitable in contexts with low computational power. The use of SHAP for SVM interpretation further enhances model transparency, making it more trustworthy for clinical applications. Fairness analysis shows slightly higher disparity across age (0.15) than gender (0.09) on the test set. Therefore, to strengthen our findings’ reliability, a larger, more diverse, and unbiased dataset is needed – particularly including individuals at risk of NSCLC and those in early disease stages.
[LG-14] Explainable AI guided unsupervised fault diagnostics for high-voltage circuit breakers
链接: https://arxiv.org/abs/2507.19168
作者: Chi-Ching Hsu,Gaëtan Frusque,Florent Forest,Felipe Macedo,Christian M. Franck,Olga Fink
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Commercial high-voltage circuit breaker (CB) condition monitoring systems rely on directly observable physical parameters such as gas filling pressure with pre-defined thresholds. While these parameters are crucial, they only cover a small subset of malfunctioning mechanisms and usually can be monitored only if the CB is disconnected from the grid. To facilitate online condition monitoring while CBs remain connected, non-intrusive measurement techniques such as vibration or acoustic signals are necessary. Currently, CB condition monitoring studies using these signals typically utilize supervised methods for fault diagnostics, where ground-truth fault types are known due to artificially introduced faults in laboratory settings. This supervised approach is however not feasible in real-world applications, where fault labels are unavailable. In this work, we propose a novel unsupervised fault detection and segmentation framework for CBs based on vibration and acoustic signals. This framework can detect deviations from the healthy state. The explainable artificial intelligence (XAI) approach is applied to the detected faults for fault diagnostics. The specific contributions are: (1) we propose an integrated unsupervised fault detection and segmentation framework that is capable of detecting faults and clustering different faults with only healthy data required during training (2) we provide an unsupervised explainability-guided fault diagnostics approach using XAI to offer domain experts potential indications of the aged or faulty components, achieving fault diagnostics without the prerequisite of ground-truth fault labels. These contributions are validated using an experimental dataset from a high-voltage CB under healthy and artificially introduced fault conditions, contributing to more reliable CB system operation.
[LG-15] Game-Theoretic Gradient Control for Robust Neural Network Training
链接: https://arxiv.org/abs/2507.19143
作者: Maria Zaitseva,Ivan Tomilov,Natalia Gusarova
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 19 pages, 6 figures
Abstract:Feed-forward neural networks (FFNNs) are vulnerable to input noise, reducing prediction performance. Existing regularization methods like dropout often alter network architecture or overlook neuron interactions. This study aims to enhance FFNN noise robustness by modifying backpropagation, interpreted as a multi-agent game, and exploring controlled target variable noising. Our “gradient dropout” selectively nullifies hidden layer neuron gradients with probability 1 - p during backpropagation, while keeping forward passes active. This is framed within compositional game theory. Additionally, target variables were perturbed with white noise or stable distributions. Experiments on ten diverse tabular datasets show varying impacts: improvement or diminishing of robustness and accuracy, depending on dataset and hyperparameters. Notably, on regression tasks, gradient dropout (p = 0.9) combined with stable distribution target noising significantly increased input noise robustness, evidenced by flatter MSE curves and more stable SMAPE values. These results highlight the method’s potential, underscore the critical role of adaptive parameter tuning, and open new avenues for analyzing neural networks as complex adaptive systems exhibiting emergent behavior within a game-theoretic framework.
[LG-16] GCL-GCN: Graphormer and Contrastive Learning Enhanced Attributed Graph Clustering Network
链接: https://arxiv.org/abs/2507.19095
作者: Binxiong Li,Xu Xiang,Xue Li,Binyu Zhao,Yujie Liu,Huijie Tang,Benhan Yang,Zhixuan Chen
类目: Machine Learning (cs.LG)
*备注: The source code for this study is available at this https URL
Abstract:Attributed graph clustering holds significant importance in modern data analysis. However, due to the complexity of graph data and the heterogeneity of node attributes, leveraging graph information for clustering remains challenging. To address this, we propose a novel deep graph clustering model, GCL-GCN, specifically designed to address the limitations of existing models in capturing local dependencies and complex structures when dealing with sparse and heterogeneous graph data. GCL-GCN introduces an innovative Graphormer module that combines centrality encoding and spatial relationships, effectively capturing both global and local information between nodes, thereby enhancing the quality of node representations. Additionally, we propose a novel contrastive learning module that significantly enhances the discriminative power of feature representations. In the pre-training phase, this module increases feature distinction through contrastive learning on the original feature matrix, ensuring more identifiable initial representations for subsequent graph convolution and clustering tasks. Extensive experimental results on six datasets demonstrate that GCL-GCN outperforms 14 advanced methods in terms of clustering quality and robustness. Specifically, on the Cora dataset, it improves ACC, NMI, and ARI by 4.94%, 13.01%, and 10.97%, respectively, compared to the primary comparison method MBN.
[LG-17] Clustering-Oriented Generative Attribute Graph Imputation ACM-MM’25
链接: https://arxiv.org/abs/2507.19085
作者: Mulin Chen,Bocheng Wang,Jiaxin Zhong,Zongcheng Miao,Xuelong Li
类目: Machine Learning (cs.LG)
*备注: Accepted by ACM MM’25
Abstract:Attribute-missing graph clustering has emerged as a significant unsupervised task, where only attribute vectors of partial nodes are available and the graph structure is intact. The related models generally follow the two-step paradigm of imputation and refinement. However, most imputation approaches fail to capture class-relevant semantic information, leading to sub-optimal imputation for clustering. Moreover, existing refinement strategies optimize the learned embedding through graph reconstruction, while neglecting the fact that some attributes are uncorrelated with the graph. To remedy the problems, we establish the Clustering-oriented Generative Imputation with reliable Refinement (CGIR) model. Concretely, the subcluster distributions are estimated to reveal the class-specific characteristics precisely, and constrain the sampling space of the generative adversarial module, such that the imputation nodes are impelled to align with the correct clusters. Afterwards, multiple subclusters are merged to guide the proposed edge attention network, which identifies the edge-wise attributes for each class, so as to avoid the redundant attributes in graph reconstruction from disturbing the refinement of overall embedding. To sum up, CGIR splits attribute-missing graph clustering into the search and mergence of subclusters, which guides to implement node imputation and refinement within a unified framework. Extensive experiments prove the advantages of CGIR over state-of-the-art competitors.
[LG-18] Exploring molecular assembly as a biosignature using mass spectrometry and machine learning
链接: https://arxiv.org/abs/2507.19057
作者: Lindsay A. Rutter,Abhishek Sharma,Ian Seet,David Obeh Alobo,An Goto,Leroy Cronin
类目: Machine Learning (cs.LG)
*备注: 35 pages,7 figures, 62 references
Abstract:Molecular assembly offers a promising path to detect life beyond Earth, while minimizing assumptions based on terrestrial life. As mass spectrometers will be central to upcoming Solar System missions, predicting molecular assembly from their data without needing to elucidate unknown structures will be essential for unbiased life detection. An ideal agnostic biosignature must be interpretable and experimentally measurable. Here, we show that molecular assembly, a recently developed approach to measure objects that have been produced by evolution, satisfies both criteria. First, it is interpretable for life detection, as it reflects the assembly of molecules with their bonds as building blocks, in contrast to approaches that discount construction history. Second, it can be determined without structural elucidation, as it can be physically measured by mass spectrometry, a property that distinguishes it from other approaches that use structure-based information measures for molecular complexity. Whilst molecular assembly is directly measurable using mass spectrometry data, there are limits imposed by mission constraints. To address this, we developed a machine learning model that predicts molecular assembly with high accuracy, reducing error by three-fold compared to baseline models. Simulated data shows that even small instrumental inconsistencies can double model error, emphasizing the need for standardization. These results suggest that standardized mass spectrometry databases could enable accurate molecular assembly prediction, without structural elucidation, providing a proof-of-concept for future astrobiology missions.
[LG-19] Dynamics-Informed Reservoir Computing with Visibility Graphs
链接: https://arxiv.org/abs/2507.19046
作者: Charlotte Geier(1),Merten Stender(2) ((1) Dynamics Group, Hamburg University of Technology, (2) Chair of Cyber-Physical Systems in Mechanical Engineering, Technische Universität Berlin, Germany)
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 figures. The following article has been submitted to by Chaos: An Interdisciplinary Journal of Nonlinear Science
Abstract:Accurate prediction of complex and nonlinear time series remains a challenging problem across engineering and scientific disciplines. Reservoir computing (RC) offers a computationally efficient alternative to traditional deep learning by training only the read-out layer while employing a randomly structured and fixed reservoir network. Despite its advantages, the largely random reservoir graph architecture often results in suboptimal and oversized networks with poorly understood dynamics. Addressing this issue, we propose a novel Dynamics-Informed Reservoir Computing (DyRC) framework that systematically infers the reservoir network structure directly from the input training sequence. This work proposes to employ the visibility graph (VG) technique, which converts time series data into networks by representing measurement points as nodes linked by mutual visibility. The reservoir network is constructed by directly adopting the VG network from a training data sequence, leveraging the parameter-free visibility graph approach to avoid expensive hyperparameter tuning. This process results in a reservoir that is directly informed by the specific dynamics of the prediction task under study. We assess the DyRC-VG method through prediction tasks involving the canonical nonlinear Duffing oscillator, evaluating prediction accuracy and consistency. Compared to an Erdős-Rényi graph of the same size, spectral radius, and comparable density, we observe higher prediction quality and more consistent performance over repeated implementations in the DyRC-VG.
[LG-20] Neural Ordinary Differential Equations for Learning and Extrapolating System Dynamics Across Bifurcations
链接: https://arxiv.org/abs/2507.19036
作者: Eva van Tegelen,George van Voorn,Ioannis Athanasiadis,Peter van Heijster
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:Forecasting system behaviour near and across bifurcations is crucial for identifying potential shifts in dynamical systems. While machine learning has recently been used to learn critical transitions and bifurcation structures from data, most studies remain limited as they exclusively focus on discrete-time methods and local bifurcations. To address these limitations, we use Neural Ordinary Differential Equations which provide a continuous, data-driven framework for learning system dynamics. We apply our approach to a predator-prey system that features both local and global bifurcations, presenting a challenging test case. Our results show that Neural Ordinary Differential Equations can recover underlying bifurcation structures directly from timeseries data by learning parameter-dependent vector fields. Notably, we demonstrate that Neural Ordinary Differential Equations can forecast bifurcations even beyond the parameter regions represented in the training data. We also assess the method’s performance under limited and noisy data conditions, finding that model accuracy depends more on the quality of information that can be inferred from the training data, than on the amount of data available.
[LG-21] ProGMLP: A Progressive Framework for GNN-to-MLP Knowledge Distillation with Efficient Trade-offs
链接: https://arxiv.org/abs/2507.19031
作者: Weigang Lu,Ziyu Guan,Wei Zhao,Yaming Yang,Yujie Sun,Zheng Liang,Yibing Zhan,Dapeng Tao
类目: Machine Learning (cs.LG)
*备注:
Abstract:GNN-to-MLP (G2M) methods have emerged as a promising approach to accelerate Graph Neural Networks (GNNs) by distilling their knowledge into simpler Multi-Layer Perceptrons (MLPs). These methods bridge the gap between the expressive power of GNNs and the computational efficiency of MLPs, making them well-suited for resource-constrained environments. However, existing G2M methods are limited by their inability to flexibly adjust inference cost and accuracy dynamically, a critical requirement for real-world applications where computational resources and time constraints can vary significantly. To address this, we introduce a Progressive framework designed to offer flexible and on-demand trade-offs between inference cost and accuracy for GNN-to-MLP knowledge distillation (ProGMLP). ProGMLP employs a Progressive Training Structure (PTS), where multiple MLP students are trained in sequence, each building on the previous one. Furthermore, ProGMLP incorporates Progressive Knowledge Distillation (PKD) to iteratively refine the distillation process from GNNs to MLPs, and Progressive Mixup Augmentation (PMA) to enhance generalization by progressively generating harder mixed samples. Our approach is validated through comprehensive experiments on eight real-world graph datasets, demonstrating that ProGMLP maintains high accuracy while dynamically adapting to varying runtime scenarios, making it highly effective for deployment in diverse application settings.
[LG-22] Adapting to Frag mented and Evolving Data: A Fisher Information Perspective
链接: https://arxiv.org/abs/2507.18996
作者: Behraj Khan,Tahir Qasim Syed,Nouman Muhammad Durrani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern machine learning systems operating in dynamic environments often face \textitsequential covariate shift (SCS), where input distributions evolve over time while the conditional distribution remains stable. We introduce FADE (Fisher-based Adaptation to Dynamic Environments), a lightweight and theoretically grounded framework for robust learning under SCS. FADE employs a shift-aware regularization mechanism anchored in Fisher information geometry, guiding adaptation by modulating parameter updates based on sensitivity and stability. To detect significant distribution changes, we propose a Cramer-Rao-informed shift signal that integrates KL divergence with temporal Fisher dynamics. Unlike prior methods requiring task boundaries, target supervision, or experience replay, FADE operates online with fixed memory and no access to target labels. Evaluated on seven benchmarks spanning vision, language, and tabular data, FADE achieves up to 19% higher accuracy under severe shifts, outperforming methods such as TENT and DIW. FADE also generalizes naturally to federated learning by treating heterogeneous clients as temporally fragmented environments, enabling scalable and stable adaptation in decentralized settings. Theoretical analysis guarantees bounded regret and parameter consistency, while empirical results demonstrate FADE’s robustness across modalities and shift intensities.
[LG-23] Agent 0: Leverag ing LLM Agents to Discover Multi-value Features from Text for Enhanced Recommendations KDD’25
链接: https://arxiv.org/abs/2507.18993
作者: Blaž Škrlj,Benoît Guilleminot,Andraž Tori
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Agent4IR, KDD '25
Abstract:Large language models (LLMs) and their associated agent-based frameworks have significantly advanced automated information extraction, a critical component of modern recommender systems. While these multitask frameworks are widely used in code generation, their application in data-centric research is still largely untapped. This paper presents Agent0, an LLM-driven, agent-based system designed to automate information extraction and feature construction from raw, unstructured text. Categorical features are crucial for large-scale recommender systems but are often expensive to acquire. Agent0 coordinates a group of interacting LLM agents to automatically identify the most valuable text aspects for subsequent tasks (such as models or AutoML pipelines). Beyond its feature engineering capabilities, Agent0 also offers an automated prompt-engineering tuning method that utilizes dynamic feedback loops from an oracle. Our findings demonstrate that this closed-loop methodology is both practical and effective for automated feature discovery, which is recognized as one of the most challenging phases in current recommender system development.
[LG-24] Reinforcement Learning via Conservative Agent for Environments with Random Delays
链接: https://arxiv.org/abs/2507.18992
作者: Jongsoo Lee,Jangwon Kim,Jiseok Jeong,Soohee Han
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real-world reinforcement learning applications are often hindered by delayed feedback from environments, which violates the Markov assumption and introduces significant challenges. Although numerous delay-compensating methods have been proposed for environments with constant delays, environments with random delays remain largely unexplored due to their inherent variability and unpredictability. In this study, we propose a simple yet robust agent for decision-making under random delays, termed the conservative agent, which reformulates the random-delay environment into its constant-delay equivalent. This transformation enables any state-of-the-art constant-delay method to be directly extended to the random-delay environments without modifying the algorithmic structure or sacrificing performance. We evaluate the conservative agent-based algorithm on continuous control tasks, and empirical results demonstrate that it significantly outperforms existing baseline algorithms in terms of asymptotic performance and sample efficiency.
[LG-25] KASPER: Kolmogorov Arnold Networks for Stock Prediction and Explainable Regimes
链接: https://arxiv.org/abs/2507.18983
作者: Vidhi Oad,Param Pathak,Nouhaila Innan,Shalini D,Muhammad Shafique
类目: Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, 3 tables
Abstract:Forecasting in financial markets remains a significant challenge due to their nonlinear and regime-dependent dynamics. Traditional deep learning models, such as long short-term memory networks and multilayer perceptrons, often struggle to generalize across shifting market conditions, highlighting the need for a more adaptive and interpretable approach. To address this, we introduce Kolmogorov-Arnold networks for stock prediction and explainable regimes (KASPER), a novel framework that integrates regime detection, sparse spline-based function modeling, and symbolic rule extraction. The framework identifies hidden market conditions using a Gumbel-Softmax-based mechanism, enabling regime-specific forecasting. For each regime, it employs Kolmogorov-Arnold networks with sparse spline activations to capture intricate price behaviors while maintaining robustness. Interpretability is achieved through symbolic learning based on Monte Carlo Shapley values, which extracts human-readable rules tailored to each regime. Applied to real-world financial time series from Yahoo Finance, the model achieves an R^2 score of 0.89, a Sharpe Ratio of 12.02, and a mean squared error as low as 0.0001, outperforming existing methods. This research establishes a new direction for regime-aware, transparent, and robust forecasting in financial markets.
[LG-26] Secure Best Arm Identification in the Presence of a Copycat
链接: https://arxiv.org/abs/2507.18975
作者: Asaf Cohen,Onur Günlü
类目: Machine Learning (cs.LG)
*备注: To appear in ITW 2025
Abstract:Consider the problem of best arm identification with a security constraint. Specifically, assume a setup of stochastic linear bandits with K arms of dimension d . In each arm pull, the player receives a reward that is the sum of the dot product of the arm with an unknown parameter vector and independent noise. The player’s goal is to identify the best arm after T arm pulls. Moreover, assume a copycat Chloe is observing the arm pulls. The player wishes to keep Chloe ignorant of the best arm. While a minimax–optimal algorithm identifies the best arm with an \Omega\left(\fracT\log(d)\right) error exponent, it easily reveals its best-arm estimate to an outside observer, as the best arms are played more frequently. A naive secure algorithm that plays all arms equally results in an \Omega\left(\fracTd\right) exponent. In this paper, we propose a secure algorithm that plays with \emphcoded arms. The algorithm does not require any key or cryptographic primitives, yet achieves an \Omega\left(\fracT\log^2(d)\right) exponent while revealing almost no information on the best arm. Comments: To appear in ITW 2025 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.18975 [cs.LG] (or arXiv:2507.18975v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.18975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-27] Vy: Time Series Visual Summary for Scalable Visualization IEEE-VIS2025
链接: https://arxiv.org/abs/2507.18972
作者: Gromit Yeuk-Yin Chan,Luis Gustavo Nonato,Themis Palpanas,Cláudio T. Silva,Juliana Freire
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: to be published in TVCG (IEEE VIS 2025)
Abstract:Visualizing multiple time series presents fundamental tradeoffs between scalability and visual clarity. Time series capture the behavior of many large-scale real-world processes, from stock market trends to urban activities. Users often gain insights by visualizing them as line charts, juxtaposing or superposing multiple time series to compare them and identify trends and patterns. However, existing representations struggle with scalability: when covering long time spans, leading to visual clutter from too many small multiples or overlapping lines. We propose TiVy, a new algorithm that summarizes time series using sequential patterns. It transforms the series into a set of symbolic sequences based on subsequence visual similarity using Dynamic Time Warping (DTW), then constructs a disjoint grouping of similar subsequences based on the frequent sequential patterns. The grouping result, a visual summary of time series, provides uncluttered superposition with fewer small multiples. Unlike common clustering techniques, TiVy extracts similar subsequences (of varying lengths) aligned in time. We also present an interactive time series visualization that renders large-scale time series in real-time. Our experimental evaluation shows that our algorithm (1) extracts clear and accurate patterns when visualizing time series data, (2) achieves a significant speed-up (1000X) compared to a straightforward DTW clustering. We also demonstrate the efficiency of our approach to explore hidden structures in massive time series data in two usage scenarios.
[LG-28] Geometric Multi-color Message Passing Graph Neural Networks for Blood-brain Barrier Permeability Prediction
链接: https://arxiv.org/abs/2507.18926
作者: Trung Nguyen,Md Masud Rana,Farjana Tasnim Mukta,Chang-Guo Zhan,Duc Duy Nguyen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate prediction of blood-brain barrier permeability (BBBP) is essential for central nervous system (CNS) drug development. While graph neural networks (GNNs) have advanced molecular property prediction, they often rely on molecular topology and neglect the three-dimensional geometric information crucial for modeling transport mechanisms. This paper introduces the geometric multi-color message-passing graph neural network (GMC-MPNN), a novel framework that enhances standard message-passing architectures by explicitly incorporating atomic-level geometric features and long-range interactions. Our model constructs weighted colored subgraphs based on atom types to capture the spatial relationships and chemical context that govern BBB permeability. We evaluated GMC-MPNN on three benchmark datasets for both classification and regression tasks, using rigorous scaffold-based splitting to ensure a robust assessment of generalization. The results demonstrate that GMC-MPNN consistently outperforms existing state-of-the-art models, achieving superior performance in both classifying compounds as permeable/non-permeable (AUC-ROC of 0.9704 and 0.9685) and in regressing continuous permeability values (RMSE of 0.4609, Pearson correlation of 0.7759). An ablation study further quantified the impact of specific atom-pair interactions, revealing that the model’s predictive power derives from its ability to learn from both common and rare, but chemically significant, functional motifs. By integrating spatial geometry into the graph representation, GMC-MPNN sets a new performance benchmark and offers a more accurate and generalizable tool for drug discovery pipelines.
[LG-29] Early Mortality Prediction in ICU Patients with Hypertensive Kidney Disease Using Interpretable Machine Learning
链接: https://arxiv.org/abs/2507.18866
作者: Yong Si,Junyi Fan,Li Sun,Shuheng Chen,Minoo Ahmadi,Elham Pishgar,Kamiar Alaei,Greg Placencia,Maryam Pishgar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Background: Hypertensive kidney disease (HKD) patients in intensive care units (ICUs) face high short-term mortality, but tailored risk prediction tools are lacking. Early identification of high-risk individuals is crucial for clinical decision-making. Methods: We developed a machine learning framework to predict 30-day in-hospital mortality among ICU patients with HKD using early clinical data from the MIMIC-IV v2.2 database. A cohort of 1,366 adults was curated with strict criteria, excluding malignancy cases. Eighteen clinical features-including vital signs, labs, comorbidities, and therapies-were selected via random forest importance and mutual information filtering. Several models were trained and compared with stratified five-fold cross-validation; CatBoost demonstrated the best performance. Results: CatBoost achieved an AUROC of 0.88 on the independent test set, with sensitivity of 0.811 and specificity of 0.798. SHAP values and Accumulated Local Effects (ALE) plots showed the model relied on meaningful predictors such as altered consciousness, vasopressor use, and coagulation status. Additionally, the DREAM algorithm was integrated to estimate patient-specific posterior risk distributions, allowing clinicians to assess both predicted mortality and its uncertainty. Conclusions: We present an interpretable machine learning pipeline for early, real-time risk assessment in ICU patients with HKD. By combining high predictive performance with uncertainty quantification, our model supports individualized triage and transparent clinical decisions. This approach shows promise for clinical deployment and merits external validation in broader critical care populations.
[LG-30] Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models
链接: https://arxiv.org/abs/2507.18858
作者: Ruimeng Ye,Zihan Wang,Xiao Yang,Zinan Ling,Manling Li,Bo Hui
类目: Machine Learning (cs.LG)
*备注:
Abstract:Weak-to-Strong generalization (W2SG) is a new trend to elicit the full capabilities of a strong model with supervision from a weak model. While existing W2SG studies focus on simple tasks like binary classification, we extend this paradigm to complex interactive decision-making environments. Specifically, we fine-tune a strong model with trajectories of intermediate actions generated by a weak model. Motivated by the human learning process, we propose to generalize not only success knowledge but also failure experience so that the strong model can learn from failed trajectories accumulated by weak models. To effectively and efficiently elicit the potential of strong agents, we further construct ``trajectory trees," a hierarchical representation that organizes weak model-generated action trajectories, coupled with Monte Carlo Tree Search (MCTS) to optimize the strong model. Through theoretical analysis, we provide formal guarantees for the effectiveness of our method in improving W2SG performance. Our empirical evaluations demonstrate substantial improvements in reasoning and decision-making capabilities across diverse task domains, validating the scalability and robustness of our proposed framework. Our code is available at: this https URL
[LG-31] Scale-Consistent Learning for Partial Differential Equations
链接: https://arxiv.org/abs/2507.18813
作者: Zongyi Li,Samuel Lanthaler,Catherine Deng,Michael Chen,Yixuan Wang,Kamyar Azizzadenesheli,Anima Anandkumar
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Machine learning (ML) models have emerged as a promising approach for solving partial differential equations (PDEs) in science and engineering. Previous ML models typically cannot generalize outside the training data; for example, a trained ML model for the Navier-Stokes equations only works for a fixed Reynolds number ( Re ) on a pre-defined domain. To overcome these limitations, we propose a data augmentation scheme based on scale-consistency properties of PDEs and design a scale-informed neural operator that can model a wide range of scales. Our formulation leverages the facts: (i) PDEs can be rescaled, or more concretely, a given domain can be re-scaled to unit size, and the parameters and the boundary conditions of the PDE can be appropriately adjusted to represent the original solution, and (ii) the solution operators on a given domain are consistent on the sub-domains. We leverage these facts to create a scale-consistency loss that encourages matching the solutions evaluated on a given domain and the solution obtained on its sub-domain from the rescaled PDE. Since neural operators can fit to multiple scales and resolutions, they are the natural choice for incorporating scale-consistency loss during training of neural PDE solvers. We experiment with scale-consistency loss and the scale-informed neural operator model on the Burgers’ equation, Darcy Flow, Helmholtz equation, and Navier-Stokes equations. With scale-consistency, the model trained on Re of 1000 can generalize to Re ranging from 250 to 10000, and reduces the error by 34% on average of all datasets compared to baselines.
[LG-32] Even Faster Simulations with Flow Matching: A Study of Zero Degree Calorimeter Responses
链接: https://arxiv.org/abs/2507.18811
作者: Maksymilian Wojnar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in generative neural networks, particularly flow matching (FM), have enabled the generation of high-fidelity samples while significantly reducing computational costs. A promising application of these models is accelerating simulations in high-energy physics (HEP), helping research institutions meet their increasing computational demands. In this work, we leverage FM to develop surrogate models for fast simulations of zero degree calorimeters in the ALICE experiment. We present an effective training strategy that enables the training of fast generative models with an exceptionally low number of parameters. This approach achieves state-of-the-art simulation fidelity for both neutron (ZN) and proton (ZP) detectors, while offering substantial reductions in computational costs compared to existing methods. Our FM model achieves a Wasserstein distance of 1.27 for the ZN simulation with an inference time of 0.46 ms per sample, compared to the current best of 1.20 with an inference time of approximately 109 ms. The latent FM model further improves the inference speed, reducing the sampling time to 0.026 ms per sample, with a minimal trade-off in accuracy. Similarly, our approach achieves a Wasserstein distance of 1.30 for the ZP simulation, outperforming the current best of 2.08. The source code is available at this https URL.
[LG-33] st-time Offline Reinforcement Learning on Goal-related Experience
链接: https://arxiv.org/abs/2507.18809
作者: Marco Bagatella,Mert Albaba,Jonas Hübotter,Georg Martius,Andreas Krause
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models compress a large amount of information in a single, large neural network, which can then be queried for individual tasks. There are strong parallels between this widespread framework and offline goal-conditioned reinforcement learning algorithms: a universal value function is trained on a large number of goals, and the policy is evaluated on a single goal in each test episode. Extensive research in foundation models has shown that performance can be substantially improved through test-time training, specializing the model to the current goal. We find similarly that test-time offline reinforcement learning on experience related to the test goal can lead to substantially better policies at minimal compute costs. We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state and quality with respect to the evaluation goal. We demonstrate across a wide range of high-dimensional loco-navigation and manipulation tasks that fine-tuning a policy on the selected data for a few gradient steps leads to significant performance gains over standard offline pre-training. Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out. Finally, we study compute allocation at inference, demonstrating that, at comparable costs, GC-TTT induces performance gains that are not achievable by scaling model size.
[LG-34] Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator ICML2025
链接: https://arxiv.org/abs/2507.18807
作者: YuXin Li,Felix Dangel,Derek Tam,Colin Raffel
类目: Machine Learning (cs.LG)
*备注: 19 pages, 2 figures. Accepted as a spotlight poster at ICML 2025
Abstract:The diagonal of a model’s Fisher Information Matrix (the “Fisher diagonal”) has frequently been used as a way to measure parameter sensitivity. Typically, the Fisher diagonal is estimated via squared sampled gradients of the model’s likelihood with respect to its parameters, averaged over a few hundred or thousand examples – a process which incurs nontrivial computational costs. At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training. This paper therefore explores whether an approximation of the Fisher diagonal can be obtained “for free” by recycling the squared gradient accumulator that has already been computed over the course of training. Through a comprehensive set of experiments covering five applications of the Fisher diagonal, we demonstrate that the “Squisher” (SQUared gradient accumulator as an approximation of the FISHER) consistently performs similarly to the Fisher diagonal while outperforming baseline methods. Additionally, we clarify the exact differences between the Squisher and the Fisher diagonal and provide empirical quantification of their respective impact.
[LG-35] Ralts: Robust Aggregation for Enhancing Graph Neural Network Resilience on Bit-flip Errors
链接: https://arxiv.org/abs/2507.18804
作者: Wencheng Zou,Nan Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph neural networks (GNNs) have been widely applied in safety-critical applications, such as financial and medical networks, in which compromised predictions may cause catastrophic consequences. While existing research on GNN robustness has primarily focused on software-level threats, hardware-induced faults and errors remain largely underexplored. As hardware systems progress toward advanced technology nodes to meet high-performance and energy efficiency demands, they become increasingly susceptible to transient faults, which can cause bit flips and silent data corruption, a prominent issue observed by major technology companies (e.g., Meta and Google). In response, we first present a comprehensive analysis of GNN robustness against bit-flip errors, aiming to reveal system-level optimization opportunities for future reliable and efficient GNN systems. Second, we propose Ralts, a generalizable and lightweight solution to bolster GNN resilience to bit-flip errors. Specifically, Ralts exploits various graph similarity metrics to filter out outliers and recover compromised graph topology, and incorporates these protective techniques directly into aggregation functions to support any message-passing GNNs. Evaluation results demonstrate that Ralts effectively enhances GNN robustness across a range of GNN models, graph datasets, error patterns, and both dense and sparse architectures. On average, under a BER of 3\times10^-5 , these robust aggregation functions improve prediction accuracy by at least 20% when errors present in model weights or node embeddings, and by at least 10% when errors occur in adjacency matrices. Ralts is also optimized to deliver execution efficiency comparable to built-in aggregation functions in PyTorch Geometric.
[LG-36] Semantic IDs for Music Recommendation RECSYS2025
链接: https://arxiv.org/abs/2507.18800
作者: M. Jeffrey Mei,Florian Henkel,Samuel E. Sandberg,Oliver Bembom,Andreas F. Ehmann
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: RecSys 2025 Industry Track
Abstract:Training recommender systems for next-item recommendation often requires unique embeddings to be learned for each item, which may take up most of the trainable parameters for a model. Shared embeddings, such as using content information, can reduce the number of distinct embeddings to be stored in memory. This allows for a more lightweight model; correspondingly, model complexity can be increased due to having fewer embeddings to store in memory. We show the benefit of using shared content-based features (‘semantic IDs’) in improving recommendation accuracy and diversity, while reducing model size, for two music recommendation datasets, including an online A/B test on a music streaming service.
[LG-37] CLEAR: Unlearning Spurious Style-Content Associations with Contrastive LEarning with Anti-contrastive Regularization
链接: https://arxiv.org/abs/2507.18794
作者: Minghui Sun,Benjamin A. Goldstein,Matthew M. Engelhard
类目: Machine Learning (cs.LG)
*备注: 10 pages main text, 24 pages in total
Abstract:Learning representations unaffected by superficial characteristics is important to ensure that shifts in these characteristics at test time do not compromise downstream prediction performance. For instance, in healthcare applications, we might like to learn features that contain information about pathology yet are unaffected by race, sex, and other sources of physiologic variability, thereby ensuring predictions are equitable and generalizable across all demographics. Here we propose Contrastive LEarning with Anti-contrastive Regularization (CLEAR), an intuitive and easy-to-implement framework that effectively separates essential (i.e., task-relevant) characteristics from superficial (i.e., task-irrelevant) characteristics during training, leading to better performance when superficial characteristics shift at test time. We begin by supposing that data representations can be semantically separated into task-relevant content features, which contain information relevant to downstream tasks, and task-irrelevant style features, which encompass superficial attributes that are irrelevant to these tasks, yet may degrade performance due to associations with content present in training data that do not generalize. We then prove that our anti-contrastive penalty, which we call Pair-Switching (PS), minimizes the Mutual Information between the style attributes and content labels. Finally, we instantiate CLEAR in the latent space of a Variational Auto-Encoder (VAE), then perform experiments to quantitatively and qualitatively evaluate the resulting CLEAR-VAE over several image datasets. Our results show that CLEAR-VAE allows us to: (a) swap and interpolate content and style between any pair of samples, and (b) improve downstream classification performance in the presence of previously unseen combinations of content and style. Our code will be made publicly available.
[LG-38] Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation RECSYS’25
链接: https://arxiv.org/abs/2507.18756
作者: Pedro R. Pires,Gregorio F. Azevedo,Pietro L. Campos,Rafael T. Sereicikas,Tiago A. Almeida
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted to be published in RecSys’25, 10 pages, 3 figures
Abstract:Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration-exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies. Despite its prevalent use, offline evaluation of MABs is increasingly recognized for its limitations in reliably assessing exploration behavior. This study conducts an extensive offline empirical comparison of several linear MABs. Strikingly, across over 90% of various datasets, a greedy linear model, with no type of exploration, consistently achieves top-tier performance, often outperforming or matching its exploratory counterparts. This observation is further corroborated by hyperparameter optimization, which consistently favors configurations that minimize exploration, suggesting that pure exploitation is the dominant strategy within these evaluation settings. Our results expose significant inadequacies in offline evaluation protocols for bandits, particularly concerning their capacity to reflect true exploratory efficacy. Consequently, this research underscores the urgent necessity for developing more robust assessment methodologies, guiding future investigations into alternative evaluation frameworks for interactive learning in recommender systems.
[LG-39] An Explainable Equity-Aware P2P Energy Trading Framework for Socio-Economically Diverse Microgrid
链接: https://arxiv.org/abs/2507.18738
作者: Abhijan Theja,Mayukha Pal
类目: ystems and Control (eess.SY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:Fair and dynamic energy allocation in community microgrids remains a critical challenge, particularly when serving socio-economically diverse participants. Static optimization and cost-sharing methods often fail to adapt to evolving inequities, leading to participant dissatisfaction and unsustainable cooperation. This paper proposes a novel framework that integrates multi-objective mixed-integer linear programming (MILP), cooperative game theory, and a dynamic equity-adjustment mechanism driven by reinforcement learning (RL). At its core, the framework utilizes a bi-level optimization model grounded in Equity-regarding Welfare Maximization (EqWM) principles, which incorporate Rawlsian fairness to prioritize the welfare of the least advantaged participants. We introduce a Proximal Policy Optimization (PPO) agent that dynamically adjusts socio-economic weights in the optimization objective based on observed inequities in cost and renewable energy access. This RL-powered feedback loop enables the system to learn and adapt, continuously striving for a more equitable state. To ensure transparency, Explainable AI (XAI) is used to interpret the benefit allocations derived from a weighted Shapley value. Validated across six realistic scenarios, the framework demonstrates peak demand reductions of up to 72.6%, and significant cooperative gains. The adaptive RL mechanism further reduces the Gini coefficient over time, showcasing a pathway to truly sustainable and fair energy communities.
[LG-40] SCORE-SET: A dataset of GuitarPro files for Music Phrase Generation and Sequence Learning
链接: https://arxiv.org/abs/2507.18723
作者: Vishakh Begari
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 6 figures
Abstract:A curated dataset of Guitar Pro tablature files (.gp5 format), tailored for tasks involving guitar music generation, sequence modeling, and performance-aware learning is provided. The dataset is derived from MIDI notes in MAESTRO and GiantMIDI which have been adapted into rhythm guitar tracks. These tracks are further processed to include a variety of expression settings typical of guitar performance, such as bends, slides, vibrato, and palm muting, to better reflect the nuances of real-world guitar playing.
[LG-41] Linearly Convergent Algorithms for Nonsmooth Problems with Unknown Smooth Pieces
链接: https://arxiv.org/abs/2507.19465
作者: Zhe Zhang,Suvrit Sra
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We develop efficient algorithms for optimizing piecewise smooth (PWS) functions where the underlying partition of the domain into smooth pieces is \emphunknown. For PWS functions satisfying a quadratic growth (QG) condition, we propose a bundle-level (BL) type method that achieves global linear convergence – to our knowledge, the first such result for any algorithm for this problem class. We extend this method to handle approximately PWS functions and to solve weakly-convex PWS problems, improving the state-of-the-art complexity to match the benchmark for smooth non-convex optimization. Furthermore, we introduce the first verifiable and accurate termination criterion for PWS optimization. Similar to the gradient norm in smooth optimization, this certificate tightly characterizes the optimality gap under the QG condition, and can moreover be evaluated without knowledge of any problem parameters. We develop a search subroutine for this certificate and embed it within a guess-and-check framework, resulting in an almost parameter-free algorithm for both the convex QG and weakly-convex settings.
[LG-42] Gradient-based grand canonical optimization enabled by graph neural networks with fractional atomic existence
链接: https://arxiv.org/abs/2507.19438
作者: Mads-Peter Verner Christiansen,Bjørk Hammer
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning interatomic potentials have become an indispensable tool for materials science, enabling the study of larger systems and longer timescales. State-of-the-art models are generally graph neural networks that employ message passing to iteratively update atomic embeddings that are ultimately used for predicting properties. In this work we extend the message passing formalism with the inclusion of a continuous variable that accounts for fractional atomic existence. This allows us to calculate the gradient of the Gibbs free energy with respect to both the Cartesian coordinates of atoms and their existence. Using this we propose a gradient-based grand canonical optimization method and document its capabilities for a Cu(110) surface oxide.
[LG-43] Perfect Clustering in Very Sparse Diverse Multiplex Networks
链接: https://arxiv.org/abs/2507.19423
作者: Marianna Pensky
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 5 figures
Abstract:The paper studies the DIverse MultiPLEx Signed Generalized Random Dot Product Graph (DIMPLE-SGRDPG) network model (Pensky (2024)), where all layers of the network have the same collection of nodes. In addition, all layers can be partitioned into groups such that the layers in the same group are embedded in the same ambient subspace but otherwise matrices of connection probabilities can be all different. This setting includes majority of multilayer network models as its particular cases. The key task in this model is to recover the groups of layers with unique subspace structures, since the case where all layers of the network are embedded in the same subspace has been fairly well studied. Until now, clustering of layers in such networks was based on the layer-per-layer analysis, which required the multilayer network to be sufficiently dense. Nevertheless, in this paper we succeeded in pooling information in all layers together and providing a tensor-based methodology that ensures perfect clustering for a much sparser network. Our theoretical results, established under intuitive non-restrictive assumptions, assert that the new technique achieves perfect clustering under sparsity conditions that, up to logarithmic factors, coincide with the computational lower bound derived for a much simpler model.
[LG-44] Human-AI Synergy in Adaptive Active Learning for Continuous Lithium Carbonate Crystallization Optimization
链接: https://arxiv.org/abs/2507.19316
作者: Shayan S. Mousavi Masouleh,Corey A. Sanz,Ryan P. Jansonius,Cara Cronin,Jason E. Hein,Jason Hattrick-Simpers
类目: Materials Science (cond-mat.mtrl-sci); Other Condensed Matter (cond-mat.other); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:As demand for high-purity lithium surges with the growth of the electric vehicle (EV) industry, cost-effective extraction from lower-grade North American sources like the Smackover Formation is critical. These resources, unlike high-purity South American brines, require innovative purification techniques to be economically viable. Continuous crystallization is a promising method for producing battery-grade lithium carbonate, but its optimization is challenged by a complex parameter space and limited data. This study introduces a Human-in-the-Loop (HITL) assisted active learning framework to optimize the continuous crystallization of lithium carbonate. By integrating human expertise with data-driven insights, our approach accelerates the optimization of lithium extraction from challenging sources. Our results demonstrate the framework’s ability to rapidly adapt to new data, significantly improving the process’s tolerance to critical impurities like magnesium from the industry standard of a few hundred ppm to as high as 6000 ppm. This breakthrough makes the exploitation of low-grade, impurity-rich lithium resources feasible, potentially reducing the need for extensive pre-refinement processes. By leveraging artificial intelligence, we have refined operational parameters and demonstrated that lower-grade materials can be used without sacrificing product quality. This advancement is a significant step towards economically harnessing North America’s vast lithium reserves, such as those in the Smackover Formation, and enhancing the sustainability of the global lithium supply chain.
[LG-45] Bespoke multiresolution analysis of graph signals
链接: https://arxiv.org/abs/2507.19181
作者: Giacomo Elefante,Gianluca Giacchi,Michael Multerer,Jacopo Quizi
类目: ignal Processing (eess.SP); Discrete Mathematics (cs.DM); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:We present a novel framework for discrete multiresolution analysis of graph signals. The main analytical tool is the samplet transform, originally defined in the Euclidean framework as a discrete wavelet-like construction, tailored to the analysis of scattered data. The first contribution of this work is defining samplets on graphs. To this end, we subdivide the graph into a fixed number of patches, embed each patch into a Euclidean space, where we construct samplets, and eventually pull the construction back to the graph. This ensures orthogonality, locality, and the vanishing moments property with respect to properly defined polynomial spaces on graphs. Compared to classical Haar wavelets, this framework broadens the class of graph signals that can efficiently be compressed and analyzed. Along this line, we provide a definition of a class of signals that can be compressed using our construction. We support our findings with different examples of signals defined on graphs whose vertices lie on smooth manifolds. For efficient numerical implementation, we combine heavy edge clustering, to partition the graph into meaningful patches, with landmark \textttIsomap, which provides low-dimensional embeddings for each patch. Our results demonstrate the method’s robustness, scalability, and ability to yield sparse representations with controllable approximation error, significantly outperforming traditional Haar wavelet approaches in terms of compression efficiency and multiresolution fidelity.
[LG-46] Graph Neural Network-Based Predictor for Optimal Quantum Hardware Selection
链接: https://arxiv.org/abs/2507.19093
作者: Antonio Tudisco,Deborah Volpe,Giacomo Orlandi,Giovanna Turvani
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:The growing variety of quantum hardware technologies, each with unique peculiarities such as connectivity and native gate sets, creates challenges when selecting the best platform for executing a specific quantum circuit. This selection process usually involves a brute-force approach: compiling the circuit on various devices and evaluating performance based on factors such as circuit depth and gate fidelity. However, this method is computationally expensive and does not scale well as the number of available quantum processors increases. In this work, we propose a Graph Neural Network (GNN)-based predictor that automates hardware selection by analyzing the Directed Acyclic Graph (DAG) representation of a quantum circuit. Our study evaluates 498 quantum circuits (up to 27 qubits) from the MQT Bench dataset, compiled using Qiskit on four devices: three superconducting quantum processors (IBM-Kyiv, IBM-Brisbane, IBM-Sherbrooke) and one trapped-ion processor (IONQ-Forte). Performance is estimated using a metric that integrates circuit depth and gate fidelity, resulting in a dataset where 93 circuits are optimally compiled on the trapped-ion device, while the remaining circuits prefer superconducting platforms. By exploiting graph-based machine learning, our approach avoids extracting the circuit features for the model evaluation but directly embeds it as a graph, significantly accelerating the optimal target decision-making process and maintaining all the information. Experimental results prove 94.4% accuracy and an 85.5% F1 score for the minority class, effectively predicting the best compilation target. The developed code is publicly available on GitHub (this https URL).
[LG-47] Probably Approximately Correct Causal Discovery
链接: https://arxiv.org/abs/2507.18903
作者: Mian Wei,Somesh Jha,David Page
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The discovery of causal relationships is a foundational problem in artificial intelligence, statistics, epidemiology, economics, and beyond. While elegant theories exist for accurate causal discovery given infinite data, real-world applications are inherently resource-constrained. Effective methods for inferring causal relationships from observational data must perform well under finite data and time constraints, where “performing well” implies achieving high, though not perfect accuracy. In his seminal paper A Theory of the Learnable, Valiant highlighted the importance of resource constraints in supervised machine learning, introducing the concept of Probably Approximately Correct (PAC) learning as an alternative to exact learning. Inspired by Valiant’s work, we propose the Probably Approximately Correct Causal (PACC) Discovery framework, which extends PAC learning principles to the causal field. This framework emphasizes both computational and sample efficiency for established causal methods such as propensity score techniques and instrumental variable approaches. Furthermore, we show that it can also provide theoretical guarantees for other widely used methods, such as the Self-Controlled Case Series (SCCS) method, which had previously lacked such guarantees.
[LG-48] Optimizing Metachronal Paddling with Reinforcement Learning at Low Reynolds Number
链接: https://arxiv.org/abs/2507.18849
作者: Alana A. Bailey,Robert D. Guy
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 14 figures, to be published in EPJ E
Abstract:Metachronal paddling is a swimming strategy in which an organism oscillates sets of adjacent limbs with a constant phase lag, propagating a metachronal wave through its limbs and propelling it forward. This limb coordination strategy is utilized by swimmers across a wide range of Reynolds numbers, which suggests that this metachronal rhythm was selected for its optimality of swimming performance. In this study, we apply reinforcement learning to a swimmer at zero Reynolds number and investigate whether the learning algorithm selects this metachronal rhythm, or if other coordination patterns emerge. We design the swimmer agent with an elongated body and pairs of straight, inflexible paddles placed along the body for various fixed paddle spacings. Based on paddle spacing, the swimmer agent learns qualitatively different coordination patterns. At tight spacings, a back-to-front metachronal wave-like stroke emerges which resembles the commonly observed biological rhythm, but at wide spacings, different limb coordinations are selected. Across all resulting strokes, the fastest stroke is dependent on the number of paddles, however, the most efficient stroke is a back-to-front wave-like stroke regardless of the number of paddles.
[LG-49] Central limit theorems for the eigenvalues of graph Laplacians on data clouds
链接: https://arxiv.org/abs/2507.18803
作者: Chenghui Li,Nicolás García Trillos,Housen Li,Leo Suchan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Differential Geometry (math.DG); Probability (math.PR)
*备注:
Abstract:Given i.i.d.\ samples X_n =\ x_1, \dots, x_n \ from a distribution supported on a low dimensional manifold M embedded in Eucliden space, we consider the graph Laplacian operator \Delta_n associated to an \varepsilon -proximity graph over X_n and study the asymptotic fluctuations of its eigenvalues around their means. In particular, letting \hat\lambda_l^\varepsilon denote the l -th eigenvalue of \Delta_n , and under suitable assumptions on the data generating model and on the rate of decay of \varepsilon , we prove that \sqrtn (\hat\lambda_l^\varepsilon - \mathbbE[\hat\lambda_l^\varepsilon] ) is asymptotically Gaussian with a variance that we can explicitly characterize. A formal argument allows us to interpret this asymptotic variance as the dissipation of a gradient flow of a suitable energy with respect to the Fisher-Rao geometry. This geometric interpretation allows us to give, in turn, a statistical interpretation of the asymptotic variance in terms of a Cramer-Rao lower bound for the estimation of the eigenvalues of certain weighted Laplace-Beltrami operator. The latter interpretation suggests a form of asymptotic statistical efficiency for the eigenvalues of the graph Laplacian. We also present CLTs for multiple eigenvalues and through several numerical experiments explore the validity of our results when some of the assumptions that we make in our theoretical analysis are relaxed.
[LG-50] Discovering the dynamics of emphSargassum rafts centers of mass
链接: https://arxiv.org/abs/2507.18771
作者: Francisco J. Beron-Vera,Gage Bonner
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Submitted to Chaos
Abstract:Since 2011, rafts of floating \emphSargassum seaweed have frequently obstructed the coasts of the Intra-Americas Seas. The motion of the rafts is represented by a high-dimensional nonlinear dynamical system. Referred to as the eBOMB model, this builds on the Maxey–Riley equation by incorporating interactions between clumps of \emphSargassum forming a raft and the effects of Earth’s rotation. The absence of a predictive law for the rafts’ centers of mass suggests a need for machine learning. In this paper, we evaluate and contrast Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) and Sparse Identification of Nonlinear Dynamics (SINDy). In both cases, a physics-inspired closure modeling approach is taken rooted in eBOMB. Specifically, the LSTM model learns a mapping from a collection of eBOMB variables to the difference between raft center-of-mass and ocean velocities. The SINDy model’s library of candidate functions is suggested by eBOMB variables and includes windowed velocity terms incorporating far-field effects of the carrying flow. Both LSTM and SINDy models perform most effectively in conditions with tightly bonded clumps, despite declining precision with rising complexity, such as with wind effects and when assessing loosely connected clumps. The LSTM model delivered the best results when designs were straightforward, with fewer neurons and hidden layers. While LSTM model serves as an opaque black-box model lacking interpretability, the SINDy model brings transparency by discerning explicit functional relationships through the function libraries. Integration of the windowed velocity terms enabled effective modeling of nonlocal interactions, particularly in datasets featuring sparsely connected rafts.
[LG-51] Adaptive Neural Quantum States: A Recurrent Neural Network Perspective
链接: https://arxiv.org/abs/2507.18700
作者: Jake McNaughton,Mohamed Hibat-Allah
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注: 14 pages, 7 figures, 3 tables. Link to GitHub repository: this https URL
Abstract:Neural-network quantum states (NQS) are powerful neural-network ansätzes that have emerged as promising tools for studying quantum many-body physics through the lens of the variational principle. These architectures are known to be systematically improvable by increasing the number of parameters. Here we demonstrate an Adaptive scheme to optimize NQSs, through the example of recurrent neural networks (RNN), using a fraction of the computation cost while reducing training fluctuations and improving the quality of variational calculations targeting ground states of prototypical models in one- and two-spatial dimensions. This Adaptive technique reduces the computational cost through training small RNNs and reusing them to initialize larger RNNs. This work opens up the possibility for optimizing graphical processing unit (GPU) resources deployed in large-scale NQS simulations.
[LG-52] Interpretable inverse design of optical multilayer thin films based on extended neural adjoint and regression activation mapping
链接: https://arxiv.org/abs/2507.18644
作者: Sungjun Kim,Jungho Kim
类目: Optics (physics.optics); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:We propose an extended neural adjoint (ENA) framework, which meets six key criteria for artificial intelligence-assisted inverse design of optical multilayer thin films (OMTs): accuracy, efficiency, diversity, scalability, flexibility, and interpretability. To enhance the scalability of the existing neural adjoint method, we present a novel forward neural network architecture for OMTs and introduce a material loss function into the existing neural adjoint loss function, facilitating the exploration of material configurations of OMTs. Furthermore, we present the detailed formulation of the regression activation mapping for the presented forward neural network architecture (F-RAM), a feature visualization method aimed at improving interpretability. We validated the efficacy of the material loss by conducting an ablation study, where each component of the loss function is systematically removed and evaluated. The results indicated that the inclusion of the material loss significantly improves accuracy and diversity. To substantiate the performance of the ENA-based inverse design, we compared it against the residual network-based global optimization network (Res-GLOnet). The ENA yielded the OMT solutions of an inverse design with higher accuracy and better diversity compared to the Res-GLOnet. To demonstrate the interpretability, we applied F-RAM to diverse OMT structures with similar optical properties, obtained by the proposed ENA method. We showed that distributions of feature importance for various OMT structures exhibiting analogous optical properties are consistent, despite variations in material configurations, layer number, and thicknesses. Furthermore, we demonstrate the flexibility of the ENA method by restricting the initial layer of OMTs to SiO2 and 100 nm.
[LG-53] A Regression-Based Share Market Prediction Model for Bangladesh DATE
链接: https://arxiv.org/abs/2507.18643
作者: Syeda Tasnim Fabiha,Rubaiyat Jahan Mumu,Farzana Aktar,B M Mainul Hossain
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: Originally written in 2018. Updated in 2025 for open-access archiving. Not previously published
Abstract:Share market is one of the most important sectors of economic development of a country. Everyday almost all companies issue their shares and investors buy and sell shares of these companies. Generally investors want to buy shares of the companies whose market liquidity is comparatively greater. Market liquidity depends on the average price of a share. In this paper, a thorough linear regression analysis has been performed on the stock market data of Dhaka Stock Exchange. Later, the linear model has been compared with random forest based on different metrics showing better results for random forest model. However, the amount of individual significance of different factors on the variability of stock price has been identified and explained. This paper also shows that the time series data is not capable of generating a predictive linear model for analysis.
[LG-54] A comparison of stretched-grid and limited-area modelling for data-driven regional weather forecasting
链接: https://arxiv.org/abs/2507.18378
作者: Jasper S. Wijnands,Michiel Van Ginderachter,Bastien François,Sophie Buurman,Piet Termonia,Dieter Van den Bleeken
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Regional machine learning weather prediction (MLWP) models based on graph neural networks have recently demonstrated remarkable predictive accuracy, outperforming numerical weather prediction models at lower computational costs. In particular, limited-area model (LAM) and stretched-grid model (SGM) approaches have emerged for generating high-resolution regional forecasts, based on initial conditions from a regional (re)analysis. While LAM uses lateral boundaries from an external global model, SGM incorporates a global domain at lower resolution. This study aims to understand how the differences in model design impact relative performance and potential applications. Specifically, the strengths and weaknesses of these two approaches are identified for generating deterministic regional forecasts over Europe. Using the Anemoi framework, models of both types are built by minimally adapting a shared architecture and trained using global and regional reanalyses in a near-identical setup. Several inference experiments have been conducted to explore their relative performance and highlight key differences. Results show that both LAM and SGM are competitive deterministic MLWP models with generally accurate and comparable forecasting performance over the regional domain. Various differences were identified in the performance of the models across applications. LAM is able to successfully exploit high-quality boundary forcings to make predictions within the regional domain and is suitable in contexts where global data is difficult to acquire. SGM is fully self-contained for easier operationalisation, can take advantage of more training data and significantly surpasses LAM in terms of (temporal) generalisability. Our paper can serve as a starting point for meteorological institutes to guide their choice between LAM and SGM in developing an operational data-driven forecasting system.
信息检索
[IR-0] On the Security of a Code-Based PIR Scheme
链接: https://arxiv.org/abs/2507.19295
作者: Svenja Lage,Hannes Bartz
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:
Abstract:Private Information Retrieval (PIR) schemes allow clients to retrieve files from a database without disclosing the requested file’s identity to the server. In the pursuit of post-quantum security, most recent PIR schemes rely on hard lattice problems. In contrast, the so called CB-cPIR scheme stands out as a pioneering effort to base PIR schemes on hard problems in coding theory, thereby contributing significantly to the diversification of security foundations. However, our research reveals a critical vulnerability in CB-cPIR, substantially diminishing its security levels. Moreover, a comparative analysis with state-of-the-art PIR schemes shows that CB-cPIR’s advantages are reduced, making it less competitive in terms of the communication cost. Nevertheless, our findings highlight the importance of continued research into code-based PIR schemes, as they have the potential to provide a valuable alternative to lattice-based approaches.
[IR-1] SelfRACG: Enabling LLM s to Self-Express and Retrieve for Code Generation
链接: https://arxiv.org/abs/2507.19033
作者: Qian Dong,Jia Chen,Qingyao Ai,Hongning Wang,Haitao Li,Yi Wu,Yao Hu,Yiqun Liu,Shaoping Ma
类目: Information Retrieval (cs.IR)
*备注: TsinghuaXiaohongshu
Abstract:Existing retrieval-augmented code generation (RACG) methods typically use an external retrieval module to fetch semantically similar code snippets used for generating subsequent fragments. However, even for consecutive code fragments, the content often diverges due to logical progression, resulting in a content gap. This gap undermines the performance of current RACG methods, as \textitexternal retrieval modules based on content matching fail to infer the specific information need of LLMs to generate the next code fragment. Therefore, we propose \textbfSelfRACG, a novel paradigm that enables large language models (LLMs) to \textbfSelf-express their information needs to enhance \textbfRACG. Specifically, SelfRACG includes an information need expression module and a two-stage information need-guided training strategy, which encourages LLMs to express their information need. Extensive experiments demonstrate that SelfRACG can retrieve external knowledge that better aligns with the LLM’s own information needs, resulting in superior generation performance compared to vanilla RACG.
[IR-2] CityHood: An Explainable Travel Recommender System for Cities and Neighborhoods
链接: https://arxiv.org/abs/2507.18778
作者: Gustavo H Santos,Myriam Delgado,Thiago H Silva
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: Accepted at ASONAM’25
Abstract:We present CityHood, an interactive and explainable recommendation system that suggests cities and neighborhoods based on users’ areas of interest. The system models user interests leveraging large-scale Google Places reviews enriched with geographic, socio-demographic, political, and cultural indicators. It provides personalized recommendations at city (Core-Based Statistical Areas - CBSAs) and neighborhood (ZIP code) levels, supported by an explainable technique (LIME) and natural-language explanations. Users can explore recommendations based on their stated preferences and inspect the reasoning behind each suggestion through a visual interface. The demo illustrates how spatial similarity, cultural alignment, and interest understanding can be used to make travel recommendations transparent and engaging. This work bridges gaps in location-based recommendation by combining a kind of interest modeling, multi-scale analysis, and explainability in a user-facing system.