本篇博文主要内容为 2025-07-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-07-09)
今日共更新519篇论文,其中:
- 自然语言处理共85篇(Computation and Language (cs.CL))
- 人工智能共175篇(Artificial Intelligence (cs.AI))
- 计算机视觉共105篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共138篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Agent KB: Leverag ing Cross-Domain Experience for Agent ic Problem Solving
【速读】: 该论文试图解决语言代理在处理复杂任务时存在的有效错误纠正能力不足以及跨领域经验复用困难的问题。其解决方案的关键在于提出Agent KB,这是一个分层经验框架,通过一种新颖的Reason-Retrieve-Refine(推理-检索-精炼)管道实现复杂的代理问题求解。Agent KB的核心创新在于能够捕获高层次策略和详细执行日志,从而构建一个共享的知识库,实现跨代理的知识迁移。
链接: https://arxiv.org/abs/2507.06229
作者: Xiangru Tang,Tianrui Qin,Tianhao Peng,Ziyang Zhou,Daniel Shao,Tingting Du,Xinming Wei,Peng Xia,Fang Wu,He Zhu,Ge Zhang,Jiaheng Liu,Xingyao Wang,Sirui Hong,Chenglin Wu,Hao Cheng,Chi Wang,Wangchunshu Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As language agents tackle increasingly complex tasks, they struggle with effective error correction and experience reuse across domains. We introduce Agent KB, a hierarchical experience framework that enables complex agentic problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses a core limitation: agents traditionally cannot learn from each other’s experiences. By capturing both high-level strategies and detailed execution logs, Agent KB creates a shared knowledge base that enables cross-agent knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3 improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks.
zh
[NLP-1] Efficiency-Effectiveness Reranking FLOPs for LLM -based Rerankers
【速读】: 该论文试图解决基于大语言模型(Large Language Models, LLMs)的重排序器在实际部署中因计算需求高而面临的效率评估问题。现有研究依赖于受硬件和运行时配置影响的代理指标,如延迟、前向传递次数、输入和输出token数量,这些指标难以准确反映模型规模对效率的影响,导致效率与效果权衡的评估不清晰。解决方案的关键是提出E²R-FLOPs,包括每PetaFLOP的排名指标(RPP)和每PetaFLOP的查询吞吐量(QPP),以实现与硬件无关的效率评估,并构建一个可解释的FLOPs估算器,无需实际运行实验即可估计LLM-based reranker的计算量。
链接: https://arxiv.org/abs/2507.06223
作者: Zhiyuan Peng,Ting-ruen Wei,Tingyu Song,Yilun Zhao,Yi Fang
机构: Santa Clara University (圣克拉拉大学); Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: under review
Abstract:Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose E\textsuperscript2R-FLOPs, for LLM-based rerankers: ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for hardware-agnostic throughput. Companied with the new metrics, an interpretable FLOPs estimator is built to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architecture, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.
zh
[NLP-2] CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions
【速读】: 该论文试图解决预训练视觉-语言模型(VLMs)在处理具有细微文化差异的视觉概念时表现不足的问题,具体表现为难以区分视觉相似但文化背景不同的概念。解决方案的关键在于设计了一个数据集构建流程,利用开源的VLM和文本到图像扩散模型生成CulTwin数据集,该数据集包含视觉相似但文化语境不同的概念-标题-图像三元组。随后,通过定制的对比学习对CLIP进行微调,构建出CultureCLIP模型,从而实现文化概念与上下文增强标题及合成图像的对齐,提升文化细粒度区分能力的同时保持模型的泛化性能。
链接: https://arxiv.org/abs/2507.06210
作者: Yuchen Huang,Zhiyuan Fan,Zhitao He,Sandeep Polisetty,Wenyan Li,Yi R. Fung
机构: Hong Kong University of Science and Technology (香港科技大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 25 pages, COLM 2025
Abstract:Pretrained vision-language models (VLMs) such as CLIP excel in multimodal understanding but struggle with contextually relevant fine-grained visual features, making it difficult to distinguish visually similar yet culturally distinct concepts. This limitation stems from the scarcity of high-quality culture-specific datasets, the lack of integrated contextual knowledge, and the absence of hard negatives highlighting subtle distinctions. To address these challenges, we first design a data curation pipeline that leverages open-sourced VLMs and text-to-image diffusion models to construct CulTwin, a synthetic cultural dataset. This dataset consists of paired concept-caption-image triplets, where concepts visually resemble each other but represent different cultural contexts. Then, we fine-tune CLIP on CulTwin to create CultureCLIP, which aligns cultural concepts with contextually enhanced captions and synthetic images through customized contrastive learning, enabling finer cultural differentiation while preserving generalization capabilities. Experiments on culturally relevant benchmarks show that CultureCLIP outperforms the base CLIP, achieving up to a notable 5.49% improvement in fine-grained concept recognition on certain tasks, while preserving CLIP’s original generalization ability, validating the effectiveness of our data synthesis and VLM backbone training paradigm in capturing subtle cultural distinctions.
zh
[NLP-3] DS@GT at CheckThat! 2025: Ensemble Methods for Detection of Scientific Discourse on Social Media
【速读】: 该论文旨在解决科学网络话语检测问题,即识别推文中是否包含科学主张、科学研究报告或出版物的引用以及科学实体(如大学或科学家)的提及。解决方案的关键在于探索三种建模方法:基于Transformer的微调、大型语言模型(LLM)的少样本提示以及由前期实验指导设计的组合集成模型。其中,集成模型在竞赛中取得了最佳性能,最终团队获得第7名,宏平均F1得分为0.8611,优于DeBERTaV3基线模型的0.8375。
链接: https://arxiv.org/abs/2507.06205
作者: Ayush Parikh,Hoang Thanh Thanh Truong,Jeanette Schofield,Maximilian Heil
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we, as the DS@GT team for CLEF 2025 CheckThat! Task 4a Scientific Web Discourse Detection, present the methods we explored for this task. For this multiclass classification task, we determined if a tweet contained a scientific claim, a reference to a scientific study or publication, and/or mentions of scientific entities, such as a university or a scientist. We present 3 modeling approaches for this task: transformer finetuning, few-shot prompting of LLMs, and a combined ensemble model whose design was informed by earlier experiments. Our team placed 7th in the competition, achieving a macro-averaged F1 score of 0.8611, an improvement over the DeBERTaV3 baseline of 0.8375. Our code is available on Github at this https URL.
zh
[NLP-4] Differential Mamba
【速读】: 该论文试图解决序列模型(如Transformer和RNN)中注意力机制过度关注无关上下文导致的中间表示噪声问题,这一问题会降低大语言模型(LLM)的能力,表现为幻觉增加、长距离依赖和检索能力减弱以及鲁棒性下降。研究提出的关键解决方案是将针对Transformer的微分设计(differential design)应用于Mamba架构,通过引入一种新的微分机制来优化其注意力分配,从而有效缓解Mamba模型中的过量注意力分配问题,并在语言建模基准测试中验证了该方法的有效性。
链接: https://arxiv.org/abs/2507.06204
作者: Nadav Schneider,Itamar Zimerman,Eliya Nachmani
机构: Ben-Gurion University(本-古里安大学); IAEC(以色列原子能委员会); Tel-Aviv University(特拉维夫大学); IBM Research(IBM研究院); School of Electrical and Computer Engineering(电气与计算机工程学院); Ben Gurion University of the Negev(本-古里安大学内盖夫分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available.
zh
[NLP-5] A Survey on Latent Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在进行链式推理(Chain-of-Thought, CoT)时依赖自然语言表达所导致的表达带宽受限问题。其解决方案的关键在于引入隐式推理(latent reasoning),通过在模型的连续隐藏状态中完成多步骤推理,从而避免对词元级监督的依赖,提升推理效率与表达能力。
链接: https://arxiv.org/abs/2507.06203
作者: Rui-Jie Zhu,Tianhao Peng,Tianhao Cheng,Xingwei Qu,Jinfa Huang,Dawei Zhu,Hao Wang,Kaiwen Xue,Xuanliang Zhang,Yong Shan,Tianle Cai,Taylor Kergan,Assel Kembay,Andrew Smith,Chenghua Lin,Binh Nguyen,Yuqi Pan,Yuhong Chou,Zefan Cai,Zhenhe Wu,Yongchi Zhao,Tianyu Liu,Jian Yang,Wangchunshu Zhou,Chujie Zheng,Chongxuan Li,Yuyin Zhou,Zhoujun Li,Zhaoxiang Zhang,Jiaheng Liu,Ge Zhang,Wenhao Huang,Jason Eshraghian
机构: UCSC(加州大学圣克鲁兹分校); FDU(福州大学); NJU(南京大学); PKU(北京大学); RUC(中国人民大学); UoM(曼彻斯特大学); UW-Madison(威斯康星大学麦迪逊分校); PolyU(香港理工大学); M-A-P(机器智能与计算实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model’s expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model’s continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: this https URL.
zh
[NLP-6] UQLM: A Python Package for Uncertainty Quantification in Large Language Models ALT
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)生成虚假或误导性内容的问题,即幻觉(hallucinations),这会影响下游应用的安全性和可信度。解决方案的关键在于引入UQLM,这是一个基于先进不确定性量化(uncertainty quantification, UQ)技术的Python包,通过提供一系列基于UQ的评分工具,计算从0到1的响应级置信度分数,从而实现对LLM幻觉的有效检测。
链接: https://arxiv.org/abs/2507.06196
作者: Dylan Bouchard,Mohit Singh Chauhan,David Skarbrevik,Ho-Kyeong Ra,Viren Bajaj,Zeya Ahmad
机构: CVS Health(CVS 健康)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to Journal of Machine Learning Research (MLOSS); UQLM Repository: this https URL
Abstract:Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.
zh
[NLP-7] DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification
【速读】: 该论文旨在解决数值性陈述(numerical claims)在自动事实核查中的挑战,包括涉及数量、比较和时间参考的声明。其解决方案的关键在于通过构建证据检索管道并评估基于QuanTemp数据集的建模策略,重点研究了三个关键因素:使用ModernBERT模型扩展输入上下文窗口以获取更多证据的影响、右到左(R2L)分词的效果,以及两者的综合影响。研究发现,证据质量是影响验证性能的主要瓶颈,而非上下文长度或分词方式,这为后续优化数值性陈述的自动核查系统提供了重要启示。
链接: https://arxiv.org/abs/2507.06195
作者: Maximilian Heil,Aleksandar Pramov
机构: Georgia Institute of Technology(乔治亚理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Numerical claims, statements involving quantities, comparisons, and temporal references, pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the effect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at this https URL.
zh
[NLP-8] SQLBarber: A System Leverag ing Large Language Models to Generate Customized and Realistic SQL Workloads
【速读】: 该论文试图解决在数据库研究与开发中,获取真实世界SQL查询的困难问题,以及现有SQL生成方法在定制化和满足现实约束方面的局限性。解决方案的关键在于提出SQLBarber系统,该系统基于大型语言模型(Large Language Models, LLMs)生成定制化且符合现实特征的SQL工作负载,其核心包括:提供一种声明式接口以轻松生成定制SQL模板、一个结合自校正模块的LLM驱动流水线以根据查询成本优化模板、以及一种贝叶斯优化器以高效探索谓词值并生成满足目标成本分布的查询集。
链接: https://arxiv.org/abs/2507.06192
作者: Jiale Lao,Immanuel Trummer
机构: Cornell University(康奈尔大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Database research and development often require a large number of SQL queries for benchmarking purposes. However, acquiring real-world SQL queries is challenging due to privacy concerns, and existing SQL generation methods are limited in customization and in satisfying realistic constraints. To address this issue, we present SQLBarber, a system based on Large Language Models (LLMs) to generate customized and realistic SQL workloads. SQLBarber (i) eliminates the need for users to manually craft SQL templates in advance, while providing the flexibility to accept natural language specifications to constrain SQL templates, (ii) scales efficiently to generate large volumes of queries matching any user-defined cost distribution (e.g., cardinality and execution plan cost), and (iii) uses execution statistics from Amazon Redshift and Snowflake to derive SQL template specifications and query cost distributions that reflect real-world query characteristics. SQLBarber introduces (i) a declarative interface for users to effortlessly generate customized SQL templates, (ii) an LLM-powered pipeline augmented with a self-correction module that profiles, refines, and prunes SQL templates based on query costs, and (iii) a Bayesian Optimizer to efficiently explore different predicate values and identify a set of queries that satisfy the target cost distribution. We construct and open-source ten benchmarks of varying difficulty levels and target query cost distributions based on real-world statistics from Snowflake and Amazon Redshift. Extensive experiments on these benchmarks show that SQLBarber is the only system that can generate customized SQL templates. It reduces query generation time by one to three orders of magnitude, and significantly improves alignment with the target cost distribution, compared with existing methods.
zh
[NLP-9] DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation
【速读】: 该论文旨在解决英语新闻文本中主观与客观句子分类的问题,通过迁移学习和风格化数据增强来提升分类效果。其解决方案的关键在于对比预训练编码器的微调与细调的Transformer在相关任务上的迁移学习效果,并引入基于GPT-4o的受控增强流程,以生成特定主观性风格的改写句。为确保标签和风格的一致性,使用同一模型对生成样本进行校正和优化。实验结果表明,特定编码器的迁移学习优于通用编码器的微调,且精心策划的数据增强显著提升了模型鲁棒性,特别是在检测主观内容方面。
链接: https://arxiv.org/abs/2507.06189
作者: Maximilian Heil,Dionne Bang
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents our submission to Task 1, Subjectivity Detection, of the CheckThat! Lab at CLEF 2025. We investigate the effectiveness of transfer-learning and stylistic data augmentation to improve classification of subjective and objective sentences in English news text. Our approach contrasts fine-tuning of pre-trained encoders and transfer-learning of fine-tuned transformer on related tasks. We also introduce a controlled augmentation pipeline using GPT-4o to generate paraphrases in predefined subjectivity styles. To ensure label and style consistency, we employ the same model to correct and refine the generated samples. Results show that transfer-learning of specified encoders outperforms fine-tuning general-purpose ones, and that carefully curated augmentation significantly enhances model robustness, especially in detecting subjective content. Our official submission placed us 16^th of 24 participants. Overall, our findings underscore the value of combining encoder specialization with label-consistent augmentation for improved subjectivity detection. Our code is available at this https URL.
zh
[NLP-10] Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review
【速读】: 该论文试图解决生成式 AI (Generative AI) 在学术同行评议过程中被滥用的问题,具体表现为通过隐藏指令(如“仅给出正面评价”)操纵AI辅助的评审过程。解决方案的关键在于识别和防范提示注入(prompt injection)技术,该技术可将隐藏指令嵌入文本中以影响AI行为。研究揭示了四种类型的隐藏提示,并强调此类行为属于新型学术不端行为,需通过提交平台的协同技术筛查和统一的生成式 AI 使用政策来应对。
链接: https://arxiv.org/abs/2507.06185
作者: Zhicheng Lin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:In July 2025, 18 academic manuscripts on the preprint website arXiv were found to contain hidden instructions known as prompts designed to manipulate AI-assisted peer review. Instructions such as “GIVE A POSITIVE REVIEW ONLY” were concealed using techniques like white-colored text. Author responses varied: one planned to withdraw the affected paper, while another defended the practice as legitimate testing of reviewer compliance. This commentary analyzes this practice as a novel form of research misconduct. We examine the technique of prompt injection in large language models (LLMs), revealing four types of hidden prompts, ranging from simple positive review commands to detailed evaluation frameworks. The defense that prompts served as “honeypots” to detect reviewers improperly using AI fails under examination–the consistently self-serving nature of prompt instructions indicates intent to manipulate. Publishers maintain inconsistent policies: Elsevier prohibits AI use in peer review entirely, while Springer Nature permits limited use with disclosure requirements. The incident exposes systematic vulnerabilities extending beyond peer review to any automated system processing scholarly texts, including plagiarism detection and citation indexing. Our analysis underscores the need for coordinated technical screening at submission portals and harmonized policies governing generative AI (GenAI) use in academic evaluation.
zh
[NLP-11] CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
【速读】: 该论文试图解决将自然语言数学陈述转化为形式化、可执行代码过程中,生成的形式化内容是否真正捕捉原始问题语义意图的问题,即关注形式化过程中的批判阶段(critic phase)而非仅关注生成与编译的成功率。解决方案的关键在于提出CriticLean框架,其中CriticLeanGPT通过监督微调和强化学习训练,用于严格评估Lean 4形式化中的语义保真度,同时引入CriticLeanBench基准来衡量模型区分语义正确与错误形式化的能力,从而提升形式化结果的可靠性。
链接: https://arxiv.org/abs/2507.06181
作者: Zhongyuan Peng,Yifan Yao,Kaijing Ma,Shuyue Guo,Yizhe Li,Yichi Zhang,Chenchen Zhang,Yifan Zhang,Zhouliang Yu,Luming Li,Minghao Liu,Yihang Xia,Jiawei Shen,Yuchen Wu,Yixin Cao,Zhaoxiang Zhang,Wenhao Huang,Jiaheng Liu,Ge Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models’ ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.
zh
[NLP-12] Skywork-R1V3 Technical Report
【速读】: 该论文旨在解决如何将文本仅大型语言模型(LLM)中的推理能力有效地迁移至视觉任务的问题,从而提升视觉语言模型(VLM)的推理性能。解决方案的关键在于提出了一种精心设计的后训练强化学习(RL)框架,该框架无需额外的持续预训练即可激活并增强模型的推理能力,同时揭示了连接模块在实现多模态推理模型中稳健跨模态对齐中的基础作用。
链接: https://arxiv.org/abs/2507.06167
作者: Wei Shen,Jiangbo Pei,Yi Peng,Xuchen Song,Yang Liu,Jian Peng,Haofeng Sun,Yunzhuo Hao,Peiyu Wang,Yahui Zhou
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model’s reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.
zh
[NLP-13] Evaluation of Habitat Robotics using Large Language Models
【速读】: 该论文试图解决大型语言模型在具身机器人任务中的有效性评估问题,具体是在Meta PARTNER基准下测试其在模拟室内厨房场景中执行协作任务的能力。论文提出的解决方案的关键在于利用简化环境和随机化室内厨房场景,通过对比不同前沿模型的表现,发现具有推理能力的模型如OpenAI o3-mini在多种配置下均优于非推理模型,如OpenAI GPT-4o和Llama 3。
链接: https://arxiv.org/abs/2507.06157
作者: William Li,Lei Hamilton,Kaise Al-natour,Sanjeev Mohindra
机构: MIT Lincoln Lab (麻省理工学院林肯实验室)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注: 6 pages, IEEE HPEC submission
Abstract:This paper focuses on evaluating the effectiveness of Large Language Models at solving embodied robotic tasks using the Meta PARTNER benchmark. Meta PARTNR provides simplified environments and robotic interactions within randomized indoor kitchen scenes. Each randomized kitchen scene is given a task where two robotic agents cooperatively work together to solve the task. We evaluated multiple frontier models on Meta PARTNER environments. Our results indicate that reasoning models like OpenAI o3-mini outperform non-reasoning models like OpenAI GPT-4o and Llama 3 when operating in PARTNR’s robotic embodied environments. o3-mini displayed outperform across centralized, decentralized, full observability, and partial observability configurations. This provides a promising avenue of research for embodied robotic development.
zh
[NLP-14] Coding Triangle: How Does Large Language Model Understand Code?
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在代码生成任务中虽然表现出一定的自洽性,但其解决方案缺乏人类程序员的多样性和鲁棒性的问题。论文提出的解决方案关键在于引入Code Triangle框架,通过编辑分析、代码实现和测试用例生成三个维度系统评估LLMs,并结合人类生成的题解、解决方案和多样化测试用例,以及模型集成方法,以显著提升LLMs的性能和鲁棒性。
链接: https://arxiv.org/abs/2507.06138
作者: Taolin Zhang,Zihan Ma,Maosong Cao,Junnan Liu,Songyang Zhang,Kai Chen
机构: Shanghai AI Laboratory(上海人工智能实验室); Tsinghua University(清华大学); Xi’an Jiaotong University(西安交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.
zh
[NLP-15] NeoBabel: A Multilingual Open Tower for Visual Generation
【速读】: 该论文试图解决文本到图像生成技术中以英语为中心所带来的语言障碍和数字不平等问题,以及现有系统依赖翻译管道所导致的语义漂移、计算开销和文化偏差问题。其解决方案的关键在于提出NeoBabel框架,该框架通过大规模多语言预训练和高分辨率指令微调相结合的方式进行模型训练,从而在性能、效率和包容性方面达到了新的帕累托前沿。
链接: https://arxiv.org/abs/2507.06137
作者: Mohammad Mahdi Derakhshani,Dheeraj Varghese,Marzieh Fadaee,Cees G. M. Snoek
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 12 figures
Abstract:Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NeoBabel, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NeoBabel achieves state-of-the-art multilingual performance while retaining strong English capability, scoring 0.75 on m-GenEval and 0.68 on m-DPG. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks, even though these models are built on multilingual base LLMs. This demonstrates the effectiveness of our targeted alignment training for preserving and extending crosslingual generalization. We further introduce two new metrics to rigorously assess multilingual alignment and robustness to code-mixed prompts. Notably, NeoBabel matches or exceeds English-only models while being 2-4x smaller. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research. Our work demonstrates that multilingual capability is not a trade-off but a catalyst for improved robustness, efficiency, and cultural fidelity in generative AI.
zh
[NLP-16] Nyay-Darpan: Enhancing Decision Making Through Summarization and Case Retrieval for Consumer Law in India
【速读】: 该论文试图解决消费者法律领域中人工智能工具的缺失问题,特别是在印度,AI驱动的司法辅助和案件预测在刑事和民事领域已被广泛研究,但在消费者法领域仍处于未探索状态。论文提出的解决方案是Nyay-Darpan框架,其关键在于结合消费者案件文件的自动摘要生成与相似判决检索功能,以支持消费者纠纷的决策过程。该框架不仅填补了消费者法AI工具的空白,还引入了一种创新的方法来评估摘要质量,从而提升了实际应用的有效性。
链接: https://arxiv.org/abs/2507.06090
作者: Swapnil Bhattacharyya,Shrey Ganatra,Harshvivek Kashid,Spandan Anaokar,Shruti Nair,Reshma Sekhar,Siddharth Manohar,Rahul Hemrajani,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); National Law School of India University, Bangalore (印度国家法律大学,班加罗尔)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:AI-based judicial assistance and case prediction have been extensively studied in criminal and civil domains, but remain largely unexplored in consumer law, especially in India. In this paper, we present Nyay-Darpan, a novel two-in-one framework that (i) summarizes consumer case files and (ii) retrieves similar case judgements to aid decision-making in consumer dispute resolution. Our methodology not only addresses the gap in consumer law AI tools but also introduces an innovative approach to evaluate the quality of the summary. The term ‘Nyay-Darpan’ translates into ‘Mirror of Justice’, symbolizing the ability of our tool to reflect the core of consumer disputes through precise summarization and intelligent case retrieval. Our system achieves over 75 percent accuracy in similar case prediction and approximately 70 percent accuracy across material summary evaluation metrics, demonstrating its practical effectiveness. We will publicly release the Nyay-Darpan framework and dataset to promote reproducibility and facilitate further research in this underexplored yet impactful domain.
zh
[NLP-17] A Survey on Prompt Tuning
【速读】: 该论文旨在解决如何在保持语言模型参数冻结的前提下,通过参数高效的方式对模型进行适应性调整的问题。其解决方案的关键在于采用提示调优(prompt tuning),即通过在输入前添加可训练的连续向量来实现模型的微调,而非更新整个模型参数。这种方法显著降低了计算成本并提高了模型适配的灵活性。
链接: https://arxiv.org/abs/2507.06085
作者: Zongqian Li,Yixuan Su,Nigel Collier
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This survey reviews prompt tuning, a parameter-efficient approach for adapting language models by prepending trainable continuous vectors while keeping the model frozen. We classify existing approaches into two categories: direct prompt learning and transfer learning. Direct prompt learning methods include: general optimization approaches, encoder-based methods, decomposition strategies, and mixture-of-experts frameworks. Transfer learning methods consist of: general transfer approaches, encoder-based methods, and decomposition strategies. For each method, we analyze method designs, innovations, insights, advantages, and disadvantages, with illustrative visualizations comparing different frameworks. We identify challenges in computational efficiency and training stability, and discuss future directions in improving training robustness and broadening application scope.
zh
[NLP-18] Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLM s
【速读】: 该论文试图解决如何表征训练数据在大型语言模型(Large Language Models, LLMs)中的记忆难度这一基础但研究不足的问题。其解决方案的关键在于通过实证实验发现熵与记忆分数之间的线性关系,即熵-记忆定律(Entropy-Memorization Law),并基于此提出一种简单而有效的方法来区分训练数据和测试数据,从而实现数据集推断(Dataset Inference, DI)。
链接: https://arxiv.org/abs/2507.06056
作者: Yizhan Huang,Zhe Yang,Meifang Chen,Jianping Zhang,Michael R. Lyu
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present the Entropy-Memorization Law. It suggests that data entropy is linearly correlated with memorization score. Moreover, in a case study of memorizing highly randomized strings, or “gibberish”, we observe that such sequences, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the broader training corpus. Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data, enabling Dataset Inference (DI).
zh
[NLP-19] Conditional Multi-Stage Failure Recovery for Embodied Agents
【速读】: 该论文试图解决具身智能体在执行复杂任务时容易发生执行失败的问题,从而需要有效的失败恢复机制。解决方案的关键在于引入一种基于零样本链式提示的条件多阶段失败恢复框架,该框架通过利用大语言模型(LLM)的推理能力,在环境上下文中分析执行挑战并制定战略解决方案,从而实现高效的失败恢复。
链接: https://arxiv.org/abs/2507.06016
作者: Youmna Farag,Svetlana Stoyanchev,Mohan Li,Simon Keizer,Rama Doddipatla
机构: Cambridge Research Laboratory, Toshiba Europe Ltd(剑桥研究中心,东芝欧洲有限公司)
类目: Computation and Language (cs.CL)
备注: Accepted at REALM 2025
Abstract:Embodied agents performing complex tasks are susceptible to execution failures, motivating the need for effective failure recovery mechanisms. In this work, we introduce a conditional multistage failure recovery framework that employs zero-shot chain prompting. The framework is structured into four error-handling stages, with three operating during task execution and one functioning as a post-execution reflection phase. Our approach utilises the reasoning capabilities of LLMs to analyse execution challenges within their environmental context and devise strategic solutions. We evaluate our method on the TfD benchmark of the TEACH dataset and achieve state-of-the-art performance, outperforming a baseline without error recovery by 11.5% and surpassing the strongest existing model by 19%.
zh
[NLP-20] DocIE@XLLM 25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations
【速读】: 该论文试图解决在零样本或少样本设置下,文档级实体和关系抽取任务中高质量标注语料库稀缺的问题。其解决方案的关键在于提出一种完全自动化的、基于大语言模型(Large Language Model, LLM)的合成数据生成与上下文学习相结合的流水线,通过推理优化的语言模型生成高质量的示例数据库,并在推理时动态检索相关例子,从而提升模型在文档级实体和关系抽取任务中的表现。
链接: https://arxiv.org/abs/2507.05997
作者: Nicholas Popovič,Ashish Kangen,Tim Schopf,Michael Färber
机构: ScaDS.AI & TU Dresden( ScaDS.AI 和德累斯顿工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over 5k Wikipedia abstracts with approximately 59k entities and 30k relation triples. Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting. We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.
zh
[NLP-21] Evolution without Large Models: Training Language Model with Task Principles
【速读】: 该论文试图解决语言模型训练中因依赖大规模人工标注数据而导致的高成本问题,以及在使用闭源大语言模型(Large Language Model, LLM)进行数据增强时产生的高碳排放和数据泄露风险。其解决方案的关键在于提出一种自进化方法,包括多层级原则生成(Multi-level Principle Generation)和基于原则的实例生成(Principle-based Instance Generation),通过小样本任务数据总结出任务完成原则,再由小规模语言模型依据这些原则生成大量数据用于模型训练,从而有效降低碳排放并避免数据泄露。
链接: https://arxiv.org/abs/2507.05991
作者: Minghang Zhu,Shen Gao,Zhengliang Shi,Jiabao Fang,Pengjie Ren,Zhaochun Ren,Zhumin Chen,Shuo Shang
机构: Shandong University (山东大学); University of Electronic Science and Technology of China (中国电子科技大学); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:A common training approach for language models involves using a large-scale language model to expand a human-provided dataset, which is subsequently used for model this http URL method significantly reduces training costs by eliminating the need for extensive human data annotation. However, it still faces challenges such as high carbon emissions during data augmentation and the risk of data leakage when we use closed-source LLMs. To address these issues, we propose a self-evolution method for language models. First, we introduce the Multi-level Principle Generation, which enables a large-scale model to summarize task-completion principles based on a small amount of task data. Then, we propose the Principle-based Instance Generation, in which a smaller-scale language model uses these task principles to generate a large amount of data. This data is then used for model training. Experimental results show that our proposed method significantly improves model performance compared to directly using a smaller-scale language model to generate data. Additionally, since we only use the large-scale language model to generate the task-completion principles, the carbon emissions associated with training the model are greatly reduced.
zh
[NLP-22] Development and Evaluation of HopeBot: an LLM -based chatbot for structured and interactive PHQ-9 depression screening
【速读】: 该论文试图解决传统静态工具如患者健康问卷-9(Patient Health Questionnaire-9, PHQ-9)在抑郁筛查中缺乏交互性和适应性的问题。其解决方案的关键在于开发HopeBot,这是一个基于大型语言模型(Large Language Model, LLM)的聊天机器人,能够通过检索增强生成和实时澄清来执行PHQ-9评估,从而提升筛查过程的互动性、结构清晰度和用户信任度。
链接: https://arxiv.org/abs/2507.05984
作者: Zhijun Guo,Alvina Lai,Julia Ive,Alexandru Petcu,Yutong Wang,Luyuan Qi,Johan H Thygesen,Kezhi Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Static tools like the Patient Health Questionnaire-9 (PHQ-9) effectively screen depression but lack interactivity and adaptability. We developed HopeBot, a chatbot powered by a large language model (LLM) that administers the PHQ-9 using retrieval-augmented generation and real-time clarification. In a within-subject study, 132 adults in the United Kingdom and China completed both self-administered and chatbot versions. Scores demonstrated strong agreement (ICC = 0.91; 45% identical). Among 75 participants providing comparative feedback, 71% reported greater trust in the chatbot, highlighting clearer structure, interpretive guidance, and a supportive tone. Mean ratings (0-10) were 8.4 for comfort, 7.7 for voice clarity, 7.6 for handling sensitive topics, and 7.4 for recommendation helpfulness; the latter varied significantly by employment status and prior mental-health service use (p 0.05). Overall, 87.1% expressed willingness to reuse or recommend HopeBot. These findings demonstrate voice-based LLM chatbots can feasibly serve as scalable, low-burden adjuncts for routine depression screening.
zh
[NLP-23] RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)及其安全分类器在低资源语言中表现不佳的问题,主要原因是训练数据和评估基准的不足。解决方案的关键在于构建一个名为RabakBench的多语言安全基准,该基准针对新加坡独特的语言环境,涵盖Singlish、中文、马来语和泰米尔语。RabakBench通过可扩展的三阶段流程构建:生成阶段利用LLM驱动的红队技术增强真实Singlish网络内容以生成对抗样本;标注阶段采用多数投票的LLM标签器进行半自动化多标签安全标注,确保与人类判断一致;翻译阶段则通过高保真翻译保持语言细微差别和毒性特征。该数据集包含超过5,000个跨四种语言的安全标记示例,并覆盖六个细粒度的安全类别及严重程度等级。
链接: https://arxiv.org/abs/2507.05980
作者: Gabriel Chua,Leanne Tan,Ziyu Ge,Roy Ka-Wei Lee
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore’s unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.
zh
[NLP-24] We Should Evaluate Real-World Impact
【速读】: 该论文试图解决自然语言处理(Natural Language Processing, NLP)系统在现实世界中的影响评估不足的问题。研究表明,ACL社区对NLP系统的实际应用效果关注较少,仅有约0.1%的论文包含此类评估,且多数论文对影响评估的描述较为简略,更倾向于关注指标评估。论文提出的解决方案的关键在于:研究人员应更加重视并系统地理解和评估NLP技术的实际应用影响,从而提升其实用性和被采纳的速度。
链接: https://arxiv.org/abs/2507.05973
作者: Ehud Reiter
机构: University of Aberdeen (阿伯丁大学)
类目: Computation and Language (cs.CL)
备注: This paper will appear in Computational Linguistics journal as a “Last Word” opinion piece. The Arxiv version is a pre-MIT Press publication version
Abstract:The ACL community has very little interest in evaluating the real-world impact of NLP systems. A structured survey of the ACL Anthology shows that perhaps 0.1% of its papers contain such evaluations; furthermore most papers which include impact evaluations present them very sketchily and instead focus on metric evaluations. NLP technology would be more useful and more quickly adopted if we seriously tried to understand and evaluate its real-world impact.
zh
[NLP-25] OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation EMNLP2025
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)生成文本的事实性评估问题,旨在提供一种开放源代码的实现方案以替代原有依赖闭源模型的FActScore框架。解决方案的关键在于引入OpenFActScore,它通过原子事实生成(Atomic Fact Generation, AFG)和原子事实验证(Atomic Fact Validation, AFV)两个步骤,分别提取并验证文本中的个体事实陈述,同时支持使用任何Hugging Face兼容模型进行这两项任务,从而实现了评估过程的透明性、可复现性和成本效益。
链接: https://arxiv.org/abs/2507.05965
作者: Lucas Fonseca Lage,Simon Ostermann
机构: Philipps-Universität Marburg (马尔堡菲利普斯大学); German Research Centre for Artificial Intelligence (DFKI) (德国人工智能研究中心); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心); Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to EMNLP 2025 System Demonstrations track
Abstract:We introduce OpenFActScore, an open-source implementation of the FActScore framework for evaluating the factuality of text generated by large language models (LLMs). FActScore evaluates the factual accuracy of long-form text by using Atomic Fact Generation (AFG) to extract individual factual claims and Atomic Fact Validation (AFV) to verify each claim against a trusted knowledge source. While the original FActScore relies on closed-source and commercial models such as InstructGPT and ChatGPT, OpenFActScore enables the use of any Hugging Face-compatible model for both AFG and AFV. We provide a detailed technical overview of our implementation, highlighting design choices and modifications made to support open models. We evaluate multiple open-source LLMs on both AFG and AFV using the original FActScore benchmark, reporting BERTScore-F1 for AFG and Error Rate relative to human annotations for AFV. Our results show that open models can approximate the performance of closed-source systems, with Gemma achieving the best overall performance, and our final setup obtains a 0.99 Pearson correlation with the original FActScore experiments. OpenFActScore promotes transparency, reproducibility, and cost-effective evaluation, and is available at: this https URL.
zh
[NLP-26] Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems
【速读】: 该论文试图解决对话系统中的Ghosting问题,即通过预测用户输入的文本以实现内联查询自动补全,从而提升用户体验。其关键解决方案在于对多种现有自动补全方法(包括基于前缀树、n-gram和深度学习的方法)进行系统性评估,并引入一种基于熵的动态提前停止策略。研究还发现,在已知前缀的情况下,统计n-gram模型和前缀树在模型性能和推理效率上优于深度学习模型,而在未知查询场景下,神经网络模型如T5和Phi-2表现更优,同时加入对话上下文显著提升了Ghosting的质量。
链接: https://arxiv.org/abs/2507.05940
作者: Sandeep Mishra,Anubhab Mandal,Bishal Santra,Tushar Abhishek,Pawan Goyal,Manish Gupta
机构: Indian Institute of Technology Kharagpur, India; Microsoft Research, India; Microsoft, India
类目: Computation and Language (cs.CL)
备注:
Abstract:Ghosting, the ability to predict a user’s intended text input for inline query auto-completion, is an invaluable feature for modern search engines and chat interfaces, greatly enhancing user experience. By suggesting completions to incomplete queries (or prefixes), ghosting aids users with slow typing speeds, disabilities, or limited language proficiency. Ghosting is a challenging problem and has become more important with the ubiquitousness of chat-based systems like ChatGPT, Copilot, etc. Despite the increasing prominence of chat-based systems utilizing ghosting, this challenging problem of Chat-Ghosting has received little attention from the NLP/ML research community. There is a lack of standardized benchmarks and relative performance analysis of deep learning and non-deep learning methods. We address this through an open and thorough study of this problem using four publicly available dialog datasets: two human-human (DailyDialog and DSTC7-Ubuntu) and two human-bot (Open Assistant and ShareGPT). We experiment with various existing query auto-completion methods (using tries), n-gram methods and deep learning methods, with and without dialog context. We also propose a novel entropy-based dynamic early stopping strategy. Our analysis finds that statistical n-gram models and tries outperform deep learning based models in terms of both model performance and inference efficiency for seen prefixes. For unseen queries, neural models like T5 and Phi-2 lead to better results. Adding conversational context leads to significant improvements in ghosting quality, especially for Open-Assistant and ShareGPT. We make code and data publicly available
zh
[NLP-27] Remember Past Anticipate Future: Learning Continual Multimodal Misinformation Detectors ACM-MM2025
【速读】: 该论文试图解决持续性多模态虚假信息检测(continual Multimodal Misinformation Detection, continual MMD)中的两个主要挑战:一是新数据训练导致旧知识遗忘,二是社会环境变化影响模型对未来数据的泛化能力。解决方案的关键在于通过基于狄利克雷过程的专家混合结构隔离事件特定参数间的干扰,以记忆过去知识,并通过学习连续时间动力学模型来预测未来环境分布,从而提出一种新的持续性MMD方法DAEDCMD。
链接: https://arxiv.org/abs/2507.05939
作者: Bing Wang,Ximing Li,Mengzhe Ye,Changchun Li,Bo Fu,Jianfeng Qu,Lin Yuanbo Wu
机构: Jilin University(吉林大学); Liaoning Normal University(辽宁师范大学); Soochow University(苏州大学); Swansea University(斯旺西大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted by ACM MM 2025. 10 pages, 6 figures. Code: this https URL
Abstract:Nowadays, misinformation articles, especially multimodal ones, are widely spread on social media platforms and cause serious negative effects. To control their propagation, Multimodal Misinformation Detection (MMD) becomes an active topic in the community to automatically identify misinformation. Previous MMD methods focus on supervising detectors by collecting offline data. However, in real-world scenarios, new events always continually emerge, making MMD models trained on offline data consistently outdated and ineffective. To address this issue, training MMD models under online data streams is an alternative, inducing an emerging task named continual MMD. Unfortunately, it is hindered by two major challenges. First, training on new data consistently decreases the detection performance on past data, named past knowledge forgetting. Second, the social environment constantly evolves over time, affecting the generalization on future data. To alleviate these challenges, we propose to remember past knowledge by isolating interference between event-specific parameters with a Dirichlet process-based mixture-of-expert structure, and anticipate future environmental distributions by learning a continuous-time dynamics model. Accordingly, we induce a new continual MMD method DAEDCMD. Extensive experiments demonstrate that DAEDCMD can consistently and significantly outperform the compared methods, including six MMD baselines and three continual learning methods.
zh
[NLP-28] owards a Principled Evaluation of Knowledge Editors ACL2025
【速读】: 该论文试图解决知识编辑(Knowledge Editing)评估方法的鲁棒性问题以及不同编辑方法对模型整体能力的影响盲点。其关键解决方案在于通过实验表明,选择不同的评估指标、评估方法及编辑批次大小会导致知识编辑器排名的变化,并进一步在通用语言理解任务中验证了这一影响。此外,论文还对基于字符串匹配的评估方法进行了人工评估,揭示了该方法存在产生假阳性结果的倾向。
链接: https://arxiv.org/abs/2507.05937
作者: Sebastian Pohl,Max Ploner,Alan Akbik
机构: Humboldt Universität zu Berlin (洪堡大学); Science Of Intelligence (智能科学)
类目: Computation and Language (cs.CL)
备注: Accepted at L2M2 workshop at ACL 2025
Abstract:Model editing has been gaining increasing attention over the past few years. For Knowledge Editing in particular, more challenging evaluation datasets have recently been released. These datasets use different methodologies to score the success of editors. Yet, it remains under-explored how robust these methodologies are and whether they unfairly favor some editors. Moreover, the disruptive impact of these editors on overall model capabilities remains a constant blind spot. We address both of these problems and show that choosing different metrics and evaluation methodologies as well as different edit batch sizes can lead to a different ranking of knowledge editors. Crucially we demonstrate this effect also on general language understanding tasks evaluated alongside the knowledge editing tasks. Further we include a manual assessment of the string matching based evaluation method for knowledge editing that is favored by recently released datasets, revealing a tendency to produce false positive matches. Comments: Accepted at L2M2 workshop at ACL 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.05937 [cs.CL] (or arXiv:2507.05937v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.05937 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-29] Semantic Certainty Assessment in Vector Retrieval Systems: A Novel Framework for Embedding Quality Evaluation
【速读】: 该论文试图解决向量检索系统在不同查询间性能差异显著的问题,这一问题主要由嵌入质量的异质性引起。其解决方案的关键在于提出一种轻量级框架,通过结合量化鲁棒性和邻域密度度量来预测查询级别的检索性能。该方法基于高质量嵌入在嵌入空间中占据几何稳定区域并表现出一致邻域结构的观察,从而实现对检索性能的有效预测。
链接: https://arxiv.org/abs/2507.05933
作者: Y. Du
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 7 pages
Abstract:Vector retrieval systems exhibit significant performance variance across queries due to heterogeneous embedding quality. We propose a lightweight framework for predicting retrieval performance at the query level by combining quantization robustness and neighborhood density metrics. Our approach is motivated by the observation that high-quality embeddings occupy geometrically stable regions in the embedding space and exhibit consistent neighborhood structures. We evaluate our method on 4 standard retrieval datasets, showing consistent improvements of 9.4 \pm 1.2% in Recall@10 over competitive baselines. The framework requires minimal computational overhead (less than 5% of retrieval time) and enables adaptive retrieval strategies. Our analysis reveals systematic patterns in embedding quality across different query types, providing insights for targeted training data augmentation.
zh
[NLP-30] Few-shot text-based emotion detection
【速读】: 该论文旨在解决文本情感检测中的跨语言和跨文化差异问题,即SemEval 2025 Workshop任务11“弥合基于文本的情感检测差距”。其解决方案的关键在于利用大型语言模型(如Gemini、Qwen、DeepSeek)进行少样本提示或微调,以提升多标签情感检测的性能。
链接: https://arxiv.org/abs/2507.05918
作者: Teodor-George Marchitan,Claudiu Creanga,Liviu P. Dinu
机构: University of Bucharest (布加勒斯特大学); Faculty of Mathematics and Computer Science (数学与计算机科学学院); Interdisciplinary School of Doctoral Studies (跨学科博士研究学院); HLT Research Center (语言技术研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper describes the approach of the Unibuc - NLP team in tackling the SemEval 2025 Workshop, Task 11: Bridging the Gap in Text-Based Emotion Detection. We mainly focused on experiments using large language models (Gemini, Qwen, DeepSeek) with either few-shot prompting or fine-tuning. With our final system, for the multi-label emotion detection track (track A), we got an F1-macro of 0.7546 (26/96 teams) for the English subset, 0.1727 (35/36 teams) for the Portuguese (Mozambican) subset and 0.325 (\textbf1/31 teams) for the Emakhuwa subset.
zh
[NLP-31] AI-Reporter: A Path to a New Genre of Scientific Communication
【速读】: 该论文试图解决学术报告内容向正式科学文献转化效率低的问题,其解决方案的关键在于利用AI-Reporter系统,通过技术手段将临时性的学术演示内容快速转化为可发表的章节,从而缩短从演讲到出版的时间周期。
链接: https://arxiv.org/abs/2507.05903
作者: Gerd Graßhoff
机构: Humboldt-Universität zu Berlin(洪堡大学)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:
Abstract:The AI-Reporter represents a paradigmatic shift in scientific publication practice. This document demonstrates through a concrete case study how our system transforms academic presentations into publication-ready chapters – in less than three minutes. Using Arno Simons’ lecture on Large Language Models from the ``Large Language Models for the History, Philosophy, and Sociology of Science’’ workshop (NEPI) as an example, we show how technological innovation bridges the gap between ephemeral presentation and permanent scientific documentation.
zh
[NLP-32] MusiScene: Leverag ing MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation
【速读】: 该论文试图解决音乐场景想象(Music Scene Imagination, MSI)问题,即根据音乐生成与之相匹配的视觉场景描述。传统音乐字幕模型仅关注音乐元素,而未能充分利用视频与音乐之间的跨模态信息。解决方案的关键在于构建一个大规模的视频-音频字幕数据集,并在此基础上微调Music Understanding LLaMA模型,从而得到MusiScene模型,该模型能够生成更具上下文相关性的场景描述,相较于MU-LLaMA表现更优。
链接: https://arxiv.org/abs/2507.05894
作者: Fathinah Izzati,Xinyue Li,Yuxuan Wu,Gus Xia
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Humans can imagine various atmospheres and settings when listening to music, envisioning movie scenes that complement each piece. For example, slow, melancholic music might evoke scenes of heartbreak, while upbeat melodies suggest celebration. This paper explores whether a Music Language Model, e.g. MU-LLaMA, can perform a similar task, called Music Scene Imagination (MSI), which requires cross-modal information from video and music to train. To improve upon existing music captioning models which focusing solely on musical elements, we introduce MusiScene, a music captioning model designed to imagine scenes that complement each music. In this paper, (1) we construct a large-scale video-audio caption dataset with 3,371 pairs, (2) we finetune Music Understanding LLaMA for the MSI task to create MusiScene, and (3) we conduct comprehensive evaluations and prove that our MusiScene is more capable of generating contextually relevant captions compared to MU-LLaMA. We leverage the generated MSI captions to enhance Video Background Music Generation (VBMG) from text.
zh
[NLP-33] Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators
【速读】: 该论文试图解决在大规模语言模型(Large Language Models, LLMs)评估中,如何高效生成具有结构效度的问卷题目问题。传统方法依赖于成本高昂的大规模人工数据收集,而本文提出了一种基于LLMs的虚拟被试模拟框架。解决方案的关键在于引入“中介因素”(mediators),即影响同一特质在不同问卷题目上产生差异性回答的因素。通过模拟具有多样化中介因素的被试,能够识别出能够稳健测量目标特质的问卷题目。
链接: https://arxiv.org/abs/2507.05890
作者: Sungjib Lim,Woojung Song,Eun-Ju Lee,Yohan Jo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 7 figures
Abstract:As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that robustly measure intended traits. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-effective survey development and a deeper understanding of how LLMs replicate human-like behavior. We will publicly release our dataset and code to support future work.
zh
[NLP-34] How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures
【速读】: 该论文试图解决自动语音识别(ASR)系统在不同说话人及说话人群体中存在偏差的问题,特别是由于性别、年龄或口音等因素导致的识别性能差异。其解决方案的关键在于比较和评估不同的性能与偏差衡量指标,以更全面地反映系统的性能和偏差情况,而不仅仅依赖于平均错误率这一传统指标。研究还提出了针对不同说话人群体的偏差缓解策略,并最终给出了改进ASR性能与偏差报告的建议,以更好地体现系统对多样化说话人群体的表现。
链接: https://arxiv.org/abs/2507.05885
作者: Tanvina Patel,Wiebke Hutiri,Aaron Yi Ding,Odette Scharenborg
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:There is increasingly more evidence that automatic speech recognition (ASR) systems are biased against different speakers and speaker groups, e.g., due to gender, age, or accent. Research on bias in ASR has so far primarily focused on detecting and quantifying bias, and developing mitigation approaches. Despite this progress, the open question is how to measure the performance and bias of a system. In this study, we compare different performance and bias measures, from literature and proposed, to evaluate state-of-the-art end-to-end ASR systems for Dutch. Our experiments use several bias mitigation strategies to address bias against different speaker groups. The findings reveal that averaged error rates, a standard in ASR research, alone is not sufficient and should be supplemented by other measures. The paper ends with recommendations for reporting ASR performance and bias to better represent a system’s performance for diverse speaker groups, and overall system bias.
zh
[NLP-35] Affective-ROPTester: Capability and Bias Analysis of LLM s in Predicting Retinopathy of Prematurity
【速读】: 该论文试图解决大型语言模型(LLM)在早产儿视网膜病变(ROP)风险预测中的能力不足问题,尤其是其在缺乏外部知识支持时的表现及情感偏见问题。解决方案的关键在于提出一种名为Affective-ROPTester的自动化评估框架,该框架结合了基于指令、思维链(CoT)和上下文学习(ICL)的提示策略,并在提示层面引入情感元素,以系统性地评估LLM的预测能力及其情感偏见,从而提升临床语言建模系统的诊断可靠性。
链接: https://arxiv.org/abs/2507.05816
作者: Shuai Zhao,Yulin Zhang,Luwei Xiao,Xinyi Wu,Yanhao Jia,Zhongliang Guo,Xiaobao Wu,Cong-Duy Nguyen,Guoming Zhang,Anh Tuan Luu
机构: Nanyang Technological University(南洋理工大学); Huizhou First Hospital(惠州第一医院); East China Normal University(华东师范大学); Shanghai Jiao Tong University(上海交通大学); University of St Andrews(圣安德鲁斯大学); Southern Medical University(南方医科大学)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:
Abstract:Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities and affective biases of LLMs in ROP risk stratification, we propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies: Instruction-based, Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme assesses LLMs’ intrinsic knowledge and associated biases, whereas the CoT and ICL schemes leverage external medical knowledge to enhance predictive accuracy. Crucially, we integrate emotional elements at the prompt level to investigate how different affective framings influence the model’s ability to predict ROP and its bias patterns. Empirical results derived from the CROP dataset yield two principal observations. First, LLMs demonstrate limited efficacy in ROP risk prediction when operating solely on intrinsic knowledge, yet exhibit marked performance gains when augmented with structured external inputs. Second, affective biases are evident in the model outputs, with a consistent inclination toward overestimating medium- and high-risk cases. Third, compared to negative emotions, positive emotional framing contributes to mitigating predictive bias in model outputs. These findings highlight the critical role of affect-sensitive prompt engineering in enhancing diagnostic reliability and emphasize the utility of Affective-ROPTester as a framework for evaluating and mitigating affective bias in clinical language modeling systems.
zh
[NLP-36] Bridging Perception and Language: A Systematic Benchmark for LVLMs Understanding of Amodal Completion Reports
【速读】: 该论文试图解决大型视觉语言模型(LVLMs)在处理与无模补全(amodal completion)相关的文本时的推理能力不足的问题。其解决方案的关键在于构建一个基于基础形式本体论(Basic Formal Ontology)的基准,以实现对无模补全的系统分类,从而评估LVLMs在该任务上的表现。
链接: https://arxiv.org/abs/2507.05799
作者: Amane Watahiki,Tomoki Doi,Taiga Shinozaki,Satoshi Nishida,Takuya Niikawa,Katsunori Miyahara,Hitomi Yanaka
机构: The University of Tokyo(东京大学); Keio University(庆应大学); NICT(国家信息通信技术研究所); Osaka University(大阪大学); Hokkaido University(北海道大学); CiNET(信息网络中心); Kobe University(神户大学)
类目: Computation and Language (cs.CL)
备注: To appear in the Proceedings of the 47th Annual Meeting of the Cognitive Science Society (COGSCI 2025)
Abstract:One of the main objectives in developing large vision-language models (LVLMs) is to engineer systems that can assist humans with multimodal tasks, including interpreting descriptions of perceptual experiences. A central phenomenon in this context is amodal completion, in which people perceive objects even when parts of those objects are hidden. Although numerous studies have assessed whether computer-vision algorithms can detect or reconstruct occluded regions, the inferential abilities of LVLMs on texts related to amodal completion remain unexplored. To address this gap, we constructed a benchmark grounded in Basic Formal Ontology to achieve a systematic classification of amodal completion. Our results indicate that while many LVLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed. Notably, in certain categories, some LLaVA-NeXT variants and Claude 3.5 Sonnet exhibit lower accuracy on original images compared to blank stimuli lacking visual content. Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.
zh
[NLP-37] Flippi: End To End GenAI Assistant for E-Commerce
【速读】: 该论文旨在解决用户在面对庞大且复杂的电商平台产品信息时,难以高效发现和选择合适商品的问题。其解决方案的关键在于构建一个基于大型语言模型(LLM)的端到端对话助手Flippi,通过自然语言处理技术如查询重写、意图识别、检索增强生成(RAG)、命名实体识别(NER)和上下文精简,精准理解用户需求并提供个性化推荐,同时具备对比产品特性、价格等信息的能力,从而提升用户的购物体验与决策效率。
链接: https://arxiv.org/abs/2507.05788
作者: Anand A. Rajasekar,Praveen Tangarajan,Anjali Nainani,Amogh Batwal,Vinay Rao Dandin,Anusua Trivedi,Ozan Ersoy
机构: Flipkart US R&D Center (Flipkart 美国研发中心)
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 7 tables
Abstract:The emergence of conversational assistants has fundamentally reshaped user interactions with digital platforms. This paper introduces Flippi-a cutting-edge, end-to-end conversational assistant powered by large language models (LLMs) and tailored for the e-commerce sector. Flippi addresses the challenges posed by the vast and often overwhelming product landscape, enabling customers to discover products more efficiently through natural language dialogue. By accommodating both objective and subjective user requirements, Flippi delivers a personalized shopping experience that surpasses traditional search methods. This paper details how Flippi interprets customer queries to provide precise product information, leveraging advanced NLP techniques such as Query Reformulation, Intent Detection, Retrieval-Augmented Generation (RAG), Named Entity Recognition (NER), and Context Reduction. Flippi’s unique capability to identify and present the most attractive offers on an e-commerce site is also explored, demonstrating how it empowers users to make cost-effective decisions. Additionally, the paper discusses Flippi’s comparative analysis features, which help users make informed choices by contrasting product features, prices, and other relevant attributes. The system’s robust architecture is outlined, emphasizing its adaptability for integration across various e-commerce platforms and the technological choices underpinning its performance and accuracy. Finally, a comprehensive evaluation framework is presented, covering performance metrics, user satisfaction, and the impact on customer engagement and conversion rates. By bridging the convenience of online shopping with the personalized assistance traditionally found in physical stores, Flippi sets a new standard for customer satisfaction and engagement in the digital marketplace.
zh
[NLP-38] DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities SIGDIAL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多轮对话任务中存在能力与训练范式不匹配的问题,因为其预训练数据主要由连续文本构成。解决方案的关键在于通过从现有文本语料库中合成对话数据,构建一个包含多轮、多主题信息检索对话的预训练语料库,即DocTalk。该语料库由超过730k条长对话组成,旨在通过在预训练阶段引入此类合成对话结构,提升LLMs在上下文记忆和理解等方面的基础多轮对话能力。
链接: https://arxiv.org/abs/2507.05750
作者: Jing Yang Lee,Hamed Bonab,Nasser Zalmout,Ming Zeng,Sanket Lokegaonkar,Colin Lockard,Binxuan Huang,Ritesh Sarkhel,Haodong Wang
机构: Amazon(亚马逊); Nanyang Technological University(南洋理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted at SIGDIAL 2025
Abstract:Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance. DocTalk is available at this https URL.
zh
[NLP-39] GPT KB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge
【速读】: 该论文试图解决语言模型(Language Models)的事实性知识仍然难以被即时浏览和进行可扩展统计分析的问题。其解决方案的关键在于提出GPTKB v1.5,这是一个由14,000个GPT-4实例通过大规模递归大语言模型知识实体化(massive-recursive LLM knowledge materialization)方法构建的密集互联的1亿三元组知识库(KB)。这一方法为系统性分析LLM知识以及自动化知识库构建提供了突破性的机会。
链接: https://arxiv.org/abs/2507.05740
作者: Yujia Hu,Tuan-Phong Nguyen,Shrestha Ghosh,Moritz Müller,Simon Razniewski
机构: ScaDS.AI Dresden/Leipzig & TU Dresden, Germany; Institute for AI, VNU University of Engineering and Technology, Hanoi, Vietnam; University of Tübingen, Germany
类目: Computation and Language (cs.CL)
备注: 7 pages, 6 figures, 1 table
Abstract:Language models are powerful tools, yet their factual knowledge is still poorly understood, and inaccessible to ad-hoc browsing and scalable statistical analysis. This demonstration introduces GPTKB v1.5, a densely interlinked 100-million-triple knowledge base (KB) built for 14,000 from GPT-4.1, using the GPTKB methodology for massive-recursive LLM knowledge materialization (Hu et al., ACL 2025). The demonstration experience focuses on three use cases: (1) link-traversal-based LLM knowledge exploration, (2) SPARQL-based structured LLM knowledge querying, (3) comparative exploration of the strengths and weaknesses of LLM knowledge. Massive-recursive LLM knowledge materialization is a groundbreaking opportunity both for the research area of systematic analysis of LLM knowledge, as well as for automated KB construction. The GPTKB demonstrator is accessible at this https URL.
zh
[NLP-40] Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
【速读】: 该论文试图解决多专家(Mixture-of-experts, MoE)架构在自动语音识别(ASR)任务中,由于各层路由器独立决策导致专家间协作不足和专业化程度有限的问题。解决方案的关键在于引入一个跨不同MoE层共享的路由器,称为\emphOmni-router Transformer,以增强不同层专家之间的协作并促进更深层次的专业化。实验结果表明,该模型在降低训练损失和提升ASR性能方面优于密集模型和Switch Transformer模型。
链接: https://arxiv.org/abs/2507.05724
作者: Zijin Gu,Tatiana Likhomanenko,Navdeep Jaitly
机构: Apple Inc. (苹果公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model \emphOmni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.
zh
[NLP-41] MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
【速读】: 该论文旨在解决现有基于视觉的GUI代理在离线环境中训练所导致的可扩展性差、过拟合特定UI模板以及在面对未见过的环境时策略脆弱的问题。其解决方案的关键在于提出MobileGUI-RL框架,该框架通过在线环境训练GUI代理,并包含两个核心组件:一是通过自我探索和过滤生成可学习任务的课程,二是通过对GRPO算法进行适应性改进,引入基于轨迹的优势和复合奖励机制,以平衡任务成功率与执行效率。
链接: https://arxiv.org/abs/2507.05720
作者: Yucheng Shi,Wenhao Yu,Zaitang Li,Yonglin Wang,Hongming Zhang,Ninghao Liu,Haitao Mi,Dong Yu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages, 4 figures
Abstract:Recently, there has been a surge of vision-based GUI agents designed to automate everyday mobile and web tasks. These agents interpret raw GUI screenshots and autonomously decide where to click, scroll, or type, which bypasses handcrafted rules and app-specific APIs. However, most existing methods trained GUI agent in the offline environment using pre-collected trajectories. This approach limits scalability, causes overfitting to specific UI templates, and leads to brittle policies when faced with unseen environment. We present MobileGUI-RL, a scalable framework that trains GUI agent in online environment. MobileGUI-RL contains two key components. It (i) synthesizes a curriculum of learnable tasks through self-exploration and filtering, and (ii) adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards that balance task success and execution efficiency. Experiments on three online mobile-agent benchmarks show consistent gains, validating the effectiveness of our approach.
zh
[NLP-42] HIRAG : Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation
【速读】: 该论文试图解决传统检索增强生成(RAG)系统在处理实时信息和领域特定问题时存在的文档质量不一致以及检索系统不完善等问题。其解决方案的关键在于提出RAG模型应具备的三种逐级递进的能力:过滤能力(选择相关信息)、组合能力(跨段落语义信息整合)以及RAG特定推理能力(利用内部知识进一步处理外部知识)。为此,作者引入了Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG)方法,该方法通过“先思考后回答”的策略,利用多层级渐进式思维链提升模型的开卷考试能力,从而显著提升模型在多个数据集上的性能。
链接: https://arxiv.org/abs/2507.05714
作者: YiHan Jiao,ZheHao Tan,Dan Yang,DuoLin Sun,Jie Feng,Jian Wang,Peng Wei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textitlack a granular focus on RAG task or \textita deeper utilization of chain-of-thought processes. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a “think before answering” strategy. This method enhances the model’s open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model’s performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.
zh
[NLP-43] DRAG ON: Dynamic RAG Benchmark On News
【速读】: 该论文试图解决现有检索增强生成(Retrieval-Augmented Generation, RAG)评估资源在非英语语言中缺乏动态性与更新能力的问题,尤其针对俄语环境下的RAG系统评估需求。解决方案的关键在于构建DRAGON(Dynamic RAG Benchmark On News),这是一个基于定期更新的俄语新闻和公共文档语料库的动态基准,支持对RAG系统的检索器和生成器组件进行全面评估,并通过知识图谱自动生成涵盖四种核心问题类型的测试集,以更好地反映实际应用场景的动态特性。
链接: https://arxiv.org/abs/2507.05713
作者: Fedor Chernogorskii,Sergei Averkiev,Liliya Kudraleeva,Zaven Martirosian,Maria Tikhonova,Valentin Malykh,Alena Fenogenova
机构: SberAI(СберАИ); ITMO(ИТМО); MISIS(МИСИС); HSE(Высшая школа экономики); MWS AI(МВС ИИ)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) is a widely adopted approach for improving the factuality of large language models (LLMs) by incorporating external knowledge at inference time. Although there exist multiple RAG benchmarks for English, evaluation resources for other languages, including Russian, remain scarce and static, failing to capture the dynamic nature of real-world deployments. In this work, we present DRAGON (Dynamic RAG Benchmark On News), the first dynamic benchmark for evaluating RAG systems in Russian on a changing news corpora. DRAGON is built upon a regularly updated corpus of Russian news and public documents and supports comprehensive evaluation of both the retriever and generator components. Question generation is performed automatically with the use of Knowledge Graph constructed from the corpus and enables the extraction of four core question types aligned with distinct subgraph patterns. We release a complete evaluation framework comprising the pipeline for automatic question generation, evaluation scripts, which are potentially reusable for other languages and multilingual settings, and benchmark data. We also launch a public leaderboard to encourage community participation and comparison. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.05713 [cs.CL] (or arXiv:2507.05713v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.05713 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-44] Agent ic-R1: Distilled Dual-Strategy Reasoning
【速读】: 该论文试图解决当前长链式思维(long-CoT)模型在数学推理中依赖缓慢且易出错的自然语言轨迹,以及工具增强代理在复杂逻辑任务上表现不佳的问题。其解决方案的关键在于提出一种微调框架DualDistill,该框架将多种教师模型中的互补推理策略提炼到一个统一的学生模型中,从而使得Agentic-R1能够动态选择最优策略处理不同查询,既通过工具执行解决算术和算法问题,又通过文本推理处理抽象问题。
链接: https://arxiv.org/abs/2507.05707
作者: Weihua Du,Pranjal Aggarwal,Sean Welleck,Yiming Yang
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. 15 pages. Project available at this https URL
Abstract:Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using this approach, we train Agentic-R1, which dynamically selects the optimal strategy for each query, invoking tools for arithmetic and algorithmic problems, and using text-based reasoning for abstract ones. Our method improves accuracy across a range of tasks, including both computation-intensive and standard benchmarks, demonstrating the effectiveness of multi-strategy distillation in achieving robust and efficient reasoning. Our project is available at this https URL
zh
[NLP-45] AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLM s
【速读】: 该论文旨在解决深度学习中内核(kernel)开发的性能优化问题,即在不同硬件平台上平衡内存管理、并行性和硬件特定优化,从而实现高效计算。其解决方案的关键在于引入AutoTriton,这是首个基于强化学习(Reinforcement Learning, RL)的Triton编程模型。AutoTriton通过监督微调(SFT)获取必要的Triton编程知识,并结合规则奖励和执行奖励的Group Relative Policy Optimization (GRPO)算法进行强化学习,以提升编程能力。实验表明,AutoTriton在多个评估基准上表现出与主流大模型相当的性能,展示了强化学习在自动生成高性能内核方面的潜力。
链接: https://arxiv.org/abs/2507.05687
作者: Shangzhan Li,Zefan Wang,Ye He,Yuxuan Li,Qi Shi,Jianling Li,Yonggang Hu,Wanxiang Che,Xu Han,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学); Harbin Institute of Technology (哈尔滨工业大学); Tianjin University (天津大学); OpenBMB (开放模型基金会)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific languages like Triton simplify GPU programming by abstracting low-level details, developers must still manually tune critical parameters such as tile sizes and memory access patterns through iterative experimentation, creating substantial barriers to optimal performance and wider adoption. In this work, we introduce AutoTriton, the first model dedicated to Triton programming powered by reinforcement learning (RL). AutoTriton performs supervised fine-tuning (SFT) to be equipped with essential Triton programming expertise using a high-quality data gathering pipeline, and conducts RL with Group Relative Policy Optimization (GRPO) algorithm, combining a rule-based reward and an execution-based reward to further improve Triton programming ability, sequentially. Experiments across five evaluation channels of TritonBench and KernelBench illustrate that our 8B model AutoTriton achieves performance comparable to mainstream large models, including Claude-4-Sonnet and DeepSeek-R1-0528. Further experimental analysis demonstrates the crucial role of each module within AutoTriton, including the SFT stage, the RL stage, and the reward design strategy. These findings underscore the promise of RL for automatically generating high-performance kernels, and since high-performance kernels are core components of AI systems, this breakthrough establishes an important foundation for building more efficient AI systems. The model and code will be available at this https URL.
zh
[NLP-46] Smoothie-Qwen : Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLM s
【速读】: 该论文试图解决多语言大语言模型(Multilingual Large Language Models, LLMs)中存在的语言混淆问题,即模型在生成响应时倾向于使用主导语言,而非用户提示所使用的语言。解决方案的关键在于提出了一种轻量级的后处理方法——Smoothie-Qwen,该方法通过选择性调整令牌级别的输出概率,有效抑制不需要的语言生成,而无需重新训练模型。
链接: https://arxiv.org/abs/2507.05686
作者: SeungWon Ji,Jungyup Lee,Jemin Kim,Sang Park,SeungJae Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Multilingual large language models (LLMs) often exhibit language confusion, a tendency to generate responses in a dominant language irrespective of the prompt’s language. To address this, we propose Smoothie-Qwen, a lightweight, post-hoc method that mitigates language bias without retraining. This technique selectively adjusts token-level output probabilities to effectively suppress undesired language generation. Applied to the Qwen model, our method reduces unintended Chinese output by over 95% while preserving task accuracy on multilingual benchmarks. This work provides a practical and efficient solution for enhancing the language controllability of LLMs, making them more reliable for global applications.
zh
[NLP-47] uneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data
【速读】: 该论文试图解决在使用特定对话数据集对大语言模型(Large Language Model, LLM)进行微调时,如何有效缓解毒性内容的问题,尤其是在面对不可信训练数据的情况下。解决方案的关键在于提出TuneShield框架,该框架利用基于LLM的毒性分类能力,结合指令遵循和安全对齐特性,高效识别有毒样本,并生成合成对话样本(称为“healing data”)以减轻毒性同时强化有益行为。此外,TuneShield通过对齐过程进一步引导聊天机器人生成期望响应,从而在保持对话质量的同时有效抵御毒性注入攻击及适应性对抗攻击。
链接: https://arxiv.org/abs/2507.05660
作者: Aravind Cheruvu,Shravya Kanchi,Sifat Muhammad Abdullah,Nicholas Kong,Daphne Yao,Murtuza Jadliwala,Bimal Viswanath
机构: Virginia Tech(弗吉尼亚理工大学); The University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Pre-print
Abstract:Recent advances in foundation models, such as LLMs, have revolutionized conversational AI. Chatbots are increasingly being developed by customizing LLMs on specific conversational datasets. However, mitigating toxicity during this customization, especially when dealing with untrusted training data, remains a significant challenge. To address this, we introduce TuneShield, a defense framework designed to mitigate toxicity during chatbot fine-tuning while preserving conversational quality. TuneShield leverages LLM-based toxicity classification, utilizing the instruction-following capabilities and safety alignment of LLMs to effectively identify toxic samples, outperforming industry API services. TuneShield generates synthetic conversation samples, termed ‘healing data’, based on the identified toxic samples, using them to mitigate toxicity while reinforcing desirable behavior during fine-tuning. It performs an alignment process to further nudge the chatbot towards producing desired responses. Our findings show that TuneShield effectively mitigates toxicity injection attacks while preserving conversational quality, even when the toxicity classifiers are imperfect or biased. TuneShield proves to be resilient against adaptive adversarial and jailbreak attacks. Additionally, TuneShield demonstrates effectiveness in mitigating adaptive toxicity injection attacks during dialog-based learning (DBL).
zh
[NLP-48] ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?
【速读】: 该论文试图解决在电商客户支持领域中评估具备多模态能力的大语言模型代理(LLM agent)缺乏系统性基准框架的问题。解决方案的关键在于构建ECom-Bench,这是一个基于真实电商客户交互中收集的用户角色信息进行动态用户模拟,并从真实的电商对话中提取任务数据集的基准框架,从而全面反映现实世界的复杂性。
链接: https://arxiv.org/abs/2507.05639
作者: Haoxin Wang,Xianhan Peng,Xucheng Huang,Yizhe Huang,Ming Gong,Chenghan Yang,Yang Liu,Ling Jiang
机构: Xiaoduo AI Lab(小多人工智能实验室); Shanghai Jiao Tong University(上海交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. Upon publication, the code and data will be open-sourced to facilitate further research and development in this domain.
zh
[NLP-49] SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression
【速读】: 该论文试图解决检索增强生成(Retrieval-augmented Generation, RAG)中面临的两个关键问题:受限的有效上下文长度和检索文档中的冗余信息。解决方案的关键在于提出一种统一的RAG框架SARA,该框架通过结合自然语言文本片段与语义压缩向量,在有限的上下文预算下平衡局部精度与全局知识覆盖。SARA在两个互补层次上表示上下文:细粒度的自然语言片段以保留关键实体和数值,以及紧凑且可解释的语义压缩向量以概括高层语义,并通过迭代证据选择模块利用压缩向量对上下文进行动态重排序,从而提升答案的相关性、正确性和语义相似性。
链接: https://arxiv.org/abs/2507.05633
作者: Yiqiao Jin,Kartik Sharma,Vineeth Rakesh,Yingtong Dou,Menghai Pan,Mahashweta Das,Srijan Kumar
机构: Georgia Institute of Technology (佐治亚理工学院); Visa Research (维萨研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 20 pages
Abstract:Retrieval-augmented Generation (RAG) extends large language models (LLMs) with external knowledge but faces key challenges: restricted effective context length and redundancy in retrieved documents. Pure compression-based approaches reduce input size but often discard fine-grained details essential for factual accuracy. We propose SARA, a unified RAG framework that balances local precision and global knowledge coverage under tight context budgets. SARA combines natural-language text snippets with semantic compression vectors to jointly enhance context efficiency and answer correctness. It represents contexts at two complementary levels: 1) fine-grained natural-language spans that preserve critical entities and numerical values, and 2) compact, interpretable vectors that summarize high-level semantics. An iterative evidence-selection module employs the compression vectors for dynamic reranking of contexts. Across 9 datasets and 5 open-source LLMs spanning 3 model families (Mistral, Llama, and Gemma), SARA consistently improves answer relevance (+17.71), answer correctness (+13.72), and semantic similarity (+15.53), demonstrating the importance of integrating textual and compressed representations for robust, context-efficient RAG.
zh
[NLP-50] Flipping Knowledge Distillation: Leverag ing Small Models Expertise to Enhance LLM s in Text Matching ACL2025
【速读】: 该论文试图解决在文本匹配任务中,如何有效结合小型语言模型(SLM)的领域特定表示能力和大型语言模型(LLM)丰富的语义理解能力的问题。其解决方案的关键在于提出一种反转的知识蒸馏范式,即让LLM从SLM中学习。通过将LLM重新解释为编码器-解码器结构,并利用LoRA技术弥补架构差异,编码器生成压缩表示,解码器将其映射到输出空间。在训练过程中,采用提出的Margin-aware Contrastive Learning (MCL) 方法对齐编码器生成的表示相似性与教师模型的相似度得分,从而确保正负样本的准确相似性建模并自适应处理样本内部差异。
链接: https://arxiv.org/abs/2507.05617
作者: Mingzhe Li,Jing Xiang,Qishen Zhang,Kaiyang Wan,Xiuying Chen
机构: ByteDance(字节跳动); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 main
Abstract:Knowledge distillation typically involves transferring knowledge from a Large Language Model (LLM) to a Smaller Language Model (SLM). However, in tasks such as text matching, fine-tuned smaller models often yield more effective domain-specific representations, as they focus on optimizing the similarity of input pairs. To leverage both the specialized strengths of small models and the rich semantic understanding of LLMs, we introduce a flipped knowledge distillation paradigm, where LLM learns from SLM. Specifically, we address the architectural gap between decoder-only LLMs and smaller encoder-based models by reinterpreting LLMs in an encoder-decoder manner using LoRA. The encoder generates compressed representations, while the decoder maps them to the output space. During training, the encoder produces representations and their similarities, which are then aligned with the similarity scores produced by the teacher, using our proposed Margin-aware Contrastive Learning (MCL) approach. The MCL ensures accurate similarity for both positive and negative pairs, and adaptively handles the internal differences within positive and negative samples. Our paradigm requires only a reasonably good-performing SLM, allowing the LLM to achieve improved performance. Experiments on financial and healthcare benchmarks, as well as real-world applications, confirm its effectiveness, and the model has been fully deployed in an online environment.
zh
[NLP-51] Self-Review Framework for Enhancing Instruction Following Capability of LLM
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在遵循复杂指令和格式约束方面的不足,以及现有迭代修订方法导致的高昂成本和输出质量下降问题。其解决方案的关键在于提出Re5框架,该框架通过提取用户指令中的任务与约束组件,进行结构化评估以防止错误累积,并实施细粒度的约束特定内容评估,随后进行选择性修订,从而实现精确且质量保持的改进。
链接: https://arxiv.org/abs/2507.05598
作者: Sihyun Park
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Various techniques have been proposed to improve large language models (LLMs) adherence to formatting and instruction constraints. One of the most effective approaches involves utilizing high-quality data generated by powerful models. However, such models often fail to fully comply with complex instructions in a single generation. To address this limitation, iterative revision methods have been introduced. Nevertheless, as the number of data points and revision iterations increases, the associated monetary costs grow significantly. As a resource-efficient alternative, methods have been proposed that leverage high-performance evaluation tools to compensate for the limited self-evaluation capabilities of open-source LLMs. However, these approaches often lead to a degradation in output quality due to excessive revision. To overcome these challenges, we propose Re5, a self-evaluation and revision framework designed to enhance instruction-following performance while preserving the quality of the generated content. Re5 extracts task and constraint components from user instructions, performs structural evaluations to prevent error accumulation, and applies fine-grained constraint-specific content evaluations followed by selective revisions. This process ensures precise and quality-preserving improvements. The final high-quality outputs are used for alignment tuning, enabling long-term alignment improvements through a data-centric iterative refinement loop. Experimental results demonstrate that Re5 achieves instruction-following performance comparable to models trained on data generated by GPT-4o-mini, a high-performance model, even with a small amount of data while maintaining response quality with a 64.24%-win rate over the non-revised initial responses. These results validate Re5 as an efficient and effective solution for enhancing instruction adherence with minimal external supervision.
zh
[NLP-52] he Landscape of Memorization in LLM s: Mechanisms Measurement and Mitigation
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练过程中对训练数据的回忆(memorization)问题,这一现象可能引发模型行为不确定性、隐私风险以及学习与记忆边界模糊等关键问题。解决方案的关键在于深入分析影响记忆的因素,包括训练数据重复性、训练动态和微调过程,并探索有效的检测方法,如基于前缀的提取、成员推理和对抗性提示,同时提出缓解策略,如数据清洗、差分隐私和训练后遗忘技术,以平衡有害记忆的最小化与模型性能之间的关系。
链接: https://arxiv.org/abs/2507.05578
作者: Alexander Xiong,Xuandong Zhao,Aneesh Pappu,Dawn Song
机构: University of California, Berkeley (加州大学伯克利分校); Google DeepMind (谷歌深度思维)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they also exhibit memorization of their training data. This phenomenon raises critical questions about model behavior, privacy risks, and the boundary between learning and memorization. Addressing these concerns, this paper synthesizes recent studies and investigates the landscape of memorization, the factors influencing it, and methods for its detection and mitigation. We explore key drivers, including training data duplication, training dynamics, and fine-tuning procedures that influence data memorization. In addition, we examine methodologies such as prefix-based extraction, membership inference, and adversarial prompting, assessing their effectiveness in detecting and measuring memorized content. Beyond technical analysis, we also explore the broader implications of memorization, including the legal and ethical implications. Finally, we discuss mitigation strategies, including data cleaning, differential privacy, and post-training unlearning, while highlighting open challenges in balancing the minimization of harmful memorization with utility. This paper provides a comprehensive overview of the current state of research on LLM memorization across technical, privacy, and performance dimensions, identifying critical directions for future work.
zh
[NLP-53] Beyond Retrieval: Ensembling Cross-Encoders and GPT Rerankers with LLM s for Biomedical QA
【速读】: 该论文旨在解决生物医学语义问答问题,以帮助研究人员、医疗专业人员和普通用户从快速发展的生物医学文献中高效获取基于证据的相关知识。其解决方案的关键在于构建一个检索增强生成(Retrieval-Augmented Generation, RAG)系统,该系统通过检索相关的PubMed文档和片段来生成答案。在检索任务中,采用从生物医学文章生成的密集嵌入进行初步检索,并利用微调的交叉编码器和大语言模型(LLM)进行重排序,从而提高相关文档的检索效果;在答案生成阶段,则采用指令微调的大语言模型进行少样本提示,以提升问答的准确性与质量。
链接: https://arxiv.org/abs/2507.05577
作者: Shashank Verma,Fengyi Jiang,Xiangning Xue
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Paper submitted to CLEF 2025 CEUR-WS
Abstract:Biomedical semantic question answering rooted in information retrieval can play a crucial role in keeping up to date with vast, rapidly evolving and ever-growing biomedical literature. A robust system can help researchers, healthcare professionals and even layman users access relevant knowledge grounded in evidence. The BioASQ 2025 Task13b Challenge serves as an important benchmark, offering a competitive platform for advancement of this space. This paper presents the methodologies and results from our participation in this challenge where we built a Retrieval-Augmented Generation (RAG) system that can answer biomedical questions by retrieving relevant PubMed documents and snippets to generate answers. For the retrieval task, we generated dense embeddings from biomedical articles for initial retrieval, and applied an ensemble of finetuned cross-encoders and large language models (LLMs) for re-ranking to identify top relevant documents. Our solution achieved an MAP@10 of 0.1581, placing 10th on the leaderboard for the retrieval task. For answer generation, we employed few-shot prompting of instruction-tuned LLMs. Our system achieved macro-F1 score of 0.95 for yes/no questions (rank 12), Mean Reciprocal Rank (MRR) of 0.64 for factoid questions (rank 1), mean-F1 score of 0.63 for list questions (rank 5), and ROUGE-SU4 F1 score of 0.29 for ideal answers (rank 11).
zh
[NLP-54] Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS
【速读】: 该论文旨在解决大型语言模型(LLM)在测试阶段通过增加计算资源提升性能时,依赖于从更先进模型中蒸馏链式思维(CoT)训练数据的问题。其解决方案的关键在于提出R2-LLMs框架,该框架通过引入双层级检索增强的上下文学习机制:在粗粒度层级,从复杂推理问题中提取抽象模板并检索相似的问题-答案对以促进高层次的上下文学习;在细粒度层级,结合蒙特卡洛树搜索(MCTS)高效检索参考数学问题数据集中的类似中间解题步骤,并利用过程奖励模型(PRM)进行评分以优化步骤级推理。
链接: https://arxiv.org/abs/2507.05557
作者: Alex ZH Dou,Zhongwei Wan,Dongfei Cui,Xin Wang,Jing Xiong,Haokun Lin,Chaofan Tao,Shen Yan,Mi Zhang
机构: The Ohio State University (俄亥俄州立大学); Case Western Reserve University (凯斯西储大学); Duke University (杜克大学); University of Hong Kong (香港大学); City University of Hong Kong (香港城市大学); ByteDance (字节跳动)
类目: Computation and Language (cs.CL)
备注: Technical Report
Abstract:Test-time scaling has emerged as a promising paradigm in language modeling, leveraging additional computational resources at inference time to enhance model performance. In this work, we introduce R2-LLMs, a novel and versatile hierarchical retrieval-augmented reasoning framework designed to improve test-time scaling in large language models (LLMs) without requiring distillation from more advanced models to obtain chain-of-thought (CoT) training data. R2-LLMs enhances inference-time generalization by integrating dual-level retrieval-based in-context learning: (1) At the coarse level, our approach extracts abstract templates from complex reasoning problems and retrieves similar problem-answer pairs to facilitate high-level in-context learning; (2) At the fine level, during Monte Carlo Tree Search (MCTS), R2-LLMs efficiently retrieves analogous intermediate solution steps from reference mathematical problem datasets, refining step-wise reasoning with the aid of a process reward model (PRM) for scoring. R2-LLMs is a robust hierarchical reasoning-augmentation method that enhances in-context-level reasoning while seamlessly integrating with step-level tree search methods. Utilizing PRM, it refines both candidate generation and decision-making for improved reasoning accuracy. Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO datasets achieve substantial relative improvement with an increase of up to 16% using LLaMA-3.1-8B compared to the baselines, showcasing the effectiveness of our approach in complex reasoning tasks.
zh
[NLP-55] Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment
【速读】: 该论文试图解决现有教育人工智能系统在可扩展性方面的不足以及缺乏对教学质量进行评估的框架问题。其解决方案的关键在于提出WikiHowAgent,这是一个基于大型语言模型(Large Language Models, LLMs)的多智能体工作流,通过整合教师代理、学习者代理、交互管理器和评估器,模拟互动式的教学-学习对话,从而促进程序性学习并评估教学质量。
链接: https://arxiv.org/abs/2507.05528
作者: Jiahuan Pei,Fanghua Ye,Xin Sun,Wentao Deng,Koen Hindriks,Junxiao Wang
机构: Vrije University of Amsterdam(自由大学阿姆斯特丹); University College London(伦敦大学学院); University of Amsterdam(阿姆斯特丹大学); Shandong University(山东大学); Guangzhou University(广州大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages
Abstract:Large language models (LLMs) have advanced virtual educators and learners, bridging NLP with AI4Education. Existing work often lacks scalability and fails to leverage diverse, large-scale course content, with limited frameworks for assessing pedagogic quality. To this end, we propose WikiHowAgent, a multi-agent workflow leveraging LLMs to simulate interactive teaching-learning conversations. It integrates teacher and learner agents, an interaction manager, and an evaluator to facilitate procedural learning and assess pedagogic quality. We introduce a dataset of 114,296 teacher-learner conversations grounded in 14,287 tutorials across 17 domains and 727 topics. Our evaluation protocol combines computational and rubric-based metrics with human judgment alignment. Results demonstrate the workflow’s effectiveness in diverse setups, offering insights into LLM capabilities across domains. Our datasets and implementations are fully open-sourced.
zh
[NLP-56] Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications
【速读】: 该论文旨在解决临床自然语言处理(NLP)中的两个高影响任务:从护士口述中生成结构化表格报告以及从医患对话中提取医疗医嘱。由于数据稀缺性和敏感性,这些任务尚未得到充分研究。论文的关键解决方案是利用私有和开源临床数据集,评估开放和封闭权重大型语言模型(LLMs)的性能,并提出一种代理式流程来生成真实但非敏感的护士口述,从而支持结构化临床观察的提取。此外,研究团队发布了SYNUR和SIMORD两个开源数据集,以促进相关领域的进一步研究。
链接: https://arxiv.org/abs/2507.05517
作者: Jean-Philippe Corbeil,Asma Ben Abacha,George Michalopoulos,Phillip Swazinna,Miguel Del-Agua,Jerome Tremblay,Akila Jeeson Daniel,Cari Bader,Kevin Cho,Pooja Krishnan,Nathan Bodenstab,Thomas Lin,Wenxuan Teng,Francois Beaulieu,Paul Vozila
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks - structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations - remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.
zh
[NLP-57] Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在增强现实 (AR) 培训中应用不足的问题,特别是在细粒度视觉-语言对齐方面的挑战。其解决方案的关键在于构建一个针对 AR 训练的综合性数据集,并系统化设计视觉-语言任务,以此评估当前最先进的视觉语言模型 (VLMs) 的性能,从而揭示现有模型在细粒度任务上的局限性,并推动更高质量的数据集、基准测试和研究的发展。
链接: https://arxiv.org/abs/2507.05515
作者: Haochen Huang,Jiahuan Pei,Mohammad Aliannejadi,Xin Sun,Moonisa Ahsan,Pablo Cesar,Chuang Yu,Zhaochun Ren,Junxiao Wang
机构: Vrije University of Amsterdam (阿姆斯特丹自由大学); University of Amsterdam (阿姆斯特丹大学); Centrum Wiskunde & Informatica (数学与计算机中心); TU Delft (代尔夫特理工大学); Leiden University (莱顿大学); University College London (伦敦大学学院); Guangzhou University (广州大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages
Abstract:Vision-language models (VLMs) are essential for enabling AI-powered smart assistants to interpret and reason in multimodal environments. However, their application in augmented reality (AR) training remains largely unexplored. In this work, we introduce a comprehensive dataset tailored for AR training, featuring systematized vision-language tasks, and evaluate nine state-of-the-art VLMs on it. Our results reveal that even advanced models, including GPT-4o, struggle with fine-grained assembly tasks, achieving a maximum F1 score of just 40.54% on state detection. These findings highlight the demand for enhanced datasets, benchmarks, and further research to improve fine-grained vision-language alignment. Beyond technical contributions, our work has broader social implications, particularly in empowering blind and visually impaired users with equitable access to AI-driven learning opportunities. We provide all related resources, including the dataset, source code, and evaluation results, to support the research community.
zh
[NLP-58] ModelCitizens:Representing Community Voices in Online Safety
【速读】: 该论文试图解决现有毒性语言检测模型在处理具有文化或语境敏感性的语言时表现不佳的问题,尤其是那些因社区规范和个体经历而具有不同毒性感知的语言。解决方案的关键在于构建MODELCITIZENS数据集,该数据集包含6.8K社交媒体帖子和40K针对不同身份群体的毒性标注,并通过大语言模型(LLM)生成的对话场景增强上下文信息,以更准确地捕捉毒性语言的语境依赖性。此外,研究还提出了基于LLaMA和Gemma的微调模型LLAMACITIZEN-8B和GEMMACITIZEN-12B,以提升对复杂毒性语言的检测能力。
链接: https://arxiv.org/abs/2507.05455
作者: Ashima Suvarna,Christina Chance,Hamid Palangi,Sophie Hao,Thomas Hartvigsen,Saadia Gabriel
机构: University of California, Los Angeles (加州大学洛杉矶分校); University of Virginia (弗吉尼亚大学); Google(谷歌); New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic toxic language detection is critical for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To capture the role of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation.
zh
[NLP-59] On the Semantics of Large Language Models
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在多大程度上真正理解语言。论文通过聚焦于LLMs在词和句子层面的语义,探讨其潜在的语义能力。解决方案的关键在于分析LLMs的内部工作机制及其生成的语言表征,并结合弗雷格(Frege)和罗素(Russell)的经典语义理论,以获得对LLMs语义能力更细致的理解。
链接: https://arxiv.org/abs/2507.05448
作者: Martin Schuele
机构: Zurich University of Applied Sciences (苏黎世应用科学大学); Université Paris 1 - Panthéon-Sorbonne (巴黎第一大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) such as ChatGPT demonstrated the potential to replicate human language abilities through technology, ranging from text generation to engaging in conversations. However, it remains controversial to what extent these systems truly understand language. We examine this issue by narrowing the question down to the semantics of LLMs at the word and sentence level. By examining the inner workings of LLMs and their generated representation of language and by drawing on classical semantic theories by Frege and Russell, we get a more nuanced picture of the potential semantic capabilities of LLMs.
zh
[NLP-60] PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs
【速读】: 该论文旨在解决第二语言(L2)学习者在学习语系差异较大的语言(如英语和韩语)时,词汇习得面临的挑战。其解决方案的关键在于开发PhoniTale系统,该系统通过基于语音相似性的第一语言(L1)关键词序列检索,并利用大语言模型(LLMs)生成记忆术,从而辅助L2词汇的学习。
链接: https://arxiv.org/abs/2507.05444
作者: Sana Kang,Myeongseok Gwon,Su Young Kwon,Jaewook Lee,Andrew Lan,Bhiksha Raj,Rita Singh
机构: KAIST(韩国科学技术院); University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Carnegie Mellon University(卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Vocabulary acquisition poses a significant challenge for second-language (L2) learners, especially when learning typologically distant languages such as English and Korean, where phonological and structural mismatches complicate vocabulary learning. Recently, large language models (LLMs) have been used to generate keyword mnemonics by leveraging similar keywords from a learner’s first language (L1) to aid in acquiring L2 vocabulary. However, most of this research has focused on native English speakers learning other languages, rather than the reverse. In this paper, we present PhoniTale, a novel cross-lingual mnemonic generation system that retrieves L1 keyword sequence based on phonological similarity and uses LLMs to generate mnemonics. We evaluate PhoniTale using both automated metrics and human evaluations, comparing its output to mnemonics created by humans and by previous automated approaches. To assess practical effectiveness, we also conduct a short-term recall test measuring mnemonic helpfulness. Our findings show that PhoniTale performs comparably to human-authored mnemonics. We also highlight key areas for future improvement in mnemonic quality and methodology.
zh
[NLP-61] Gendered Divides in Online Discussions about Reproductive Rights
【速读】: 该论文试图解决性别与地方社会政治背景如何共同影响公众对堕胎议题的讨论和态度的问题。其解决方案的关键在于利用近1000万条通过用户推断出的性别、意识形态和地理位置的堕胎相关推文数据,分析性别在保守地区对堕胎态度和情感表达的显著调节作用,以及这种作用独立于意识形态的影响,从而揭示出在保守地区堕胎态度中性别差距的加剧。
链接: https://arxiv.org/abs/2507.05443
作者: Ashwin Rao,Sze Yuh Nina Wang,Kristina Lerman
机构: University of Southern California(南加州大学); Information Sciences Institute(信息科学研究所); Cornell University(康奈尔大学); Indiana University(印第安纳大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:The U.S. Supreme Court’s 2022 ruling in Dobbs v. Jackson Women’s Health Organization marked a turning point in the national debate over reproductive rights. While the ideological divide over abortion is well documented, less is known about how gender and local sociopolitical contexts interact to shape public discourse. Drawing on nearly 10 million abortion-related posts on X (formerly Twitter) from users with inferred gender, ideology and location, we show that gender significantly moderates abortion attitudes and emotional expression, particularly in conservative regions, and independently of ideology. This creates a gender gap in abortion attitudes that grows more pronounced in conservative regions. The leak of the Dobbs draft opinion further intensified online engagement, disproportionately mobilizing pro-abortion women in areas where access was under threat. These findings reveal that abortion discourse is not only ideologically polarized but also deeply structured by gender and place, highlighting the central role of identity in shaping political expression during moments of institutional disruption.
zh
[NLP-62] “Lost-in-the-Later”: Framework for Quantifying Contextual Grounding in Large Language Models
【速读】: 该论文试图解决大型语言模型在处理上下文信息时存在的位置偏差问题,即“lost-in-the-later”现象,该现象表现为模型倾向于忽略或低优先级处理上下文中较后出现的信息,从而影响上下文的准确理解与整合。解决方案的关键在于提出一种新的评估框架CoPE,用于系统地测量不同模型和语言中的上下文知识(CK)和参数知识(PK),并通过分析发现基于提示的方法可以有效提升模型对输入上下文的利用能力,进而改善事实性基础并减少幻觉。
链接: https://arxiv.org/abs/2507.05424
作者: Yufei Tao,Adam Hiatt,Rahul Seetharaman,Ameeta Agrawal
机构: Portland State University (波特兰州立大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are capable of leveraging both contextual and parametric knowledge but how they prioritize and integrate these sources remains underexplored. We introduce CoPE, a novel evaluation framework that systematically measures contextual knowledge (CK) and parametric knowledge (PK) across models and languages. Using our MultiWikiAtomic dataset in English, Spanish, and Danish, we analyze how large language models (LLMs) integrate context, prioritize information, and incorporate PK in open-ended question answering. Our analysis uncovers a phenomenon we call lost-in-the-later, where LLMs tend to overlook or deprioritize information that appears later in a given context, revealing a strong positional bias that affects contextual grounding. We further find that reasoning models, as well as non-reasoning models prompted with chain-of-thought (CoT), use context even less than non-reasoning models without CoT and fail to mitigate the lost-in-the-later effect. CoT prompting, in particular, results in lower recall and shorter responses, leading to degraded contextual grounding. Based on these insights, we design prompt-based methods to effectively leverage input context. A case study applying CoPE to summarization demonstrates that CK-informed prompting improves factual grounding and reduces hallucination.
zh
[NLP-63] Learn Globally Speak Locally: Bridging the Gaps in Multilingual Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在低资源语言如斯瓦希里语或泰语中的多语言推理能力不足的问题,以及模型对高资源语言的隐性偏倚导致的事实准确性、可解释性和信任度下降的问题。解决方案的关键在于提出BRIDGE训练方法,该方法通过语言一致性奖励引导监督微调和测试时的强化学习,以使模型的推理过程与输入语言保持一致。
链接: https://arxiv.org/abs/2507.05418
作者: Jaedong Hwang,Kumar Tanmay,Seok-Jin Lee,Ayush Agrawal,Hamid Palangi,Kumar Ayush,Ila Fiete,Paul Pu Liang
机构: Massachusetts Institute of Technology (麻省理工学院); Havard Univeristy (哈佛大学); LG CNS (LG CNS); Université de Montréal (蒙特利尔大学); Mila (Mila); Google(谷歌); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. Current multilingual benchmarks focus only on final answers, overlooking whether models actually reason in the target language. To address this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. We further propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning with a language-consistency reward to align reasoning with the input language. Finally, we develop an automatic evaluation protocol using LLM-as-a-judge to assess answer correctness and the quality and language consistency of reasoning traces, enabling nuanced and scalable analysis beyond surface-level metrics. Our results show that BRIDGE significantly enhances multilingual reasoning fidelity, demonstrating that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization. this https URL
zh
[NLP-64] Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences
【速读】: 该论文试图解决用户在使用大型语言模型(Large Language Models, LLMs)时,因通过商业API访问而可能暴露数据的问题。解决方案的关键在于引入隐私配置文件(privacy profiles),即用户提供的简单自然语言指令,用于指定哪些信息应被披露、哪些应被隐藏。通过构建一个框架,本地模型根据这些指令重写查询,在将查询发送至外部模型之前仅隐藏用户认定的敏感信息,从而在保护隐私与保持性能之间取得平衡。
链接: https://arxiv.org/abs/2507.05391
作者: Guillem Ramírez,Alexandra Birch,Ivan Titov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are primarily accessed via commercial APIs, but this often requires users to expose their data to service providers. In this paper, we explore how users can stay in control of their data by using privacy profiles: simple natural language instructions that say what should and should not be revealed. We build a framework where a local model uses these instructions to rewrite queries, only hiding details deemed sensitive by the user, before sending them to an external model, thus balancing privacy with performance. To support this research, we introduce PEEP, a multilingual dataset of real user queries annotated to mark private content and paired with synthetic privacy profiles. Our experiments with lightweight LLMs show they can follow these instructions to some extent, but also face consistent challenges, highlighting the need for models that better understand and comply with user-defined privacy preferences.
zh
[NLP-65] he Generalization Ridge: Information Flow in Natural Language Generation
【速读】: 该论文试图解决Transformer模型在自然语言生成任务中,其内部机制如何合成与任务相关的信息这一问题,特别是中间层与最终层在表征泛化能力上的差异及其演变过程。解决方案的关键在于提出InfoRidge框架,该框架基于信息理论,用于表征隐藏表征与目标输出之间的互信息随模型深度的变化情况,从而追踪任务相关信息在整个模型中的流动。实验结果揭示了预测信息在上-middle层达到峰值,形成泛化脊(generalization ridge),随后在最终层下降,反映了从泛化到记忆的转变。此外,通过引入可训练的残差缩放系数作为功能探针,进一步验证了中间层在分布偏移下对泛化的关键作用。
链接: https://arxiv.org/abs/2507.05387
作者: Ruidi Chang,Chunyuan Deng,Hanjie Chen
机构: Rice University (莱斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG) tasks, yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. To address this gap, we propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth. Estimating this quantity enables us to trace the flow of task-relevant information throughout the model during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in upper-middle layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we introduce residual scaling coefficients-trainable scalar parameters applied to each residual block-which serve as functional probes for assessing the relative importance of individual transformer layers. These coefficients reveal that, under distribution shift, models downweight final layers and increasingly rely on ridge layers, highlighting their role in generalization. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.
zh
[NLP-66] Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training
【速读】: 该论文旨在解决持续后训练(Continual Post-Training, CPT)过程中模型在学习新下游任务时出现的灾难性遗忘问题,以及如何有效保持模型的通用知识。其解决方案的关键在于对比分析监督微调(Supervised Fine-Tuning, SFT)与强化学习微调(Reinforcement Fine-Tuning, RFT)两种核心后训练范式,并揭示RFT在知识保留和性能稳定性方面的优势。研究发现,RFT能够内在地保护已有知识并提升模型在标准基准上的泛化能力,而SFT则会导致模型性能显著下降。此外,研究指出RFT中的隐式正则化是缓解遗忘的关键因素,而非显式的机制如KL惩罚或思维链推理。
链接: https://arxiv.org/abs/2507.05386
作者: Song Lai,Haohan Zhao,Rong Feng,Changyi Ma,Wenzhuo Liu,Hongbo Zhao,Xi Lin,Dong Yi,Min Xie,Qingfu Zhang,Hongbin Liu,Gaofeng Meng,Fei Zhu
机构: Centre for Artificial Intelligence and Robotics, HKISI, CAS(Centre for Artificial Intelligence and Robotics, HKISI, CAS); City University of Hong Kong(University of Hong Kong); Institute of Automation, CAS(Institute of Automation, CAS); University of Chinese Academy of Sciences(University of Chinese Academy of Sciences)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model’s general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis shows that explicit mechanisms, such as KL penalty and chain-of-thought reasoning, are not the primary factors. Instead, we find that the implicit regularization inherent to RFT is a key factor in mitigating forgetting. Finally, we propose a rollout-based instance filtering algorithm to improve the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.
zh
[NLP-67] EduCoder: An Open-Source Annotation System for Education Transcript Data
【速读】: 该论文试图解决教育对话转录文本在进行话语层级标注时所面临的复杂性问题,包括定义用于复杂教学特征的编码手册、支持开放式与分类编码以及将话语与外部特征(如课程目的和教学价值)进行上下文关联。解决方案的关键在于EduCoder平台,它为研究者和领域专家提供了一个协作定义复杂编码手册的环境,并集成了分类和开放式标注类型以及上下文材料,同时支持多标注者响应的并排比较,以提高数据的一致性和可靠性。
链接: https://arxiv.org/abs/2507.05385
作者: Guanzhong Pan,Mei Tan,Hyunji Nam,Lucía Langlois,James Malamut,Liliana Deonizio,Dorottya Demszky
机构: Carnegie Mellon University (卡内基梅隆大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce EduCoder, a domain-specialized tool designed to support utterance-level annotation of educational dialogue. While general-purpose text annotation tools for NLP and qualitative research abound, few address the complexities of coding education dialogue transcripts – with diverse teacher-student and peer interactions. Common challenges include defining codebooks for complex pedagogical features, supporting both open-ended and categorical coding, and contextualizing utterances with external features, such as the lesson’s purpose and the pedagogical value of the instruction. EduCoder is designed to address these challenges by providing a platform for researchers and domain experts to collaboratively define complex codebooks based on observed data. It incorporates both categorical and open-ended annotation types along with contextual materials. Additionally, it offers a side-by-side comparison of multiple annotators’ responses, allowing comparison and calibration of annotations with others to improve data reliability. The system is open-source, with a demo video available.
zh
[NLP-68] On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning : A Shortest-Path Case Study
【速读】: 该论文试图解决如何在大型语言模型(Large Language Models, LLMs)中提升推理能力的问题,特别是探讨在推理过程中计算资源的分配与推理结构对模型泛化能力的影响。其解决方案的关键在于通过控制实验环境,利用分层图中的最短路径任务,对比分析基于最优自底向上动态规划轨迹与包含回溯的较长有效轨迹训练模型的效果。研究发现,在相同训练令牌预算下,基于低效轨迹的模型在未见过的图上表现出更好的泛化能力,这一优势并非源于轨迹长度本身,而是与模型在下一步token预测中的置信度相关,表明长而连贯且局部递增的推理轨迹有助于优化训练信号。
链接: https://arxiv.org/abs/2507.05362
作者: Riccardo Alberghi,Elizaveta Demyanenko,Luca Biggio,Luca Saglietti
机构: Unibocconi(博科尼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model’s confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.
zh
[NLP-69] LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks
【速读】: 该论文试图解决在特定任务和领域中,如何高效选择和组合微调的语言模型专家的问题。其解决方案的关键在于提出LoRA-Augmented Generation (LAG),该方法利用大规模知识库和任务相关的LoRA适配器,在无需额外训练或数据访问的情况下,实现基于每个标记和层的专家过滤、检索与应用。
链接: https://arxiv.org/abs/2507.05346
作者: William Fleshman,Benjamin Van Durme
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The proliferation of fine-tuned language model experts for specific tasks and domains signals the need for efficient selection and combination methods. We propose LoRA-Augmented Generation (LAG) for leveraging large libraries of knowledge and task-specific LoRA adapters. LAG requires no additional training or access to data, and efficiently filters, retrieves, and applies experts on a per-token and layer basis. We evaluate LAG on various knowledge-intensive tasks, achieving superior performance over existing data-free methods. We explore scenarios where additional data is available, demonstrating LAG’s compatibility with alternative solutions such as retrieval-augmented generation (RAG).
zh
[NLP-70] MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents
【速读】: 该论文试图解决电子商务客户服务中复杂多模态场景下大型语言模型(LLMs)能力受限的问题。其解决方案的关键在于提出MindFlow,这是一个基于CoALA框架的开源多模态LLM代理,它集成了记忆、决策和行动模块,并采用“MLLM-as-Tool”策略以实现有效的视觉-文本推理。
链接: https://arxiv.org/abs/2507.05330
作者: Ming Gong,Xucheng Huang,Chenghan Yang,Xianhan Peng,Haoxin Wang,Yang Liu,Ling Jiang
机构: Xiaoduo AI(小多智能); University of Dayton(戴顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have enabled new applications in e-commerce customer service. However, their capabilities remain constrained in complex, multimodal scenarios. We present MindFlow, the first open-source multimodal LLM agent tailored for e-commerce. Built on the CoALA framework, it integrates memory, decision-making, and action modules, and adopts a modular “MLLM-as-Tool” strategy for effect visual-textual reasoning. Evaluated via online A/B testing and simulation-based ablation, MindFlow demonstrates substantial gains in handling complex queries, improving user satisfaction, and reducing operational costs, with a 93.53% relative improvement observed in real-world deployments.
zh
[NLP-71] LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review ACL
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动生成出院摘要时存在的幻觉问题,如生成不准确内容或无依据的信息,以及电子医疗记录(Electronic Medical Records, EMRs)中的长文本数据使LLMs难以进行内容溯源的问题。其解决方案的关键在于提出LCDS系统,该系统通过计算EMRs与出院摘要之间的文本相似性构建源映射表,以限制摘要内容的范围,并结合全面的逻辑规则生成更可靠的银质出院摘要,同时支持生成内容的来源追溯,便于专家审核与修正,最终生成的黄金标准出院摘要用于LLMs的增量微调。
链接: https://arxiv.org/abs/2507.05319
作者: Cheng Yuan,Xinkai Rui,Yongqi Fan,Yawei Fan,Boyang Zhong,Jiacheng Wang,Weiyan Zhang,Tong Ruan
机构: East China University of Science and Technology (华东理工大学); Ruijin Hospital, Shanghai Jiaotong University School of Medicine (瑞金医院,上海交通大学医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL Demo 2025
Abstract:Despite the remarkable performance of Large Language Models (LLMs) in automated discharge summary generation, they still suffer from hallucination issues, such as generating inaccurate content or fabricating information without valid sources. In addition, electronic medical records (EMRs) typically consist of long-form data, making it challenging for LLMs to attribute the generated content to the sources. To address these challenges, we propose LCDS, a Logic-Controlled Discharge Summary generation system. LCDS constructs a source mapping table by calculating textual similarity between EMRs and discharge summaries to constrain the scope of summarized content. Moreover, LCDS incorporates a comprehensive set of logical rules, enabling it to generate more reliable silver discharge summaries tailored to different clinical fields. Furthermore, LCDS supports source attribution for generated content, allowing experts to efficiently review, provide feedback, and rectify errors. The resulting golden discharge summaries are subsequently recorded for incremental fine-tuning of LLMs. Our project and demo video are in the GitHub repository this https URL.
zh
[NLP-72] Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLM s as a Viable Alternative to Proprietary Models for Pedagogical Tools
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在教育场景中应用时存在的计算规模大、成本高以及过度辅助等问题,旨在为初学者提供更有效且可行的编程错误解释工具。论文提出的解决方案关键在于利用监督微调(Supervised Fine-Tuning, SFT)对小型、专用的语言模型进行优化,使其在教育任务中表现出与大型模型相当的性能,同时具备更高的效率和可扩展性。
链接: https://arxiv.org/abs/2507.05305
作者: Lorenzo Lee Solano,Charles Koutcheme,Juho Leinonen,Alexandra Vassar,Jake Renzella
机构: University of New South Wales(新南威尔士大学); Aalto University(阿尔托大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 7 pages, 3 tables, 1 figure
Abstract:Frontier Large language models (LLMs) like ChatGPT and Gemini can decipher cryptic compiler errors for novice programmers, but their computational scale, cost, and tendency to over-assist make them problematic for widespread pedagogical adoption. This work demonstrates that smaller, specialised language models, enhanced via Supervised Fine-Tuning (SFT), present a more viable alternative for educational tools. We utilise a new dataset of 40,000 C compiler error explanations, derived from real introductory programming (CS1/2) student-generated programming errors, which we used to fine-tune three open-source models: Qwen3-4B, Llama-3.1-8B, and Qwen3-32B. We performed a dual evaluation, combining expert human reviews with a large-scale automated analysis of 8,000 responses using a validated LLM-as-judge ensemble. Our results show that SFT significantly boosts the pedagogical quality of smaller models, achieving performance comparable to much larger models. We analyse the trade-offs between model size and quality, confirming that fine-tuning compact, efficient models on high-quality, domain-specific data is a potent strategy for creating specialised models to drive educational tools. We provide a replicable methodology to foster broader access to generative AI capabilities in educational contexts.
zh
[NLP-73] News Source Citing Patterns in AI Search Systems
【速读】: 该论文试图解决AI搜索系统在引用新闻来源方面的模式和偏差问题,特别是其作为信息中介对用户获取新闻和信息方式的影响。解决方案的关键在于通过分析来自AI Search Arena平台的大量对话和响应数据,揭示不同提供商的模型在引用行为上的共性和差异,以及新闻引用集中度和政治倾向性等特征。
链接: https://arxiv.org/abs/2507.05301
作者: Kai-Cheng Yang
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 15 pages, 7 figures
Abstract:AI-powered search systems are emerging as new information gatekeepers, fundamentally transforming how users access news and information. Despite their growing influence, the citation patterns of these systems remain poorly understood. We address this gap by analyzing data from the AI Search Arena, a head-to-head evaluation platform for AI search systems. The dataset comprises over 24,000 conversations and 65,000 responses from models across three major providers: OpenAI, Perplexity, and Google. Among the over 366,000 citations embedded in these responses, 9% reference news sources. We find that while models from different providers cite distinct news sources, they exhibit shared patterns in citation behavior. News citations concentrate heavily among a small number of outlets and display a pronounced liberal bias, though low-credibility sources are rarely cited. User preference analysis reveals that neither the political leaning nor the quality of cited news sources significantly influences user satisfaction. These findings reveal significant challenges in current AI search systems and have important implications for their design and governance.
zh
[NLP-74] Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)
【速读】: 该论文试图解决生成式 AI (Generative AI) 在文本到图像生成任务中对提示(prompt)的遵循问题,这一问题主要源于大规模数据集(如LAION-5B)的噪声和非结构化特性。解决方案的关键在于在训练过程中强制执行一致的标题结构,通过引入Re-LAION-Caption 19M数据集,其中包含1900万张具有结构化描述的图像,这些描述遵循由主体、场景、美学和相机细节组成的四部分模板,从而提升模型的可控性和文本-图像对齐效果。
链接: https://arxiv.org/abs/2507.05300
作者: Nicholas Merchant,Haitz Sáez de Ocáriz Borde,Andrei Cristian Popescu,Carlos Garcia Jurado Suarez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7-page main paper + appendix, 18 figures
Abstract:We argue that generative text-to-image models often struggle with prompt adherence due to the noisy and unstructured nature of large-scale datasets like LAION-5B. This forces users to rely heavily on prompt engineering to elicit desirable outputs. In this work, we propose that enforcing a consistent caption structure during training can significantly improve model controllability and alignment. We introduce Re-LAION-Caption 19M, a high-quality subset of Re-LAION-5B, comprising 19 million 1024x1024 images with captions generated by a Mistral 7B Instruct-based LLaVA-Next model. Each caption follows a four-part template: subject, setting, aesthetics, and camera details. We fine-tune PixArt- \Sigma and Stable Diffusion 2 using both structured and randomly shuffled captions, and show that structured versions consistently yield higher text-image alignment scores using visual question answering (VQA) models. The dataset is publicly available at this https URL.
zh
[NLP-75] A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models ACL2025
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在关键领域部署后所带来的算法生成虚假信息的社会风险问题。传统检测方法难以有效应对LLM生成的虚假信息,因其具有自我强化、高度可信和跨语言快速传播的特性。论文提出的解决方案关键在于构建一种主动防御范式,即“Three Pillars框架”,包括:(1)知识可信度,强化训练和部署数据的完整性;(2)推理可靠性,嵌入推理过程中的自修正机制;(3)输入鲁棒性,提升模型接口对对抗攻击的抵抗能力。通过综合调查和比较元分析,论文表明主动防御策略在防止虚假信息方面相比传统方法有高达63%的性能提升。
链接: https://arxiv.org/abs/2507.05288
作者: Shuliang Liu,Hongyi Liu,Aiwei Liu,Bingchen Duan,Qi Zheng,Yibo Yan,He Geng,Peijie Jiang,Jia Liu,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; Harbin Institute of Technology; Ant Group, Alibaba; Northeast Forest University
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL 2025 Findings
Abstract:The widespread deployment of large language models (LLMs) across critical domains has amplified the societal risks posed by algorithmically generated misinformation. Unlike traditional false content, LLM-generated misinformation can be self-reinforcing, highly plausible, and capable of rapid propagation across multiple languages, which traditional detection methods fail to mitigate effectively. This paper introduces a proactive defense paradigm, shifting from passive post hoc detection to anticipatory mitigation strategies. We propose a Three Pillars framework: (1) Knowledge Credibility, fortifying the integrity of training and deployed data; (2) Inference Reliability, embedding self-corrective mechanisms during reasoning; and (3) Input Robustness, enhancing the resilience of model interfaces against adversarial attacks. Through a comprehensive survey of existing techniques and a comparative meta-analysis, we demonstrate that proactive defense strategies offer up to 63% improvement over conventional methods in misinformation prevention, despite non-trivial computational overhead and generalization challenges. We argue that future research should focus on co-designing robust knowledge foundations, reasoning certification, and attack-resistant interfaces to ensure LLMs can effectively counter misinformation across varied domains.
zh
[NLP-76] Beyond classical and contemporary models: a transformative ai framework for student dropout prediction in distance learning using rag prompt engineering and cross-modal fusion
【速读】: 该论文旨在解决远程学习中学生辍学这一关键问题,该问题对社会和经济产生深远影响。传统机器学习模型虽然利用结构化的社会人口学和行为数据进行预测,但难以捕捉非结构化学生互动中的情感和情境因素。论文提出的解决方案核心在于构建一个融合检索增强生成(RAG)、提示工程和跨模态注意力融合的AI框架,通过领域特定的情感分析、学术压力因子的解码以及文本、行为和社会人口学信息的动态对齐,提升辍学预测的准确性与解释性。
链接: https://arxiv.org/abs/2507.05285
作者: Miloud Mihoubi,Meriem Zerkouk,Belkacem Chikhaoui
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: 10 pages, 5 figures, 5 tables. Submitted to the 38th Canadian Conference on Artificial Intelligence (Canadian AI 2025)
Abstract:Student dropout in distance learning remains a critical challenge, with profound societal and economic consequences. While classical machine learning models leverage structured socio-demographic and behavioral data, they often fail to capture the nuanced emotional and contextual factors embedded in unstructured student interactions. This paper introduces a transformative AI framework that redefines dropout prediction through three synergistic innovations: Retrieval-Augmented Generation (RAG) for domain-specific sentiment analysis, prompt engineering to decode academic stressors, and cross-modal attention fusion to dynamically align textual, behavioral, and socio-demographic insights. By grounding sentiment analysis in a curated knowledge base of pedagogical content, our RAG-enhanced BERT model interprets student comments with unprecedented contextual relevance, while optimized prompts isolate indicators of academic distress (e.g., “isolation,” “workload anxiety”). A cross-modal attention layer then fuses these insights with temporal engagement patterns, creating holistic risk profiles. Evaluated on a longitudinal dataset of 4 423 students, the framework achieves 89% accuracy and an F1-score of 0.88, outperforming conventional models by 7% and reducing false negatives by 21%. Beyond prediction, the system generates interpretable interventions by retrieving contextually aligned strategies (e.g., mentorship programs for isolated learners). This work bridges the gap between predictive analytics and actionable pedagogy, offering a scalable solution to mitigate dropout risks in global education systems
zh
[NLP-77] Chat2SPaT: A Large Language Model Based Tool for Automating Traffic Signal Control Plan Management
【速读】: 该论文旨在解决传统定时交通信号控制中因需手动创建和更新信号配时计划而带来的繁琐工作问题,特别是在不同时段或星期几使用不同计划时,单个交叉口可能涉及多个计划,导致重复的人工参数输入。解决方案的关键在于提出Chat2SPaT方法,该方法能够将用户对信号控制计划的半结构化和模糊描述转化为精确的信号相位与时间(SPaT)结果,并进一步生成结构化的基于阶段或环状的计划,以适配智能交通系统(ITS)软件和交通信号控制器。该方法通过精心设计的提示语利用大语言模型(LLM)的理解能力,将用户描述重构为包含相位序列和相位属性的JSON格式结果,再通过Python脚本完成周期内相位定位、交通信号控制细节处理及完整控制计划的组装。
链接: https://arxiv.org/abs/2507.05283
作者: Yue Wang,Miao Zhou,Guijing Huang,Rui Zhuo,Chao Yi,Zhenliang Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Pre-timed traffic signal control, commonly used for operating signalized intersections and coordinated arterials, requires tedious manual work for signaling plan creating and updating. When the time-of-day or day-of-week plans are utilized, one intersection is often associated with multiple plans, leading to further repetitive manual plan parameter inputting. To enable a user-friendly traffic signal control plan management process, this study proposes Chat2SPaT, a method to convert users’ semi-structured and ambiguous descriptions on the signal control plan to exact signal phase and timing (SPaT) results, which could further be transformed into structured stage-based or ring-based plans to interact with intelligent transportation system (ITS) software and traffic signal controllers. With curated prompts, Chat2SPaT first leverages large language models’ (LLMs) capability of understanding users’ plan descriptions and reformulate the plan as a combination of phase sequence and phase attribute results in the json format. Based on LLM outputs, python scripts are designed to locate phases in a cycle, address nuances of traffic signal control, and finally assemble the complete traffic signal control plan. Within a chat, the pipeline can be utilized iteratively to conduct further plan editing. Experiments show that Chat2SPaT can generate plans with an accuracy of over 94% for both English and Chinese cases, using a test dataset with over 300 plan descriptions. As the first benchmark for evaluating LLMs’ capability of understanding traffic signal control plan descriptions, Chat2SPaT provides an easy-to-use plan management pipeline for traffic practitioners and researchers, serving as a potential new building block for a more accurate and versatile application of LLMs in the field of ITS. The source codes, prompts and test dataset are openly accessible at this https URL.
zh
[NLP-78] CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark
【速读】: 该论文旨在解决现有仓库级基准在评估大型语言模型(Large Language Models, LLMs)工程级代码处理能力时存在的局限性,如仅关注单一场景、未能充分反映真实软件的多样性和复杂性以及测试用例生成的可靠性问题。其解决方案的关键在于提出CorePipe自动化流水线,将仓库转换为全面的测试用例,并引入可配置的多场景仓库级基准CoreCodeBench,通过生成三种原子问题和复合问题来模拟真实的工程场景,从而更准确地评估LLMs在实际工程项目中的适用性。
链接: https://arxiv.org/abs/2507.05281
作者: Lingyue Fu,Hao Guan,Bolun Zhang,Haowei Yuan,Yaoming Zhu,Jun Xu,Zongyu Wang,Lin Qiu,Xunliang Cai,Xuezhi Cao,Weiwen Liu,Weinan Zhang,Yong Yu
机构: Shanghai Jiao Tong University (上海交通大学); AGI-EVAL; Meituan (美团)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) demonstrate increasingly sophisticated code processing capabilities, evaluating their performance on engineering-level code remains challenging. Existing repository-level benchmarks primarily focus on single scenarios, such as code generation or bug fixing, without adequately capturing the diversity and complexity of real-world software or project engineering workflows. Furthermore, these benchmarks suffer from limited controllability in question positioning and reliability issues in their generated test cases. To address these limitations, we present CorePipe, a fully automated pipeline that converts repositories into comprehensive test cases, and introduce CoreCodeBench, a configurable multi-scenario repository-level benchmark. To simulate real engineering scenarios, CorePipe generates three types of atomic questions (Development, BugFix, and Test-Driven Development) specifically targeting core code segments. These atomic questions are further combined into three types of composite questions, with difficulty levels flexibly adjusted through hyperparameter tuning. CoreCodeBench provides a comprehensive and extensive repository-level benchmark to investigate the applicability of LLMs in real-world engineering projects. Experiments with 16 LLMs across diverse scenarios reveal varying capabilities and offer multi-dimensional insights into LLM performance in engineering contexts. The code for CorePipe is available at this https URL, and the data for CoreCodeBench can be accessed at this https URL.
zh
[NLP-79] ReservoirChat: Interactive Documentation Enhanced with LLM and Knowledge Graph for ReservoirPy
【速读】: 该论文旨在解决大型语言模型(LLMs)在代码开发和Reservoir Computing领域回答复杂问题时存在的幻觉和事实准确性不足的问题。其解决方案的关键在于通过引入检索增强生成(Retrieval-Augmented Generation, RAG)和知识图谱,将外部知识整合到模型中,从而提升生成响应的准确性和可靠性。该方法不仅增强了模型在编码任务上的表现,还提供了类似ChatGPT的交互体验,专门针对ReservoirPy库进行优化。
链接: https://arxiv.org/abs/2507.05279
作者: Virgile Boraud(Mnemosyne),Yannis Bendi-Ouis(Mnemosyne),Paul Bernard(Mnemosyne),Xavier Hinaut(Mnemosyne)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:We introduce a tool designed to improve the capabilities of Large Language Models (LLMs) in assisting with code development using the ReservoirPy library, as well as in answering complex questions in the field of Reservoir Computing. By incorporating external knowledge through Retrieval-Augmented Generation (RAG) and knowledge graphs, our approach aims to reduce hallucinations and increase the factual accuracy of generated responses. The system provides an interactive experience similar to ChatGPT, tailored specifically for ReservoirPy, enabling users to write, debug, and understand Python code while accessing reliable domain-specific insights. In our evaluation, while proprietary models such as ChatGPT-4o and NotebookLM performed slightly better on general knowledge questions, our model outperformed them on coding tasks and showed a significant improvement over its base model, Codestral-22B.
zh
[NLP-80] An Adaptive Supervised Contrastive Learning Framework for Implicit Sexism Detection in Digital Social Networks
【速读】: 该论文旨在解决隐性性别歧视内容在社交媒体上的检测问题,此类内容常被传统方法忽视。其解决方案的关键在于提出一种自适应监督对比学习框架(ASCEND),核心创新在于引入基于阈值的对比学习机制,通过计算嵌入向量之间的余弦相似度,仅将相似度超过可学习阈值的样本对视为正例,从而在嵌入空间中更稳健地聚合语义相似文本表示并分离不相似文本,降低误报和漏报率。
链接: https://arxiv.org/abs/2507.05271
作者: Mohammad Zia Ur Rehman,Aditya Shah,Nagendra Kumar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The global reach of social media has amplified the spread of hateful content, including implicit sexism, which is often overlooked by conventional detection methods. In this work, we introduce an Adaptive Supervised Contrastive lEarning framework for implicit sexism detectioN (ASCEND). A key innovation of our method is the incorporation of threshold-based contrastive learning: by computing cosine similarities between embeddings, we selectively treat only those sample pairs as positive if their similarity exceeds a learnable threshold. This mechanism refines the embedding space by robustly pulling together representations of semantically similar texts while pushing apart dissimilar ones, thus reducing false positives and negatives. The final classification is achieved by jointly optimizing a contrastive loss with a cross-entropy loss. Textual features are enhanced through a word-level attention module. Additionally, we employ sentiment, emotion, and toxicity features. Evaluations on the EXIST2021 and MLSC datasets demonstrate that ASCEND significantly outperforms existing methods, with average Macro F1 improvements of 9.86%, 29.63%, and 32.51% across multiple tasks, highlighting its efficacy in capturing the subtle cues of implicit sexist language.
zh
[NLP-81] User Behavior Prediction as a Generic Robust Scalable and Low-Cost Evaluation Strategy for Estimating Generalization in LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)泛化能力评估的问题,特别是在数据污染背景下确保测试任务和用例在训练阶段未被见过变得几乎不可能。论文认为知识检索和推理任务并非衡量泛化能力的理想指标,因为LLMs并非针对特定任务进行训练。解决方案的关键在于提出用户行为预测作为理论上有依据、可扩展且稳健的替代方法,并引入一种新的框架进行验证,实验结果表明该框架能够有效评估模型的泛化能力。
链接: https://arxiv.org/abs/2507.05266
作者: Sougata Saha,Monojit Choudhury
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Measuring the generalization ability of Large Language Models (LLMs) is challenging due to data contamination. As models grow and computation becomes cheaper, ensuring tasks and test cases are unseen during training phases will become nearly impossible. We argue that knowledge-retrieval and reasoning tasks are not ideal for measuring generalization, as LLMs are not trained for specific tasks. Instead, we propose user behavior prediction, also a key aspect of personalization, as a theoretically sound, scalable, and robust alternative. We introduce a novel framework for this approach and test it on movie and music recommendation datasets for GPT-4o, GPT-4o-mini, and Llama-3.1-8B-Instruct. Results align with our framework’s predictions, showing GPT-4o outperforms GPT-4o-mini and Llama, though all models have much room for improvement, especially Llama.
zh
[NLP-82] okenShapley: Token Level Context Attribution with Shapley Value
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成响应时,如何准确地对特定关键词(如数字、年份或名称)进行数据溯源的问题。现有方法仅能在句子级别进行归因,无法满足用户对细粒度信息的归因需求。解决方案的关键在于提出TokenShapley,这是一种结合基于Shapley值的数据归因与基于KNN的检索技术的新型token级归因方法,通过预计算的数据存储实现上下文检索,并利用Shapley值量化token的重要性,从而实现更精确的细粒度数据归因。
链接: https://arxiv.org/abs/2507.05261
作者: Yingtai Xiao,Yuqing Zhu,Sirat Samyoun,Wanrong Zhang,Jiachen T. Wang,Jian Du
机构: TikTok Inc.(TikTok公司); Princeton University(普林斯顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) demonstrate strong capabilities in in-context learning, but verifying the correctness of their generated responses remains a challenge. Prior work has explored attribution at the sentence level, but these methods fall short when users seek attribution for specific keywords within the response, such as numbers, years, or names. To address this limitation, we propose TokenShapley, a novel token-level attribution method that combines Shapley value-based data attribution with KNN-based retrieval techniques inspired by recent advances in KNN-augmented LLMs. By leveraging a precomputed datastore for contextual retrieval and computing Shapley values to quantify token importance, TokenShapley provides a fine-grained data attribution approach. Extensive evaluations on four benchmarks show that TokenShapley outperforms state-of-the-art baselines in token-level attribution, achieving an 11-23% improvement in accuracy.
zh
[NLP-83] ABench-Physics: Benchmarking Physical Reasoning in LLM s via High-Difficulty and Dynamic Physics Problems
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在物理领域推理和泛化能力不足的问题,尤其是其在物理建模、概念理解和精确计算方面的局限性。现有基准测试在难度、题型和评估环境方面存在不足,无法有效评估模型的物理推理能力。论文提出的解决方案是构建ABench-Physics基准,其关键在于包含两个组成部分:Phy_A,一个包含400道研究生或竞赛级别静态问题的集合;Phy_B,一个配备自动变化引擎的100道动态问题子集,用于测试模型在变化条件下的鲁棒性。该基准要求精确数值答案,并设有严格的格式和容差约束,从而为评估和提升LLMs的科学推理能力提供了一个具有挑战性和诊断性的框架。
链接: https://arxiv.org/abs/2507.04766
作者: Yiming Zhang,Yingfan Ma,Yanmei Gu,Zhengkai Yang,Yihong Zhuang,Feng Wang,Zenan Huang,Yuanyuan Wang,Chao Huang,Bowen Song,Cheng Lin,Junbo Zhao
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have shown impressive performance in domains such as mathematics and programming, yet their capabilities in physics remain underexplored and poorly understood. Physics poses unique challenges that demand not only precise computation but also deep conceptual understanding and physical modeling skills. Existing benchmarks often fall short due to limited difficulty, multiple-choice formats, and static evaluation settings that fail to capture physical modeling ability. In this paper, we introduce ABench-Physics, a novel benchmark designed to rigorously evaluate LLMs’ physical reasoning and generalization capabilities. ABench-Physics consists of two components: Phy_A, a static set of 400 graduate- or Olympiad-level problems; and Phy_B, a dynamic subset of 100 problems equipped with an automatic variation engine to test model robustness across changing conditions. All questions require precise numerical answers, with strict formatting and tolerance constraints. Our evaluation of several state-of-the-art LLMs reveals substantial performance gaps, highlighting persistent limitations in physical reasoning, especially in generalization to dynamic variants. ABench-Physics provides a challenging and diagnostic framework for advancing scientific reasoning in LLMs.
zh
[NLP-84] ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark
【速读】: 该论文试图解决传统自动语音识别(Automatic Speech Recognition, ASR)模型在上下文建模、记忆及基于世界知识的推理能力方面的不足,这些问题限制了其在复杂场景下的性能评估。解决方案的关键在于提出ContextASR-Bench:一个涵盖多领域、大规模的基准测试平台,用于评估ASR系统在不同粒度上下文信息下的表现,并特别关注模型对语音输入中提及的命名实体的识别能力。通过这一基准,研究验证了具备强大世界知识和上下文学习能力的大型音频语言模型(Large Audio Language Models, LALMs)在性能上显著优于传统ASR模型。
链接: https://arxiv.org/abs/2507.05727
作者: He Wang,Linhan Ma,Dake Guo,Xiong Wang,Lei Xie,Jin Xu,Junyang Lin
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 18 pages, 4 figures
Abstract:Automatic Speech Recognition (ASR) has been extensively investigated, yet prior evaluative efforts have largely been restricted to contextless paradigms. This constraint stems from the limited proficiency of conventional ASR models in context modeling and their deficiency in memory and reasoning based on world knowledge. Recent breakthroughs in the development of Large Language Models (LLMs) and corresponding Large Audio Language Models (LALMs) have markedly enhanced the visibility of general artificial intelligence capabilities. Consequently, there exists a compelling need for a benchmark that can evaluate both the generality and intelligence of ASR systems. To address this gap, we propose ContextASR-Bench: a comprehensive, large-scale benchmark designed to assess contextual speech recognition. This benchmark encompasses up to 40,000 data entries across over 10 domains, enabling a thorough evaluation of model performance in scenarios that omit or incorporate coarse-grained or fine-grained contextual information. Moreover, diverging from conventional ASR evaluations, our benchmark includes an analysis of model efficacy in recognizing named entities mentioned within the auditory input. Our extensive evaluation highlights that LALMs, with strong world knowledge and context learning capabilities, outperform conventional ASR models by a large margin. The dataset and evaluation code have been released at this https URL.
zh
计算机视觉
[CV-0] Learning to Track Any Points from Human Motion
【速读】:该论文试图解决人体运动中点跟踪任务的训练数据获取困难问题,尤其是在面对非刚性形变、关节运动、衣物扭曲和频繁遮挡等复杂情况时,手动标注数据成本高昂且难以大规模获取。其解决方案的关键在于提出了一种自动化管道AnthroTAP,通过拟合Skinned Multi-Person Linear (SMPL)模型到视频帧中检测到的人体,将生成的3D网格顶点投影到2D图像平面以创建伪轨迹,并利用射线投射处理遮挡,结合光流一致性过滤不可靠轨迹,从而生成高质量的伪标签数据用于训练点跟踪模型。
链接: https://arxiv.org/abs/2507.06233
作者: Inès Hyeonsu Kim,Seokju Cho,Jahyeok Koo,Junghyun Park,Jiahui Huang,Joon-Young Lee,Seungryong Kim
机构: KAIST AI; Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracking remains difficult due to laborious manual annotation. Our proposed pipeline, AnthroTAP, addresses this by proposing an automated pipeline to generate pseudo-labeled training data, leveraging the Skinned Multi-Person Linear (SMPL) model. We first fit the SMPL model to detected humans in video frames, project the resulting 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handle occlusions using ray-casting, and filter out unreliable tracks based on optical flow consistency. A point tracking model trained on AnthroTAP annotated dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using 10,000 times less data and only 1 day in 4 GPUs, compared to 256 GPUs used in recent state-of-the-art.
zh
[CV-1] RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models
【速读】:该论文旨在解决遥感图像分割中由于多模态耦合处理机制导致的语义关系复杂性管理不足和跨模态对齐不精确的问题。其关键解决方案是提出RSRefSeg 2,采用解耦范式,将传统流程重构为粗略定位与精细分割的协作双阶段框架,通过战略基础模型协作整合CLIP的跨模态对齐能力和SAM的分割泛化能力,从而提升分割精度与复杂语义理解能力。
链接: https://arxiv.org/abs/2507.06231
作者: Keyan Chen,Chenyang Liu,Bowen Chen,Jiafan Zhang,Zhengxia Zou,Zhenwei Shi
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring Remote Sensing Image Segmentation provides a flexible and fine-grained framework for remote sensing scene analysis via vision-language collaborative interpretation. Current approaches predominantly utilize a three-stage pipeline encompassing dual-modal encoding, cross-modal interaction, and pixel decoding. These methods demonstrate significant limitations in managing complex semantic relationships and achieving precise cross-modal alignment, largely due to their coupled processing mechanism that conflates target localization with boundary delineation. This architectural coupling amplifies error propagation under semantic ambiguity while restricting model generalizability and interpretability. To address these issues, we propose RSRefSeg 2, a decoupling paradigm that reformulates the conventional workflow into a collaborative dual-stage framework: coarse localization followed by fine segmentation. RSRefSeg 2 integrates CLIP’s cross-modal alignment strength with SAM’s segmentation generalizability through strategic foundation model collaboration. Specifically, CLIP is employed as the dual-modal encoder to activate target features within its pre-aligned semantic space and generate localization prompts. To mitigate CLIP’s misactivation challenges in multi-entity scenarios described by referring texts, a cascaded second-order prompter is devised, which enhances precision through implicit reasoning via decomposition of text embeddings into complementary semantic subspaces. These optimized semantic prompts subsequently direct the SAM to generate pixel-level refined masks, thereby completing the semantic transmission pipeline. Extensive experiments (RefSegRS, RRSIS-D, and RISBench) demonstrate that RSRefSeg 2 surpasses contemporary methods in segmentation accuracy (+~3% gIoU) and complex semantic interpretation. Code is available at: this https URL.
zh
[CV-2] Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion ICCV2025
【速读】:该论文试图解决从单张图像中同时推断场景的3D几何结构和语义信息的问题,即语义场景补全(Semantic Scene Completion, SSC)。传统方法依赖于昂贵的真值标注,而本文在无监督设置下提出了一种新方法——SceneDINO,其关键在于利用自监督表示学习和2D无监督场景理解的技术,仅通过多视角一致性自监督信号进行训练,无需任何语义或几何真值。通过一种新颖的3D特征蒸馏方法,SceneDINO能够获得无监督的3D语义,并在3D和2D无监督场景理解任务中达到了最先进的分割精度。
链接: https://arxiv.org/abs/2507.06230
作者: Aleksandar Jevtić,Christoph Reich,Felix Wimbauer,Oliver Hahn,Christian Rupprecht,Stefan Roth,Daniel Cremers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at ICCV 2025. Christoph Reich and Aleksandar Jevtić - both authors contributed equally. Code: this https URL Project page: this https URL
Abstract:Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.
zh
[CV-3] Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling
【速读】:该论文旨在解决科学文献中图表等半结构化数据的视觉问答(Visual Question Answering, VQA)问题,特别是针对科学数据解释所需的精度挑战,包括数值处理、多步骤视觉元素推理以及视觉观察与文本推理之间的一致性。其解决方案的关键在于采用大规模模型(参数量为5B至8B),并通过提示优化、思维链推理和集成建模策略提升模型在视觉问答任务中的性能。其中,InternVL3模型在SciVQA测试集上表现最佳,而集成模型在验证集上的误差分析也表明其性能优于多数单一模型。
链接: https://arxiv.org/abs/2507.06183
作者: Prahitha Movva,Naga Harshita Marupaka
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Technical reports and articles often contain valuable information in the form of semi-structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of \textbf0.740 and a BERTScore of \textbf0.983 on the SciVQA test split. We also developed an ensemble model with multiple vision language models (VLMs). Through error analysis on the validation split, our ensemble approach improved performance compared to most individual models, though InternVL3 remained the strongest standalone performer. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model’s ability in visual question answering. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.06183 [cs.CV] (or arXiv:2507.06183v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.06183 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-4] OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
【速读】:该论文试图解决生成式3D资产时缺乏显式、可编辑的部件结构的问题,现有方法通常生成整体形状,限制了其在交互应用中的实用性。解决方案的关键在于提出OmniPart框架,该框架通过两个协同阶段实现组件间的高语义解耦与结构稳固性:第一阶段利用自回归结构规划模块生成可控制的、变长的3D部件边界框序列,由灵活的2D部件掩码引导,实现无需直接对应关系或语义标签的直观部件分解;第二阶段则通过空间条件修正流模型,从预训练的整体3D生成器高效适配,同步且一致地合成所有3D部件。
链接: https://arxiv.org/abs/2507.06165
作者: Yunhan Yang,Yufan Zhou,Yuan-Chen Guo,Zi-Xin Zou,Yukun Huang,Ying-Tian Liu,Hao Xu,Ding Liang,Yan-Pei Cao,Xihui Liu
机构: The University of Hong Kong(香港大学); Harbin Institute of Technology(哈尔滨工业大学); VAST(视觉艺术与科技研究所); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.
zh
[CV-5] Normalizing Diffusion Kernels with Optimal Transport
【速读】:该论文试图解决在非结构化数据(如点云、稀疏体素网格或高斯混合模型)上进行平滑处理的问题,传统方法依赖于基于微分几何的拉普拉斯算子,但其构建需要精确定义的域结构,而这种结构在实际应用中并不总能获得。解决方案的关键在于引入一类广义的平滑算子,这些算子基于相似性或邻接矩阵构造,并通过对称的Sinkhorn算法进行归一化,使其具备类似热扩散的结构行为,从而实现类似于拉普拉斯算子的平滑效果,并保留拉普拉斯算子的谱信息。
链接: https://arxiv.org/abs/2507.06161
作者: Nathan Kessler,Robin Magnet,Jean Feydy
机构: Centre Borelli, ENS Paris-Saclay; Inria, Université Paris Cité, Inserm, HeKA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 25 figures
Abstract:Smoothing a signal based on local neighborhoods is a core operation in machine learning and geometry processing. On well-structured domains such as vector spaces and manifolds, the Laplace operator derived from differential geometry offers a principled approach to smoothing via heat diffusion, with strong theoretical guarantees. However, constructing such Laplacians requires a carefully defined domain structure, which is not always available. Most practitioners thus rely on simple convolution kernels and message-passing layers, which are biased against the boundaries of the domain. We bridge this gap by introducing a broad class of smoothing operators, derived from general similarity or adjacency matrices, and demonstrate that they can be normalized into diffusion-like operators that inherit desirable properties from Laplacians. Our approach relies on a symmetric variant of the Sinkhorn algorithm, which rescales positive smoothing operators to match the structural behavior of heat diffusion. This construction enables Laplacian-like smoothing and processing of irregular data such as point clouds, sparse voxel grids or mixture of Gaussians. We show that the resulting operators not only approximate heat diffusion but also retain spectral information from the Laplacian itself, with applications to shape analysis and matching.
zh
[CV-6] SoftReMish: A Novel Activation Function for Enhanced Convolutional Neural Networks for Visual Recognition Performance
【速读】:该论文试图解决卷积神经网络(CNN)在图像分类任务中性能优化的问题,特别是通过设计一种新的激活函数来提升模型的收敛行为和泛化能力。解决方案的关键在于提出了一种名为SoftReMish的新激活函数,并通过实验验证其在标准CNN架构上的优越表现,结果显示SoftReMish在最小训练损失和最大验证准确率方面均优于ReLU、Tanh和Mish等现有激活函数。
链接: https://arxiv.org/abs/2507.06148
作者: Mustafa Bayram Gücen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:In this study, SoftReMish, a new activation function designed to improve the performance of convolutional neural networks (CNNs) in image classification tasks, is proposed. Using the MNIST dataset, a standard CNN architecture consisting of two convolutional layers, max pooling, and fully connected layers was implemented. SoftReMish was evaluated against popular activation functions including ReLU, Tanh, and Mish by replacing the activation function in all trainable layers. The model performance was assessed in terms of minimum training loss and maximum validation accuracy. Results showed that SoftReMish achieved a minimum loss (3.14e-8) and a validation accuracy (99.41%), outperforming all other functions tested. These findings demonstrate that SoftReMish offers better convergence behavior and generalization capability, making it a promising candidate for visual recognition tasks.
zh
[CV-7] Prompt-Free Conditional Diffusion for Multi-object Image Augmentation IJCAI2025
【速读】:该论文试图解决多对象图像生成中因依赖文本条件导致的生成对象与原始数据偏差,以及过度依赖原始图像导致生成图像多样性不足的问题。解决方案的关键在于提出一种无需提示的条件扩散框架,通过引入局部-全局语义融合策略从图像中提取语义以替代文本,并利用LoRA注入知识以缓解原始模型与目标数据集之间的类别偏差;同时设计基于计数损失的奖励模型来辅助传统重建损失,通过约束每类对象的数量而非逐像素约束,从而弥合生成数据与原始数据之间的数量差异并提升生成数据的多样性。
链接: https://arxiv.org/abs/2507.06146
作者: Haoyu Wang,Lei Zhang,Wei Wei,Chen Ding,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学); Xi’an University of Posts & Telecommunications (西安邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IJCAI 2025
Abstract:Diffusion models has underpinned much recent advances of dataset augmentation in various computer vision tasks. However, when involving generating multi-object images as real scenarios, most existing methods either rely entirely on text condition, resulting in a deviation between the generated objects and the original data, or rely too much on the original images, resulting in a lack of diversity in the generated images, which is of limited help to downstream tasks. To mitigate both problems with one stone, we propose a prompt-free conditional diffusion framework for multi-object image augmentation. Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text, and inject knowledge into the diffusion model through LoRA to alleviate the category deviation between the original model and the target dataset. In addition, we design a reward model based counting loss to assist the traditional reconstruction loss for model training. By constraining the object counts of each category instead of pixel-by-pixel constraints, bridging the quantity deviation between the generated data and the original data while improving the diversity of the generated data. Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. Code is available at \hrefthis https URLhere.
zh
[CV-8] Omni-Video: Democratizing Unified Video Understanding and Generation
【速读】:该论文试图解决当前基础模型主要聚焦于图像处理,而在视频理解与生成的统一模型发展上存在明显不足的问题。其解决方案的关键在于教导现有的多模态大语言模型(MLLMs)生成连续的视觉线索,这些线索作为扩散解码器的输入,用于生成高质量视频。通过引入轻量级架构设计和高效的多阶段训练方案,以实现MLLMs与扩散解码器之间的快速连接,并充分发挥系统在统一视频建模中的潜力。
链接: https://arxiv.org/abs/2507.06119
作者: Zhiyu Tan,Hao Yang,Luozheng Qin,Jia Gong,Mengping Yang,Hao Li
机构: Shanghai Academy of Artificial Intelligence for Science; Fudan University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report, project page: this https URL
Abstract:Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.
zh
[CV-9] LighthouseGS: Indoor Structure-aware 3D Gaussian Splatting for Panorama-Style Mobile Captures
【速读】:该论文试图解决使用手持设备进行全景式运动采集数据时,由于旋转主导的运动和窄基线导致的相机位姿估计与三维点云重建困难的问题,尤其是在无纹理的室内场景中。其解决方案的关键在于提出LighthouseGS框架,该框架借鉴了全景视图的灯塔式扫动运动,利用粗略的几何先验信息(如手机摄像头位姿和单目深度估计)以及室内环境中常见的平面结构,通过新的平面脚手架组装初始化方法生成一致的三维点,并结合稳定剪枝策略提升几何精度与优化稳定性,同时引入几何与光度校正以解决运动漂移和自动曝光带来的不一致性。
链接: https://arxiv.org/abs/2507.06109
作者: Seungoh Han,Jaehoon Jang,Hyunsu Kim,Jaeheung Surh,Junhyung Kwak,Hyowon Ha,Kyungdon Joo
机构: Artificial Intelligence Graduate School, UNIST(UNIST人工智能研究生院); Bucketplace, Co., Ltd.(Bucketplace有限公司)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled real-time novel view synthesis (NVS) with impressive quality in indoor scenes. However, achieving high-fidelity rendering requires meticulously captured images covering the entire scene, limiting accessibility for general users. We aim to develop a practical 3DGS-based NVS framework using simple panorama-style motion with a handheld camera (e.g., mobile device). While convenient, this rotation-dominant motion and narrow baseline make accurate camera pose and 3D point estimation challenging, especially in textureless indoor scenes. To address these challenges, we propose LighthouseGS, a novel framework inspired by the lighthouse-like sweeping motion of panoramic views. LighthouseGS leverages rough geometric priors, such as mobile device camera poses and monocular depth estimation, and utilizes the planar structures often found in indoor environments. We present a new initialization method called plane scaffold assembly to generate consistent 3D points on these structures, followed by a stable pruning strategy to enhance geometry and optimization stability. Additionally, we introduce geometric and photometric corrections to resolve inconsistencies from motion drift and auto-exposure in mobile devices. Tested on collected real and synthetic indoor scenes, LighthouseGS delivers photorealistic rendering, surpassing state-of-the-art methods and demonstrating the potential for panoramic view synthesis and object placement.
zh
[CV-10] Reflections Unlock: Geometry-Aware Reflection Disentanglement in 3D Gaussian Splatting for Photorealistic Scenes Rendering
【速读】:该论文试图解决反射表面在新视角合成中的准确渲染问题,现有方法如NeRF和3D Gaussian Splatting(3DGS)常将反射误认为物理几何,导致重建质量下降。其解决方案的关键在于提出Ref-Unlock框架,该框架基于3DGS,通过显式解耦透射与反射成分,以更好地捕捉复杂反射并提升真实场景中的几何一致性。该方法采用双分支表示和高阶球面谐波来捕获高频反射细节,并引入反射去除模块提供伪无反射监督以指导清洁分解,同时结合伪深度图和几何感知双边平滑约束以增强三维几何一致性和分解稳定性。
链接: https://arxiv.org/abs/2507.06103
作者: Jiayi Song,Zihan Ye,Qingyuan Zhou,Weidong Yang,Ben Fei,Jingyi Xu,Ying He,Wanli Ouyang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately rendering scenes with reflective surfaces remains a significant challenge in novel view synthesis, as existing methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) often misinterpret reflections as physical geometry, resulting in degraded reconstructions. Previous methods rely on incomplete and non-generalizable geometric constraints, leading to misalignment between the positions of Gaussian splats and the actual scene geometry. When dealing with real-world scenes containing complex geometry, the accumulation of Gaussians further exacerbates surface artifacts and results in blurred reconstructions. To address these limitations, in this work, we propose Ref-Unlock, a novel geometry-aware reflection modeling framework based on 3D Gaussian Splatting, which explicitly disentangles transmitted and reflected components to better capture complex reflections and enhance geometric consistency in real-world scenes. Our approach employs a dual-branch representation with high-order spherical harmonics to capture high-frequency reflective details, alongside a reflection removal module providing pseudo reflection-free supervision to guide clean decomposition. Additionally, we incorporate pseudo-depth maps and a geometry-aware bilateral smoothness constraint to enhance 3D geometric consistency and stability in decomposition. Extensive experiments demonstrate that Ref-Unlock significantly outperforms classical GS-based reflection methods and achieves competitive results with NeRF-based models, while enabling flexible vision foundation models (VFMs) driven reflection editing. Our method thus offers an efficient and generalizable solution for realistic rendering of reflective scenes. Our code is available at this https URL.
zh
[CV-11] -Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification
【速读】:该论文旨在解决植被样方图像中多物种植物识别的问题。其解决方案的关键在于构建一个结合微调的Vision Transformer模型ViTD2PC24All进行局部特征推理、4x4分块策略以匹配网络的518x518感受野,以及通过PaCMAP + K-Means视觉聚类和地理定位过滤实现领域先验适应的流水线。通过对分块预测结果进行多数投票并结合聚类相关的贝叶斯先验进行重新加权,最终在无需额外训练的情况下实现了0.348的宏平均F1分数(私有排行榜)。
链接: https://arxiv.org/abs/2507.06093
作者: Murilo Gustineli,Anthony Miyaguchi,Adrian Cheung,Divyansh Khattak
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:We describe DS@GT’s second-place solution to the PlantCLEF 2025 challenge on multi-species plant identification in vegetation quadrat images. Our pipeline combines (i) a fine-tuned Vision Transformer ViTD2PC24All for patch-level inference, (ii) a 4x4 tiling strategy that aligns patch size with the network’s 518x518 receptive field, and (iii) domain-prior adaptation through PaCMAP + K-Means visual clustering and geolocation filtering. Tile predictions are aggregated by majority vote and re-weighted with cluster-specific Bayesian priors, yielding a macro-averaged F1 of 0.348 (private leaderboard) while requiring no additional training. All code, configuration files, and reproducibility scripts are publicly available at this https URL.
zh
[CV-12] CAST-Phys: Contactless Affective States Through Physiological signals Database
【速读】:该论文试图解决情感识别系统发展中因缺乏多模态情感数据集而导致的准确率不足问题,以及传统接触式设备在情绪诱发过程中可能干扰真实自发情绪反应的局限性。其解决方案的关键在于提出一个名为CAST-Phys的高质量多模态非接触生理情感数据库,该数据库结合了面部视频和多种生理信号(如PPG、EDA和RR),实现了无需物理接触即可提取情感线索,从而为远程生理情感识别提供了可靠的数据支持。
链接: https://arxiv.org/abs/2507.06080
作者: Joaquim Comas,Alexander Joel Vera,Xavier Vives,Eleonora De Filippi,Alexandre Pereda,Federico Sukno
机构: Pompeu Fabra University (庞培法布拉大学); Eurecat Centre Tecnològic (欧瑞凯特技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, affective computing and its applications have become a fast-growing research topic. Despite significant advancements, the lack of affective multi-modal datasets remains a major bottleneck in developing accurate emotion recognition systems. Furthermore, the use of contact-based devices during emotion elicitation often unintentionally influences the emotional experience, reducing or altering the genuine spontaneous emotional response. This limitation highlights the need for methods capable of extracting affective cues from multiple modalities without physical contact, such as remote physiological emotion recognition. To address this, we present the Contactless Affective States Through Physiological Signals Database (CAST-Phys), a novel high-quality dataset explicitly designed for multi-modal remote physiological emotion recognition using facial and physiological cues. The dataset includes diverse physiological signals, such as photoplethysmography (PPG), electrodermal activity (EDA), and respiration rate (RR), alongside high-resolution uncompressed facial video recordings, enabling the potential for remote signal recovery. Our analysis highlights the crucial role of physiological signals in realistic scenarios where facial expressions alone may not provide sufficient emotional information. Furthermore, we demonstrate the potential of remote multi-modal emotion recognition by evaluating the impact of individual and fused modalities, showcasing its effectiveness in advancing contactless emotion recognition technologies.
zh
[CV-13] ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models
【速读】:该论文试图解决深度学习模型在面对对抗攻击时的脆弱性问题,特别是现有方法多依赖于ℓp-norm扰动约束,与人类感知能力不匹配,而生成自然、无约束的对抗样本(UAEs)仍是挑战。其解决方案的关键在于提出一种基于扩散模型的新型方法——ScoreAdv,该方法通过引入可解释的对抗引导机制,逐步将采样分布向对抗分布转移,并利用可解释的显著图将参考图像的视觉信息注入生成样本中,从而有效生成高质量且数量无限的自然对抗样本。
链接: https://arxiv.org/abs/2507.06078
作者: Chihan Huang,Hao Tang
机构: 北京大学(Beijing University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on \ell_p -norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.
zh
[CV-14] Discontinuity-aware Normal Integration for Generic Central Camera Models
【速读】:该论文试图解决从表面法线图中恢复三维表面的问题,即正常积分(normal integration),这是光度形状重建技术如基于阴影的形状(shape-from-shading)和光度立体(photometric stereo)的关键组成部分。现有方法通常隐式处理深度不连续性,并仅限于正交或理想针孔相机模型。该论文提出了一种新的公式,能够显式建模不连续性并处理通用中心相机模型。其关键思想是基于局部平面性假设,通过表面法线与光线方向之间的约束进行建模。
链接: https://arxiv.org/abs/2507.06075
作者: Francesco Milano,Manuel López-Antequera,Naina Dhingra,Roland Siegwart,Robert Thiel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 13 figures, 8 tables
Abstract:Recovering a 3D surface from its surface normal map, a problem known as normal integration, is a key component for photometric shape reconstruction techniques such as shape-from-shading and photometric stereo. The vast majority of existing approaches for normal integration handle only implicitly the presence of depth discontinuities and are limited to orthographic or ideal pinhole cameras. In this paper, we propose a novel formulation that allows modeling discontinuities explicitly and handling generic central cameras. Our key idea is based on a local planarity assumption, that we model through constraints between surface normals and ray directions. Compared to existing methods, our approach more accurately approximates the relation between depth and surface normals, achieves state-of-the-art results on the standard normal integration benchmark, and is the first to directly handle generic central camera models.
zh
[CV-15] MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding
【速读】:该论文试图解决自动驾驶视频理解中驾驶行为识别与推理的准确性问题,现有方法往往仅挖掘浅层因果关系,无法处理多模态间的虚假相关性,并忽略了自车层级的因果建模。解决方案的关键在于提出一种新颖的多模态因果分析模型(Multimodal Causal Analysis Model, MCAM),该模型通过构建视觉与语言模态之间的潜在因果结构,结合多级特征提取器捕捉长程依赖、基于有向无环图(DAG)动态建模驾驶场景以及利用视觉-语言Transformer对齐关键视觉特征与语义表达,从而提升视觉-语言因果关系学习的效果。
链接: https://arxiv.org/abs/2507.06072
作者: Tongtong Cheng,Rongzhen Li,Yixin Xiong,Tao Zhang,Jing Wang,Kai Liu
机构: Chongqing University(重庆大学); National Elite Institute of Engineering, Chongqing University(重庆大学精英工程学院); National University of Deffense Technology(国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate driving behavior recognition and reasoning are critical for autonomous driving video understanding. However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. Firstly, we design a multi-level feature extractor to capture long-range dependencies. Secondly, we design a causal analysis module that dynamically models driving scenarios using a directed acyclic graph (DAG) of driving states. Thirdly, we utilize a vision-language transformer to align critical visual features with their corresponding linguistic expressions. Extensive experiments on the BDD-X, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning. Furthermore, the model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications. The code is available at this https URL.
zh
[CV-16] MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding
【速读】:该论文旨在解决现有音频驱动情感3D面部动画方法中情感标签静态且预定义导致的表达多样性与自然性受限的问题。其解决方案的关键在于提出MEDTalk框架,通过精心设计的交叉重构过程将内容与情感嵌入空间从运动序列中解耦,实现对唇部动作和面部表情的独立控制,并结合音频与语音文本预测帧级强度变化,动态调整静态情感特征以生成逼真的情感表达。
链接: https://arxiv.org/abs/2507.06071
作者: Chang Liu,Ye Pan,Chenyang Ding,Susanto Rahardja,Xiaokang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 11 pages, 8 figures
Abstract:Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline.
zh
[CV-17] VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis
【速读】:该论文旨在解决传统3D面部动画方法在利用2D计算机视觉和图形学快速进展方面的局限性,特别是在生成逼真、高保真度的面部动画时。其解决方案的关键在于提出VisualSpeaker方法,该方法通过使用基于视觉语音识别的监督信号,结合光栅化渲染技术,实现更高质量的3D面部动画生成。核心创新点在于引入了感知唇读损失,通过将逼真3D高斯泼溅头像渲染输入预训练的视觉自动语音识别模型来计算该损失,从而提升生成动画的感知质量与准确性。
链接: https://arxiv.org/abs/2507.06060
作者: Alexandre Symeonidis-Herzig,Özge Mercanoğlu Sincan,Richard Bowden
机构: University of Surrey(萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation. This perceptual focus naturally supports accurate mouthings, essential cues that disambiguate similar manual signs in sign language avatars.
zh
[CV-18] xtPixs: Glyph-Conditioned Diffusion with Character-Aware Attention and OCR-Guided Supervision
【速读】:该论文试图解决生成式 AI (Generative AI) 在生成图像中无法准确生成可读、有意义且拼写正确的文本这一问题,该问题显著限制了其在广告、学习和创意设计等实际应用场景中的使用。解决方案的关键在于提出一种新的框架——Glyph-Conditioned Diffusion with Character-Aware Attention (GCDA),该框架通过三个精心设计的模块进行改进:首先,采用双流文本编码器以获得富含语义和字符信息的输入文本表示;其次,引入一种基于字符感知的注意力机制及新的注意力分离损失,以减少字符分布的扭曲伪影;最后,通过 OCR-in-the-loop 的微调阶段,利用全文本感知损失优化模型以提升可读性和拼写准确性。
链接: https://arxiv.org/abs/2507.06033
作者: Syeda Anshrah Gillani,Mirza Samad Ahmed Baig,Osama Ahmed Khan,Shahid Munir Shah,Umema Mujeeb,Maheen Ali
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages
Abstract:The modern text-to-image diffusion models boom has opened a new era in digital content production as it has proven the previously unseen ability to produce photorealistic and stylistically diverse imagery based on the semantics of natural-language descriptions. However, the consistent disadvantage of these models is that they cannot generate readable, meaningful, and correctly spelled text in generated images, which significantly limits the use of practical purposes like advertising, learning, and creative design. This paper introduces a new framework, namely Glyph-Conditioned Diffusion with Character-Aware Attention (GCDA), using which a typical diffusion backbone is extended by three well-designed modules. To begin with, the model has a dual-stream text encoder that encodes both semantic contextual information and explicit glyph representations, resulting in a character-aware representation of the input text that is rich in nature. Second, an attention mechanism that is aware of the character is proposed with a new attention segregation loss that aims to limit the attention distribution of each character independently in order to avoid distortion artifacts. Lastly, GCDA has an OCR-in-the-loop fine-tuning phase, where a full text perceptual loss, directly optimises models to be legible and accurately spell. Large scale experiments to benchmark datasets, such as MARIO-10M and T2I-CompBench, reveal that GCDA sets a new state-of-the-art on all metrics, with better character based metrics on text rendering (Character Error Rate: 0.08 vs 0.21 for the previous best; Word Error Rate: 0.15 vs 0.25), human perception, and comparable image synthesis quality on high-fidelity (FID: 14.3).
zh
[CV-19] ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge
【速读】:该论文试图解决在资源受限的边缘设备上进行实时视觉分析时,如何实现能耗与检测精度之间的联合优化问题。解决方案的关键在于提出ECORE框架,该框架集成多种动态路由策略,包括基于估计的技术和贪心选择算法,以将图像处理请求导向最合适的边缘设备-模型组合,并根据目标特征动态平衡能效与检测性能。
链接: https://arxiv.org/abs/2507.06011
作者: Daghash K. Alqahtani,Maria A. Rodriguez,Muhammad Aamir Cheema,Hamid Rezatofighi,Adel N. Toosi
机构: University of Melbourne(墨尔本大学); Monash University(莫纳什大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Edge computing enables data processing closer to the source, significantly reducing latency an essential requirement for real-time vision-based analytics such as object detection in surveillance and smart city environments. However, these tasks place substantial demands on resource constrained edge devices, making the joint optimization of energy consumption and detection accuracy critical. To address this challenge, we propose ECORE, a framework that integrates multiple dynamic routing strategies including estimation based techniques and a greedy selection algorithm to direct image processing requests to the most suitable edge device-model pair. ECORE dynamically balances energy efficiency and detection performance based on object characteristics. We evaluate our approach through extensive experiments on real-world datasets, comparing the proposed routers against widely used baseline techniques. The evaluation leverages established object detection models (YOLO, SSD, EfficientDet) and diverse edge platforms, including Jetson Orin Nano, Raspberry Pi 4 and 5, and TPU accelerators. Results demonstrate that our proposed context-aware routing strategies can reduce energy consumption and latency by 45% and 49%, respectively, while incurring only a 2% loss in detection accuracy compared to accuracy-centric methods.
zh
[CV-20] Geo-Registration of Terrestrial LiDAR Point Clouds with Satellite Images without GNSS
【速读】:该论文旨在解决在GNSS信号受阻的城市区域中,对LiDAR点云进行精确地理配准(geo-registration)的问题。现有方法通常依赖于实时GNSS和IMU数据,需要预先校准并假设数据采集期间定位稳定,但在密集城区这一假设往往失效,导致定位误差。该论文提出的解决方案关键在于采用一种结构化的地理配准与空间校正方法,通过将3D点云与卫星图像对齐,实现帧级GNSS信息恢复和城市尺度3D地图重建,而无需依赖先验定位。该方法利用预训练的Point Transformer模型分割道路点,提取道路骨架和交叉点用于对齐,并通过交叉点进行全局刚性配准,随后使用径向基函数(RBF)插值进行局部优化,最后结合SRTM地形数据进行高程校正。
链接: https://arxiv.org/abs/2507.05999
作者: Xinyu Wang,Muhammad Ibrahim,Atif Mansoor,Ajmal Mian
机构: The University of Western Australia (西澳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to Transactions on Geoscience Remote Sensing
Abstract:Accurate geo-registration of LiDAR point clouds presents significant challenges in GNSS signal denied urban areas with high-rise buildings and bridges. Existing methods typically rely on real-time GNSS and IMU data, that require pre-calibration and assume stable positioning during data collection. However, this assumption often fails in dense urban areas, resulting in localization errors. To address this, we propose a structured geo-registration and spatial correction method that aligns 3D point clouds with satellite images, enabling frame-wise recovery of GNSS information and reconstruction of city scale 3D maps without relying on prior localization. The proposed approach employs a pre-trained Point Transformer model to segment the road points and then extracts the road skeleton and intersection points from the point cloud as well as the target map for alignment. Global rigid alignment of the two is performed using the intersection points, followed by local refinement using radial basis function (RBF) interpolation. Elevation correction is then applied to the point cloud based on terrain information from SRTM dataset to resolve vertical discrepancies. The proposed method was tested on the popular KITTI benchmark and a locally collected Perth (Western Australia) CBD dataset. On the KITTI dataset, our method achieved an average planimetric alignment standard deviation (STD) of 0.84~m across sequences with intersections, representing a 55.3% improvement over the original dataset. On the Perth dataset, which lacks GNSS information, our method achieved an average STD of 0.96~m compared to the GPS data extracted from Google Maps API. This corresponds to a 77.4% improvement from the initial alignment. Our method also resulted in elevation correlation gains of 30.5% on the KITTI dataset and 50.4% on the Perth dataset.
zh
[CV-21] Ensemble-Based Deepfake Detection using State-of-the-Art Models with Robust Cross-Dataset Generalisation
【速读】:该论文试图解决深度伪造检测模型在分布外数据上的性能下降问题,即现有基于机器学习的生成式AI(Generative AI)检测模型在基准数据集上表现优异,但在面对不同分布的数据时效果显著恶化。解决方案的关键在于采用集成学习方法,通过结合多个先进且不对称的模型的预测概率,提升检测系统的泛化能力。实验结果表明,单一模型在不同场景下表现不稳定,而基于集成的预测在所有测试场景中均表现出更稳定和可靠的效果。
链接: https://arxiv.org/abs/2507.05996
作者: Haroon Wahab,Hassan Ugail,Lujain Jaleel
机构: University of Bradford(布拉德福德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Machine learning-based Deepfake detection models have achieved impressive results on benchmark datasets, yet their performance often deteriorates significantly when evaluated on out-of-distribution data. In this work, we investigate an ensemble-based approach for improving the generalization of deepfake detection systems across diverse datasets. Building on a recent open-source benchmark, we combine prediction probabilities from several state-of-the-art asymmetric models proposed at top venues. Our experiments span two distinct out-of-domain datasets and demonstrate that no single model consistently outperforms others across settings. In contrast, ensemble-based predictions provide more stable and reliable performance in all scenarios. Our results suggest that asymmetric ensembling offers a robust and scalable solution for real-world deepfake detection where prior knowledge of forgery type or quality is often unavailable.
zh
[CV-22] Exploring Partial Multi-Label Learning via Integrating Semantic Co-occurrence Knowledge
【速读】:该论文旨在解决部分多标签学习(Partial Multi-Label Learning)中从不完全标注数据中提取知识的问题,其核心挑战在于准确识别标签与实例之间的模糊关系。解决方案的关键在于通过匹配标签与实例的共现模式来增强对这种关系的理解,论文提出了一种名为语义共现洞察网络(Semantic Co-occurrence Insight Network, SCINet)的新框架,其核心创新包括双主导提示模块、跨模态融合模块以及内在语义增强策略,以提升模型对数据语义的理解和实例-标签间的相互依赖关系建模能力。
链接: https://arxiv.org/abs/2507.05992
作者: Xin Wu,Fei Teng,Yue Feng,Kaibo Shi,Zhuosheng Lin,Ji Zhang,James Wang
机构: Southwest Jiaotong University (西南交通大学); Swinburne University of Technology (斯威本科技大学); School of Engineering (工程学院); School of Computing and Artificial Intelligence (计算与人工智能学院); Engineering Research Center of Sustainable Urban Intelligent Transportation (可持续城市智能交通工程研究中心); Wuyi University (五邑大学); School of Electronic and Information Engineering (电子与信息工程学院); College of Electrical Engineering (电气工程学院); Sichuan University (四川大学); School of Computer Science (计算机科学学院); Chengdu University (成都大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures, Under Review
Abstract:Partial multi-label learning aims to extract knowledge from incompletely annotated data, which includes known correct labels, known incorrect labels, and unknown labels. The core challenge lies in accurately identifying the ambiguous relationships between labels and instances. In this paper, we emphasize that matching co-occurrence patterns between labels and instances is key to addressing this challenge. To this end, we propose Semantic Co-occurrence Insight Network (SCINet), a novel and effective framework for partial multi-label learning. Specifically, SCINet introduces a bi-dominant prompter module, which leverages an off-the-shelf multimodal model to capture text-image correlations and enhance semantic alignment. To reinforce instance-label interdependencies, we develop a cross-modality fusion module that jointly models inter-label correlations, inter-instance relationships, and co-occurrence patterns across instance-label assignments. Moreover, we propose an intrinsic semantic augmentation strategy that enhances the model’s understanding of intrinsic data semantics by applying diverse image transformations, thereby fostering a synergistic relationship between label confidence and sample difficulty. Extensive experiments on four widely-used benchmark datasets demonstrate that SCINet surpasses state-of-the-art methods.
zh
[CV-23] Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval ACM-MM2025
【速读】:该论文试图解决传统Composed Image Retrieval (CIR)方法依赖于昂贵且手动标注的三元组数据,从而限制了模型的可扩展性和零样本能力的问题。其解决方案的关键在于提出了一种可扩展的自动三元组生成流程,并构建了一个全合成数据集CIRHS。该流程利用大语言模型生成多样化提示,控制文本到图像生成模型以创建包含相同元素的图像对,随后通过过滤和重组形成CIRHS数据集。此外,论文还引入了Hybrid Contextual Alignment (CoAlign)框架,实现了全局对齐与局部推理的结合,提升了模型的表示能力。
链接: https://arxiv.org/abs/2507.05970
作者: Haiwen Li,Delong Liu,Zhaohui Hou,Zhicheng Zhao,Fei Su
机构: Beijing University of Posts and Telecommunications(北京邮电大学); SenseTime(商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper was originally submitted to ACM MM 2025 on April 12, 2025
Abstract:As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.
zh
[CV-24] -LoRA: Single Image Diffusion Model Customization Without Overfitting
【速读】:该论文试图解决在训练样本有限的情况下,微调扩散模型容易出现过拟合的问题,从而影响模型的泛化能力和输出多样性。其解决方案的关键在于提出T-LoRA框架,该框架通过两个核心创新实现更有效的模型个性化:一是基于扩散时间步的动态微调策略,根据时间步调整秩约束更新;二是通过正交初始化的权重参数化技术,确保适配器组件之间的独立性。
链接: https://arxiv.org/abs/2507.05964
作者: Vera Soboleva,Aibek Alanov,Andrey Kuznetsov,Konstantin Sobolev
机构: AIRI(人工智能研究院); HSE University(高等经济大学); Sber(斯贝赫); Innopolis(因诺波利斯); MSU(莫斯科国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment, highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at this https URL.
zh
[CV-25] ora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation ACM-MM25
【速读】:该论文旨在解决视频生成中多实体外观与运动定制的协同控制问题,特别是在多模态条件下的对齐与一致性挑战。其解决方案的关键在于引入了一个解耦的个性化提取器,用于生成多个开放集实体的全面个性化嵌入,从而更好地保留细粒度视觉细节;同时设计了门控自注意力机制,以整合轨迹、文本描述和视觉信息,有效减少训练过程中的多模态条件错位问题。此外,通过引入对比损失函数,实现了运动与个性化嵌入之间的显式映射,联合优化轨迹动力学与实体一致性,从而首次在视频生成中实现了外观与运动的同步多实体定制。
链接: https://arxiv.org/abs/2507.05963
作者: Zhenghao Zhang,Junchao Liao,Xiangyu Meng,Long Qin,Weizhi Wang
机构: Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM MM25 Conference Proceedings
Abstract:Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation. Project page: this https URL .
zh
[CV-26] High-Fidelity and Generalizable Neural Surface Reconstruction with Sparse Feature Volumes
【速读】:该论文旨在解决神经表面重建中密集3D特征体素(dense 3D feature volume)在分辨率提升时存储和计算效率不足的问题,从而限制了重建质量的提升。其解决方案的关键在于提出一种稀疏表示方法(sparse representation),通过在训练阶段预测体素占据情况,并仅在高占据概率的体素中进行特征计算和体积渲染,从而显著提高内存效率并支持更高分辨率的重建。
链接: https://arxiv.org/abs/2507.05952
作者: Aoxiang Fan,Corentin Dumery,Nicolas Talabot,Hieu Le,Pascal Fua
机构: EPFL(洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalizable neural surface reconstruction has become a compelling technique to reconstruct from few images without per-scene optimization, where dense 3D feature volume has proven effective as a global representation of scenes. However, the dense representation does not scale well to increasing voxel resolutions, severely limiting the reconstruction quality. We thus present a sparse representation method, that maximizes memory efficiency and enables significantly higher resolution reconstructions on standard hardware. We implement this through a two-stage approach: First training a network to predict voxel occupancies from posed images and associated depth maps, then computing features and performing volume rendering only in voxels with sufficiently high occupancy estimates. To support this sparse representation, we developed custom algorithms for efficient sampling, feature aggregation, and querying from sparse volumes-overcoming the dense-volume assumptions inherent in existing works. Experiments on public datasets demonstrate that our approach reduces storage requirements by more than 50 times without performance degradation, enabling reconstructions at 512^3 resolution compared to the typical 128^3 on similar hardware, and achieving superior reconstruction accuracy over current state-of-the-art methods.
zh
[CV-27] Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation
【速读】:该论文旨在解决视频实例分割(Video Instance Segmentation, VIS)在时间关联过程中面临的普遍挑战,如目标遮挡、运动模糊和外观变化等问题。其解决方案的关键在于引入几何感知,通过战略性地利用单目深度估计来增强VIS的鲁棒性。研究系统地探讨了三种不同的集成范式:扩展深度通道(EDC)、共享视觉Transformer(SV)和深度监督(DS),其中EDC和SV方法在基准测试中表现出显著的性能提升,尤其当使用Swin-L骨干网络时,EDC方法在OVIS基准上取得了56.2 AP的新状态最优结果。
链接: https://arxiv.org/abs/2507.05948
作者: Quanzhu Niu,Yikang Zhou,Shihao Chen,Tao Zhang,Shunping Ji
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Instance Segmentation (VIS) fundamentally struggles with pervasive challenges including object occlusions, motion blur, and appearance variations during temporal association. To overcome these limitations, this work introduces geometric awareness to enhance VIS robustness by strategically leveraging monocular depth estimation. We systematically investigate three distinct integration paradigms. Expanding Depth Channel (EDC) method concatenates the depth map as input channel to segmentation networks; Sharing ViT (SV) designs a uniform ViT backbone, shared between depth estimation and segmentation branches; Depth Supervision (DS) makes use of depth prediction as an auxiliary training guide for feature learning. Though DS exhibits limited effectiveness, benchmark evaluations demonstrate that EDC and SV significantly enhance the robustness of VIS. When with Swin-L backbone, our EDC method gets 56.2 AP, which sets a new state-of-the-art result on OVIS benchmark. This work conclusively establishes depth cues as critical enablers for robust video understanding.
zh
[CV-28] gAug: Data Augmentation for Testing Traffic Light Detection in Autonomous Driving Systems
【速读】:该论文试图解决自动驾驶系统(ADSs)中交通灯检测模型的自动化测试问题,当前普遍依赖人工收集和标注数据,存在劳动强度大且难以覆盖多种驾驶环境的局限性。解决方案的关键是提出TigAug方法,通过构建基于天气环境、摄像头属性和交通灯属性的元变换关系和变换族,自动增强已标注的交通灯图像,从而有效测试和提升交通灯检测模型的性能。
链接: https://arxiv.org/abs/2507.05932
作者: You Lu,Dingji Wang,Kaifeng Huang,Bihuan Chen,Xin Peng
机构: Shanghai Key Laboratory of Data Science(上海市数据科学重点实验室)
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous vehicle technology has been developed in the last decades with recent advances in sensing and computing technology. There is an urgent need to ensure the reliability and robustness of autonomous driving systems (ADSs). Despite the recent achievements in testing various ADS modules, little attention has been paid on the automated testing of traffic light detection models in ADSs. A common practice is to manually collect and label traffic light data. However, it is labor-intensive, and even impossible to collect diverse data under different driving environments. To address these problems, we propose and implement TigAug to automatically augment labeled traffic light images for testing traffic light detection models in ADSs. We construct two families of metamorphic relations and three families of transformations based on a systematic understanding of weather environments, camera properties, and traffic light properties. We use augmented images to detect erroneous behaviors of traffic light detection models by transformation-specific metamorphic relations, and to improve the performance of traffic light detection models by retraining. Large-scale experiments with four state-of-the-art traffic light detection models and two traffic light datasets have demonstrated that i) TigAug is effective in testing traffic light detection models, ii) TigAug is efficient in synthesizing traffic light images, and iii) TigAug generates traffic light images with acceptable naturalness. Subjects: Software Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.05932 [cs.SE] (or arXiv:2507.05932v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.05932 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-29] High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
【速读】:该论文旨在解决大型多模态模型(LMMs)在处理高分辨率图像时面临的挑战,即输入的大量视觉标记中存在许多与下游任务无关的信息。其解决方案的关键在于提出一种基于多轮对话框架的端到端强化学习(RL)框架——多轮定位策略优化(MGPO),该框架通过模型预测的定位坐标自动裁剪子图像,使模型能够迭代地关注关键视觉区域,从而提升模型的视觉定位能力。
链接: https://arxiv.org/abs/2507.05920
作者: Xinyu Huang,Yuhao Dong,Weiwei Tian,Bo Li,Rui Feng,Ziwei Liu
机构: Fudan University (复旦大学); S-Lab, Nanyang Technological University (南洋理工大学S-Lab)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4% improvement on in-distribution MME-Realworld and 5.2% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI’s o1 and GPT-4o models on the OOD V* Bench. Codes are available at this https URL.
zh
[CV-30] On the Effectiveness of Methods and Metrics for Explainable AI in Remote Sensing Image Scene Classification
【速读】:该论文试图解决在遥感(RS)图像场景分类任务中,直接应用为自然图像设计的可解释人工智能(xAI)方法及其评估指标可能不适用的问题。其解决方案的关键在于系统地分析和评估十种跨五类(忠实性、鲁棒性、定位性、复杂性和随机化)的解释度量,并将其应用于五种经典的特征归因方法(Occlusion、LIME、GradCAM、LRP 和 DeepLIFT),以识别解释方法和度量在 RS 场景下的关键局限性,并提供针对 RS 图像场景分类的解释方法、度量及超参数选择的指导原则。
链接: https://arxiv.org/abs/2507.05916
作者: Jonas Klotz,Tom Burgert,Begüm Demir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The code of this work will be publicly available at this https URL
Abstract:The development of explainable artificial intelligence (xAI) methods for scene classification problems has attracted great attention in remote sensing (RS). Most xAI methods and the related evaluation metrics in RS are initially developed for natural images considered in computer vision (CV), and their direct usage in RS may not be suitable. To address this issue, in this paper, we investigate the effectiveness of explanation methods and metrics in the context of RS image scene classification. In detail, we methodologically and experimentally analyze ten explanation metrics spanning five categories (faithfulness, robustness, localization, complexity, randomization), applied to five established feature attribution methods (Occlusion, LIME, GradCAM, LRP, and DeepLIFT) across three RS datasets. Our methodological analysis identifies key limitations in both explanation methods and metrics. The performance of perturbation-based methods, such as Occlusion and LIME, heavily depends on perturbation baselines and spatial characteristics of RS scenes. Gradient-based approaches like GradCAM struggle when multiple labels are present in the same image, while some relevance propagation methods (LRP) can distribute relevance disproportionately relative to the spatial extent of classes. Analogously, we find limitations in evaluation metrics. Faithfulness metrics share the same problems as perturbation-based methods. Localization metrics and complexity metrics are unreliable for classes with a large spatial extent. In contrast, robustness metrics and randomization metrics consistently exhibit greater stability. Our experimental results support these methodological findings. Based on our analysis, we provide guidelines for selecting explanation methods, metrics, and hyperparameters in the context of RS image scene classification.
zh
[CV-31] What You Have is What You Track: Adaptive and Robust Multimodal Tracking ICCV2025
【速读】:该论文试图解决在存在时间不完整多模态数据情况下,视觉跟踪器性能显著下降的问题。现有跟踪器由于架构刚性,无法有效处理缺失模态,导致鲁棒性不足。解决方案的关键在于提出一种灵活的框架,通过动态激活计算单元以适应不同缺失率,并引入一种具有自适应复杂度的异构专家混合融合机制,结合视频级掩码策略,确保时间一致性和空间完整性,从而提升多模态跟踪的鲁棒性。
链接: https://arxiv.org/abs/2507.05899
作者: Yuedong Tan,Jiawei Shao,Eduard Zamfir,Ruanjun Li,Zhaochong An,Chao Ma,Danda Paudel,Luc Van Gool,Radu Timofte,Zongwei Wu
机构: TeleAI, China Telecom (TeleAI, 中国电信); Computer Vision Lab, CAIDAS & IFI, University of Wurzburg (计算机视觉实验室,CAIDAS & IFI,维尔茨堡大学); INSAIT, Sofia University (INSAIT,索菲亚大学); ShanghaiTech University (上海科技大学); University of Copenhagen (哥本哈根大学); AI Institute, Shanghai Jiao Tong University (人工智能研究院,上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV2025 accepted
Abstract:Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities. To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness which is critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be publicly available at this https URL.
zh
[CV-32] GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing
【速读】:该论文试图解决现有遥感视觉-语言模型(Remote Sensing Vision-Language Models, RS-VLMs)在处理像素级任务和小目标识别场景中的能力不足,以及在处理高分辨率遥感图像时计算资源消耗过大的问题。其解决方案的关键在于提出GeoMag框架,该框架通过任务驱动的多粒度分辨率调整(Task-driven Multi-granularity Resolution Adjustment, TMRA)和提示引导的语义感知裁剪(Prompt-guided Semantic-aware Cropping, PSC)技术,动态聚焦注意力范围,从而有效提升模型对关键目标区域的感知能力,抑制背景冗余,并降低高分辨率遥感图像解析的计算成本。
链接: https://arxiv.org/abs/2507.05887
作者: Xianzhi Ma,Jianhui Li,Changhua Pei,Hao Liu
机构: Nanjing UniversitySuzhouChina; Chinese Academy of SciencesBeijingChina
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The application of Vision-Language Models (VLMs) in remote sensing (RS) image understanding has achieved notable progress, demonstrating the basic ability to recognize and describe geographical entities. However, existing RS-VLMs are mostly limited to image-level and region-level tasks, lacking the capability to handle pixel-level tasks and performing poorly in small-object recognition scenarios. Moreover, RS-VLMs consume significant computational resources when processing high-resolution RS images, further restricting their practical applicability. In this context, we propose GeoMag (Geographical Magnifier), an end-to-end general-purpose large model framework for RS. GeoMag dynamically focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing across multiple levels of granularity. This method introduces Task-driven Multi-granularity Resolution Adjustment (TMRA) and Prompt-guided Semantic-aware Cropping (PSC), which adaptively reduce the spatial resolution of task-irrelevant regions while enhancing the visual representation of task-relevant areas. This approach improves the model’s perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery. Extensive comparative experiments on 10 benchmarks demonstrate that GeoMag not only excels in handling pixel-level tasks but also maintains competitive performance across tasks of other granularities compared to existing RS-VLMs.
zh
[CV-33] D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos
【速读】:该论文试图解决动态三维高斯点云序列的高效压缩问题,旨在提升自由视角视频(FVV)的传输与存储效率。解决方案的关键在于提出一种前馈式压缩框架——动态高斯点云前馈压缩(D-FCGS),其核心创新包括引入基于帧组(GoF)的I-P帧编码结构以提取帧间运动,并通过稀疏控制点获取运动张量;同时采用双先验感知的熵模型对运动张量进行前馈压缩,结合超先验与时空先验实现精确率失真估计;在重建阶段,利用控制点引导的运动补偿和精炼网络提升视图一致性保真度。
链接: https://arxiv.org/abs/2507.05859
作者: Wenkang Zhang,Yan Zhao,Qiang Wang,Li Song,Zhengxue Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 12 pages, 9 figures, 8 tables
Abstract:Free-viewpoint video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representations remains a major challenge. Recent advances in 3D Gaussian Splatting (3DGS) and its dynamic extensions have enabled high-fidelity scene modeling. However, existing methods often couple scene reconstruction with optimization-dependent coding, which limits generalizability. This paper presents Feedforward Compression of Dynamic Gaussian Splatting (D-FCGS), a novel feedforward framework for compressing temporally correlated Gaussian point cloud sequences. Our approach introduces a Group-of-Frames (GoF) structure with I-P frame coding, where inter-frame motions are extracted via sparse control points. The resulting motion tensors are compressed in a feedforward manner using a dual prior-aware entropy model that combines hyperprior and spatial-temporal priors for accurate rate estimation. For reconstruction, we perform control-point-guided motion compensation and employ a refinement network to enhance view-consistent fidelity. Trained on multi-view video-derived Gaussian frames, D-FCGS generalizes across scenes without per-scene optimization. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression in under 2 seconds while preserving visual quality across viewpoints. This work advances feedforward compression for dynamic 3DGS, paving the way for scalable FVV transmission and storage in immersive applications.
zh
[CV-34] DFYP: A Dynamic Fusion Framework with Spectral Channel Attention and Adaptive Operator learning for Crop Yield Prediction
【速读】:该论文试图解决遥感作物产量预测中因复杂空间模式、异质光谱特征和动态农业条件带来的挑战,现有方法在空间建模能力、跨作物类型和年份的泛化能力方面存在不足。其解决方案的关键在于提出DFYP框架,该框架通过融合光谱通道注意力、边缘自适应空间建模和可学习融合机制,提升在多样化农业场景下的鲁棒性,具体包括三个核心组件:分辨率感知通道注意力模块、自适应算子学习网络以及具有可学习融合机制的双分支架构,以联合建模局部空间细节与全局上下文信息,实现跨分辨率和跨作物的泛化能力。
链接: https://arxiv.org/abs/2507.05849
作者: Juli Zhang,Zeyu Yan,Jing Zhang,Qiguang Miao,Quan Wang
机构: Xidian University(西安电子科技大学); Australian National University(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages
Abstract:Accurate remote sensing-based crop yield prediction remains a fundamental challenging task due to complex spatial patterns, heterogeneous spectral characteristics, and dynamic agricultural conditions. Existing methods often suffer from limited spatial modeling capacity, weak generalization across crop types and years. To address these challenges, we propose DFYP, a novel Dynamic Fusion framework for crop Yield Prediction, which combines spectral channel attention, edge-adaptive spatial modeling and a learnable fusion mechanism to improve robustness across diverse agricultural scenarios. Specifically, DFYP introduces three key components: (1) a Resolution-aware Channel Attention (RCA) module that enhances spectral representation by adaptively reweighting input channels based on resolution-specific characteristics; (2) an Adaptive Operator Learning Network (AOL-Net) that dynamically selects operators for convolutional kernels to improve edge-sensitive spatial feature extraction under varying crop and temporal conditions; and (3) a dual-branch architecture with a learnable fusion mechanism, which jointly models local spatial details and global contextual information to support cross-resolution and cross-crop generalization. Extensive experiments on multi-year datasets MODIS and multi-crop dataset Sentinel-2 demonstrate that DFYP consistently outperforms current state-of-the-art baselines in RMSE, MAE, and R2 across different spatial resolutions, crop types, and time periods, showcasing its effectiveness and robustness for real-world agricultural monitoring.
zh
[CV-35] USIGAN: Unbalanced Self-Information Feature Transport for Weakly Paired Image IHC Virtual Staining
【速读】:该论文试图解决在弱配对条件下,生成式AI (Generative AI) 在免疫组化虚拟染色 (IHC virtual staining) 任务中因相邻切片的空间异质性导致的一对多映射不准确及病理语义不一致的问题。解决方案的关键在于提出一种名为USIGAN的不平衡自信息特征传输方法,通过全局形态语义提取并去除联合边缘分布中的弱配对项,有效缓解弱配对对联合分布的影响,从而显著提升生成结果的内容一致性和病理语义一致性。此外,还设计了不平衡最优传输一致性机制(UOT-CTM)和病理自我对应机制(PC-SCM),以构建H&E与生成IHC图像间的图像级相关矩阵以及真实IHC与生成IHC图像集内的组内相关矩阵。
链接: https://arxiv.org/abs/2507.05843
作者: Yue Peng,Bing Xiong,Fuqiang Chen,De Eybo,RanRan Zhang,Wanming Hu,Jing Cai,Wenjian Qin
机构: ShenZhen Institues of Advanced Technology, university chinese academy of sciences (深圳先进技术研究院,中国科学院大学); Department of Pathology, Sun Yat-sen University Cancer Center (中山大学肿瘤防治中心病理部); State Key Laboratory of Oncology in South China (华南肿瘤学国家重点实验室); Guangdong Provincial Clinical Research Center for Cancer (广东省癌症临床研究重点实验室); Department of Health Technology and Informatics, The Hong Kong Polytechnic University (健康技术与信息学系,香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Immunohistochemical (IHC) virtual staining is a task that generates virtual IHC images from H\E images while maintaining pathological semantic consistency with adjacent slices. This task aims to achieve cross-domain mapping between morphological structures and staining patterns through generative models, providing an efficient and cost-effective solution for pathological analysis. However, under weakly paired conditions, spatial heterogeneity between adjacent slices presents significant challenges. This can lead to inaccurate one-to-many mappings and generate results that are inconsistent with the pathological semantics of adjacent slices. To address this issue, we propose a novel unbalanced self-information feature transport for IHC virtual staining, named USIGAN, which extracts global morphological semantics without relying on positional this http URL removing weakly paired terms in the joint marginal distribution, we effectively mitigate the impact of weak pairing on joint distributions, thereby significantly improving the content consistency and pathological semantic consistency of the generated results. Moreover, we design the Unbalanced Optimal Transport Consistency (UOT-CTM) mechanism and the Pathology Self-Correspondence (PC-SCM) mechanism to construct correlation matrices between H\E and generated IHC in image-level and real IHC and generated IHC image sets in intra-group level… Experiments conducted on two publicly available datasets demonstrate that our method achieves superior performance across multiple clinically significant metrics, such as IoD and Pearson-R correlation, demonstrating better clinical relevance.
zh
[CV-36] I2R: Inter and Intra-image Refinement in Few Shot Segmentation
【速读】:该论文旨在解决语义分割中的标注瓶颈问题,具体是通过少样本分割(Few-Shot Segmentation, FSS)方法,使模型能够在仅使用少量示例的情况下快速泛化到新类别。现有方法受限于支持图像与查询图像之间的语义差距以及支持或查询图像内部视觉相似但语义不同的区域所带来的误检问题。该论文提出的解决方案关键在于:1)利用类别特定的高层表示,从支持和查询图像中聚合全局语义线索,以实现更精确的跨图像区域定位;2)采用方向性掩码策略,抑制支持-查询像素对中特征相似但掩码冲突的不一致区域,从而缓解误检问题。
链接: https://arxiv.org/abs/2507.05838
作者: Ourui Fu,Hangzhou He,Xinliang Zhang,Lei Zhu,Shuang Zeng,ZhaoHeng Xie,Yanye Lu
机构: Peking University (北京大学); Peking University Health Science Center (北京大学医学部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The annotation bottleneck in semantic segmentation has driven significant interest in few-shot segmentation, which aims to develop segmentation models capable of generalizing rapidly to novel classes using minimal exemplars. Conventional training paradigms typically generate query prior maps by extracting masked-area features from support images, followed by making predictions guided by these prior maps. However, current approaches remain constrained by two critical limitations stemming from inter- and intra-image discrepancies, both of which significantly degrade segmentation performance: 1) The semantic gap between support and query images results in mismatched features and inaccurate prior maps; 2) Visually similar yet semantically distinct regions within support or query images lead to false negative or false positive predictions. We propose a novel FSS method called \textbfI ^2 R: 1) Using category-specific high level representations which aggregate global semantic cues from support and query images, enabling more precise inter-image region localization and address the first limitation. 2) Directional masking strategy that suppresses inconsistent support-query pixel pairs, which exhibit high feature similarity but conflicting mask, to mitigate the second issue. Experiments demonstrate that our method outperforms state-of-the-art approaches, achieving improvements of 1.9% and 2.1% in mIoU under the 1-shot setting on PASCAL-5 ^i and COCO-20 ^i benchmarks, respectively.
zh
[CV-37] Fair Domain Generalization: An Information-Theoretic View
【速读】:该论文试图解决机器学习中的两个关键挑战——领域泛化(Domain Generalization, DG)和算法公平性(Algorithmic Fairness)之间的脱节问题。传统领域泛化方法仅关注最小化未见目标领域的期望风险,而公平性方法则通常忽略领域偏移,导致训练阶段获得的公平性无法推广到未见领域。论文提出的解决方案是研究公平领域泛化(Fair Domain Generalization, FairDG)问题,其核心在于同时最小化未见目标领域的期望风险和公平性违规。关键创新在于基于互信息推导出多类分类任务中期望风险和公平性违规的上界,并通过帕累托优化框架PAFDG建模效用与公平性的权衡,从而实现更优的实用性和公平性平衡。
链接: https://arxiv.org/abs/2507.05823
作者: Tangzheng Lian,Guanyu Hu,Dimitrios Kollias,Xinyu Yang,Oya Celiktutan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Domain generalization (DG) and algorithmic fairness are two critical challenges in machine learning. However, most DG methods focus only on minimizing expected risk in the unseen target domain without considering algorithmic fairness. Conversely, fairness methods typically do not account for domain shifts, so the fairness achieved during training may not generalize to unseen test domains. In this work, we bridge these gaps by studying the problem of Fair Domain Generalization (FairDG), which aims to minimize both expected risk and fairness violations in unseen target domains. We derive novel mutual information-based upper bounds for expected risk and fairness violations in multi-class classification tasks with multi-group sensitive attributes. These bounds provide key insights for algorithm design from an information-theoretic perspective. Guided by these insights, we introduce PAFDG (Pareto-Optimal Fairness for Domain Generalization), a practical framework that solves the FairDG problem and models the utility-fairness trade-off through Pareto optimization. Experiments on real-world vision and language datasets show that PAFDG achieves superior utility-fairness trade-offs compared to existing methods.
zh
[CV-38] Video Event Reasoning and Prediction by Fusing World Knowledge from LLM s with Vision Foundation Models
【速读】:该论文试图解决当前视频理解模型在高层次认知任务(如因果推理和未来预测)上的不足,这一问题源于模型缺乏常识性世界知识。其解决方案的关键是提出一种新颖的框架,该框架协同融合强大的视觉基础模型(Vision Foundation Model, VFM)与作为知识驱动推理核心的大规模语言模型(Large Language Model, LLM)。该框架的核心技术创新是一个受Q-Former架构启发的复杂融合模块,能够将复杂的时空和以物体为中心的视觉特征提炼为简洁的语言对齐表示,从而使得LLM能够基于直接的视觉证据进行推理。
链接: https://arxiv.org/abs/2507.05822
作者: L’ea Dubois,Klaus Schmidt,Chengyu Wang,Ji-Hoon Park,Lin Wang,Santiago Munoz
机构: INRIA(法国国家信息与自动化研究所); Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所); San Francisco State University(旧金山州立大学); Seoul AI Institute (SAII)(首尔人工智能研究院); Tsinghua University(清华大学); Polytechnic University of Madrid(马德里理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 4 figures
Abstract:Current video understanding models excel at recognizing “what” is happening but fall short in high-level cognitive tasks like causal reasoning and future prediction, a limitation rooted in their lack of commonsense world knowledge. To bridge this cognitive gap, we propose a novel framework that synergistically fuses a powerful Vision Foundation Model (VFM) for deep visual perception with a Large Language Model (LLM) serving as a knowledge-driven reasoning core. Our key technical innovation is a sophisticated fusion module, inspired by the Q-Former architecture, which distills complex spatiotemporal and object-centric visual features into a concise, language-aligned representation. This enables the LLM to effectively ground its inferential processes in direct visual evidence. The model is trained via a two-stage strategy, beginning with large-scale alignment pre-training on video-text data, followed by targeted instruction fine-tuning on a curated dataset designed to elicit advanced reasoning and prediction skills. Extensive experiments demonstrate that our model achieves state-of-the-art performance on multiple challenging benchmarks. Notably, it exhibits remarkable zero-shot generalization to unseen reasoning tasks, and our in-depth ablation studies validate the critical contribution of each architectural component. This work pushes the boundary of machine perception from simple recognition towards genuine cognitive understanding, paving the way for more intelligent and capable AI systems in robotics, human-computer interaction, and beyond.
zh
[CV-39] 2D Instance Editing in 3D Space
【速读】:该论文试图解决生成式模型在2D图像编辑中面临的连贯性不足和对象身份保持困难的问题(Generative models have achieved significant progress in advancing 2D image editing, demonstrating exceptional precision and realism. However, they often struggle with consistency and object identity preservation due to their inherent pixel-manipulation nature)。解决方案的关键在于引入一种“2D-3D-2D”框架,通过将2D物体提升到3D表示,使其在物理上合理的刚性约束的3D环境中进行编辑,随后将编辑后的3D物体重新投影并无缝修复回原始2D图像,从而实现更一致且保留对象身份的编辑效果。
链接: https://arxiv.org/abs/2507.05819
作者: Yuhuan Xie,Aoxuan Pan,Ming-Xian Lin,Wei Huang,Yi-Hua Huang,Xiaojuan Qi
机构: The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
Abstract:Generative models have achieved significant progress in advancing 2D image editing, demonstrating exceptional precision and realism. However, they often struggle with consistency and object identity preservation due to their inherent pixel-manipulation nature. To address this limitation, we introduce a novel “2D-3D-2D” framework. Our approach begins by lifting 2D objects into 3D representation, enabling edits within a physically plausible, rigidity-constrained 3D environment. The edited 3D objects are then reprojected and seamlessly inpainted back into the original 2D image. In contrast to existing 2D editing methods, such as DragGAN and DragDiffusion, our method directly manipulates objects in a 3D environment. Extensive experiments highlight that our framework surpasses previous methods in general performance, delivering highly consistent edits while robustly preserving object identity.
zh
[CV-40] Empowering Bridge Digital Twins by Bridging the Data Gap with a Unified Synthesis Framework
【速读】:该论文旨在解决桥梁等关键交通基础设施因老化和劣化带来的检测挑战,以及传统人工检测方法效率低下的问题。现有基于3D点云技术的方案受限于真实数据的不完整性,如标签缺失和扫描遮挡。论文提出了一种系统性框架,用于生成高质量的3D桥梁数据,其关键在于能够自动生成包含部件级实例标注、高保真颜色和精确法向量的完整点云,并可扩展模拟多样且物理真实的不完整点云,以支持分割与补全网络的训练。
链接: https://arxiv.org/abs/2507.05814
作者: Wang Wang,Mingyu Shi,Jun Jiang,Wenqian Ma,Chong Liu,Yasutaka Narazaki,Xuguang Wang
机构: Zhejiang University/University of Illinois Urbana-Champaign Institute (Zhejiang University/University of Illinois Urbana-Champaign Institute); College of Civil Engineering and Architecture, Zhejiang University (College of Civil Engineering and Architecture, Zhejiang University); State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University (State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures
Abstract:As critical transportation infrastructure, bridges face escalating challenges from aging and deterioration, while traditional manual inspection methods suffer from low efficiency. Although 3D point cloud technology provides a new data-driven paradigm, its application potential is often constrained by the incompleteness of real-world data, which results from missing labels and scanning occlusions. To overcome the bottleneck of insufficient generalization in existing synthetic data methods, this paper proposes a systematic framework for generating 3D bridge data. This framework can automatically generate complete point clouds featuring component-level instance annotations, high-fidelity color, and precise normal vectors. It can be further extended to simulate the creation of diverse and physically realistic incomplete point clouds, designed to support the training of segmentation and completion networks, respectively. Experiments demonstrate that a PointNet++ model trained with our synthetic data achieves a mean Intersection over Union (mIoU) of 84.2% in real-world bridge semantic segmentation. Concurrently, a fine-tuned KT-Net exhibits superior performance on the component completion task. This research offers an innovative methodology and a foundational dataset for the 3D visual analysis of bridge structures, holding significant implications for advancing the automated management and maintenance of infrastructure. Comments: 18 pages, 10 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.05814 [cs.CV] (or arXiv:2507.05814v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.05814 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-41] owards Solar Altitude Guided Scene Illumination
【速读】:该论文旨在解决自动驾驶系统中高质量传感器数据生成的问题,特别是在白天光照条件变化下的合成相机数据生成。现有方法受限于标注成本、驾驶员安全协议以及场景覆盖的多样性,难以有效生成具有真实光照特性的数据。论文的关键解决方案是引入太阳高度角作为全局条件变量,该变量可从经纬度坐标和当地时间计算得出,从而避免了对大量人工标注数据的依赖。此外,论文还提出了一种定制化的归一化方法,以应对光照对高度角微小数值变化的敏感性,从而更准确地捕捉光照特征和与光照相关的图像噪声。
链接: https://arxiv.org/abs/2507.05812
作者: Samed Doğan,Maximilian Hoh,Nico Leuze,Nicolas R.-Peña,Alfred Schöttl
机构: University of Applied Sciences Munich (慕尼黑应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication
Abstract:The development of safe and robust autonomous driving functions is heavily dependent on large-scale, high-quality sensor data. However, real-word data acquisition demands intensive human labor and is strongly limited by factors such as labeling cost, driver safety protocols and diverse scenario coverage. Thus, multiple lines of work focus on the conditional generation of synthetic camera sensor data. We identify a significant gap in research regarding daytime variation, presumably caused by the scarcity of available labels. Consequently, we present the solar altitude as global conditioning variable. It is readily computable from latitude-longitude coordinates and local time, eliminating the need for extensive manual labeling. Our work is complemented by a tailored normalization approach, targeting the sensitivity of daylight towards small numeric changes in altitude. We demonstrate its ability to accurately capture lighting characteristics and illumination-dependent image noise in the context of diffusion models.
zh
[CV-42] Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs
【速读】:该论文试图解决深度学习模型在面对数据集偏差时,其泛化能力(或缺乏泛化能力)的机制性理解问题。传统基于概念的可解释性方法主要关注对神经网络预测的局部解释,而本文提出了一种新颖的框架和交互工具,将这些方法扩展到机制性可解释性的领域。解决方案的关键在于通过分析高层次语义属性(称为概念)如何在模型内部组件中生成、交互和传播,实现对模型行为的全局分解。该框架系统地量化了语义概念在不同层中的表示,揭示了潜在的电路和信息流,从而增强了对模型决策机制的理解。
链接: https://arxiv.org/abs/2507.05810
作者: Sofiia Chorna,Kateryna Tarelkina,Eloïse Berthier,Gianni Franchi
机构: U2IS, ENSTA, Institut Polytechnique de Paris(乌2is,恩斯塔,巴黎综合理工学院); École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages
Abstract:While concept-based interpretability methods have traditionally focused on local explanations of neural network predictions, we propose a novel framework and interactive tool that extends these methods into the domain of mechanistic interpretability. Our approach enables a global dissection of model behavior by analyzing how high-level semantic attributes (referred to as concepts) emerge, interact, and propagate through internal model components. Unlike prior work that isolates individual neurons or predictions, our framework systematically quantifies how semantic concepts are represented across layers, revealing latent circuits and information flow that underlie model decision-making. A key innovation is our visualization platform that we named BAGEL (for Bias Analysis with a Graph for global Explanation Layers), which presents these insights in a structured knowledge graph, allowing users to explore concept-class relationships, identify spurious correlations, and enhance model trustworthiness. Our framework is model-agnostic, scalable, and contributes to a deeper understanding of how deep learning models generalize (or fail to) in the presence of dataset biases. The demonstration is available at this https URL.
zh
[CV-43] DREAM: Document Reconstruction via End-to-end Autoregressive Model
【速读】:该论文旨在解决文档重建中的误差传播问题以及现有端到端生成模型在保留元素布局信息方面的不足。其解决方案的关键在于提出一种专为文档重建设计的自回归模型——DREAM,该模型通过端到端的方式将文本图像转换为全面的文档重建序列,从而更有效地捕捉和整合文档元素信息。
链接: https://arxiv.org/abs/2507.05805
作者: Xin Li,Mingming Gong,Yunfei Wu,Jianxin Dai,Antai Guo,Xinghua Jiang,Haoyu Cao,Yinsong Liu,Deqiang Jiang,Xing Sun
机构: Tencent YouTu Lab(腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Document reconstruction constitutes a significant facet of document analysis and recognition, a field that has been progressively accruing interest within the scholarly community. A multitude of these researchers employ an array of document understanding models to generate predictions on distinct subtasks, subsequently integrating their results into a holistic document reconstruction format via heuristic principles. Nevertheless, these multi-stage methodologies are hindered by the phenomenon of error propagation, resulting in suboptimal performance. Furthermore, contemporary studies utilize generative models to extract the logical sequence of plain text, tables and mathematical expressions in an end-to-end process. However, this approach is deficient in preserving the information related to element layouts, which are vital for document reconstruction. To surmount these aforementioned limitations, we in this paper present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM). DREAM transmutes the text image into a sequence of document reconstruction in a comprehensive, end-to-end process, encapsulating a broader spectrum of document element information. In addition, we establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task. Empirical results substantiate that our methodology attains unparalleled performance in the realm of document reconstruction. Furthermore, the results on a variety of subtasks, encompassing document layout analysis, text recognition, table structure recognition, formula recognition and reading order detection, indicate that our model is competitive and compatible with various tasks.
zh
[CV-44] SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning ICCV2025
【速读】:该论文旨在解决开放词汇场景图生成(open-vocabulary PSG)中由于预训练视觉-语言模型(VLM)在空间关系推理方面的固有局限性导致的关系预测性能不佳的问题,尤其是难以区分物体相对位置的问题。其解决方案的关键在于提出一种名为SPADE(SPatial-Aware Denoising-nEtwork)的框架,该框架通过两个关键步骤实现:首先,利用去噪扩散模型的逆过程进行校准,将通用预训练教师扩散模型适配为特定于PSG的去噪网络;其次,引入空间感知的上下文推理机制,以捕捉局部和长距离的上下文信息,从而生成高质量的关系查询。
链接: https://arxiv.org/abs/2507.05798
作者: Xin Hu,Ke Qin,Guiduo Duan,Ming Li,Yuan-Fang Li,Tao He
机构: UESTC(电子科技大学); Sichuan Province(四川省); Monash University(莫纳什大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (广东省人工智能与数字经济实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction. Motivated by the denoising diffusion model’s inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial-Aware Denoising-nEtwork) framework – a novel approach for open-vocabulary PSG. SPADE consists of two key steps: (1) inversion-guided calibration for the UNet adaptation, and (2) spatial-aware context reasoning. In the first step, we calibrate a general pre-trained teacher diffusion model into a PSG-specific denoising network with cross-attention maps derived during inversion through a lightweight LoRA-based fine-tuning strategy. In the second step, we develop a spatial-aware relation graph transformer that captures both local and long-range contextual information, facilitating the generation of high-quality relation queries. Extensive experiments on benchmark PSG and Visual Genome datasets demonstrate that SPADE outperforms state-of-the-art methods in both closed- and open-set scenarios, particularly for spatial relationship prediction.
zh
[CV-45] alkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model
【速读】:该论文试图解决如何仅通过文本指令实现多功能的虚拟试穿问题,包括整体着装更换和局部编辑。以往的方法主要依赖端到端网络执行单一试穿任务,缺乏灵活性和通用性。其解决方案的关键在于提出TalkFashion,一个利用大语言模型强大理解能力的智能试穿助手,能够分析用户指令并决定执行的任务,从而激活相应的处理流程;同时引入基于指令的局部重绘模型,无需用户手动提供掩码,借助多模态模型实现完全自动化的局部编辑,提升了编辑任务的灵活性。
链接: https://arxiv.org/abs/2507.05790
作者: Yujie Hu,Xuanyu Zhang,Weiqi Li,Jian Zhang
机构: Peking University (北京大学); Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology (广东省超高清沉浸式媒体技术重点实验室); Shenzhen Graduate School, Peking University (深圳研究生院,北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures
Abstract:Virtual try-on has made significant progress in recent years. This paper addresses how to achieve multifunctional virtual try-on guided solely by text instructions, including full outfit change and local editing. Previous methods primarily relied on end-to-end networks to perform single try-on tasks, lacking versatility and flexibility. We propose TalkFashion, an intelligent try-on assistant that leverages the powerful comprehension capabilities of large language models to analyze user instructions and determine which task to execute, thereby activating different processing pipelines accordingly. Additionally, we introduce an instruction-based local repainting model that eliminates the need for users to manually provide masks. With the help of multi-modal models, this approach achieves fully automated local editings, enhancing the flexibility of editing tasks. The experimental results demonstrate better semantic consistency and visual quality compared to the current methods.
zh
[CV-46] DreamArt: Generating Interactable Articulated Objects from a Single Image
【速读】:该论文试图解决从单视角图像生成高保真、可交互的关节物体(articulated objects)的问题,现有方法在部分分解和关节建模方面存在不足,且依赖密集多视角或交互数据,限制了可扩展性。其解决方案的关键在于提出DreamArt框架,采用三阶段流程:首先通过图像到3D生成、掩码提示的3D分割和部分非可见补全重建带部分分割的完整3D网格;其次微调视频扩散模型以捕捉部件级关节先验,利用可移动部件掩码和非可见图像减少遮挡带来的歧义;最后优化由双四元数表示的关节运动,并进行全局纹理精修与重绘,以确保所有部件的连贯高质量纹理。
链接: https://arxiv.org/abs/2507.05763
作者: Ruijie Lu,Yu Liu,Jiaxiang Tang,Junfeng Ni,Yuxiang Wang,Diwen Wan,Gang Zeng,Yixin Chen,Siyuan Huang
机构: State Key Lab of General AI, Peking University(国家通用人工智能重点实验室,北京大学); Tsinghua University(清华大学); State Key Lab of General AI, BIGAI(国家通用人工智能重点实验室,BIGAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:Generating articulated objects, such as laptops and microwaves, is a crucial yet challenging task with extensive applications in Embodied AI and AR/VR. Current image-to-3D methods primarily focus on surface geometry and texture, neglecting part decomposition and articulation modeling. Meanwhile, neural reconstruction approaches (e.g., NeRF or Gaussian Splatting) rely on dense multi-view or interaction data, limiting their scalability. In this paper, we introduce DreamArt, a novel framework for generating high-fidelity, interactable articulated assets from single-view images. DreamArt employs a three-stage pipeline: firstly, it reconstructs part-segmented and complete 3D object meshes through a combination of image-to-3D generation, mask-prompted 3D segmentation, and part amodal completion. Second, we fine-tune a video diffusion model to capture part-level articulation priors, leveraging movable part masks as prompt and amodal images to mitigate ambiguities caused by occlusion. Finally, DreamArt optimizes the articulation motion, represented by a dual quaternion, and conducts global texture refinement and repainting to ensure coherent, high-quality textures across all parts. Experimental results demonstrate that DreamArt effectively generates high-quality articulated objects, possessing accurate part shape, high appearance fidelity, and plausible articulation, thereby providing a scalable solution for articulated asset generation. Our project page is available at this https URL.
zh
[CV-47] Normal Patch Retinex Robust Alghoritm for White Balancing in Digital Microscopy
【速读】:该论文试图解决光学显微镜中获取准确颜色和平衡图像的难题,特别是在病理形态学中常用的显微标本图像处理方面。解决方案的关键在于提出了一种完全自动化的白平衡调整机制,该机制能够有效地校正显微彩色图像,相较于数字摄影中常用的白平衡算法,该方法在使用苏木精-伊红-沙黄染色和免疫组化染色的显微图像上表现出更高的有效性。
链接: https://arxiv.org/abs/2507.05757
作者: Radoslaw Roszczyk,Artur Krupa,Izabella Antoniuk
机构: Warsaw University of Technology(华沙理工大学); Warsaw University of Life Sciences – SGGW(华沙生命科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The acquisition of accurately coloured, balanced images in an optical microscope can be a challenge even for experienced microscope operators. This article presents an entirely automatic mechanism for balancing the white level that allows the correction of the microscopic colour images adequately. The results of the algorithm have been confirmed experimentally on a set of two hundred microscopic images. The images contained scans of three microscopic specimens commonly used in pathomorphology. Also, the results achieved were compared with other commonly used white balance algorithms in digital photography. The algorithm applied in this work is more effective than the classical algorithms used in colour photography for microscopic images stained with hematoxylin-phloxine-saffron and for immunohistochemical staining images.
zh
[CV-48] SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations
【速读】:该论文试图解决现有6D物体位姿估计方法在面对真实世界中光照、曝光、增益及深度传感器模式等环境变化时性能下降的问题,这些问题在现有基准数据集(如LM-O、YCB-V和T-Less)中未被充分研究。其解决方案的关键在于引入SenseShift6D数据集,该数据集通过物理上扫描多种RGB和深度传感器配置(包括13种RGB曝光、9种RGB增益、自动曝光、4种深度捕获模式和5种光照水平),提供了丰富的传感器-光照组合数据,从而支持测试阶段的传感器控制策略,以提升模型在实际环境中的鲁棒性。
链接: https://arxiv.org/abs/2507.05751
作者: Yegyu Han,Taegyoon Yoon,Dayeon Woo,Sojeong Kim,Hyung-Sin Kim
机构: Seoul National University (首尔大学); Gwangju Institute of Science and Technology (光州科学技术大学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances on 6D object-pose estimation has achieved high performance on representative benchmarks such as LM-O, YCB-V, and T-Less. However, these datasets were captured under fixed illumination and camera settings, leaving the impact of real-world variations in illumination, exposure, gain or depth-sensor mode - and the potential of test-time sensor control to mitigate such variations - largely unexplored. To bridge this gap, we introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels. For three common household objects (spray, pringles, and tincase), we acquire 101.9k RGB and 10k depth images, which can provide 1,380 unique sensor-lighting permutations per object pose. Experiments with state-of-the-art models on our dataset show that applying sensor control during test-time induces greater performance improvement over digital data augmentation, achieving performance comparable to or better than costly increases in real-world training data quantity and diversity. Adapting either RGB or depth sensors individually is effective, while jointly adapting multimodal RGB-D configurations yields even greater improvements. SenseShift6D extends the 6D-pose evaluation paradigm from data-centered to sensor-aware robustness, laying a foundation for adaptive, self-tuning perception systems capable of operating robustly in uncertain real-world environments. Our dataset is available at: this http URL Associated scripts can be found at: this http URL
zh
[CV-49] Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study
【速读】:该论文试图解决高光谱异常检测(Hyperspectral Anomaly Detection, HAD)方法在计算复杂度、噪声敏感性和跨数据集泛化能力方面的挑战。其解决方案的关键在于对多种HAD技术进行系统性比较,包括统计模型、基于表示的方法、传统机器学习方法和深度学习模型,并通过17个基准数据集及多种性能指标(如ROC、AUC和可分性图)评估它们的检测精度、计算效率及优缺点,从而为未来研究提供指导。
链接: https://arxiv.org/abs/2507.05730
作者: Aayushma Pant,Arbind Agrahari Baniya,Tsz-Kwan Lee,Sunil Aryal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Hyperspectral images are high-dimensional datasets consisting of hundreds of contiguous spectral bands, enabling detailed material and surface analysis. Hyperspectral anomaly detection (HAD) refers to the technique of identifying and locating anomalous targets in such data without prior information about a hyperspectral scene or target spectrum. This technology has seen rapid advancements in recent years, with applications in agriculture, defence, military surveillance, and environmental monitoring. Despite this significant progress, existing HAD methods continue to face challenges such as high computational complexity, sensitivity to noise, and limited generalisation across diverse datasets. This study presents a comprehensive comparison of various HAD techniques, categorising them into statistical models, representation-based methods, classical machine learning approaches, and deep learning models. We evaluated these methods across 17 benchmarking datasets using different performance metrics, such as ROC, AUC, and separability map to analyse detection accuracy, computational efficiency, their strengths, limitations, and directions for future this http URL research shows that deep learning models achieved the highest detection accuracy, while statistical models demonstrated exceptional speed across all datasets. This study aims to provide valuable insights for researchers and practitioners working to advance the field of hyperspectral anomaly detection methods.
zh
[CV-50] Event-RGB Fusion for Spacecraft Pose Estimation Under Harsh Lighting
【速读】:该论文试图解决在恶劣光照条件下航天器位姿估计的挑战,尤其是传统RGB成像传感器易受眩光、过曝、 blooming 和镜头眩光等成像伪影影响的问题。解决方案的关键在于引入一种将RGB传感器与事件传感器(event sensors)进行融合的传感融合方法,通过分束棱镜实现光学和时间上的精确对齐,并采用基于RANSAC的技术融合两种模态的信息,以充分利用各自的优势。此外,该方法还结合了dropout不确定性估计来检测可能影响任一通道的极端条件。
链接: https://arxiv.org/abs/2507.05698
作者: Mohsi Jawaid,Marcus Märtens,Tat-Jun Chin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Spacecraft pose estimation is crucial for autonomous in-space operations, such as rendezvous, docking and on-orbit servicing. Vision-based pose estimation methods, which typically employ RGB imaging sensors, is a compelling solution for spacecraft pose estimation, but are challenged by harsh lighting conditions, which produce imaging artifacts such as glare, over-exposure, blooming and lens flare. Due to their much higher dynamic range, neuromorphic or event sensors are more resilient to extreme lighting conditions. However, event sensors generally have lower spatial resolution and suffer from reduced signal-to-noise ratio during periods of low relative motion. This work addresses these individual sensor limitations by introducing a sensor fusion approach combining RGB and event sensors. A beam-splitter prism was employed to achieve precise optical and temporal alignment. Then, a RANSAC-based technique was developed to fuse the information from the RGB and event channels to achieve pose estimation that leveraged the strengths of the two modalities. The pipeline was complemented by dropout uncertainty estimation to detect extreme conditions that affect either channel. To benchmark the performance of the proposed event-RGB fusion method, we collected a comprehensive real dataset of RGB and event data for satellite pose estimation in a laboratory setting under a variety of challenging illumination conditions. Encouraging results on the dataset demonstrate the efficacy of our event-RGB fusion approach and further supports the usage of event sensors for spacecraft pose estimation. To support community research on this topic, our dataset will be released publicly.
zh
[CV-51] LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion
【速读】:该论文旨在解决视频扩散模型(Video Diffusion Models, VDMs)在同时精确控制相机轨迹和物体运动时面临的不稳定融合与非线性可扩展性问题。其解决方案的关键在于提出LiON-LoRA框架,该框架通过三个核心原则重新思考LoRA融合:线性可扩展性、正交性和范数一致性。具体而言,通过分析浅层VDM中LoRA特征的正交性实现低级可控性,通过范数一致性稳定复杂相机运动组合中的融合过程,并引入可控制的token结合改进的自注意力机制以实现对相机和物体运动幅度的线性调整。
链接: https://arxiv.org/abs/2507.05678
作者: Yisu Zhang,Chenjie Cao,Chaohui Yu,Jianke Zhu
机构: Zhejiang University (浙江大学); DAMO Academy, Alibaba Group (达摩院,阿里巴巴集团); Hupan Lab (湖畔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Diffusion Models (VDMs) have demonstrated remarkable capabilities in synthesizing realistic videos by learning from large-scale data. Although vanilla Low-Rank Adaptation (LoRA) can learn specific spatial or temporal movement to driven VDMs with constrained data, achieving precise control over both camera trajectories and object motion remains challenging due to the unstable fusion and non-linear scalability. To address these issues, we propose LiON-LoRA, a novel framework that rethinks LoRA fusion through three core principles: Linear scalability, Orthogonality, and Norm consistency. First, we analyze the orthogonality of LoRA features in shallow VDM layers, enabling decoupled low-level controllability. Second, norm consistency is enforced across layers to stabilize fusion during complex camera motion combinations. Third, a controllable token is integrated into the diffusion transformer (DiT) to linearly adjust motion amplitudes for both cameras and objects with a modified self-attention mechanism to ensure decoupled control. Additionally, we extend LiON-LoRA to temporal generation by leveraging static-camera videos, unifying spatial and temporal controllability. Experiments demonstrate that LiON-LoRA outperforms state-of-the-art methods in trajectory control accuracy and motion strength adjustment, achieving superior generalization with minimal training data. Project Page: this https URL
zh
[CV-52] Integrated Structural Prompt Learning for Vision-Language Models
【速读】:该论文旨在解决预训练视觉-语言模型(VLMs)在微调过程中忽略可学习提示与模态内及模态间标记之间结构关系的问题,以及如何平衡基础类与新类性能的挑战。其解决方案的关键在于提出一种集成结构提示(Integrated Structural Prompt, ISP),通过引入自结构和跨结构提示模块,建模可学习提示与冻结标记之间的结构关系,从而增强文本和图像分支间的信息交互,同时保持特征稳定性。此外,还设计了一个样本探测模块,动态调整损失系数以提升对新类别的泛化能力。
链接: https://arxiv.org/abs/2507.05677
作者: Jiahui Wang,Qin Xu,Bo Jiang,Bin Luo
机构: Anhui University(安徽大学); The Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education(教育部智能计算与信号处理重点实验室); Information Materials and Intelligent Sensing Laboratory of Anhui Province(安徽省信息材料与智能感知实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prompt learning methods have significantly extended the transferability of pre-trained Vision-Language Models (VLMs) like CLIP for various downstream tasks. These methods adopt handcraft templates or learnable vectors to provide text or image instructions in fine-tuning VLMs. However, most existing works ignore the structural relationships between learnable prompts and tokens within and between modalities. Moreover, balancing the performance of base and new classes remains a significant challenge. In this paper, we propose an Integrated Structural Prompt (ISP) for VLMs to enhance the interaction of information representations between the text and image branches. ISP introduces self-structural and cross-structural prompt modules to model the structural relationships between learnable prompts and frozen tokens within and across modalities. This enables efficient information transfer while preserving feature stability. Additionally, we propose a sample probing module that dynamically adjusts loss coefficients based on sample difficulty, preventing the mode from overfitting to simple samples and improving generalization ability to new classes. Extensive experiments on three widely used settings: base-to-new generalization, cross-dataset evaluation, and domain generalization demonstrate that the proposed ISP achieves competitive performance against state-of-the-art methods.
zh
[CV-53] MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
【速读】:该论文试图解决医疗视频生成领域中由于缺乏大规模、高质量数据集而导致的模型生成内容不真实或存在错误的问题。解决方案的关键在于引入了MedVideoCap-55K,这是首个针对医疗视频生成的大规模、多样化且带有丰富描述的数据库,为训练通用医疗视频生成模型提供了坚实的基础。基于该数据集,研究者开发了MedGen模型,在视觉质量和医疗准确性方面均表现出色,达到了开源模型中的领先水平,并与商业系统相媲美。
链接: https://arxiv.org/abs/2507.05675
作者: Rongsheng Wang,Junying Chen,Ke Ji,Zhenyang Cai,Shunian Chen,Yunjin Yang,Benyou Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at this https URL
zh
[CV-54] R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding ACL2025
【速读】:该论文旨在解决在图形用户界面(GUI)自动化中,如何准确地对齐和定位界面元素这一关键问题。现有基于视觉的GUI代理模型在处理复杂截图时需要过滤大量无关信息,且通常采用基础的交叉熵损失函数,难以有效捕捉接地质量。论文提出的解决方案的关键在于引入R-VLM方法,该方法通过使用放大区域提案进行精确元素定位,并设计了一个基于交并比(IoU)的感知目标函数,以提升模型对高IoU预测的收敛能力,从而显著提升了跨多种GUI平台的接地准确性。
链接: https://arxiv.org/abs/2507.05673
作者: Joonhyung Park,Peng Tang,Sagnik Das,Srikar Appalaraju,Kunwar Yashraj Singh,R. Manmatha,Shabnam Ghadar
机构: AWS AI Labs(亚马逊人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2025; 17 pages
Abstract:Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring them to process substantial irrelevant information that compromises their accuracy. In addition, these approaches typically employ basic cross-entropy loss for learning grounding objectives, which fails to effectively capture grounding quality compared to established object detection metrics like Intersection-over-Union (IoU). To address these issues, we introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization. We also propose an IoU-aware objective function that facilitates model convergence toward high IoU predictions. Our approach bridges the gap between VLMs and conventional object detection techniques, improving the state-of-the-art grounding accuracy by 13% across diverse GUI platforms on the GUI grounding benchmarks ScreenSpot and AgentStudio. In addition, our R-VLM approach shows 3.2-9.7% absolute accuracy improvements in GUI navigation tasks on the AITW and Mind2Web benchmarks.
zh
[CV-55] Modeling and Reversing Brain Lesions Using Diffusion Models
【速读】:该论文试图解决现有脑病变分割方法未能区分受损组织与变形组织的问题,从而导致将两者统一标记为单一异常。其解决方案的关键在于引入一种基于扩散模型的框架,该框架首先对大脑中的异常区域进行分割,然后通过恢复移位组织至原位置来估计并逆转组织变形,从而隔离出代表初始损伤的核心病变区域,并通过修复核心病变区域以估算病前健康大脑的状态。该框架逆向了一个在生物力学研究中已建立的正向病变生长过程模型。
链接: https://arxiv.org/abs/2507.05670
作者: Omar Zamzam,Haleh Akrami,Anand Joshi,Richard Leahy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Brain lesions are abnormalities or injuries in brain tissue that are often detectable using magnetic resonance imaging (MRI), which reveals structural changes in the affected areas. This broad definition of brain lesions includes areas of the brain that are irreversibly damaged, as well as areas of brain tissue that are deformed as a result of lesion growth or swelling. Despite the importance of differentiating between damaged and deformed tissue, existing lesion segmentation methods overlook this distinction, labeling both of them as a single anomaly. In this work, we introduce a diffusion model-based framework for analyzing and reversing the brain lesion process. Our pipeline first segments abnormal regions in the brain, then estimates and reverses tissue deformations by restoring displaced tissue to its original position, isolating the core lesion area representing the initial damage. Finally, we inpaint the core lesion area to arrive at an estimation of the pre-lesion healthy brain. This proposed framework reverses a forward lesion growth process model that is well-established in biomechanical studies that model brain lesions. Our results demonstrate improved accuracy in lesion segmentation, characterization, and brain labeling compared to traditional methods, offering a robust tool for clinical and research applications in brain lesion analysis. Since pre-lesion healthy versions of abnormal brains are not available in any public dataset for validation of the reverse process, we simulate a forward model to synthesize multiple lesioned brain images.
zh
[CV-56] Dynamic Rank Adaptation for Vision-Language Models
【速读】:该论文旨在解决预训练大型视觉语言模型(VLMs)在微调过程中难以保持对未见新类别的强泛化能力的问题。现有基于提示和适配器的方法在处理所有图像和文本编码器的token时缺乏区分度,导致过度拟合于低信息特征,从而削弱了对新概念识别至关重要的通用表示。其解决方案的关键在于提出动态秩适配(Dynamic Rank Adaptation, DRA),通过根据特征重要性动态分配适配秩来保留通用知识,包括基于序列注意力进行token重要性分组、按token组重要性动态调整秩以及设计新的通道响应机制以优先保留和适应最具信息量的特征通道。
链接: https://arxiv.org/abs/2507.05668
作者: Jiahui Wang,Qin Xu,Bo Jiang,Bin Luo
机构: Anhui University(安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained large vision-language models (VLMs) like CLIP demonstrate impressive generalization ability. Existing prompt-based and adapter-based works have made significant progress in fine-tuning VLMs but still face the challenges of maintaining strong generalization abilities, particularly towards unseen new classes. This limitation partly arises from these methods treating all tokens of the image and text encoder equally, which can lead to overfitting on less informative features (e.g., background noise, template words) and degrade the general representations that are crucial for novel concept recognition. To address this issue, we propose Dynamic Rank Adaptation (DRA), a novel adapter variant method, designed specifically to enhance new class generalization. DRA dynamically allocates adaptation ranks based on the importance of features during training to preserve general knowledge. DRA first employs token importance grouping, using sequence attention to evaluate and group tokens by their importance. Then, we adopt rank adaptation according to the importance of each token group dynamically by assigning higher feature ranks to the more important tokens. Also, we design a new channel response mechanism to prioritize the preservation and adaptation of feature channels identified as the most informative for each instance. In addition, a L1 regularization term is introduced to stabilize the training. Extensive experiments demonstrate the effectiveness and superiority of our proposed DRA over existing works, especially on enhancing the performance of new classes on various benchmarks, including base-new classes, cross-datasets evaluation and domain generalization. The source code will be published after the paper is received.
zh
[CV-57] Knowledge-guided Complex Diffusion Model for PolSAR Image Classification in Contourlet Domain
【速读】:该论文试图解决传统实值扩散模型在处理极化合成孔径雷达(PolSAR)数据时难以捕捉复值相位信息以及难以保持精细结构细节的问题。其解决方案的关键在于利用轮廓波变换(Contourlet transform)的多尺度和多方向表示能力,提出一种结构知识引导的复扩散模型。该模型通过在轮廓波域中分解数据为低频和高频子带,提取统计和边界特征,并设计知识引导的复扩散网络来建模低频成分的统计特性,同时利用高频系数中的结构信息指导扩散过程,以提升边缘保持能力,进而提高分类精度。
链接: https://arxiv.org/abs/2507.05666
作者: Junfei Shi,Yu Cheng,Haiyan Jin,Junhuai Li,Zhaolin Xiao,Maoguo Gong,Weisi Lin
机构: Xi’an University of Technology, Shaanxi Key Laboratory for Network Computing and Security Technology(西安理工大学,陕西省网络计算与安全技术重点实验室); China University of Mining and Technology(中国矿业大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Diffusion models have demonstrated exceptional performance across various domains due to their ability to model and generate complicated data distributions. However, when applied to PolSAR data, traditional real-valued diffusion models face challenges in capturing complex-valued phase this http URL, these models often struggle to preserve fine structural details. To address these limitations, we leverage the Contourlet transform, which provides rich multiscale and multidirectional representations well-suited for PolSAR imagery. We propose a structural knowledge-guided complex diffusion model for PolSAR image classification in the Contourlet domain. Specifically, the complex Contourlet transform is first applied to decompose the data into low- and high-frequency subbands, enabling the extraction of statistical and boundary features. A knowledge-guided complex diffusion network is then designed to model the statistical properties of the low-frequency components. During the process, structural information from high-frequency coefficients is utilized to guide the diffusion process, improving edge preservation. Furthermore, multiscale and multidirectional high-frequency features are jointly learned to further boost classification accuracy. Experimental results on three real-world PolSAR datasets demonstrate that our approach surpasses state-of-the-art methods, particularly in preserving edge details and maintaining region homogeneity in complex terrain.
zh
[CV-58] 3DGS_LSR:Large_Scale Relocation for Autonomous Driving Based on 3D Gaussian Splatting
【速读】:该论文试图解决自主机器人系统在复杂城市环境中因GNSS信号遮挡和多路径效应导致的定位不可靠问题,以及传统地图构建方法在存储和计算效率上的局限性。其解决方案的关键在于提出3DGS-LSR框架,该框架利用3D Gaussian Splatting(3DGS)技术,通过单目RGB图像实现厘米级定位,结合多传感器数据构建高精度3DGS地图,并采用基于SuperPoint和SuperGlue的特征提取与匹配方法,结合迭代优化策略进行逐步渲染以提升定位精度,从而实现实时自主导航。
链接: https://arxiv.org/abs/2507.05661
作者: Haitao Lu,Haijier Chen,Haoze Liu,Shoujian Zhang,Bo Xu,Ziao Liu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages,7 figures,4 tables
Abstract:In autonomous robotic systems, precise localization is a prerequisite for safe navigation. However, in complex urban environments, GNSS positioning often suffers from signal occlusion and multipath effects, leading to unreliable absolute positioning. Traditional mapping approaches are constrained by storage requirements and computational inefficiency, limiting their applicability to resource-constrained robotic platforms. To address these challenges, we propose 3DGS-LSR: a large-scale relocalization framework leveraging 3D Gaussian Splatting (3DGS), enabling centimeter-level positioning using only a single monocular RGB image on the client side. We combine multi-sensor data to construct high-accuracy 3DGS maps in large outdoor scenes, while the robot-side localization requires just a standard camera input. Using SuperPoint and SuperGlue for feature extraction and matching, our core innovation is an iterative optimization strategy that refines localization results through step-by-step rendering, making it suitable for real-time autonomous navigation. Experimental validation on the KITTI dataset demonstrates our 3DGS-LSR achieves average positioning accuracies of 0.026m, 0.029m, and 0.081m in town roads, boulevard roads, and traffic-dense highways respectively, significantly outperforming other representative methods while requiring only monocular RGB input. This approach provides autonomous robots with reliable localization capabilities even in challenging urban environments where GNSS fails.
zh
[CV-59] OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval
【速读】:该论文试图解决生成式图像检索(Composed Image Retrieval, CIR)中的两个关键问题:一是视觉数据中主导部分与噪声部分之间的异质性被忽略,导致查询特征退化;二是图像修改过程中文本数据的优先级被忽视,导致视觉焦点偏差。解决方案的关键在于提出一种基于焦点映射的特征提取器,包含主导部分分割和双焦点映射模块,用于识别图像中的显著主导区域并引导视觉与文本特征的提取,从而减少噪声干扰;随后引入一种文本引导的焦点修订模块,利用文本中的修改需求对参考图像进行自适应焦点修订,以增强组合特征的修改焦点感知能力。
链接: https://arxiv.org/abs/2507.05631
作者: Zhiwei Chen,Yupeng Hu,Zixu Li,Zhiheng Fu,Xuemeng Song,Liqiang Nie
机构: Shandong University(山东大学); City University of Hong Kong(香港城市大学); Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Composed Image Retrieval (CIR) represents a novel retrieval paradigm that is capable of expressing users’ intricate retrieval requirements flexibly. It enables the user to give a multimodal query, comprising a reference image and a modification text, and subsequently retrieve the target image. Notwithstanding the considerable advances made by prevailing methodologies, CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation, and 2) the priority of textual data in the image modification process is overlooked, which leads to a visual focus bias. To address these two limitations, this work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping. It is designed to identify significant dominant portions in images and guide the extraction of visual and textual data features, thereby reducing the impact of noise interference. Subsequently, we propose a textually guided focus revision module, which can utilize the modification requirements implied in the text to perform adaptive focus revision on the reference image, thereby enhancing the perception of the modification focus on the composed features. The aforementioned modules collectively constitute the segmentatiOn-based Focus shiFt reviSion nETwork (\mboxOFFSET), and comprehensive experiments on four benchmark datasets substantiate the superiority of our proposed method. The codes and data are available on this https URL
zh
[CV-60] DreamGrasp: Zero-Shot 3D Multi-Object Reconstruction from Partial-View Images for Robotic Manipulation
【速读】:该论文试图解决在杂乱、遮挡的现实场景中,从少量稀疏的RGB图像中重建3D几何结构并识别物体实例的问题(Partial-view 3D recognition)。现有方法由于依赖强对称性先验或在精心标注的数据集上进行监督学习,难以适应此类复杂环境。论文提出的解决方案——DreamGrasp,其关键在于利用大规模预训练图像生成模型的想象能力,结合粗粒度3D重建、对比学习驱动的实例分割以及文本引导的实例级优化,从而克服了传统方法的局限性,实现了在多物体复杂环境中的鲁棒3D重建。
链接: https://arxiv.org/abs/2507.05627
作者: Young Hun Kim,Seungyeon Kim,Yonghyeon Lee,Frank Chongwoo Park
机构: Seoul National University (首尔国立大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Partial-view 3D recognition – reconstructing 3D geometry and identifying object instances from a few sparse RGB images – is an exceptionally challenging yet practically essential task, particularly in cluttered, occluded real-world settings where full-view or reliable depth data are often unavailable. Existing methods, whether based on strong symmetry priors or supervised learning on curated datasets, fail to generalize to such scenarios. In this work, we introduce DreamGrasp, a framework that leverages the imagination capability of large-scale pre-trained image generative models to infer the unobserved parts of a scene. By combining coarse 3D reconstruction, instance segmentation via contrastive learning, and text-guided instance-wise refinement, DreamGrasp circumvents limitations of prior methods and enables robust 3D reconstruction in complex, multi-object environments. Our experiments show that DreamGrasp not only recovers accurate object geometry but also supports downstream tasks like sequential decluttering and target retrieval with high success rates.
zh
[CV-61] AdaptaGen: Domain-Specific Image Generation through Hierarchical Semantic Optimization Framework
【速读】:该论文旨在解决领域特定图像生成中存在两个关键问题:现有方法将提示工程与模型适配分开处理,忽视了语义理解与视觉表征在专业领域中的内在依赖关系;同时,在内容合成过程中未能充分融入领域特定的语义约束,导致生成结果出现幻觉和语义偏差。其解决方案的关键在于提出AdaptaGen,一个分层语义优化框架,通过基于矩阵的提示优化与多视角理解相结合,从全局和局部角度捕捉全面的语义关系,并设计跨模态适应机制以减少幻觉,结合智能内容合成保持核心主题元素的同时引入多样细节。此外,在生成阶段引入两阶段的标题语义转换,以维持语义连贯性并增强视觉多样性,从而确保生成图像符合领域特定约束。
链接: https://arxiv.org/abs/2507.05621
作者: Suoxiang Zhang,Xiaxi Li,Hongrui Chang,Zhuoyan Hou,Guoxin Wu,Ronghua Ji
机构: China Agricultural University (中国农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Domain-specific image generation aims to produce high-quality visual content for specialized fields while ensuring semantic accuracy and detail fidelity. However, existing methods exhibit two critical limitations: First, current approaches address prompt engineering and model adaptation separately, overlooking the inherent dependence between semantic understanding and visual representation in specialized domains. Second, these techniques inadequately incorporate domain-specific semantic constraints during content synthesis, resulting in generation outcomes that exhibit hallucinations and semantic deviations. To tackle these issues, we propose AdaptaGen, a hierarchical semantic optimization framework that integrates matrix-based prompt optimization with multi-perspective understanding, capturing comprehensive semantic relationships from both global and local perspectives. To mitigate hallucinations in specialized domains, we design a cross-modal adaptation mechanism, which, when combined with intelligent content synthesis, enables preserving core thematic elements while incorporating diverse details across images. Additionally, we introduce a two-phase caption semantic transformation during the generation phase. This approach maintains semantic coherence while enhancing visual diversity, ensuring the generated images adhere to domain-specific constraints. Experimental results confirm our approach’s effectiveness, with our framework achieving superior performance across 40 categories from diverse datasets using only 16 images per category, demonstrating significant improvements in image quality, diversity, and semantic consistency.
zh
[CV-62] Generative Head-Mounted Camera Captures for Photorealistic Avatars
【速读】:该论文试图解决在虚拟和增强现实(VR/AR)中生成逼真角色动画的挑战,特别是由于难以获取面部的地面真实状态(ground truth)。传统方法依赖于分析-合成(analysis-by-synthesis)策略,但存在表达与风格解耦不完全的问题,并且需要大量配对的头戴式相机(HMC)和穹顶摄像头数据,导致数据收集成本高且不可重用。该论文提出的解决方案是Generative HMC (GenHMC),其关键在于利用大量未配对的HMC数据,直接从穹顶摄像头捕捉的条件化角色状态生成高质量的合成HMC图像,从而实现表达与外观的正确解耦,并具备对未见身份的泛化能力。
链接: https://arxiv.org/abs/2507.05620
作者: Shaojie Bai,Seunghyeon Seo,Yida Wang,Chenghui Li,Owen Wang,Te-Li Wang,Tianyang Ma,Jason Saragih,Shih-En Wei,Nojun Kwak,Hyung Jun Kim
机构: Meta Reality Labs (Meta Reality Labs); Seoul National University (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 16 figures
Abstract:Enabling photorealistic avatar animations in virtual and augmented reality (VR/AR) has been challenging because of the difficulty of obtaining ground truth state of faces. It is physically impossible to obtain synchronized images from head-mounted cameras (HMC) sensing input, which has partial observations in infrared (IR), and an array of outside-in dome cameras, which have full observations that match avatars’ appearance. Prior works relying on analysis-by-synthesis methods could generate accurate ground truth, but suffer from imperfect disentanglement between expression and style in their personalized training. The reliance of extensive paired captures (HMC and dome) for the same subject makes it operationally expensive to collect large-scale datasets, which cannot be reused for different HMC viewpoints and lighting. In this work, we propose a novel generative approach, Generative HMC (GenHMC), that leverages large unpaired HMC captures, which are much easier to collect, to directly generate high-quality synthetic HMC images given any conditioning avatar state from dome captures. We show that our method is able to properly disentangle the input conditioning signal that specifies facial expression and viewpoint, from facial appearance, leading to more accurate ground truth. Furthermore, our method can generalize to unseen identities, removing the reliance on the paired captures. We demonstrate these breakthroughs by both evaluating synthetic HMC images and universal face encoders trained from these new HMC-avatar correspondences, which achieve better data efficiency and state-of-the-art accuracy.
zh
[CV-63] Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration
【速读】:该论文旨在解决扩散模型在图像恢复任务中面临的保真度不一致和不良伪影的问题。其解决方案的关键在于引入一种名为“Kernel Density Steering (KDS)”的新型推理阶段框架,该框架通过显式局部模态搜索促进鲁棒且高保真的输出。KDS利用扩散样本的N-粒子集合,从其集体输出中计算逐块核密度估计梯度,并将每个粒子中的块引导至集合中识别出的共享高密度区域,从而实现“集体智慧”,使样本远离因独立采样或模型缺陷而产生的虚假模态,转向更稳健、高保真的结构。
链接: https://arxiv.org/abs/2507.05604
作者: Yuyang Hu,Kangfu Mei,Mojtaba Sahraee-Ardakan,Ulugbek S. Kamilov,Peyman Milanfar,Mauricio Delbracio
机构: Google(谷歌); Washington University in St. Louis(圣路易斯华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Diffusion models show promise for image restoration, but existing methods often struggle with inconsistent fidelity and undesirable artifacts. To address this, we introduce Kernel Density Steering (KDS), a novel inference-time framework promoting robust, high-fidelity outputs through explicit local mode-seeking. KDS employs an N -particle ensemble of diffusion samples, computing patch-wise kernel density estimation gradients from their collective outputs. These gradients steer patches in each particle towards shared, higher-density regions identified within the ensemble. This collective local mode-seeking mechanism, acting as “collective wisdom”, steers samples away from spurious modes prone to artifacts, arising from independent sampling or model imperfections, and towards more robust, high-fidelity structures. This allows us to obtain better quality samples at the expense of higher compute by simultaneously sampling multiple particles. As a plug-and-play framework, KDS requires no retraining or external verifiers, seamlessly integrating with various diffusion samplers. Extensive numerical validations demonstrate KDS substantially improves both quantitative and qualitative performance on challenging real-world super-resolution and image inpainting tasks.
zh
[CV-64] Rethinking Layered Graphic Design Generation with a Top-Down Approach ICCV2025
【速读】:该论文试图解决将生成式 AI (Generative AI) 生成的非分层图像转换为可编辑的分层设计的问题,同时通过用户提示对生成的无意义文本进行语义优化。解决方案的关键在于提出 Accordion 框架,该框架基于视觉语言模型 (VLM) 在三个精心设计阶段中执行不同任务,采用自上而下的方法,利用视觉和谐参考图像作为全局指导来分解每一层,并结合多个视觉专家模型(如 SAM 和元素移除模型)以促进图形图层的创建。
链接: https://arxiv.org/abs/2507.05601
作者: Jingye Chen,Zhaowen Wang,Nanxuan Zhao,Li Zhang,Difan Liu,Jimei Yang,Qifeng Chen
机构: HKUST(香港科技大学); Adobe Research(Adobe 研究院); Runway(Runway)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Graphic design is crucial for conveying ideas and messages. Designers usually organize their work into objects, backgrounds, and vectorized text layers to simplify editing. However, this workflow demands considerable expertise. With the rise of GenAI methods, an endless supply of high-quality graphic designs in pixel format has become more accessible, though these designs often lack editability. Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs. Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs, meanwhile refining nonsensical AI-generated text with meaningful alternatives guided by user prompts. It is built around a vision language model (VLM) playing distinct roles in three curated stages. For each stage, we design prompts to guide the VLM in executing different tasks. Distinct from existing bottom-up methods (e.g., COLE and Open-COLE) that gradually generate elements to create layered designs, our approach works in a top-down manner by using the visually harmonious reference image as global guidance to decompose each layer. Additionally, it leverages multiple vision experts such as SAM and element removal models to facilitate the creation of graphic layers. We train our method using the in-house graphic design dataset Design39K, augmented with AI-generated design images coupled with refined ground truth created by a customized inpainting model. Experimental results and user studies by designers show that Accordion generates favorable results on the DesignIntention benchmark, including tasks such as text-to-template, adding text to background, and text de-rendering, and also excels in creating design variations.
zh
[CV-65] PaddleOCR 3.0 Technical Report
【速读】:该论文旨在解决大规模语言模型时代下文档理解的需求,特别是针对文本识别、文档结构解析和关键信息提取的问题。其解决方案的关键在于提出三个主要技术模块:PP-OCRv5用于多语言文本识别,PP-StructureV3用于分层文档解析,PP-ChatOCRv4用于关键信息抽取。这些模型在参数量少于1亿的情况下,实现了与十亿参数视觉-语言模型相当的准确性和效率。
链接: https://arxiv.org/abs/2507.05595
作者: Cheng Cui,Ting Sun,Manhui Lin,Tingquan Gao,Yubo Zhang,Jiaxuan Liu,Xueqing Wang,Zelun Zhang,Changda Zhou,Hongen Liu,Yue Zhang,Wenyu Lv,Kui Huang,Yichao Zhang,Jing Zhang,Jun Zhang,Yi Liu,Dianhai Yu,Yanjun Ma
机构: PaddlePaddle Team, Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. To address the growing demand for document understanding in the era of large language models, PaddleOCR 3.0 presents three major solutions: (1) PP-OCRv5 for multilingual text recognition, (2) PP-StructureV3 for hierarchical document parsing, and (3) PP-ChatOCRv4 for key information extraction. Compared to mainstream vision-language models (VLMs), these models with fewer than 100 million parameters achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs. In addition to offering a high-quality OCR model library, PaddleOCR 3.0 provides efficient tools for training, inference, and deployment, supports heterogeneous hardware acceleration, and enables developers to easily build intelligent document applications.
zh
[CV-66] GSVR: 2D Gaussian-based Video Representation for 800 FPS with Hybrid Deformation Field
【速读】:该论文旨在解决现有基于卷积的视频表示方法在解码速度和训练时间上的不足,以及在处理视频中相机运动与物体运动耦合问题时的局限性。其解决方案的关键在于提出一种基于2D高斯的视频表示方法GSVR,通过引入混合形变场来建模视频动态,结合三平面运动和多项式运动两种运动模式以处理运动耦合问题,并采用动态感知的时间切片策略和量化感知微调技术,从而实现高效的解码速度和压缩性能。
链接: https://arxiv.org/abs/2507.05594
作者: Zhizhuo Pang,Zhihui Ke,Xiaobo Zhou,Tie Qiu
机构: Tianjin University(天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Implicit neural representations for video have been recognized as a novel and promising form of video representation. Existing works pay more attention to improving video reconstruction quality but little attention to the decoding speed. However, the high computation of convolutional network used in existing methods leads to low decoding speed. Moreover, these convolution-based video representation methods also suffer from long training time, about 14 seconds per frame to achieve 35+ PSNR on Bunny. To solve the above problems, we propose GSVR, a novel 2D Gaussian-based video representation, which achieves 800+ FPS and 35+ PSNR on Bunny, only needing a training time of 2 seconds per frame. Specifically, we propose a hybrid deformation field to model the dynamics of the video, which combines two motion patterns, namely the tri-plane motion and the polynomial motion, to deal with the coupling of camera motion and object motion in the video. Furthermore, we propose a Dynamic-aware Time Slicing strategy to adaptively divide the video into multiple groups of pictures(GOP) based on the dynamic level of the video in order to handle large camera motion and non-rigid movements. Finally, we propose quantization-aware fine-tuning to avoid performance reduction after quantization and utilize image codecs to compress Gaussians to achieve a compact representation. Experiments on the Bunny and UVG datasets confirm that our method converges much faster than existing methods and also has 10x faster decoding speed compared to other methods. Our method has comparable performance in the video interpolation task to SOTA and attains better video compression performance than NeRV.
zh
[CV-67] Semi-Supervised Defect Detection via Conditional Diffusion and CLIP-Guided Noise Filtering
【速读】:该论文旨在解决工业质量检测中缺陷检测的效率低、成本高及鲁棒性不足的问题。其解决方案的关键在于提出一种基于条件扩散的半监督缺陷检测框架(DSYM),该框架采用两阶段协同训练机制和分阶段联合优化策略,通过利用标注数据进行初始训练,并借助伪标签引入未标注数据,结合条件扩散模型生成多尺度伪缺陷样本以及基于CLIP跨模态特征的噪声过滤机制,有效提升了数据效率与检测性能。
链接: https://arxiv.org/abs/2507.05588
作者: Shuai Li,Shihan Chen,Wanru Geng,Zhaohua Xu,Xiaolu Liu,Can Dong,Zhen Tian,Changlin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the realm of industrial quality inspection, defect detection stands as a critical component, particularly in high-precision, safety-critical sectors such as automotive components aerospace, and medical devices. Traditional methods, reliant on manual inspection or early image processing algorithms, suffer from inefficiencies, high costs, and limited robustness. This paper introduces a semi-supervised defect detection framework based on conditional diffusion (DSYM), leveraging a two-stage collaborative training mechanism and a staged joint optimization strategy. The framework utilizes labeled data for initial training and subsequently incorporates unlabeled data through the generation of pseudo-labels. A conditional diffusion model synthesizes multi-scale pseudo-defect samples, while a CLIP cross-modal feature-based noise filtering mechanism mitigates label contamination. Experimental results on the NEU-DET dataset demonstrate a 78.4% mAP@0.5 with the same amount of labeled data as traditional supervised methods, and 75.1% mAP@0.5 with only 40% of the labeled data required by the original supervised model, showcasing significant advantages in data efficiency. This research provides a high-precision, low-labeling-dependent solution for defect detection in industrial quality inspection scenarios. The work of this article has been open-sourced at this https URL.
zh
[CV-68] Multi-Modal Face Anti-Spoofing via Cross-Modal Feature Transitions
【速读】:该论文旨在解决多模态人脸活体检测(Multi-modal Face Anti-Spoofing, FAS)中因不同模态数据分布差异大以及在推理阶段可能出现模态缺失所带来的挑战。其解决方案的关键在于提出一种跨模态过渡引导网络(Cross-modal Transition-guided Network, CTNet),通过学习活体样本间一致的跨模态特征过渡构建泛化特征空间,并利用活体与欺骗样本之间不一致的跨模态特征过渡来有效检测分布外攻击。此外,为应对模态缺失问题,CTNet还通过从RGB模态中学习互补的红外(IR)和深度特征作为辅助模态,从而提升系统的鲁棒性。
链接: https://arxiv.org/abs/2507.05575
作者: Jun-Xiong Chong,Fang-Yu Hsu,Ming-Tsung Hsu,Yi-Ting Lin,Kai-Heng Chien,Chiou-Ting Hsu,Pei-Kai Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal face anti-spoofing (FAS) aims to detect genuine human presence by extracting discriminative liveness cues from multiple modalities, such as RGB, infrared (IR), and depth images, to enhance the robustness of biometric authentication systems. However, because data from different modalities are typically captured by various camera sensors and under diverse environmental conditions, multi-modal FAS often exhibits significantly greater distribution discrepancies across training and testing domains compared to single-modal FAS. Furthermore, during the inference stage, multi-modal FAS confronts even greater challenges when one or more modalities are unavailable or inaccessible. In this paper, we propose a novel Cross-modal Transition-guided Network (CTNet) to tackle the challenges in the multi-modal FAS task. Our motivation stems from that, within a single modality, the visual differences between live faces are typically much smaller than those of spoof faces. Additionally, feature transitions across modalities are more consistent for the live class compared to those between live and spoof classes. Upon this insight, we first propose learning consistent cross-modal feature transitions among live samples to construct a generalized feature space. Next, we introduce learning the inconsistent cross-modal feature transitions between live and spoof samples to effectively detect out-of-distribution (OOD) attacks during inference. To further address the issue of missing modalities, we propose learning complementary infrared (IR) and depth features from the RGB modality as auxiliary modalities. Extensive experiments demonstrate that the proposed CTNet outperforms previous two-class multi-modal FAS methods across most protocols.
zh
[CV-69] ReLayout: Integrating Relation Reasoning for Content-aware Layout Generation with Multi-modal Large Language Models
【速读】:该论文试图解决基于大语言模型(Large Language Models, LLMs)的布局生成方法在空间关系理解上的不足,这些方法未能充分解析视觉主题与设计元素之间的空间关系,导致生成的布局在结构和多样性上存在问题。解决方案的关键在于引入ReLayout方法,该方法通过关系思维链(relation-CoT)从设计概念出发,生成更合理且美学一致的布局。其核心在于增强布局标注,引入显式的关系定义(如区域、显著性及元素间的边距),并将布局分解为更结构化和递归的小布局,同时提出一种布局原型再平衡采样器,从三个维度定义布局原型特征并量化不同的布局风格,从而解决因原型分布平衡过程中数据偏差导致的生成均匀性问题。
链接: https://arxiv.org/abs/2507.05568
作者: Jiaxu Tian,Xuehui Yu,Yaoxing Wang,Pan Wang,Guangqian Guo,Shan Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Content-aware layout aims to arrange design elements appropriately on a given canvas to convey information effectively. Recently, the trend for this task has been to leverage large language models (LLMs) to generate layouts automatically, achieving remarkable performance. However, existing LLM-based methods fail to adequately interpret spatial relationships among visual themes and design elements, leading to structural and diverse problems in layout generation. To address this issue, we introduce ReLayout, a novel method that leverages relation-CoT to generate more reasonable and aesthetically coherent layouts by fundamentally originating from design concepts. Specifically, we enhance layout annotations by introducing explicit relation definitions, such as region, salient, and margin between elements, with the goal of decomposing the layout into smaller, structured, and recursive layouts, thereby enabling the generation of more structured layouts. Furthermore, based on these defined relationships, we introduce a layout prototype rebalance sampler, which defines layout prototype features across three dimensions and quantifies distinct layout styles. This sampler addresses uniformity issues in generation that arise from data bias in the prototype distribution balance process. Extensive experimental results verify that ReLayout outperforms baselines and can generate structural and diverse layouts that are more aligned with human aesthetics and more explainable.
zh
[CV-70] Simulating Refractive Distortions and Weather-Induced Artifacts for Resource-Constrained Autonomous Perception ICCV2025
【速读】:该论文试图解决发展中国家,特别是非洲地区在自动驾驶领域中因缺乏高质量车载数据集而导致的感知性能不足问题。其关键解决方案是提出一种程序化增强管道,通过向低成本单目行车记录仪视频中引入针对非洲复杂驾驶场景的现实折射失真和天气诱发的伪影,如低质量镜头光学效应、空气湍流、雾霾和镜头眩光等,从而提升数据集的多样性和实用性。
链接: https://arxiv.org/abs/2507.05536
作者: Moseli Mots’oehli,Feimei Chen,Hok Wai Chan,Itumeleng Tlali,Thulani Babeli,Kyungim Baek,Huaijin Chen
机构: University of Hawai‘i at Mānoa (夏威夷大学马诺阿分校); MindForge AI (MindForge AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: This paper has been submitted to the ICCV 2025 Workshop on Computer Vision for Developing Countries (CV4DC) for review
Abstract:The scarcity of autonomous vehicle datasets from developing regions, particularly across Africa’s diverse urban, rural, and unpaved roads, remains a key obstacle to robust perception in low-resource settings. We present a procedural augmentation pipeline that enhances low-cost monocular dashcam footage with realistic refractive distortions and weather-induced artifacts tailored to challenging African driving scenarios. Our refractive module simulates optical effects from low-quality lenses and air turbulence, including lens distortion, Perlin noise, Thin-Plate Spline (TPS), and divergence-free (incompressible) warps. The weather module adds homogeneous fog, heterogeneous fog, and lens flare. To establish a benchmark, we provide baseline performance using three image restoration models. To support perception research in underrepresented African contexts, without costly data collection, labeling, or simulation, we release our distortion toolkit, augmented dataset splits, and benchmark results.
zh
[CV-71] Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model
【速读】:该论文旨在解决跨模态检索系统中对高精度文本-图像匹配的需求,提出了一种统一的文本-图像检索模型llama-nemoretriever-colembed,以实现多基准测试下的最先进性能。其解决方案的关键在于基于NVIDIA Eagle2视觉语言模型(VLM)进行架构改进,将因果注意力机制替换为双向注意力,并引入类似ColBERT的晚期交互机制,从而在共享嵌入空间中实现细粒度的多模态检索。这一机制虽然提升了检索准确性,但也带来了存储和效率上的权衡。
链接: https://arxiv.org/abs/2507.05513
作者: Mengyao Xu,Gabriel Moreira,Ronay Ak,Radek Osmulski,Yauhen Babakhin,Zhiding Yu,Benedikt Schifferer,Even Oldridge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model’s retrieval capabilities. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.05513 [cs.CV] (or arXiv:2507.05513v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.05513 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-72] LoomNet: Enhancing Multi-View Image Generation via Latent Space Weaving
【速读】:该论文试图解决从单张图像生成一致的多视角图像的问题,特别是由于缺乏空间一致性导致的表面重建中3D网格质量下降的问题。其解决方案的关键在于提出LoomNet,这是一种新型的多视角扩散架构,通过并行多次应用相同的扩散模型,协作构建和利用共享潜在空间以实现视角一致性。每个视角特定的推理生成一个编码,表示从给定相机姿态出发的新视角假设,并将其投影到三个正交平面上,随后对这些平面进行信息传播和缺失区域插值,最终结合所有假设形成统一且连贯的解释。
链接: https://arxiv.org/abs/2507.05499
作者: Giulio Federico,Fabio Carrara,Claudio Gennaro,Giuseppe Amato,Marco Di Benedetto
机构: University of Pisa(比萨大学); CNR-ISTI(意大利国家研究理事会-比萨研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating consistent multi-view images from a single image remains challenging. Lack of spatial consistency often degrades 3D mesh quality in surface reconstruction. To address this, we propose LoomNet, a novel multi-view diffusion architecture that produces coherent images by applying the same diffusion model multiple times in parallel to collaboratively build and leverage a shared latent space for view consistency. Each viewpoint-specific inference generates an encoding representing its own hypothesis of the novel view from a given camera pose, which is projected onto three orthogonal planes. For each plane, encodings from all views are fused into a single aggregated plane. These aggregated planes are then processed to propagate information and interpolate missing regions, combining the hypotheses into a unified, coherent interpretation. The final latent space is then used to render consistent multi-view images. LoomNet generates 16 high-quality and coherent views in just 15 seconds. In our experiments, LoomNet outperforms state-of-the-art methods on both image quality and reconstruction metrics, also showing creativity by producing diverse, plausible novel views from the same input.
zh
[CV-73] Cloud Diffusion Part 1: Theory and Motivation
【速读】:该论文试图解决传统扩散模型在图像生成中使用白噪声(white noise)作为噪声分布所带来的局限性,例如推理速度较慢、高频细节不足以及控制能力有限等问题。解决方案的关键在于引入具有尺度不变性(scale invariance)的噪声分布替代白噪声,构建所谓的“Cloud Diffusion Model”,该模型通过强调大尺度相关性并弱化小尺度相关性,从而提升生成效果和可控性。
链接: https://arxiv.org/abs/2507.05496
作者: Andrew Randono
机构: The Spin Group Research Institute( The Spin Group Research Institute)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 21 figures. Associated code: this https URL
Abstract:Diffusion models for image generation function by progressively adding noise to an image set and training a model to separate out the signal from the noise. The noise profile used by these models is white noise – that is, noise based on independent normal distributions at each point whose mean and variance is independent of the scale. By contrast, most natural image sets exhibit a type of scale invariance in their low-order statistical properties characterized by a power-law scaling. Consequently, natural images are closer (in a quantifiable sense) to a different probability distribution that emphasizes large scale correlations and de-emphasizes small scale correlations. These scale invariant noise profiles can be incorporated into diffusion models in place of white noise to form what we will call a ``Cloud Diffusion Model". We argue that these models can lead to faster inference, improved high-frequency details, and greater controllability. In a follow-up paper, we will build and train a Cloud Diffusion Model that uses scale invariance at a fundamental level and compare it to classic, white noise diffusion models.
zh
[CV-74] Driving as a Diagnostic Tool: Scenario-based Cognitive Assessment in Older Drivers From Driving Video
【速读】:该论文试图解决老年人认知功能下降(如阿尔茨海默病和轻度认知障碍)在临床诊断中常因方法耗时且成本高而被低估的问题。其解决方案的关键在于利用自然驾驶视频和大视觉模型,从真实世界驾驶行为中提取与功能退化和临床特征相关的“数字指纹”,从而实现对认知状态的识别、分类及疾病进展的预测。该方法通过将车辆作为“诊断工具”,分析驾驶行为以早期发现功能损伤的预警信号,进而支持主动干预策略的制定。
链接: https://arxiv.org/abs/2507.05463
作者: Md Zahid Hasan,Guillermo Basulto-Elias,Jun Ha Chang,Sahuna Hallmark,Matthew Rizzo,Anuj Sharma,Soumik Sarkar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures
Abstract:We introduce scenario-based cognitive status identification in older drivers from Naturalistic driving videos and large vision models. In recent times, cognitive decline, including Alzheimer’s disease (AD) and mild cognitive impairment (MCI), is often underdiagnosed due to the time-consuming and costly nature of current diagnostic methods. By analyzing real-world driving behavior captured through in-vehicle systems, this research aims to extract “digital fingerprints” that correlate with functional decline and clinical features of MCI and AD. Moreover, modern large vision models can draw meaningful insights from everyday driving patterns of older patients to early detect cognitive decline. We propose a framework that uses large vision models and naturalistic driving videos to analyze driver behavior, classify cognitive status and predict disease progression. We leverage the strong relationship between real-world driving behavior as an observation of the current cognitive status of the drivers where the vehicle can be utilized as a “diagnostic tool”. Our method identifies early warning signs of functional impairment, contributing to proactive intervention strategies. This work enhances early detection and supports the development of scalable, non-invasive monitoring systems to mitigate the growing societal and economic burden of cognitive decline in the aging population.
zh
[CV-75] NRXR-ID: Two-Factor Authentication (2FA) in VR Using Near-Range Extended Reality and Smartphones
【速读】:该论文试图解决在虚拟现实(Virtual Reality, VR)环境中实施双因素认证(Two-Factor Authentication, 2FA)的难题,因为用户通常佩戴头戴式显示器(Head-Mounted Display, HMD),无法看到现实世界环境,从而难以完成传统的2FA流程。解决方案的关键是提出NRXR-ID技术,该技术允许用户在不摘下HMD的情况下,通过智能手机完成认证挑战,从而实现安全的身份验证。
链接: https://arxiv.org/abs/2507.05447
作者: Aiur Nanzatov,Lourdes Peña-Castillo,Oscar Meruvia-Pastor
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Two-factor authentication (2FA) has become widely adopted as an efficient and secure way to validate someone’s identity online. Two-factor authentication is difficult in virtual reality (VR) because users are usually wearing a head-mounted display (HMD) which does not allow them to see their real-world surroundings. We present NRXR-ID, a technique to implement two-factor authentication while using extended reality systems and smartphones. The proposed method allows users to complete an authentication challenge using their smartphones without removing their HMD. We performed a user study where we explored four types of challenges for users, including a novel checkers-style challenge. Users responded to these challenges under three different configurations, including a technique that uses the smartphone to support gaze-based selection without the use of VR controllers. A 4X3 within-subjects design allowed us to study all the variations proposed. We collected performance metrics and performed user experience questionnaires to collect subjective impressions from 30 participants. Results suggest that the checkers-style visual matching challenge was the most appropriate option, followed by entering a digital PIN challenge submitted via the smartphone and answered within the VR environment.
zh
[CV-76] Robotic System with AI for Real Time Weed Detection Canopy Aware Spraying and Droplet Pattern Evaluation
【速读】:该论文旨在解决现代农业生产中除草剂过度且不均匀施用导致的投入成本增加、环境污染以及抗药性杂草出现的问题。其解决方案的关键在于开发一种基于视觉引导的AI驱动变量喷洒系统,该系统能够实时检测杂草存在、估计冠层大小并动态调整喷嘴激活,通过集成轻量级YOLO11n和YOLO11n-seg深度学习模型,在NVIDIA Jetson Orin Nano上进行本地推理,并利用Arduino Uno控制电磁阀喷嘴,实现精准喷洒。
链接: https://arxiv.org/abs/2507.05432
作者: Inayat Rasool,Pappu Kumar Yadav,Amee Parmar,Hasan Mirzakhaninafchi,Rikesh Budhathoki,Zain Ul Abideen Usmani,Supriya Paudel,Ivan Perez Olivera,Eric Jone
机构: South Dakota State University (南达科他州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:Uniform and excessive herbicide application in modern agriculture contributes to increased input costs, environmental pollution, and the emergence of herbicide resistant weeds. To address these challenges, we developed a vision guided, AI-driven variable rate sprayer system capable of detecting weed presence, estimating canopy size, and dynamically adjusting nozzle activation in real time. The system integrates lightweight YOLO11n and YOLO11n-seg deep learning models, deployed on an NVIDIA Jetson Orin Nano for onboard inference, and uses an Arduino Uno-based relay interface to control solenoid actuated nozzles based on canopy segmentation results. Indoor trials were conducted using 15 potted Hibiscus rosa sinensis plants of varying canopy sizes to simulate a range of weed patch scenarios. The YOLO11n model achieved a mean average precision (mAP@50) of 0.98, with a precision of 0.99 and a recall close to 1.0. The YOLO11n-seg segmentation model achieved a mAP@50 of 0.48, precision of 0.55, and recall of 0.52. System performance was validated using water sensitive paper, which showed an average spray coverage of 24.22% in zones where canopy was present. An upward trend in mean spray coverage from 16.22% for small canopies to 21.46% and 21.65% for medium and large canopies, respectively, demonstrated the system’s capability to adjust spray output based on canopy size in real time. These results highlight the potential of combining real time deep learning with low-cost embedded hardware for selective herbicide application. Future work will focus on expanding the detection capabilities to include three common weed species in South Dakota: water hemp (Amaranthus tuberculatus), kochia (Bassia scoparia), and foxtail (Setaria spp.), followed by further validation in both indoor and field trials within soybean and corn production systems.
zh
[CV-77] OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts
【速读】:该论文旨在解决基于开放式语言提示进行目标分割的问题,即让模型能够将文本语义准确地映射到空间掩码,并处理多样且未见过的类别。其解决方案的关键在于提出OpenWorldSAM框架,通过集成轻量级视觉-语言模型(VLM)提取的多模态嵌入,扩展了SAM2的提示驱动能力以适应开放词汇场景,同时遵循统一提示、效率、实例感知和泛化四个核心原则,从而实现了在多种分割任务中的卓越性能。
链接: https://arxiv.org/abs/2507.05427
作者: Shiting Xiao,Rishabh Kabra,Yuhang Li,Donghyun Lee,Joao Carreira,Priyadarshini Panda
机构: Yale University (耶鲁大学); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model’s spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks, including ADE20k, PASCAL, ScanNet, and SUN-RGBD.
zh
[CV-78] Mastering Regional 3DGS: Locating Initializing and Editing with Diverse 2D Priors
【速读】:该论文试图解决3D场景局部编辑中由于3D语义分割性能不足导致的精准区域操作困难问题,从而限制了编辑的保真度。其解决方案的关键在于利用2D扩散编辑技术准确识别每个视角中的修改区域,随后通过逆渲染进行3D定位,并结合由2D基础模型预测的深度图生成一致视角和近似形状的粗略3D高斯溅射(3DGS)初始化,支持迭代且视角一致的编辑过程,逐步提升结构细节和纹理的一致性。
链接: https://arxiv.org/abs/2507.05426
作者: Lanqing Guo,Yufei Wang,Hezhen Hu,Yan Zheng,Yeying Jin,Siyu Huang,Zhangyang Wang
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); National University of Singapore (新加坡国立大学); Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many 3D scene editing tasks focus on modifying local regions rather than the entire scene, except for some global applications like style transfer, and in the context of 3D Gaussian Splatting (3DGS), where scenes are represented by a series of Gaussians, this structure allows for precise regional edits, offering enhanced control over specific areas of the scene; however, the challenge lies in the fact that 3D semantic parsing often underperforms compared to its 2D counterpart, making targeted manipulations within 3D spaces more difficult and limiting the fidelity of edits, which we address by leveraging 2D diffusion editing to accurately identify modification regions in each view, followed by inverse rendering for 3D localization, then refining the frontal view and initializing a coarse 3DGS with consistent views and approximate shapes derived from depth maps predicted by a 2D foundation model, thereby supporting an iterative, view-consistent editing process that gradually enhances structural details and textures to ensure coherence across perspectives. Experiments demonstrate that our method achieves state-of-the-art performance while delivering up to a 4\times speedup, providing a more efficient and effective approach to 3D scene local editing.
zh
[CV-79] Motion Generation: A Survey of Generative Approaches and Benchmarks
【速读】:该论文试图解决运动生成(motion generation)任务中,如何从多种条件输入中合成现实运动序列的问题,这一问题在计算机视觉、计算机图形学和机器人学中具有广泛应用。其解决方案的关键在于对基于生成方法的运动生成技术进行深入分类与综述,重点分析近年来顶级会议中发表的相关研究,涵盖架构设计、条件机制、生成设置以及评估指标和数据集,旨在为研究人员提供清晰的比较框架和未来研究方向的参考。
链接: https://arxiv.org/abs/2507.05419
作者: Aliasghar Khani,Arianna Rampini,Bruno Roy,Larasika Nadela,Noa Kaplan,Evan Atherton,Derek Cheung,Jacky Bibliowicz
机构: Autodesk Research(欧特克研究); Autodesk Research(欧特克研究); Autodesk Research(欧特克研究); Autodesk Research(欧特克研究); Autodesk Research(欧特克研究); Autodesk Research(欧特克研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Motion generation, the task of synthesizing realistic motion sequences from various conditioning inputs, has become a central problem in computer vision, computer graphics, and robotics, with applications ranging from animation and virtual agents to human-robot interaction. As the field has rapidly progressed with the introduction of diverse modeling paradigms including GANs, autoencoders, autoregressive models, and diffusion-based techniques, each approach brings its own advantages and limitations. This growing diversity has created a need for a comprehensive and structured review that specifically examines recent developments from the perspective of the generative approach employed. In this survey, we provide an in-depth categorization of motion generation methods based on their underlying generative strategies. Our main focus is on papers published in top-tier venues since 2023, reflecting the most recent advancements in the field. In addition, we analyze architectural principles, conditioning mechanisms, and generation settings, and compile a detailed overview of the evaluation metrics and datasets used across the literature. Our objective is to enable clearer comparisons and identify open challenges, thereby offering a timely and foundational reference for researchers and practitioners navigating the rapidly evolving landscape of motion generation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2507.05419 [cs.CV] (or arXiv:2507.05419v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.05419 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-80] Neural-Driven Image Editing
【速读】:该论文旨在解决传统图像编辑依赖手动提示导致的劳动密集型和对运动控制或语言能力有限用户不友好问题。其解决方案的关键在于提出LoongX,一种基于多模态神经生理信号的手势自由图像编辑方法,通过整合跨尺度状态空间模块和动态门控融合模块,有效处理多模态信号的异质性,并利用扩散模型与微调机制将神经信号与编辑语义对齐。
链接: https://arxiv.org/abs/2507.05397
作者: Pengfei Zhou,Jie Xia,Xiaopeng Peng,Wangbo Zhao,Zilong Ye,Zekai Li,Suorong Yang,Jiadong Pan,Yuanxiang Chen,Ziqiao Wang,Kai Wang,Qian Zheng,Xiaojun Chang,Gang Pan,Shurong Dong,Kaipeng Zhang,Yang You
机构: NUS(新加坡国立大学); Zhejiang University(浙江大学); RIT(罗切斯特理工学院); NJU(南京大学); USTC(中国科学技术大学); MBZUAI(穆巴达拉人工智能研究院); Shanghai AI Lab(上海人工智能实验室); SII(新加坡资讯通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 14 figures
Abstract:Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.
zh
[CV-81] pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models
【速读】:该论文试图解决在去中心化、异构数据环境下,如何高效适应视觉-语言模型(Vision-Language Models, VLMs)以实现个性化与泛化之间的平衡问题。现有方法在追求个性化时往往牺牲了模型的泛化能力,尤其在面对未见过的类别或领域时表现不佳。论文提出的解决方案是pFedMMA,其关键在于引入多模态适配器(multi-modal adapters),每个适配器包含特定模态的上采样和下采样层以及一个全局共享的投影层,用于对齐跨模态特征。通过非对称优化策略,客户端可以在本地适应个性化数据分布的同时,协作训练共享投影以提升全局泛化能力,且仅需交换共享组件以实现通信效率。
链接: https://arxiv.org/abs/2507.05394
作者: Sajjad Ghiasvand,Mahnoosh Alizadeh,Ramtin Pedarsani
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our asymmetric optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods. The code is available at this https URL.
zh
[CV-82] Enhancing Underwater Images Using Deep Learning with Subjective Image Quality Integration
【速读】:该论文试图解决 underwater image quality enhancement(水下图像质量增强)的问题,旨在通过深度学习方法提升低质量水下图像的视觉效果和客观评价指标。解决方案的关键在于将人类主观评估融入训练过程,首先利用专家标注的高质量与低质量水下图像数据集训练分类网络以区分图像质量,随后使用生成对抗网络(GANs)结合如颜色保真度和图像清晰度等增强标准对低质量图像进行优化,从而在量化指标和定性分析上均取得显著提升。
链接: https://arxiv.org/abs/2507.05393
作者: Jose M. Montero,Jose-Luis Lisani
机构: Universitat de les Illes Balears (伊比利亚群岛大学); IAC3 (IAC3)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Recent advances in deep learning, particularly neural networks, have significantly impacted a wide range of fields, including the automatic enhancement of underwater images. This paper presents a deep learning-based approach to improving underwater image quality by integrating human subjective assessments into the training process. To this end, we utilize publicly available datasets containing underwater images labeled by experts as either high or low quality. Our method involves first training a classifier network to distinguish between high- and low-quality images. Subsequently, generative adversarial networks (GANs) are trained using various enhancement criteria to refine the low-quality images. The performance of the GAN models is evaluated using quantitative metrics such as PSNR, SSIM, and UIQM, as well as through qualitative analysis. Results demonstrate that the proposed model – particularly when incorporating criteria such as color fidelity and image sharpness – achieves substantial improvements in both perceived and measured image quality.
zh
[CV-83] From General to Specialized: The Need for Foundational Models in Agriculture
【速读】:该论文试图解决农业领域中利用基础模型进行作物类型识别、作物物候估计和作物产量预测等任务的潜力尚未被充分探索的问题。解决方案的关键在于构建一个面向农业领域的理想基础模型(CropFM)的要求框架,并对现有通用基础模型在农业特定任务中的性能进行系统评估与比较,从而凸显出专门为农业设计基础模型的必要性。
链接: https://arxiv.org/abs/2507.05390
作者: Vishal Nedungadi,Xingguo Xiong,Aike Potze,Ron Van Bree,Tao Lin,Marc Rußwurm,Ioannis N. Athanasiadis
机构: Wageningen University and Research (瓦赫宁根大学与研究机构); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Food security remains a global concern as population grows and climate change intensifies, demanding innovative solutions for sustainable agricultural productivity. Recent advances in foundation models have demonstrated remarkable performance in remote sensing and climate sciences, and therefore offer new opportunities for agricultural monitoring. However, their application in challenges related to agriculture-such as crop type mapping, crop phenology estimation, and crop yield estimation-remains under-explored. In this work, we quantitatively evaluate existing foundational models to assess their effectivity for a representative set of agricultural tasks. From an agricultural domain perspective, we describe a requirements framework for an ideal agricultural foundation model (CropFM). We then survey and compare existing general-purpose foundational models in this framework and empirically evaluate two exemplary of them in three representative agriculture specific tasks. Finally, we highlight the need for a dedicated foundational model tailored specifically to agriculture.
zh
[CV-84] Foreground-aware Virtual Staining for Accurate 3D Cell Morphological Profiling ICML2025
【速读】:该论文试图解决现有虚拟染色方法在训练过程中对所有像素一视同仁导致背景噪声和伪影被复制而非关注生物上有意义信号的问题。其解决方案的关键在于引入Spotlight,该方法通过基于直方图的前景估计来掩码像素级损失,并对软阈值预测计算Dice损失,从而实现形状感知的学习,使模型更专注于相关细胞结构。
链接: https://arxiv.org/abs/2507.05383
作者: Alexandr A. Kalinin,Paula Llanos,Theresa Maria Sommer,Giovanni Sestini,Xinhai Hou,Jonathan Z. Sexton,Xiang Wan,Ivo D. Dinov,Brian D. Athey,Nicolas Rivron,Anne E. Carpenter,Beth Cimini,Shantanu Singh,Matthew J. O’Meara
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: ICML 2025 Generative AI and Biology (GenBio) Workshop
Abstract:Microscopy enables direct observation of cellular morphology in 3D, with transmitted-light methods offering low-cost, minimally invasive imaging and fluorescence microscopy providing specificity and contrast. Virtual staining combines these strengths by using machine learning to predict fluorescence images from label-free inputs. However, training of existing methods typically relies on loss functions that treat all pixels equally, thus reproducing background noise and artifacts instead of focusing on biologically meaningful signals. We introduce Spotlight, a simple yet powerful virtual staining approach that guides the model to focus on relevant cellular structures. Spotlight uses histogram-based foreground estimation to mask pixel-wise loss and to calculate a Dice loss on soft-thresholded predictions for shape-aware learning. Applied to a 3D benchmark dataset, Spotlight improves morphological representation while preserving pixel-level accuracy, resulting in virtual stains better suited for downstream tasks such as segmentation and profiling.
zh
[CV-85] YOLO-APD: Enhancing YOLOv8 for Robust Pedestrian Detection on Complex Road Geometries
【速读】:该论文旨在解决自动驾驶车辆在几何结构复杂的道路场景(如Type-S曲线路面)中行人检测的挑战,传统基于RGB相机的方法在此类场景下存在局限性。其关键解决方案是提出YOLO-APD,一种针对此问题优化的深度学习架构,通过集成参数无关的SimAM注意力机制、计算高效的C3Ghost模块、新型的SimSPPF多尺度特征池化模块、Mish激活函数以及智能聚合分发(IGD)模块,提升了特征提取与融合能力,同时结合车辆转向动力学实现自适应感兴趣区域处理,从而在保持实时性(100 FPS)的前提下实现了77.7% mAP@0.5:0.95的检测精度和超过96%的行人召回率。
链接: https://arxiv.org/abs/2507.05376
作者: Aquino Joctum,John Kandiri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the International Journal of Computer Trends and Technology (IJCTT), vol. 73, no. 6, 2024. The final version of record is available at: this https URL
Abstract:Autonomous vehicle perception systems require robust pedestrian detection, particularly on geometrically complex roadways like Type-S curved surfaces, where standard RGB camera-based methods face limitations. This paper introduces YOLO-APD, a novel deep learning architecture enhancing the YOLOv8 framework specifically for this challenge. YOLO-APD integrates several key architectural modifications: a parameter-free SimAM attention mechanism, computationally efficient C3Ghost modules, a novel SimSPPF module for enhanced multi-scale feature pooling, the Mish activation function for improved optimization, and an Intelligent Gather Distribute (IGD) module for superior feature fusion in the network’s neck. The concept of leveraging vehicle steering dynamics for adaptive region-of-interest processing is also presented. Comprehensive evaluations on a custom CARLA dataset simulating complex scenarios demonstrate that YOLO-APD achieves state-of-the-art detection accuracy, reaching 77.7% mAP@0.5:0.95 and exceptional pedestrian recall exceeding 96%, significantly outperforming baseline models, including YOLOv8. Furthermore, it maintains real-time processing capabilities at 100 FPS, showcasing a superior balance between accuracy and efficiency. Ablation studies validate the synergistic contribution of each integrated component. Evaluation on the KITTI dataset confirms the architecture’s potential while highlighting the need for domain adaptation. This research advances the development of highly accurate, efficient, and adaptable perception systems based on cost-effective sensors, contributing to enhanced safety and reliability for autonomous navigation in challenging, less-structured driving environments.
zh
[CV-86] Conditional Graph Neural Network for Predicting Soft Tissue Deformation and Forces
【速读】:该论文试图解决虚拟环境中软组织模拟的难题,特别是由于软组织高变形性所带来的挑战。现有方法依赖于分割、网格化和组织刚度属性的估计,同时需要精确的力反馈以实现更沉浸式的体验。论文提出了一种新的数据驱动模型——条件图神经网络(cGNN),其关键在于通过表面点和施加力的位置来预测点的变形及作用在它们上的力。该模型通过实验收集的软组织假体表面跟踪数据进行训练,并利用迁移学习克服数据稀缺问题,先在质量-弹簧仿真数据上预训练,再用实验数据微调,从而提升了模型的泛化能力并实现了对组织变形和相应交互力的准确预测。
链接: https://arxiv.org/abs/2507.05315
作者: Madina Kojanazarova,Florentin Bieder,Robin Sandkühler,Philippe C. Cattin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Soft tissue simulation in virtual environments is becoming increasingly important for medical applications. However, the high deformability of soft tissue poses significant challenges. Existing methods rely on segmentation, meshing and estimation of stiffness properties of tissues. In addition, the integration of haptic feedback requires precise force estimation to enable a more immersive experience. We introduce a novel data-driven model, a conditional graph neural network (cGNN) to tackle this complexity. Our model takes surface points and the location of applied forces, and is specifically designed to predict the deformation of the points and the forces exerted on them. We trained our model on experimentally collected surface tracking data of a soft tissue phantom and used transfer learning to overcome the data scarcity by initially training it with mass-spring simulations and fine-tuning it with the experimental data. This approach improves the generalisation capability of the model and enables accurate predictions of tissue deformations and corresponding interaction forces. The results demonstrate that the model can predict deformations with a distance error of 0.35 \pm 0.03 mm for deformations up to 30 mm and the force with an absolute error of 0.37 \pm 0.05 N for forces up to 7.5 N. Our data-driven approach presents a promising solution to the intricate challenge of simulating soft tissues within virtual environments. Beyond its applicability in medical simulations, this approach holds the potential to benefit various fields where realistic soft tissue simulations are required.
zh
[CV-87] Self-Attention Based Multi-Scale Graph Auto-Encoder Network of 3D Meshes
【速读】:该论文试图解决将卷积神经网络(CNN)扩展到非欧几里得数据如3D网格(3D meshes)的挑战,特别是在保持几何结构完整性的同时有效捕捉局部和全局特征的问题。其解决方案的关键在于提出一种基于图卷积网络(GCN)的新型框架——3D Geometric Mesh Network (3DGeoMeshNet),该框架采用各向异性卷积层,在空间域中直接学习全局和局部特征,而无需将网格转换为体素网格或点云等中间表示,从而实现了更精确的形状重建。
链接: https://arxiv.org/abs/2507.05304
作者: Saqib Nazir,Olivier Lézoray,Sébastien Bougleux(UNICAEN)
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D meshes are fundamental data representations for capturing complex geometric shapes in computer vision and graphics applications. While Convolutional Neural Networks (CNNs) have excelled in structured data like images, extending them to irregular 3D meshes is challenging due to the non-Euclidean nature of the data. Graph Convolutional Networks (GCNs) offer a solution by applying convolutions to graph-structured data, but many existing methods rely on isotropic filters or spectral decomposition, limiting their ability to capture both local and global mesh features. In this paper, we introduce 3D Geometric Mesh Network (3DGeoMeshNet), a novel GCN-based framework that uses anisotropic convolution layers to effectively learn both global and local features directly in the spatial domain. Unlike previous approaches that convert meshes into intermediate representations like voxel grids or point clouds, our method preserves the original polygonal mesh format throughout the reconstruction process, enabling more accurate shape reconstruction. Our architecture features a multi-scale encoder-decoder structure, where separate global and local pathways capture both large-scale geometric structures and fine-grained local details. Extensive experiments on the COMA dataset containing human faces demonstrate the efficiency of 3DGeoMeshNet in terms of reconstruction accuracy.
zh
[CV-88] CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection
【速读】:该论文试图解决面部深度伪造(deepfake)检测中存在的一系列挑战,特别是现有视觉方法在解释伪造细节方面缺乏清晰性,而多模态方法则容易受到模态间不一致的影响。其解决方案的关键在于提出一种名为CorrDetail的可视化细节增强自校正框架,该框架通过误差引导的问题来修正真实的伪造细节,以提升模型揭示伪造特征的能力,而非生成幻觉响应。同时,引入了视觉细粒度细节增强模块,提供更精确的视觉伪造信息,并设计了一种融合决策策略,通过视觉信息补偿和模型偏差整合进一步增强模型的判别能力。
链接: https://arxiv.org/abs/2507.05302
作者: Binjia Zhou,Hengrui Lou,Lizhe Chen,Haoyuan Li,Dawei Luo,Shuai Chen,Jie Lei,Zunlei Feng,Yijun Bei
机构: Zhejiang University(浙江大学); Tsinghua University(清华大学); Zhejiang University(浙江大学); Ant Group(蚂蚁集团); Zhejiang University of Technology(浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the swift progression of image generation technology, the widespread emergence of facial deepfakes poses significant challenges to the field of security, thus amplifying the urgent need for effective deepfake this http URL techniques for face forgery detection can broadly be categorized into two primary groups: visual-based methods and multimodal approaches. The former often lacks clear explanations for forgery details, while the latter, which merges visual and linguistic modalities, is more prone to the issue of this http URL address these shortcomings, we introduce a visual detail enhanced self-correction framework, designated CorrDetail, for interpretable face forgery detection. CorrDetail is meticulously designed to rectify authentic forgery details when provided with error-guided questioning, with the aim of fostering the ability to uncover forgery details rather than yielding hallucinated responses. Additionally, to bolster the reliability of its findings, a visual fine-grained detail enhancement module is incorporated, supplying CorrDetail with more precise visual forgery details. Ultimately, a fusion decision strategy is devised to further augment the model’s discriminative capacity in handling extreme samples, through the integration of visual information compensation and model bias this http URL results demonstrate that CorrDetail not only achieves state-of-the-art performance compared to the latest methodologies but also excels in accurately identifying forged details, all while exhibiting robust generalization capabilities.
zh
[CV-89] LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models
【速读】:该论文旨在解决低剂量计算机断层扫描(LDCT)图像质量下降导致诊断准确性受限的问题。其核心解决方案是引入LangMamba框架,该框架利用视觉-语言模型(VLM)生成的语义表示,通过两阶段学习策略提升LDCT去噪效果。关键在于首先预训练一个语言引导的自编码器(LangAE),将正常剂量CT(NDCT)图像映射到富含解剖信息的语义空间;其次结合语义增强高效去噪器(SEED)和语言参与的双空间对齐(LangDA)损失函数,以确保去噪图像在感知和语义空间上与NDCT一致,从而提升细节保留和视觉保真度。
链接: https://arxiv.org/abs/2507.06140
作者: Zhihao Chen,Tao Chen,Chenhui Wang,Qi Gao,Huidong Xie,Chuang Niu,Ge Wang,Hongming Shan
机构: Fudan University (复旦大学); Yale University (耶鲁大学); Rensselaer Polytechnic Institute (伦斯勒理工学院); Shanghai Center for Brain Science and Brain-inspired Technology (上海脑科学与脑智能技术中心)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures
Abstract:Low-dose computed tomography (LDCT) reduces radiation exposure but often degrades image quality, potentially compromising diagnostic accuracy. Existing deep learning-based denoising methods focus primarily on pixel-level mappings, overlooking the potential benefits of high-level semantic guidance. Recent advances in vision-language models (VLMs) suggest that language can serve as a powerful tool for capturing structured semantic information, offering new opportunities to improve LDCT reconstruction. In this paper, we introduce LangMamba, a Language-driven Mamba framework for LDCT denoising that leverages VLM-derived representations to enhance supervision from normal-dose CT (NDCT). LangMamba follows a two-stage learning strategy. First, we pre-train a Language-guided AutoEncoder (LangAE) that leverages frozen VLMs to map NDCT images into a semantic space enriched with anatomical information. Second, we synergize LangAE with two key components to guide LDCT denoising: Semantic-Enhanced Efficient Denoiser (SEED), which enhances NDCT-relevant local semantic while capturing global features with efficient Mamba mechanism, and Language-engaged Dual-space Alignment (LangDA) Loss, which ensures that denoised images align with NDCT in both perceptual and semantic spaces. Extensive experiments on two public datasets demonstrate that LangMamba outperforms conventional state-of-the-art methods, significantly improving detail preservation and visual fidelity. Remarkably, LangAE exhibits strong generalizability to unseen datasets, thereby reducing training costs. Furthermore, LangDA loss improves explainability by integrating language-guided insights into image reconstruction and offers a plug-and-play fashion. Our findings shed new light on the potential of language as a supervisory signal to advance LDCT denoising. The code is publicly available on this https URL.
zh
[CV-90] Enhancing Synthetic CT from CBCT via Multimodal Fusion and End-To-End Registration
【速读】:该论文旨在解决锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)图像质量较低及与常规计算机断层扫描(Computed Tomography, CT)图像之间存在配准偏差的问题。其解决方案的关键在于通过多模态学习,结合术中CBCT数据和术前CT数据,并在sCT生成流程中引入端到端可学习的配准模块,以克服模态间的固有不对齐问题。
链接: https://arxiv.org/abs/2507.06067
作者: Maximilian Tschuchnig,Lukas Lamminger,Philipp Steininger,Michael Gadermayr
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CAIP 2025. arXiv admin note: substantial text overlap with arXiv:2506.08716
Abstract:Cone-Beam Computed Tomography (CBCT) is widely used for intraoperative imaging due to its rapid acquisition and low radiation dose. However, CBCT images typically suffer from artifacts and lower visual quality compared to conventional Computed Tomography (CT). A promising solution is synthetic CT (sCT) generation, where CBCT volumes are translated into the CT domain. In this work, we enhance sCT generation through multimodal learning by jointly leveraging intraoperative CBCT and preoperative CT data. To overcome the inherent misalignment between modalities, we introduce an end-to-end learnable registration module within the sCT pipeline. This model is evaluated on a controlled synthetic dataset, allowing precise manipulation of data quality and alignment parameters. Further, we validate its robustness and generalizability on two real-world clinical datasets. Experimental results demonstrate that integrating registration in multimodal sCT generation improves sCT quality, outperforming baseline multimodal methods in 79 out of 90 evaluation settings. Notably, the improvement is most significant in cases where CBCT quality is low and the preoperative CT is moderately misaligned.
zh
[CV-91] A novel framework for fully-automated co-registration of intravascular ultrasound and optical coherence tomography imaging data
【速读】:该论文旨在解决intravascular ultrasound (IVUS)与optical coherence tomography (OCT)图像在纵向和环向上的自动配准问题。其关键解决方案是基于深度学习(deep learning, DL)框架,通过训练模型自动提取腔内边界、侧支血管和钙化组织等特征,并结合动态时间规整算法实现纵向配准,以及通过动态规划优化环向配准,从而实现IVUS与OCT图像的全自动配准。
链接: https://arxiv.org/abs/2507.05883
作者: Xingwei He,Kit Mills Bransby,Ahmet Emir Ulutas,Thamil Kumaran,Nathan Angelo Lecaros Yap,Gonul Zeren,Hesong Zeng,Yaojun Zhang,Andreas Baumbach,James Moon,Anthony Mathur,Jouke Dijkstra,Qianni Zhang,Lorenz Raber,Christos V Bourantas
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Aims: To develop a deep-learning (DL) framework that will allow fully automated longitudinal and circumferential co-registration of intravascular ultrasound (IVUS) and optical coherence tomography (OCT) images. Methods and results: Data from 230 patients (714 vessels) with acute coronary syndrome that underwent near-infrared spectroscopy (NIRS)-IVUS and OCT imaging in their non-culprit vessels were included in the present analysis. The lumen borders annotated by expert analysts in 61,655 NIRS-IVUS and 62,334 OCT frames, and the side branches and calcific tissue identified in 10,000 NIRS-IVUS frames and 10,000 OCT frames, were used to train DL solutions for the automated extraction of these features. The trained DL solutions were used to process NIRS-IVUS and OCT images and their output was used by a dynamic time warping algorithm to co-register longitudinally the NIRS-IVUS and OCT images, while the circumferential registration of the IVUS and OCT was optimized through dynamic programming. On a test set of 77 vessels from 22 patients, the DL method showed high concordance with the expert analysts for the longitudinal and circumferential co-registration of the two imaging sets (concordance correlation coefficient 0.99 for the longitudinal and 0.90 for the circumferential co-registration). The Williams Index was 0.96 for longitudinal and 0.97 for circumferential co-registration, indicating a comparable performance to the analysts. The time needed for the DL pipeline to process imaging data from a vessel was 90s. Conclusion: The fully automated, DL-based framework introduced in this study for the co-registration of IVUS and OCT is fast and provides estimations that compare favorably to the expert analysts. These features renders it useful in research in the analysis of large-scale data collected in studies that incorporate multimodality imaging to characterize plaque composition.
zh
[CV-92] ssue Concepts v2: a Supervised Foundation Model for whole slide images
【速读】:该论文试图解决传统基础模型(Foundation Models, FMs)在计算病理学中训练资源消耗大、依赖大量标注数据的问题。其解决方案的关键在于引入了监督学习的端到端多任务学习方法,通过使用基于切片级别的标签进行训练,显著降低了训练所需的资源,并且模型能够在公开可用的数据上完成全量训练,从而提高了性能并增强了模型的可解释性。
链接: https://arxiv.org/abs/2507.05742
作者: Till Nicke,Daniela Scharcherer,Jan Raphael Schäfer,Natalia Artysh,Antje Prasse,André Homeyer,Andrea Schenk,Henning Höfener,Johannes Lotz
机构: Fraunhofer MEVIS; Fraunhofer ITEM; Hannover Medical School; University of Basel
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models (FMs) are transforming the field of computational pathology by offering new approaches to analyzing histopathology images. Typically relying on weeks of training on large databases, the creation of FMs is a resource-intensive process in many ways. In this paper, we introduce the extension of our supervised foundation model, Tissue Concepts, to whole slide images, called Tissue Concepts v2 (TCv2), a supervised foundation model for whole slide images to address the issue above. TCv2 uses supervised, end-to-end multitask learning on slide-level labels. Training TCv2 uses a fraction of the training resources compared to self-supervised training. The presented model shows superior performance compared to SSL-trained models in cancer subtyping benchmarks and is fully trained on freely available data. Furthermore, a shared trained attention module provides an additional layer of explainability across different tasks.
zh
[CV-93] ADPv2: A Hierarchical Histological Tissue Type-Annotated Dataset for Potential Biomarker Discovery of Colorectal Disease
【速读】:该论文试图解决计算病理学(Computational Pathology, CoPath)领域中公开可用且具备细粒度组织类型(Histological Tissue Type, HTT)标注的数据集稀缺问题,这一问题限制了对特定器官疾病进行深入研究的能力。其解决方案的关键在于构建ADPv2数据集,该数据集专注于胃肠道病理学,包含20,004个来自健康结肠活检切片的图像块,并依据3级层次结构的32种不同HTT进行标注。此外,研究者在该数据集上训练了一个多标签表示学习模型,采用两阶段训练流程,并利用VMamba架构实现了0.88的平均精度(mAP),从而验证了该数据集在支持器官特异性深入研究和潜在生物标志物发现方面的有效性。
链接: https://arxiv.org/abs/2507.05656
作者: Zhiyuan Yang,Kai Li,Sophia Ghamoshi Ramandi,Patricia Brassard,Hakim Khellaf,Vincent Quoc-Huy Trinh,Jennifer Zhang,Lina Chen,Corwyn Rowsell,Sonal Varma,Kostas Plataniotis,Mahdi S. Hosseini
机构: Concordia University (康考迪亚大学); University of Toronto (多伦多大学); Toronto Metropolitan University (多伦多都会大学); Université de Montréal (蒙特利尔大学); Centre de recherche du CHUM (CHUM研究中心); Université de Montréal (蒙特利尔大学); Sunnybrook Health Sciences Centre (阳光山健康科学中心); University of Toronto (多伦多大学); Queen’s University (皇后大学); Université de Montréal (蒙特利尔大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:
Abstract:Computational pathology (CoPath) leverages histopathology images to enhance diagnostic precision and reproducibility in clinical pathology. However, publicly available datasets for CoPath that are annotated with extensive histological tissue type (HTT) taxonomies at a granular level remain scarce due to the significant expertise and high annotation costs required. Existing datasets, such as the Atlas of Digital Pathology (ADP), address this by offering diverse HTT annotations generalized to multiple organs, but limit the capability for in-depth studies on specific organ diseases. Building upon this foundation, we introduce ADPv2, a novel dataset focused on gastrointestinal histopathology. Our dataset comprises 20,004 image patches derived from healthy colon biopsy slides, annotated according to a hierarchical taxonomy of 32 distinct HTTs of 3 levels. Furthermore, we train a multilabel representation learning model following a two-stage training procedure on our ADPv2 dataset. We leverage the VMamba architecture and achieving a mean average precision (mAP) of 0.88 in multilabel classification of colon HTTs. Finally, we show that our dataset is capable of an organ-specific in-depth study for potential biomarker discovery by analyzing the model’s prediction behavior on tissues affected by different colon diseases, which reveals statistical patterns that confirm the two pathological pathways of colon cancer development. Our dataset is publicly available here: Part 1 at this https URL, Part 2 at this https URL and Part 3 at this https URL
zh
[CV-94] Diffusion-Based Limited-Angle CT Reconstruction under Noisy Conditions ICIP
【速读】:该论文旨在解决有限角度计算机断层扫描(Limited-Angle Computed Tomography, LACT)中由于缺失角度投影导致的sinogram不完整和重建图像严重伪影的问题。其解决方案的关键在于将LACT视为一个sinogram补全任务,并提出一种基于扩散的框架,利用均值回归随机微分方程(Mean-Reverting Stochastic Differential Equation, MR-SDE)完成缺失角度视图的补全。此外,为提升在现实噪声条件下的鲁棒性,引入了RNSD^+,这是一种新型的噪声感知校正机制,能够显式建模推理阶段的不确定性,从而实现可靠且稳健的重建。
链接: https://arxiv.org/abs/2507.05647
作者: Jiaqi Guo,Santiago López-Tapia
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2025 IEEE International Conference on Image Processing (ICIP), Workshop
Abstract:Limited-Angle Computed Tomography (LACT) is a challenging inverse problem where missing angular projections lead to incomplete sinograms and severe artifacts in the reconstructed images. While recent learning-based methods have demonstrated effectiveness, most of them assume ideal, noise-free measurements and fail to address the impact of measurement noise. To overcome this limitation, we treat LACT as a sinogram inpainting task and propose a diffusion-based framework that completes missing angular views using a Mean-Reverting Stochastic Differential Equation (MR-SDE) formulation. To improve robustness under realistic noise, we propose RNSD ^+ , a novel noise-aware rectification mechanism that explicitly models inference-time uncertainty, enabling reliable and robust reconstruction. Extensive experiments demonstrate that our method consistently surpasses baseline models in data consistency and perceptual quality, and generalizes well across varying noise intensity and acquisition scenarios.
zh
[CV-95] Learning Segmentation from Radiology Reports MICCAI2025
【速读】:该论文试图解决医学影像中肿瘤分割掩码稀缺的问题,这一问题限制了生成式AI在临床诊断、手术规划和预后评估中的应用。其关键解决方案是提出一种报告监督损失(R-Super),该方法将放射科报告转化为体素级的监督信号,从而辅助肿瘤分割模型的训练。通过利用医院中大量存在的放射科报告来补充稀缺的分割掩码,R-Super显著提升了AI模型的性能,无论训练数据中分割掩码的数量多少。
链接: https://arxiv.org/abs/2507.05582
作者: Pedro R. A. S. Bassi,Wenxuan Li,Jieneng Chen,Zheren Zhu,Tianyu Lin,Sergio Decherchi,Andrea Cavalli,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025
Abstract:Tumor segmentation in CT scans is key for diagnosis, surgery, and prognosis, yet segmentation masks are scarce because their creation requires time and expertise. Public abdominal CT datasets have from dozens to a couple thousand tumor masks, but hospitals have hundreds of thousands of tumor CTs with radiology reports. Thus, leveraging reports to improve segmentation is key for scaling. In this paper, we propose a report-supervision loss (R-Super) that converts radiology reports into voxel-wise supervision for tumor segmentation AI. We created a dataset with 6,718 CT-Report pairs (from the UCSF Hospital), and merged it with public CT-Mask datasets (from AbdomenAtlas 2.0). We used our R-Super to train with these masks and reports, and strongly improved tumor segmentation in internal and external validation–F1 Score increased by up to 16% with respect to training with masks only. By leveraging readily available radiology reports to supplement scarce segmentation masks, R-Super strongly improves AI performance both when very few training masks are available (e.g., 50), and when many masks were available (e.g., 1.7K). Project: this https URL Comments: Accepted to MICCAI 2025 Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.05582 [eess.IV] (or arXiv:2507.05582v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.05582 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-96] Self-supervised Deep Learning for Denoising in Ultrasound Microvascular Imaging
【速读】:该论文旨在解决超声微血管成像(Ultrasound Microvascular Imaging, UMI)中因信噪比(SNR)低而导致的血管定量分析和可靠疾病诊断受限的问题,尤其是在无对比剂或深层组织场景下。其解决方案的关键在于提出了一种名为Half-Angle-to-Half-Angle (HA2HA)的自监督去噪框架,该框架通过从波束成形射频(RF)血流数据的互补角度子集构建训练对,利用血管信号在不同角度下的一致性与噪声的差异性进行训练,从而实现有效的去噪。
链接: https://arxiv.org/abs/2507.05451
作者: Lijie Huang,Jingyi Yin,Jingke Zhang,U-Wai Lok,Ryan M. DeRuiter,Jieyang Jin,Kate M. Knoll,Kendra E. Petersen,James D. Krier,Xiang-yang Zhu,Gina K. Hesley,Kathryn A. Robinson,Andrew J. Bentall,Thomas D. Atwell,Andrew D. Rule,Lilach O. Lerman,Shigao Chen,Chengwu Huang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 12 pages, 10 figures. Supplementary materials are available at this https URL
Abstract:Ultrasound microvascular imaging (UMI) is often hindered by low signal-to-noise ratio (SNR), especially in contrast-free or deep tissue scenarios, which impairs subsequent vascular quantification and reliable disease diagnosis. To address this challenge, we propose Half-Angle-to-Half-Angle (HA2HA), a self-supervised denoising framework specifically designed for UMI. HA2HA constructs training pairs from complementary angular subsets of beamformed radio-frequency (RF) blood flow data, across which vascular signals remain consistent while noise varies. HA2HA was trained using in-vivo contrast-free pig kidney data and validated across diverse datasets, including contrast-free and contrast-enhanced data from pig kidneys, as well as human liver and kidney. An improvement exceeding 15 dB in both contrast-to-noise ratio (CNR) and SNR was observed, indicating a substantial enhancement in image quality. In addition to power Doppler imaging, denoising directly in the RF domain is also beneficial for other downstream processing such as color Doppler imaging (CDI). CDI results of human liver derived from the HA2HA-denoised signals exhibited improved microvascular flow visualization, with a suppressed noisy background. HA2HA offers a label-free, generalizable, and clinically applicable solution for robust vascular imaging in both contrast-free and contrast-enhanced UMI.
zh
[CV-97] PWD: Prior-Guided and Wavelet-Enhanced Diffusion Model for Limited-Angle CT
【速读】:该论文旨在解决有限角度计算机断层扫描(LACT)中生成式扩散模型在推理阶段需要大量采样步骤导致计算开销大,以及跳步采样策略造成细节丢失的问题。其解决方案的关键在于提出一种先验信息嵌入与小波特征融合的快速采样扩散模型(PWD),该模型通过在训练阶段将LACT图像分布映射到全采样目标图像分布,使模型学习结构对应关系,并在推理阶段利用LACT图像作为显式先验引导采样轨迹,同时在小波域进行多尺度特征融合,从而在减少采样步骤的同时保持重建精度和细节质量。
链接: https://arxiv.org/abs/2507.05317
作者: Yi Liu,Yiyang Wen,Zekun Zhou,Junqi Ma,Linghang Wang,Yucheng Yao,Liu Shi,Qiegen Liu
机构: Nanchang University (南昌大学); YOFO (Hefei) Co. Ltd. (YOFO (合肥)有限公司)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative diffusion models have received increasing attention in medical imaging, particularly in limited-angle computed tomography (LACT). Standard diffusion models achieve high-quality image reconstruction but require a large number of sampling steps during inference, resulting in substantial computational overhead. Although skip-sampling strategies have been proposed to improve efficiency, they often lead to loss of fine structural details. To address this issue, we propose a prior information embedding and wavelet feature fusion fast sampling diffusion model for LACT reconstruction. The PWD enables efficient sampling while preserving reconstruction fidelity in LACT, and effectively mitigates the degradation typically introduced by skip-sampling. Specifically, during the training phase, PWD maps the distribution of LACT images to that of fully sampled target images, enabling the model to learn structural correspondences between them. During inference, the LACT image serves as an explicit prior to guide the sampling trajectory, allowing for high-quality reconstruction with significantly fewer steps. In addition, PWD performs multi-scale feature fusion in the wavelet domain, effectively enhancing the reconstruction of fine details by leveraging both low-frequency and high-frequency information. Quantitative and qualitative evaluations on clinical dental arch CBCT and periapical datasets demonstrate that PWD outperforms existing methods under the same sampling condition. Using only 50 sampling steps, PWD achieves at least 1.7 dB improvement in PSNR and 10% gain in SSIM.
zh
[CV-98] Dual-Attention U-Net with Class-Specific Ensembles and Bayesian Hyperparameter Optimization for Precise Wound and Scale Marker Segmentation ALT
【速读】:该论文旨在解决临床图像中伤口和尺度标记的准确分割问题,这一问题对于有效的伤口管理和自动化评估至关重要。其解决方案的关键在于提出了一种融合通道注意力(SCSE)和空间注意力机制的双注意力U-Net++架构,以应对医学图像中的严重类别不平衡和变化性。此外,通过5折交叉验证选择EfficientNet-B7作为最优编码器,并采用定制化的预处理、数据增强和贝叶斯超参数调优策略独立训练两个类别特定模型,最终通过测试时增强技术提升预测可靠性。
链接: https://arxiv.org/abs/2507.05314
作者: Daniel Cieślak,Miriam Reca,Olena Onyshchenko,Jacek Rumiński
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, conference: Joint 20th Nordic-Baltic Conference on Biomedical Engineering 24th Polish Conference on Biocybernetics and Biomedical Engineering; 6 figures, 2 tables, 11 sources
Abstract:Accurate segmentation of wounds and scale markers in clinical images remainsa significant challenge, crucial for effective wound management and automatedassessment. In this study, we propose a novel dual-attention U-Net++ archi-tecture, integrating channel-wise (SCSE) and spatial attention mechanisms toaddress severe class imbalance and variability in medical images this http URL, extensive benchmarking across diverse architectures and encoders via 5-fold cross-validation identified EfficientNet-B7 as the optimal encoder this http URL, we independently trained two class-specific models with tailoredpreprocessing, extensive data augmentation, and Bayesian hyperparameter tun-ing (WandB sweeps). The final model ensemble utilized Test Time Augmentationto further enhance prediction reliability. Our approach was evaluated on a bench-mark dataset from the NBC 2025 PCBBE 2025 competition. Segmentationperformance was quantified using a weighted F1-score (75% wounds, 25% scalemarkers), calculated externally by competition organizers on undisclosed hard-ware. The proposed approach achieved an F1-score of 0.8640, underscoring itseffectiveness for complex medical segmentation tasks.
zh
[CV-99] Cross-Subject DD: A Cross-Subject Brain-Computer Interface Algorithm
【速读】:该论文试图解决脑机接口(Brain-Computer Interface, BCI)在不同个体之间适应性差的问题,即现有BCI模型由于个体间脑活动的差异,导致其泛化能力有限,难以广泛应用。解决方案的关键在于提出一种跨被试的BCI算法——Cross-Subject DD(CSDD),通过提取跨被试的共性特征来构建通用的BCI模型,具体包括为每个被试训练个性化模型、将个性化模型转换为关系谱、通过统计分析识别共性特征,并基于共性特征构建跨被试通用模型。
链接: https://arxiv.org/abs/2507.05268
作者: Xiaoyuan Li,Xinru Xue,Bohan Zhang,Ye Sun,Shoushuo Xi,Gang Liu
机构: Zhengzhou University(郑州大学); Henan Key Laboratory of Brain Science and Brain Computer Interface Technology(河南省脑科学与脑机接口技术重点实验室)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 20 pages, 9 figures
Abstract:Brain-computer interface (BCI) based on motor imagery (MI) enables direct control of external devices by decoding the electroencephalogram (EEG) generated in the brain during imagined movements. However, due to inter-individual variability in brain activity, existing BCI models exhibit poor adaptability across subjects, thereby limiting their generalizability and widespread application. To address this issue, this paper proposes a cross-subject BCI algorithm named Cross-Subject DD (CSDD), which constructs a universal BCI model by extracting common features across subjects. The specific methods include: 1) training personalized models for each subject; 2) transforming personalized models into relation spectrums; 3) identifying common features through statistical analysis; and 4) constructing a cross-subject universal model based on common features. The experiments utilized the BCIC IV 2a dataset, involving nine subjects. Eight of these subjects were selected for training and extracing the common features, and the cross-subject decoding performance of the model was validated on the remaining subject. The results demonstrate that, compared with existing similar methods, our approach achieves a 3.28% improvement in performance. This paper introduces for the first time a novel method for extracting pure common features and constructing a universal cross-subject BCI model, thereby facilitating broader applications of BCI technology.
zh
人工智能
[AI-0] EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow ICCV2025
【速读】:该论文试图解决当前语言引导的机器人操作系统在模仿学习中依赖低级动作标注数据集的问题,以及现有以物体为中心的流预测方法在处理柔性物体、遮挡和非物体位移任务时的局限性。其解决方案的关键在于提出一种名为Embodiment-Centric Flow (EC-Flow)的框架,该框架通过预测与身体相关的流直接从无动作标注的视频中学习操作策略,并结合身体固有的运动学特性,显著提升了在多样化操作场景中的泛化能力。
链接: https://arxiv.org/abs/2507.06224
作者: Yixiang Chen,Peiyan Li,Yan Huang,Jiabing Yang,Kehan Chen,Liang Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at ICCV 2025
Abstract:Current language-guided robotic manipulation systems often require low-level action-labeled datasets for imitation learning. While object-centric flow prediction methods mitigate this issue, they remain limited to scenarios involving rigid objects with clear displacement and minimal occlusion. In this work, we present Embodiment-Centric Flow (EC-Flow), a framework that directly learns manipulation from action-unlabeled videos by predicting embodiment-centric flow. Our key insight is that incorporating the embodiment’s inherent kinematics significantly enhances generalization to versatile manipulation scenarios, including deformable object handling, occlusions, and non-object-displacement tasks. To connect the EC-Flow with language instructions and object interactions, we further introduce a goal-alignment module by jointly optimizing movement consistency and goal-image prediction. Moreover, translating EC-Flow to executable robot actions only requires a standard robot URDF (Unified Robot Description Format) file to specify kinematic constraints across joints, which makes it easy to use in practice. We validate EC-Flow on both simulation (Meta-World) and real-world tasks, demonstrating its state-of-the-art performance in occluded object handling (62% improvement), deformable object manipulation (45% improvement), and non-object-displacement tasks (80% improvement) than prior state-of-the-art object-centric flow methods. For more information, see our project website at this https URL .
zh
[AI-1] Aligned Textual Scoring Rules
【速读】:该论文试图解决如何将人类对文本的偏好与生成式AI (Generative AI) 的概率性信息获取机制有效对齐的问题。其解决方案的关键在于设计一种对齐评分规则 (Aligned Scoring Rule, ASR),通过优化和最小化一个适当评分规则与参考评分(如人类评分)之间的均方误差,从而在保持适当性的同时提升与人类偏好的一致性。
链接: https://arxiv.org/abs/2507.06221
作者: Yuxuan Lu,Yifan Wu,Jason Hartline,Michael J. Curry
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Scoring rules elicit probabilistic predictions from a strategic agent by scoring the prediction against a ground truth state. A scoring rule is proper if, from the agent’s perspective, reporting the true belief maximizes the expected score. With the development of language models, Wu and Hartline (2024) proposes a reduction from textual information elicitation to the numerical (i.e. probabilistic) information elicitation problem, which achieves provable properness for textual elicitation. However, not all proper scoring rules are well aligned with human preference over text. Our paper designs the Aligned Scoring rule (ASR) for text by optimizing and minimizing the mean squared error between a proper scoring rule and a reference score (e.g. human score). Our experiments show that our ASR outperforms previous methods in aligning with human preference while maintaining properness.
zh
[AI-2] Is Diversity All You Need for Scalable Robotic Manipulation?
【速读】:该论文试图解决机器人操作中数据缩放的有效性问题,特别是数据多样性在机器人学习中的作用尚未被充分理解。其解决方案的关键在于重新审视数据多样性的三个关键维度——任务(what to do)、具身(which robot to use)和专家(who demonstrates),并发现任务多样性比单任务演示数量更为重要,高质量单具身数据在跨具身迁移中表现出更优的可扩展性,而专家多样性带来的速度多模态性是政策学习中的主要干扰因素。通过提出一种分布去偏方法以缓解速度歧义,显著提升了性能。
链接: https://arxiv.org/abs/2507.06219
作者: Modi Shi,Li Chen,Jin Chen,Yuxiang Lu,Chiming Liu,Guanghui Ren,Ping Luo,Di Huang,Maoqing Yao,Hongyang Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at this https URL
Abstract:Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of “more diverse is better”. Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.
zh
[AI-3] Identifiability in Causal Abstractions: A Hierarchy of Criteria UAI2025
【速读】:该论文试图解决在观测数据中识别处理效应的问题,尤其是在缺乏完整已知因果图(causal diagram)的情况下。其解决方案的关键在于利用因果抽象(causal abstraction),即简化但保留部分因果信息的表示形式,并在此基础上研究因果查询的可识别性。论文提出并形式化了几种可识别性准则,并将其组织成一个结构化的层次体系,以明确不同水平的因果知识下可识别的内容。
链接: https://arxiv.org/abs/2507.06213
作者: Clément Yvernes,Emilie Devijver,Marianne Clausel,Eric Gaussier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the CAR Workshop at UAI2025
Abstract:Identifying the effect of a treatment from observational data typically requires assuming a fully specified causal diagram. However, such diagrams are rarely known in practice, especially in complex or high-dimensional settings. To overcome this limitation, recent works have explored the use of causal abstractions-simplified representations that retain partial causal information. In this paper, we consider causal abstractions formalized as collections of causal diagrams, and focus on the identifiability of causal queries within such collections. We introduce and formalize several identifiability criteria under this setting. Our main contribution is to organize these criteria into a structured hierarchy, highlighting their relationships. This hierarchical view enables a clearer understanding of what can be identified under varying levels of causal knowledge. We illustrate our framework through examples from the literature and provide tools to reason about identifiability when full causal knowledge is unavailable.
zh
[AI-4] he Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains
【速读】:该论文试图解决在缺乏强监督信号的情况下,如何提升语言模型性能的问题。其解决方案的关键在于利用由弱数据点组成的配对偏好数据,通过相对质量差异(delta)驱动学习,而非依赖单一数据点的绝对质量。研究提出了delta学习假设,验证了在控制实验和大规模场景下,通过将小型模型的输出与更小模型的输出进行配对生成有意义的相对质量差异,能够显著提升模型性能,甚至达到依赖更强监督信号的先进模型的水平。
链接: https://arxiv.org/abs/2507.06187
作者: Scott Geng,Hamish Ivison,Chun-Liang Li,Maarten Sap,Jerry Li,Ranjay Krishna,Pang Wei Koh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: COLM 2025
Abstract:Improvements in language models are often driven by improving the quality of the data we train them on, which can be limiting when strong supervision is scarce. In this work, we show that paired preference data consisting of individually weak data points can enable gains beyond the strength of each individual data point. We formulate the delta learning hypothesis to explain this phenomenon, positing that the relative quality delta between points suffices to drive learning via preference tuning–even when supervised finetuning on the weak data hurts. We validate our hypothesis in controlled experiments and at scale, where we post-train 8B models on preference data generated by pairing a small 3B model’s responses with outputs from an even smaller 1.5B model to create a meaningful delta. Strikingly, on a standard 11-benchmark evaluation suite (MATH, MMLU, etc.), our simple recipe matches the performance of Tulu 3, a state-of-the-art open model tuned from the same base model while relying on much stronger supervisors (e.g., GPT-4o). Thus, delta learning enables simpler and cheaper open recipes for state-of-the-art post-training. To better understand delta learning, we prove in logistic regression that the performance gap between two weak teacher models provides useful signal for improving a stronger student. Overall, our work shows that models can learn surprisingly well from paired data that might typically be considered weak.
zh
[AI-5] Fast Bilateral Teleoperation and Imitation Learning Using Sensorless Force Control via Accurate Dynamics Model
【速读】:该论文试图解决低成本机械臂在高速或接触密集操作中因缺乏力反馈而导致的遥操作性能不足问题。解决方案的关键在于利用四通道双向控制,结合精确识别的机械臂动力学模型,实现非线性项补偿、速度与外部力估计以及根据惯性变化的可变增益控制,从而在无力传感器的情况下实现具备力反馈的高速遥操作。
链接: https://arxiv.org/abs/2507.06174
作者: Koki Yamane,Yunhan Li,Masashi Konosu,Koki Inami,Junji Oaki,Sho Sakaino,Toshiaki Tsuji
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 19 pages, 8 figures, Submitted to CoRL 2025
Abstract:In recent years, the advancement of imitation learning has led to increased interest in teleoperating low-cost manipulators to collect demonstration data. However, most existing systems rely on unilateral control, which only transmits target position values. While this approach is easy to implement and suitable for slow, non-contact tasks, it struggles with fast or contact-rich operations due to the absence of force feedback. This work demonstrates that fast teleoperation with force feedback is feasible even with force-sensorless, low-cost manipulators by leveraging 4-channel bilateral control. Based on accurately identified manipulator dynamics, our method integrates nonlinear terms compensation, velocity and external force estimation, and variable gain corresponding to inertial variation. Furthermore, using data collected by 4-channel bilateral control, we show that incorporating force information into both the input and output of learned policies improves performance in imitation learning. These results highlight the practical effectiveness of our system for high-fidelity teleoperation and data collection on affordable hardware.
zh
[AI-6] A Method for Optimizing Connections in Differentiable Logic Gate Networks
【速读】:该论文试图解决深度可微逻辑门网络(Deep Differentiable Logic Gate Networks, LGNs)中连接优化的问题,旨在通过减少逻辑门数量来提升模型效率与性能。其解决方案的关键在于引入一种新的部分优化方法,该方法在每个逻辑门输入的连接子集上利用概率分布,选择具有最高优势的连接,随后确定逻辑门类型,从而实现更高效的连接优化。
链接: https://arxiv.org/abs/2507.06173
作者: Wout Mommen,Lars Keuninckx,Matthias Hartmann,Piet Wambacq
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a novel method for partial optimization of the connections in Deep Differentiable Logic Gate Networks (LGNs). Our training method utilizes a probability distribution over a subset of connections per gate input, selecting the connection with highest merit, after which the gate-types are selected. We show that the connection-optimized LGNs outperform standard fixed-connection LGNs on the Yin-Yang, MNIST and Fashion-MNIST benchmarks, while requiring only a fraction of the number of logic gates. When training all connections, we demonstrate that 8000 simple logic gates are sufficient to achieve over 98% on the MNIST data set. Additionally, we show that our network has 24 times fewer gates, while performing better on the MNIST data set compared to standard fully connected LGNs. As such, our work shows a pathway towards fully trainable Boolean logic.
zh
[AI-7] Critical Nodes Identification in Complex Networks: A Survey
【速读】:该论文试图解决复杂网络中关键节点识别的难题,旨在克服现实网络固有的复杂性和结构性异质性所带来的挑战,特别是在动态和高阶网络中的应用。其解决方案的关键在于对现有关键节点识别技术进行系统分类与综合评述,涵盖七类主要方法:中心性、关键节点删除问题、影响力最大化、网络控制、人工智能、高阶与动态方法,并强调各类方法在不同网络类型中的优势、局限性和适用性。通过结构化的综述,论文揭示了算法通用性、动态网络实时评估、高阶结构分析以及大规模网络计算效率等核心挑战,为未来研究提供了方向。
链接: https://arxiv.org/abs/2507.06164
作者: Duxin Chen,Jiawen Chen,Xiaoyu Zhang,Qinghan Jia,Xiaolu Liu,Ye Sun,Linyuan Lv,Wenwu Yu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注:
Abstract:Complex networks have become essential tools for understanding diverse phenomena in social systems, traffic systems, biomolecular systems, and financial systems. Identifying critical nodes is a central theme in contemporary research, serving as a vital bridge between theoretical foundations and practical applications. Nevertheless, the intrinsic complexity and structural heterogeneity characterizing real-world networks, with particular emphasis on dynamic and higher-order networks, present substantial obstacles to the development of universal frameworks for critical node identification. This paper provides a comprehensive review of critical node identification techniques, categorizing them into seven main classes: centrality, critical nodes deletion problem, influence maximization, network control, artificial intelligence, higher-order and dynamic methods. Our review bridges the gaps in existing surveys by systematically classifying methods based on their methodological foundations and practical implications, and by highlighting their strengths, limitations, and applicability across different network types. Our work enhances the understanding of critical node research by identifying key challenges, such as algorithmic universality, real-time evaluation in dynamic networks, analysis of higher-order structures, and computational efficiency in large-scale networks. The structured synthesis consolidates current progress and highlights open questions, particularly in modeling temporal dynamics, advancing efficient algorithms, integrating machine learning approaches, and developing scalable and interpretable metrics for complex systems.
zh
[AI-8] Fast and Accurate Collision Probability Estimation for Autonomous Vehicles using Adaptive Sigma-Point Sampling
【速读】:该论文试图解决动态物体在轨迹具有不确定性的情况下碰撞概率的估计问题,其中轨迹以带有高斯分布的姿态序列形式给出。解决方案的关键是一种自适应的sigma-point采样方案,该方案最终实现了一个快速且简单的算法,能够在Intel Xeon Gold 6226R处理器上实现3.5%的中位数误差和0.21ms的中位数运行时间。此外,该算法明确考虑了碰撞概率的时间依赖性,这是以往工作中常被忽略的问题,从而避免了对碰撞概率的过高估计。
链接: https://arxiv.org/abs/2507.06149
作者: Charles Champagne Cossette,Taylor Scott Clawson,Andrew Feit
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注: 8 pages, 6 figures
Abstract:A novel algorithm is presented for the estimation of collision probabilities between dynamic objects with uncertain trajectories, where the trajectories are given as a sequence of poses with Gaussian distributions. We propose an adaptive sigma-point sampling scheme, which ultimately produces a fast, simple algorithm capable of estimating the collision probability with a median error of 3.5%, and a median runtime of 0.21ms, when measured on an Intel Xeon Gold 6226R Processor. Importantly, the algorithm explicitly accounts for the collision probability’s temporal dependence, which is often neglected in prior work and otherwise leads to an overestimation of the collision probability. Finally, the method is tested on a diverse set of relevant real-world scenarios, consisting of 400 6-second snippets of autonomous vehicle logs, where the accuracy and latency is rigorously evaluated.
zh
[AI-9] opic Modeling and Link-Prediction for Material Property Discovery
【速读】:该论文试图解决科学文献网络和知识图谱中实体间缺失链接的问题,特别是在复杂材料领域中推断隐藏的关联。其解决方案的关键是提出一种基于人工智能的分层链接预测框架,该框架整合了层次非负矩阵分解(HNMFk)、布尔矩阵分解(BNMFk)和逻辑矩阵分解(LMF),并通过自动模型选择进行优化,以构建一个三层次主题树。此方法结合了离散可解释性与概率评分,能够识别材料与主题之间的缺失或弱连接,从而生成新的跨学科研究假设。
链接: https://arxiv.org/abs/2507.06139
作者: Ryan C. Barron,Maksim E. Eren,Valentin Stanev,Cynthia Matuszek,Boian S. Alexandrov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 4 pages, 3 figures, 1 table
Abstract:Link prediction infers missing or future relations between graph nodes, based on connection patterns. Scientific literature networks and knowledge graphs are typically large, sparse, and noisy, and often contain missing links between entities. We present an AI-driven hierarchical link prediction framework that integrates matrix factorization to infer hidden associations and steer discovery in complex material domains. Our method combines Hierarchical Nonnegative Matrix Factorization (HNMFk) and Boolean matrix factorization (BNMFk) with automatic model selection, as well as Logistic matrix factorization (LMF), we use to construct a three-level topic tree from a 46,862-document corpus focused on 73 transition-metal dichalcogenides (TMDs). These materials are studied in a variety of physics fields with many current and potential applications. An ensemble BNMFk + LMF approach fuses discrete interpretability with probabilistic scoring. The resulting HNMFk clusters map each material onto coherent topics like superconductivity, energy storage, and tribology. Also, missing or weakly connected links are highlight between topics and materials, suggesting novel hypotheses for cross-disciplinary exploration. We validate our method by removing publications about superconductivity in well-known superconductors, and show the model predicts associations with the superconducting TMD clusters. This shows the method finds hidden connections in a graph of material to latent topic associations built from scientific literature, especially useful when examining a diverse corpus of scientific documents covering the same class of phenomena or materials but originating from distinct communities and perspectives. The inferred links generating new hypotheses, produced by our method, are exposed through an interactive Streamlit dashboard, designed for human-in-the-loop scientific discovery. Comments: 4 pages, 3 figures, 1 table Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE) Cite as: arXiv:2507.06139 [cs.LG] (or arXiv:2507.06139v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.06139 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-10] OpenAgentS afety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety
【速读】:该论文试图解决生成式 AI (Generative AI) 在实际应用场景中可能表现出的不安全行为问题,特别是在面对复杂、多步骤任务时的安全性评估不足。其解决方案的关键在于提出 OpenAgentSafety,这是一个全面且模块化的框架,能够针对八类关键风险进行评估。该框架通过让代理与真实工具(如网页浏览器、代码执行环境、文件系统等)交互,并支持超过350个涉及良性和对抗性用户意图的多轮多用户任务,从而更真实地反映实际场景中的潜在风险。此外,OpenAgentSafety 结合了规则分析与 LLM-as-judge 评估方法,以检测显性和隐性的不安全行为。
链接: https://arxiv.org/abs/2507.06134
作者: Sanidhya Vijayvargiya,Aditya Bharat Soni,Xuhui Zhou,Zora Zhiruo Wang,Nouha Dziri,Graham Neubig,Maarten Sap
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 10 figures
Abstract:Recent advances in AI agents capable of solving complex, everyday tasks, from scheduling to customer service, have enabled deployment in real-world settings, but their possibilities for unsafe behavior demands rigorous evaluation. While prior benchmarks have attempted to assess agent safety, most fall short by relying on simulated environments, narrow task domains, or unrealistic tool abstractions. We introduce OpenAgentSafety, a comprehensive and modular framework for evaluating agent behavior across eight critical risk categories. Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms; and supports over 350 multi-turn, multi-user tasks spanning both benign and adversarial user intents. OpenAgentSafety is designed for extensibility, allowing researchers to add tools, tasks, websites, and adversarial strategies with minimal effort. It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors. Empirical analysis of five prominent LLMs in agentic scenarios reveals unsafe behavior in 51.2% of safety-vulnerable tasks with Claude-Sonnet-3.7, to 72.7% with o3-mini, highlighting critical safety vulnerabilities and the need for stronger safeguards before real-world deployment.
zh
[AI-11] PrefixAgent : An LLM -Powered Design Framework for Efficient Prefix Adder Optimization
【速读】:该论文试图解决前缀加法器(prefix adder)设计空间随位宽指数增长所带来的优化挑战,这些问题包括性能限制、泛化能力不足以及可扩展性问题。其解决方案的关键在于提出PrefixAgent,这是一个基于大型语言模型(LLM)的框架,通过将问题重新表述为包括核心结构合成和结构优化在内的子任务,有效缩小了搜索空间,并利用E-graph高效收集大量高质量数据和推理轨迹,从而实现LLM的有效微调。
链接: https://arxiv.org/abs/2507.06127
作者: Dongsheng Zuo,Jiadong Zhu,Yang Luo,Yuzhe Ma
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Prefix adders are fundamental arithmetic circuits, but their design space grows exponentially with bit-width, posing significant optimization challenges. Previous works face limitations in performance, generalization, and scalability. To address these challenges, we propose PrefixAgent, a large language model (LLM)-powered framework that enables efficient prefix adder optimization. Specifically, PrefixAgent reformulates the problem into subtasks including backbone synthesis and structure refinement, which effectively reduces the search space. More importantly, this new design perspective enables us to efficiently collect enormous high-quality data and reasoning traces with E-graph, which further results in an effective fine-tuning of LLM. Experimental results show that PrefixAgent synthesizes prefix adders with consistently smaller areas compared to baseline methods, while maintaining scalability and generalization in commercial EDA flows.
zh
[AI-12] Subspace-based Approximate Hessian Method for Zeroth-Order Optimization
【速读】:该论文试图解决零阶优化中梯度信息不可获取或计算成本过高的问题,以及现有方法依赖一阶近似导致收敛速度受限的问题。解决方案的关键在于提出一种基于子空间的近似Hessian(ZO-SAH)方法,通过在随机选择的二维子空间内拟合二次多项式来估计Hessian矩阵的二阶系数,从而降低函数评估的成本,并结合周期性子空间切换策略以进一步减少函数查询次数。
链接: https://arxiv.org/abs/2507.06125
作者: Dongyoon Kim,Sungjae Lee,Wonjin Lee,Kwang In Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 8 figures
Abstract:Zeroth-order optimization addresses problems where gradient information is inaccessible or impractical to compute. While most existing methods rely on first-order approximations, incorporating second-order (curvature) information can, in principle, significantly accelerate convergence. However, the high cost of function evaluations required to estimate Hessian matrices often limits practical applicability. We present the subspace-based approximate Hessian (ZO-SAH) method, a zeroth-order optimization algorithm that mitigates these costs by focusing on randomly selected two-dimensional subspaces. Within each subspace, ZO-SAH estimates the Hessian by fitting a quadratic polynomial to the objective function and extracting its second-order coefficients. To further reduce function-query costs, ZO-SAH employs a periodic subspace-switching strategy that reuses function evaluations across optimization steps. Experiments on eight benchmark datasets, including logistic regression and deep neural network training tasks, demonstrate that ZO-SAH achieves significantly faster convergence than existing zeroth-order methods.
zh
[AI-13] Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis
【速读】:该论文试图解决自动语音质量评估在不同粒度层级预测任务中表现差异显著的问题。其解决方案的关键在于基于自监督学习的语音模型,引入了Mixture of Experts (MoE)分类头,并利用多个商业生成模型的合成数据进行数据增强,以提升模型在不同语音质量评估任务中的性能。
链接: https://arxiv.org/abs/2507.06116
作者: Xintong Hu,Yixuan Chen,Rui Yang,Wenxiang Guo,Changhao Pan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Automatic speech quality assessment plays a crucial role in the development of speech synthesis systems, but existing models exhibit significant performance variations across different granularity levels of prediction tasks. This paper proposes an enhanced MOS prediction system based on self-supervised learning speech models, incorporating a Mixture of Experts (MoE) classification head and utilizing synthetic data from multiple commercial generation models for data augmentation. Our method builds upon existing self-supervised models such as wav2vec2, designing a specialized MoE architecture to address different types of speech quality assessment tasks. We also collected a large-scale synthetic speech dataset encompassing the latest text-to-speech, speech conversion, and speech enhancement systems. However, despite the adoption of the MoE architecture and expanded dataset, the model’s performance improvements in sentence-level prediction tasks remain limited. Our work reveals the limitations of current methods in handling sentence-level quality assessment, provides new technical pathways for the field of automatic speech quality assessment, and also delves into the fundamental causes of performance differences across different assessment granularities.
zh
[AI-14] aming Data Challenges in ML-based Security Tasks: Lessons from Integrating Generative AI
【速读】:该论文试图解决机器学习监督分类器在安全任务中因数据挑战而性能受限的问题,这些问题包括数据不足、概念漂移以及数据质量不佳等。论文提出的解决方案的关键在于利用生成式 AI (Generative AI) 技术生成合成数据,以增强训练数据集,从而提升分类器的泛化能力。研究通过多种 GenAI 方法进行验证,并引入一种名为 Nimai 的新型 GenAI 方案,实现对数据合成的高控制性,结果显示 GenAI 能显著提升分类器性能,尤其在数据严重受限的情况下效果明显。
链接: https://arxiv.org/abs/2507.06092
作者: Shravya Kanchi,Neal Mangaokar,Aravind Cheruvu,Sifat Muhammad Abdullah,Shirin Nilizadeh,Atul Prakash,Bimal Viswanath
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Machine learning-based supervised classifiers are widely used for security tasks, and their improvement has been largely focused on algorithmic advancements. We argue that data challenges that negatively impact the performance of these classifiers have received limited attention. We address the following research question: Can developments in Generative AI (GenAI) address these data challenges and improve classifier performance? We propose augmenting training datasets with synthetic data generated using GenAI techniques to improve classifier generalization. We evaluate this approach across 7 diverse security tasks using 6 state-of-the-art GenAI methods and introduce a novel GenAI scheme called Nimai that enables highly controlled data synthesis. We find that GenAI techniques can significantly improve the performance of security classifiers, achieving improvements of up to 32.6% even in severely data-constrained settings (only ~180 training samples). Furthermore, we demonstrate that GenAI can facilitate rapid adaptation to concept drift post-deployment, requiring minimal labeling in the adjustment process. Despite successes, our study finds that some GenAI schemes struggle to initialize (train and produce data) on certain security tasks. We also identify characteristics of specific tasks, such as noisy labels, overlapping class distributions, and sparse feature vectors, which hinder performance boost using GenAI. We believe that our study will drive the development of future GenAI tools designed for security tasks.
zh
[AI-15] QS4D: Quantization-aware training for efficient hardware deployment of structured state-space sequential models
【速读】:该论文试图解决如何在资源受限的边缘计算设备上高效部署结构化状态空间模型(Structured State Space models, SSM)的问题,特别是针对模拟内存计算(analog in-memory computing, AIMC)芯片的特性。解决方案的关键在于采用量化感知训练(quantization-aware training, QAT),通过QAT显著降低SSM的复杂度,并提升其对模拟噪声的鲁棒性以及支持结构剪枝,从而实现高效的模型部署与计算效率提升。
链接: https://arxiv.org/abs/2507.06079
作者: Sebastian Siegel,Ming-Jay Yang,Younes Bouhadjar,Maxime Fabre,Emre Neftci,John Paul Strachan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Structured State Space models (SSM) have recently emerged as a new class of deep learning models, particularly well-suited for processing long sequences. Their constant memory footprint, in contrast to the linearly scaling memory demands of Transformers, makes them attractive candidates for deployment on resource-constrained edge-computing devices. While recent works have explored the effect of quantization-aware training (QAT) on SSMs, they typically do not address its implications for specialized edge hardware, for example, analog in-memory computing (AIMC) chips. In this work, we demonstrate that QAT can significantly reduce the complexity of SSMs by up to two orders of magnitude across various performance metrics. We analyze the relation between model size and numerical precision, and show that QAT enhances robustness to analog noise and enables structural pruning. Finally, we integrate these techniques to deploy SSMs on a memristive analog in-memory computing substrate and highlight the resulting benefits in terms of computational efficiency.
zh
[AI-16] AI-Based Demand Forecasting and Load Balancing for Optimising Energy use in Healthcare Systems: A real case study
【速读】:该论文试图解决医疗设施中因需求波动导致的能源管理效率低下和可持续性不足的问题。其解决方案的关键在于提出了一种结合长短期记忆网络(LSTM)、遗传算法(GA)和SHAP(Shapley Additive Explanations)的AI框架,通过LSTM实现对复杂非线性需求模式的高精度预测,利用GA优化模型参数和负载均衡策略,以及通过SHAP提升模型透明度和决策可信度,从而全面提升医疗设施的能源管理效率与可持续性。
链接: https://arxiv.org/abs/2507.06077
作者: Iman Rahimi,Isha Patel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper tackles the urgent need for efficient energy management in healthcare facilities, where fluctuating demands challenge operational efficiency and sustainability. Traditional methods often prove inadequate, causing inefficiencies and higher costs. To address this, the study presents an AI-based framework combining Long Short-Term Memory (LSTM), genetic algorithm (GA), and SHAP (Shapley Additive Explanations), specifically designed for healthcare energy management. Although LSTM is widely used for time-series forecasting, its application in healthcare energy prediction remains underexplored. The results reveal that LSTM significantly outperforms ARIMA and Prophet models in forecasting complex, non-linear demand patterns. LSTM achieves a Mean Absolute Error (MAE) of 21.69 and Root Mean Square Error (RMSE) of 29.96, far better than Prophet (MAE: 59.78, RMSE: 81.22) and ARIMA (MAE: 87.73, RMSE: 125.22), demonstrating superior performance. The genetic algorithm is applied to optimize model parameters and improve load balancing strategies, enabling adaptive responses to real-time energy fluctuations. SHAP analysis further enhances model transparency by explaining the influence of different features on predictions, fostering trust in decision-making processes. This integrated LSTM-GA-SHAP approach offers a robust solution for improving forecasting accuracy, boosting energy efficiency, and advancing sustainability in healthcare facilities. Future research may explore real-time deployment and hybridization with reinforcement learning for continuous optimization. Overall, the study establishes a solid foundation for using AI in healthcare energy management, highlighting its scalability, efficiency, and resilience potential.
zh
[AI-17] Contrastive and Transfer Learning for Effective Audio Fingerprinting through a Real-World Evaluation Protocol
【速读】:该论文旨在解决在真实世界场景下,基于深度神经网络的音频指纹识别方法性能显著下降的问题,尤其是在噪声环境中通过移动设备麦克风采集的音频数据。其解决方案的关键在于设计了一种更贴近实际应用的评估协议,并通过改进数据增强流程,引入低通和高通滤波器以提升模型鲁棒性;此外,还提出了一种带有定制投影模块的Transformer模型,并利用语义相关领域的知识迁移来增强模型的泛化能力,从而在不同噪声水平和查询时长下均表现出优于传统卷积神经网络(CNN)模型的性能。
链接: https://arxiv.org/abs/2507.06070
作者: Christos Nikou,Theodoros Giannakopoulos
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: International Journal of Music Science, Technology and Art, 15 pages, 7 figures
Abstract:Recent advances in song identification leverage deep neural networks to learn compact audio fingerprints directly from raw waveforms. While these methods perform well under controlled conditions, their accuracy drops significantly in real-world scenarios where the audio is captured via mobile devices in noisy environments. In this paper, we introduce a novel evaluation protocol designed to better reflect such real-world conditions. We generate three recordings of the same audio, each with increasing levels of noise, captured using a mobile device’s microphone. Our results reveal a substantial performance drop for two state-of-the-art CNN-based models under this protocol, compared to previously reported benchmarks. Additionally, we highlight the critical role of the augmentation pipeline during training with contrastive loss. By introduction low pass and high pass filters in the augmentation pipeline we significantly increase the performance of both systems in our proposed evaluation. Furthermore, we develop a transformer-based model with a tailored projection module and demonstrate that transferring knowledge from a semantically relevant domain yields a more robust solution. The transformer architecture outperforms CNN-based models across all noise levels, and query durations. In low noise conditions it achieves 47.99% for 1-sec queries, and 97% for 10-sec queries in finding the correct song, surpassing by 14%, and by 18.5% the second-best performing model, respectively, Under heavy noise levels, we achieve a detection rate 56.5% for 15-second query duration. All experiments are conducted on public large-scale dataset of over 100K songs, with queries matched against a database of 56 million vectors.
zh
[AI-18] FEVO: Financial Knowledge Expansion and Reasoning Evolution for Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在金融领域性能不足的问题,尤其是在需要大量领域专业知识的任务中表现有限。解决方案的关键在于提出FEVO(Financial Evolution)多阶段增强框架,通过持续预训练(Continued Pre-training, CPT)扩展金融领域知识、监督微调(Supervised Fine-tuning, SFT)引入结构化推理模式以及强化学习(Reinforcement Learning, RL)融合扩展的领域知识与结构化推理,从而系统性提升LLMs在金融领域的表现。
链接: https://arxiv.org/abs/2507.06057
作者: Bo Pang,Yalu Ouyang,Hangfei Xu,Ziqi Jia,Panpan Li,Shengzhao Wen,Lu Wang,Shiyong Li,Yanpeng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Advancements in reasoning for large language models (LLMs) have lead to significant performance improvements for LLMs in various fields such as mathematics and programming. However, research applying these advances to the financial domain, where considerable domain-specific knowledge is necessary to complete tasks, remains limited. To address this gap, we introduce FEVO (Financial Evolution), a multi-stage enhancement framework developed to enhance LLM performance in the financial domain. FEVO systemically enhances LLM performance by using continued pre-training (CPT) to expand financial domain knowledge, supervised fine-tuning (SFT) to instill structured, elaborate reasoning patterns, and reinforcement learning (RL) to further integrate the expanded financial domain knowledge with the learned structured reasoning. To ensure effective and efficient training, we leverage frontier reasoning models and rule-based filtering to curate FEVO-Train, high-quality datasets specifically designed for the different post-training phases. Using our framework, we train the FEVO series of models – C32B, S32B, R32B – from Qwen2.5-32B and evaluate them on seven benchmarks to assess financial and general capabilities, with results showing that FEVO-R32B achieves state-of-the-art performance on five financial benchmarks against much larger models as well as specialist models. More significantly, FEVO-R32B demonstrates markedly better performance than FEVO-R32B-0 (trained from Qwen2.5-32B-Instruct using only RL), thus validating the effectiveness of financial domain knowledge expansion and structured, logical reasoning distillation
zh
[AI-19] CAVGAN: Unifying Jailbreak and Defense of LLM s via Generative Adversarial Attacks on their Internal Representations
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)在面对恶意查询时的安全对齐问题,特别是针对越狱攻击(jailbreak attack)的漏洞。其解决方案的关键在于分析LLM中间层嵌入的线性可分性以及越狱攻击的本质,通过生成对抗网络(Generative Adversarial Network, GAN)学习LLM内部的安全判断边界,从而实现高效的越狱攻击与防御。
链接: https://arxiv.org/abs/2507.06043
作者: Xiaohu Li,Yunfeng Ning,Zepeng Bao,Mayi Xu,Jianhao Chen,Tieyun Qian
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security The code and data are available at this https URL.
zh
[AI-20] On Lockean beliefs that are deductively closed and minimal change
【速读】:该论文试图解决Lockean信念集在经典逻辑演绎下不具有封闭性的问题,这一缺陷使得其在信念变迁理论等场景中应用受限。解决方案的关键在于提出一种概率更新方法,通过最小化对现有信念集的修改,实现信念集的演绎闭包,从而保证其在经典逻辑推理下的稳定性与一致性。
链接: https://arxiv.org/abs/2507.06042
作者: Tommaso Flaminio,Lluis Godo,Ramón Pino Pérez,Lluis Subirana
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, to appear in the proceedings of JELIA 2025
Abstract:Within the formal setting of the Lockean thesis, an agent belief set is defined in terms of degrees of confidence and these are described in probabilistic terms. This approach is of established interest, notwithstanding some limitations that make its use troublesome in some contexts, like, for instance, in belief change theory. Precisely, Lockean belief sets are not generally closed under (classical) logical deduction. The aim of the present paper is twofold: on one side we provide two characterizations of those belief sets that are closed under classical logic deduction, and on the other we propose an approach to probabilistic update that allows us for a minimal revision of those beliefs, i.e., a revision obtained by making the fewest possible changes to the existing belief set while still accommodating the new information. In particular, we show how we can deductively close a belief set via a minimal revision.
zh
[AI-21] Efficient Federated Learning with Timely Update Dissemination
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中由于分布式数据管理导致的模型更新延迟与效率问题。其解决方案的关键在于利用额外的下行带宽资源,通过异步和同步两种框架实现更高效且及时的模型更新。在异步框架下,提出了一种考虑模型过时性的模型更新方法(FedASMU),结合了服务器端的动态模型聚合技术和设备端的自适应模型调整机制;而在同步框架下,则扩展为FedSSMU,进一步提升了模型的准确性和训练效率。
链接: https://arxiv.org/abs/2507.06031
作者: Juncheng Jia,Ji Liu,Chao Huo,Yihui Shen,Yang Zhou,Huaiyu Dai,Dejing Dou
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 38 pages, to appear in Knowledge and Information Systems (KAIS)
Abstract:Federated Learning (FL) has emerged as a compelling methodology for the management of distributed data, marked by significant advancements in recent years. In this paper, we propose an efficient FL approach that capitalizes on additional downlink bandwidth resources to ensure timely update dissemination. Initially, we implement this strategy within an asynchronous framework, introducing the Asynchronous Staleness-aware Model Update (FedASMU), which integrates both server-side and device-side methodologies. On the server side, we present an asynchronous FL system model that employs a dynamic model aggregation technique, which harmonizes local model updates with the global model to enhance both accuracy and efficiency. Concurrently, on the device side, we propose an adaptive model adjustment mechanism that integrates the latest global model with local models during training to further elevate accuracy. Subsequently, we extend this approach to a synchronous context, referred to as FedSSMU. Theoretical analyses substantiate the convergence of our proposed methodologies. Extensive experiments, encompassing six models and five public datasets, demonstrate that FedASMU and FedSSMU significantly surpass baseline methods in terms of both accuracy (up to 145.87%) and efficiency (up to 97.59%).
zh
[AI-22] Feature-Guided Neighbor Selection for Non-Expert Evaluation of Model Predictions IJCAI2025
【速读】:该论文试图解决可解释人工智能(Explainable AI, XAI)方法在生成清晰、可理解的输出方面存在的问题,特别是在缺乏领域专业知识的用户中。解决方案的关键在于提出一种后处理方法——特征引导的邻居选择(Feature-Guided Neighbor Selection, FGNS),该方法通过结合局部和全局特征重要性来选择类别代表性样本,从而提升模型解释的可理解性。
链接: https://arxiv.org/abs/2507.06029
作者: Courtney Ford,Mark T. Keane
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures, 1 table. Accepted at IJCAI 2025 Workshop on User-Aligned Assessment of Adaptive AI Systems
Abstract:Explainable AI (XAI) methods often struggle to generate clear, interpretable outputs for users without domain expertise. We introduce Feature-Guided Neighbor Selection (FGNS), a post hoc method that enhances interpretability by selecting class-representative examples using both local and global feature importance. In a user study (N = 98) evaluating Kannada script classifications, FGNS significantly improved non-experts’ ability to identify model errors while maintaining appropriate agreement with correct predictions. Participants made faster and more accurate decisions compared to those given traditional k-NN explanations. Quantitative analysis shows that FGNS selects neighbors that better reflect class characteristics rather than merely minimizing feature-space distance, leading to more consistent selection and tighter clustering around class prototypes. These results support FGNS as a step toward more human-aligned model assessment, although further work is needed to address the gap between explanation quality and perceived trust.
zh
[AI-23] CogniSQL-R1-Zero: Lightweight Reinforced Reasoning for Efficient SQL Generation
【速读】:该论文试图解决将自然语言翻译为可执行SQL的问题,这是语言理解和结构化数据访问交叉领域的核心挑战。其解决方案的关键在于提出一种基于强化学习(RL)的框架CogniSQL-R1-Zero,该框架通过使用轻量级奖励信号(基于执行正确性和格式标签合规性)生成准确的SQL语句,避免了中间监督、混合流水线和复杂奖励设计,从而促进了稳定的学习并增强了与最终任务目标——生成可执行程序的一致性。
链接: https://arxiv.org/abs/2507.06013
作者: Kushal Gajjar,Harshit Sikchi,Arpit Singh Gautam,Marc Hammons,Saurabh Jha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Translating natural language into SQL (Text-to-SQL) remains a core challenge at the intersection of language understanding and structured data access. Although large language models (LLMs) have improved fluency, generating correct and executable SQL, especially for complex queries, continues to be challenging. We introduce CogniSQL-R1-Zero, a reinforcement learning (RL) framework and model that produces accurate SQL using a lightweight reward signal based on execution correctness and format-tag compliance. By avoiding intermediate supervision, hybrid pipelines and complex reward shaping, our method encourages stable learning and stronger alignment with the ultimate task objective-producing executable programs. CogniSQL-R1-Zero achieves state-of-the-art execution accuracy on Text2SQL benchmark; BIRD bench, outperforming prior supervised and instruction-tuned baselines including SFT CodeS-7B, DeepSeek-Coder 236B, and Mistral 123B-despite being trained on a significantly smaller 7B backbone. This result underscores the scalability and efficiency of our RL-based approach when trained on just four NVIDIA A100 GPUs (40 GB VRAM each). To support further research in efficient and interpretable Text-to-SQL modeling, we release two curated datasets: (i) a collection of 5,024 reasoning traces with varying context lengths, and (ii) a positive-sampled corpus of 36,356 corpus of weakly supervised queries, each annotated with six semantically diverse reasoning paths. Together, these contributions advance scalable, execution-aligned Text-to-SQL generation.
zh
[AI-24] he Impact of Event Data Partitioning on Privacy-aware Process Discovery
【速读】:该论文试图解决在保留流程发现实用性的同时对事件日志进行匿名化所面临的隐私挑战。其关键解决方案是提出一种结合匿名化与事件数据分块的管道,其中利用事件抽象(event abstraction)进行分块。通过事件抽象,可以将事件日志分割为多个部分,从而分别对每个子日志进行匿名化处理,从而在保护隐私的同时减少实用性的损失。
链接: https://arxiv.org/abs/2507.06008
作者: Jungeun Lim,Stephan A. Fahrenkrog-Petersen,Xixi Lu,Jan Mendling,Minseok Song
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Information systems support the execution of business processes. The event logs of these executions generally contain sensitive information about customers, patients, and employees. The corresponding privacy challenges can be addressed by anonymizing the event logs while still retaining utility for process discovery. However, trading off utility and privacy is difficult: the higher the complexity of event log, the higher the loss of utility by anonymization. In this work, we propose a pipeline that combines anonymization and event data partitioning, where event abstraction is utilized for partitioning. By leveraging event abstraction, event logs can be segmented into multiple parts, allowing each sub-log to be anonymized separately. This pipeline preserves privacy while mitigating the loss of utility. To validate our approach, we study the impact of event partitioning on two anonymization techniques using three real-world event logs and two process discovery techniques. Our results demonstrate that event partitioning can bring improvements in process discovery utility for directly-follows-based anonymization techniques.
zh
[AI-25] Enhancing the Interpretability of Rule-based Explanations through Information Retrieval
【速读】:该论文试图解决数据驱动的人工智能技术在医疗决策中的可解释性不足问题(the lack of transparency of data-driven Artificial Intelligence techniques),从而限制了其在医疗领域的接受度。解决方案的关键在于提出一种基于属性的分析方法,通过信息检索技术的标准指标对基于规则的预测模型中的属性进行统计分析,以计算每个属性对预测的相关性,并为用户提供关于风险因素影响的可解释信息。
链接: https://arxiv.org/abs/2507.05976
作者: Alessandro Umbrico,Guido Bologna,Luca Coraci,Francesca Fracasso,Silvia Gola,Gabriella Cortellessa
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:The lack of transparency of data-driven Artificial Intelligence techniques limits their interpretability and acceptance into healthcare decision-making processes. We propose an attribution-based approach to improve the interpretability of Explainable AI-based predictions in the specific context of arm lymphedema’s risk assessment after lymph nodal radiotherapy in breast cancer. The proposed method performs a statistical analysis of the attributes in the rule-based prediction model using standard metrics from Information Retrieval techniques. This analysis computes the relevance of each attribute to the prediction and provides users with interpretable information about the impact of risk factors. The results of a user study that compared the output generated by the proposed approach with the raw output of the Explainable AI model suggested higher levels of interpretability and usefulness in the context of predicting lymphedema risk.
zh
[AI-26] Simple Convergence Proof of Adam From a Sign-like Descent Perspective
【速读】:该论文试图解决Adam优化器在理论收敛分析上的不足,特别是现有研究将其视为带有动量的预条件随机梯度下降(SGDM)所导致的复杂性和不透明性。其解决方案的关键在于将Adam重新解释为一种类似符号的优化器,通过形式化为 \bmx_{t+1} = \bmx_t - \gamma_t \frac{|\bmm_t|}{\sqrt{\bmv_t}+\epsilon} \circ \rm Sign(\bmm_t),从而显著简化了收敛性分析。该方法在较弱的假设下证明了Adam能够达到最优收敛率 O(T1/41),并提供了关于动量作用的新见解及学习率调优的实践指导。
链接: https://arxiv.org/abs/2507.05966
作者: Hanyang Peng,Shuang Qin,Yue Yu,Fangqing Jiang,Hui Wang,Zhouchen Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 2figures
Abstract:Adam is widely recognized as one of the most effective optimizers for training deep neural networks (DNNs). Despite its remarkable empirical success, its theoretical convergence analysis remains unsatisfactory. Existing works predominantly interpret Adam as a preconditioned stochastic gradient descent with momentum (SGDM), formulated as \bmx_t+1 = \bmx_t - \frac\gamma_t\sqrt\bmv_t+\epsilon \circ \bmm_t . This perspective necessitates strong assumptions and intricate techniques, resulting in lengthy and opaque convergence proofs that are difficult to verify and extend. In contrast, we propose a novel interpretation by treating Adam as a sign-like optimizer, expressed as \bmx_t+1 = \bmx_t - \gamma_t \frac|\bmm_t|\sqrt\bmv_t+\epsilon \circ \rm Sign(\bmm_t) . This reformulation significantly simplifies the convergence analysis. For the first time, with some mild conditions, we prove that Adam achieves the optimal rate of \cal O(\frac1T^\sfrac14) rather than the previous \cal O \left(\frac\ln TT^\sfrac14\right) under weak assumptions of the generalized p -affine variance and (L_0, L_1, q) -smoothness, without dependence on the model dimensionality or the numerical stability parameter \epsilon . Additionally, our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence and offers practical guidelines for tuning learning rates in Adam, further bridging the gap between theory and practice.
zh
[AI-27] Complexity Results of Persuasion
【速读】:该论文试图解决说服(persuasion)是否属于计算复杂性中的难解问题,即是否存在有效的算法能够在多项式时间内求解该问题。论文的核心贡献是证明了说服问题是一个NP完全问题。解决方案的关键在于将说服问题转化为一个已知的NP完全问题,并通过多项式时间归约的方式证明其计算复杂性,从而确立了该问题在计算理论中的难度地位。
链接: https://arxiv.org/abs/2507.05951
作者: Alban Grastien
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
备注:
Abstract:We prove that persuasion is an NP-complete problem.
zh
[AI-28] A Wireless Foundation Model for Multi-Task Prediction
【速读】:该论文旨在解决移动通信网络中关键系统参数(如信道状态信息、用户位置和网络流量)预测的泛化能力不足问题,传统深度学习方法在不同场景和任务间难以有效迁移。其解决方案的关键在于提出一个统一的基础模型,该模型通过单变量分解实现异构任务的统一、引入粒度编码以增强区间感知能力,并采用因果Transformer架构提升预测准确性,同时在训练中引入补丁掩码策略以支持任意输入长度。
链接: https://arxiv.org/abs/2507.05938
作者: Yucheng Sheng,Jiacheng Wang,Xingyu Zhou,Le Liang,Hao Ye,Shi Jin,Geoffrey Ye Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:With the growing complexity and dynamics of the mobile communication networks, accurately predicting key system parameters, such as channel state information (CSI), user location, and network traffic, has become essential for a wide range of physical (PHY)-layer and medium access control (MAC)-layer tasks. Although traditional deep learning (DL)-based methods have been widely applied to such prediction tasks, they often struggle to generalize across different scenarios and tasks. In response, we propose a unified foundation model for multi-task prediction in wireless networks that supports diverse prediction intervals. The proposed model enforces univariate decomposition to unify heterogeneous tasks, encodes granularity for interval awareness, and uses a causal Transformer backbone for accurate predictions. Additionally, we introduce a patch masking strategy during training to support arbitrary input lengths. After trained on large-scale datasets, the proposed foundation model demonstrates strong generalization to unseen scenarios and achieves zero-shot performance on new tasks that surpass traditional full-shot baselines.
zh
[AI-29] BlueLM-2.5-3B Technical Report
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)在边缘设备上部署时面临的性能与效率之间的平衡问题,同时实现对思考过程的显式控制。其关键解决方案包括通过多样化数据整理、关键数据重采样、混合异构强化学习以及高性能训练基础设施进行模型优化,从而在仅29亿参数规模下实现了卓越的多模态能力与文本处理性能。
链接: https://arxiv.org/abs/2507.05934
作者: Baojiao Xiong,Boheng Chen,Chengzhi Wang,Daxiong Luo,Dongsheng Xu,Dongyang Liu,Fan Yang,Fangyuan Li,Fei Teng,Feng Wang,Fukang Qin,Fuquan Peng,Guanxin Tan,Guozhi Wang,Haibo Yu,Haohao Gao,Heng Liu,Hongbo Yang,Hongjian Zou,Houzheng Shen,Hu Meng,Huan Li,Hui Tan,Jiali Chen,Jianzhao Chen,Jinliang Zhu,Kai Wang,Lei Wu,Liangbing Liu,Liuyang Bian,Liyan He,Long Liu,Peiwen Li,Penggang Shi,Qi Ding,Rui Hu,Shuai Cao,Shuai Ren,Shuang Peng,Teng Xie,Weiji Chen,Weilin Xiang,Weixin Wu,Xi Yin,Xiaoxin Chen,Xu Chen,Yafei Wen,Yan Hu,Yanzhou Yang,Yina Xie,Yinghao Chen,Yixuan Liao,Yu Geng,Yuanjiang Ouyang,Yuanzhuo Yang,Yuehua He,Yushuai Peng,Zhaoxiong Wang,Zheng Wang,Zhibo Zhou,Ziyang Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present BlueLM-2.5-3B, a compact and unified dense Multimodal Large Language Model (MLLM) designed for efficient edge-device deployment, offering strong general-purpose and reasoning capabilities. To the best of our knowledge, this is the first 3B-scale MLLM to support both thinking and non-thinking modes, while also enabling explicit control over thinking token budget. BlueLM-2.5-3B is developed through diversified data curation, key data resampling, hybrid heterogeneous reinforcement learning, and a high-performance training infrastructure. Our model achieves superior multimodal capacity while preserving competitive pure-text performance with only 2.9 billion parameters. We conduct comprehensive evaluations across a broad range of multimodal and text-only benchmarks. In thinking mode, BlueLM-2.5-3B achieves comparable performance to Qwen3-4B on text-only benchmarks, and trails the larger Kimi-VL-A3B-16B by only about 5% on average across multimodal evaluations. In non-thinking mode, it outperforms Qwen2.5-VL-3B on the majority of multimodal benchmarks. Additionally, BlueLM-2.5-3B exhibits exceptional data efficiency. All of the aforementioned performance is achieved with substantially less total training data than Qwen2.5-VL-3B and Qwen3-4B. We hope our work contributes to the advancement of high-performance, on-device MLLMs and provides meaningful insights to the research community.
zh
[AI-30] Differentiable Reward Optimization for LLM based TTS system
【速读】:该论文旨在解决神经编解码语言模型语音合成(TTS)系统中发音准确性和指令遵循能力不足的问题。其解决方案的关键在于提出一种新型的可微分奖励优化(Differentiable Reward Optimization, DiffRO)方法,该方法直接基于神经编解码令牌计算奖励,而非依赖合成音频,并利用Gumbel-Softmax技术使奖励函数具有可微性,从而简化了强化学习从人类反馈(RLHF)的训练过程。此外,引入多任务奖励(Multi-Task Reward, MTR)模型以提供多角度反馈,进一步增强了系统对指令的遵循能力。
链接: https://arxiv.org/abs/2507.05911
作者: Changfeng Gao,Zhihao Du,Shiliang Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furthermore, we employ the Gumbel-Softmax technique to render the reward function differentiable, thereby streamlining the RLHF training process. Additionally, we introduce a multi-task reward (MTR) model which can provide feedback from different perspectives and find that it can augment the system’s capability to follow instructions this http URL results indicate that DiffRO significantly improves the pronunciation accuracy of the TTS system, achieving state-of-the-art (SOTA) WER results on the seed-tts-eval benchmark. Moreover, with the integration of the MTR model, we demonstrate the ability to control emotional and quality attributes in a zero-shot manner.
zh
[AI-31] Feature-Based vs. GAN-Based Learning from Demonstrations: When and Why
【速读】:该论文试图解决从示范中学习(Learning from Demonstrations, LfD)中的奖励函数结构设计与策略学习之间的关系问题。其解决方案的关键在于比较基于特征的方法与生成对抗网络(GAN)方法的优劣,并强调结构化运动表示的重要性,以实现更平滑的过渡、可控的合成以及任务集成的提升。研究认为,特征方法与GAN方法之间的二元对立正在变得复杂,选择应基于任务特定的优先级,如保真度、多样性、可解释性和适应性,从而为从示范中学习提供系统的算法权衡与设计考量。
链接: https://arxiv.org/abs/2507.05906
作者: Chenhao Li,Marco Hutter,Andreas Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
备注:
Abstract:This survey provides a comparative analysis of feature-based and GAN-based approaches to learning from demonstrations, with a focus on the structure of reward functions and their implications for policy learning. Feature-based methods offer dense, interpretable rewards that excel at high-fidelity motion imitation, yet often require sophisticated representations of references and struggle with generalization in unstructured settings. GAN-based methods, in contrast, use implicit, distributional supervision that enables scalability and adaptation flexibility, but are prone to training instability and coarse reward signals. Recent advancements in both paradigms converge on the importance of structured motion representations, which enable smoother transitions, controllable synthesis, and improved task integration. We argue that the dichotomy between feature-based and GAN-based methods is increasingly nuanced: rather than one paradigm dominating the other, the choice should be guided by task-specific priorities such as fidelity, diversity, interpretability, and adaptability. This work outlines the algorithmic trade-offs and design considerations that underlie method selection, offering a framework for principled decision-making in learning from demonstrations.
zh
[AI-32] Universal Embeddings of Tabular Data VLDB2025
【速读】:该论文试图解决在工业数据库中,由于应用任务未预先定义而导致的表格式数据分析与解释难题。其解决方案的关键在于提出一种新颖的框架,用于生成与任务无关的表格式数据嵌入表示。该方法通过将表格式数据转换为图结构,利用图自编码器生成实体嵌入,并进一步聚合得到每行数据的嵌入表示,从而实现对未见样本的高效嵌入和下游任务(如回归、分类或异常检测)的执行。
链接: https://arxiv.org/abs/2507.05904
作者: Astrid Franz,Frederik Hoppe,Marianne Michaelis,Udo Göbel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at Tabular Data Analysis (TaDA) Workshop at VLDB 2025
Abstract:Tabular data in relational databases represents a significant portion of industrial data. Hence, analyzing and interpreting tabular data is of utmost importance. Application tasks on tabular data are manifold and are often not specified when setting up an industrial database. To address this, we present a novel framework for generating universal, i.e., task-independent embeddings of tabular data for performing downstream tasks without predefined targets. Our method transforms tabular data into a graph structure, leverages Graph Auto-Encoders to create entity embeddings, which are subsequently aggregated to obtain embeddings for each table row, i.e., each data sample. This two-step approach has the advantage that unseen samples, consisting of similar entities, can be embedded without additional training. Downstream tasks such as regression, classification or outlier detection, can then be performed by applying a distance-based similarity measure in the embedding space. Experiments on real-world datasets demonstrate that our method achieves superior performance compared to existing universal tabular data embedding techniques.
zh
[AI-33] Decomposing the Time Series Forecasting Pipeline: A Modular Approach for Time Series Representation Information Extraction and Projection
【速读】:该论文旨在解决时间序列预测中的挑战,包括有效的序列表示、有意义的信息提取以及精确的未来预测。其解决方案的关键在于将时间序列预测流程系统性地分解为三个核心阶段:输入序列表示、信息提取与记忆构建、最终目标投影,并在每个阶段中研究多种架构配置以评估不同模块(如卷积层和自注意力机制)在多种预测任务中的有效性。通过这种方法,模型在保持高预测精度的同时显著提升了计算效率。
链接: https://arxiv.org/abs/2507.05891
作者: Robert Leppich,Michael Stenger,André Bauer,Samuel Kounev
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the advent of Transformers, time series forecasting has seen significant advances, yet it remains challenging due to the need for effective sequence representation, memory construction, and accurate target projection. Time series forecasting remains a challenging task, demanding effective sequence representation, meaningful information extraction, and precise future projection. Each dataset and forecasting configuration constitutes a distinct task, each posing unique challenges the model must overcome to produce accurate predictions. To systematically address these task-specific difficulties, this work decomposes the time series forecasting pipeline into three core stages: input sequence representation, information extraction and memory construction, and final target projection. Within each stage, we investigate a range of architectural configurations to assess the effectiveness of various modules, such as convolutional layers for feature extraction and self-attention mechanisms for information extraction, across diverse forecasting tasks, including evaluations on seven benchmark datasets. Our models achieve state-of-the-art forecasting accuracy while greatly enhancing computational efficiency, with reduced training and inference times and a lower parameter count. The source code is available at this https URL.
zh
[AI-34] Current Practices for Building LLM -Powered Reasoning Tools Are Ad Hoc – and We Can Do Better
【速读】:该论文试图解决当前构建神经符号化自动化推理(Neurosymbolic Automated Reasoning, AR)系统时所面临的挑战,即现有的实现方式是一种缺乏理论基础的临时编程模型,未能提供传统符号算法的强保证,也未能实现神经网络与符号推理的深度同步,从而无法充分发挥由大型语言模型(LLM)驱动的推理潜力。论文提出的解决方案是引入神经符号化转移系统(Neurosymbolic Transition Systems),其关键在于将符号状态与直觉(intuition)配对,并在符号和直觉上并行进行状态转移,从而在保持符号算法强保证的同时,扩展逻辑推理的能力。
链接: https://arxiv.org/abs/2507.05886
作者: Aaron Bembenek(The University of Melbourne)
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 6 pages, 4 figures
Abstract:There is growing excitement about building software verifiers, synthesizers, and other Automated Reasoning (AR) tools by combining traditional symbolic algorithms and Large Language Models (LLMs). Unfortunately, the current practice for constructing such neurosymbolic AR systems is an ad hoc programming model that does not have the strong guarantees of traditional symbolic algorithms, nor a deep enough synchronization of neural networks and symbolic reasoning to unlock the full potential of LLM-powered reasoning. I propose Neurosymbolic Transition Systems as a principled computational model that can underlie infrastructure for building neurosymbolic AR tools. In this model, symbolic state is paired with intuition, and state transitions operate over symbols and intuition in parallel. I argue why this new paradigm can scale logical reasoning beyond current capabilities while retaining the strong guarantees of symbolic algorithms, and I sketch out how the computational model I propose can be reified in a logic programming language.
zh
[AI-35] Comparison of Path Planning Algorithms for Autonomous Vehicle Navigation Using Satellite and Airborne LiDAR Data
【速读】:该论文旨在解决在非结构化环境(如森林和山地区域)中自动驾驶车辆的路径规划问题,其核心挑战在于不规则地形和复杂的道路条件。解决方案的关键在于对主流且成熟的路径规划算法进行比较评估,包括A*、Dijkstra、RRT*以及一种改进的蚁群优化算法(NIACO),这些算法应用于从高分辨率卫星影像和机载LiDAR数据生成的加权像素级道路网络。研究分别在2D和3D道路地图上测试了这些算法,并基于路径成本、计算时间和内存消耗进行评估,结果表明Dijkstra算法在静态地形导航中表现出最稳定和高效的性能。
链接: https://arxiv.org/abs/2507.05884
作者: Chang Liu,Zhexiong Xue,Tamas Sziranyi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 67th International Symposium ELMAR-2025 15-17 September 2025 Zadar, Croatia
Abstract:Autonomous vehicle navigation in unstructured environments, such as forests and mountainous regions, presents significant challenges due to irregular terrain and complex road conditions. This work provides a comparative evaluation of mainstream and well-established path planning algorithms applied to weighted pixel-level road networks derived from high-resolution satellite imagery and airborne LiDAR data. For 2D road-map navigation, where the weights reflect road conditions and terrain difficulty, A*, Dijkstra, RRT*, and a Novel Improved Ant Colony Optimization Algorithm (NIACO) are tested on the DeepGlobe satellite dataset. For 3D road-map path planning, 3D A*, 3D Dijkstra, RRT-Connect, and NIACO are evaluated using the Hamilton airborne LiDAR dataset, which provides detailed elevation information. All algorithms are assessed under identical start and end point conditions, focusing on path cost, computation time, and memory consumption. Results demonstrate that Dijkstra consistently offers the most stable and efficient performance in both 2D and 3D scenarios, particularly when operating on dense, pixel-level geospatial road-maps. These findings highlight the reliability of Dijkstra-based planning for static terrain navigation and establish a foundation for future research on dynamic path planning under complex environmental constraints.
zh
[AI-36] CogniPlay: a work-in-progress Human-like model for General Game Playing
【速读】:该论文试图解决当前人工智能系统虽然在多种游戏中表现优异,但仍未达到真正“类人”的问题,其核心在于这些系统缺乏人类认知中基于模式的直觉决策过程。解决方案的关键在于借鉴认知心理学的研究成果和以往对类人行为建模的努力,提出一种基于这些观察的模型——CogniPlay,以更贴近人类在通用游戏环境中的行为表现。
链接: https://arxiv.org/abs/2507.05868
作者: Aloïs Rautureau,Éric Piette
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure
Abstract:While AI systems have equaled or surpassed human performance in a wide variety of games such as Chess, Go, or Dota 2, describing these systems as truly “human-like” remains far-fetched. Despite their success, they fail to replicate the pattern-based, intuitive decision-making processes observed in human cognition. This paper presents an overview of findings from cognitive psychology and previous efforts to model human-like behavior in artificial agents, discusses their applicability to General Game Playing (GGP) and introduces our work-in-progress model based on these observations: CogniPlay.
zh
[AI-37] Intra-DP: A High Performance Collaborative Inference System for Mobile Edge Computing
【速读】:该论文旨在解决在资源受限的移动设备上部署深度神经网络(Deep Neural Networks, DNNs)时面临的实时性能挑战,尤其是在计算资源和电池寿命有限的情况下。现有方法主要依赖于逐层模型分割,但由于DNN操作的顺序执行导致了显著的传输瓶颈。论文提出的解决方案——Intra-DP,其关键在于采用一种基于局部算子(local operators)的并行计算技术,通过将计算分解为多个独立子操作,并利用并行执行重叠不同子操作的计算与传输,从而缓解边缘计算中的传输瓶颈,实现快速且节能的推理。
链接: https://arxiv.org/abs/2507.05829
作者: Zekai Sun,Xiuxian Guan,Zheng Lin,Zihan Fang,Xiangming Cai,Zhe Chen,Fangming Liu,Heming Cui,Jie Xiong,Wei Ni,Chau Yuen
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 19 figures
Abstract:Deploying deep neural networks (DNNs) on resource-constrained mobile devices presents significant challenges, particularly in achieving real-time performance while simultaneously coping with limited computational resources and battery life. While Mobile Edge Computing (MEC) offers collaborative inference with GPU servers as a promising solution, existing approaches primarily rely on layer-wise model partitioning and undergo significant transmission bottlenecks caused by the sequential execution of DNN operations. To address this challenge, we present Intra-DP, a high-performance collaborative inference system optimized for DNN inference on MEC. Intra DP employs a novel parallel computing technique based on local operators (i.e., operators whose minimum unit input is not the entire input tensor, such as the convolution kernel). By decomposing their computations (operations) into several independent sub-operations and overlapping the computation and transmission of different sub-operations through parallel execution, Intra-DP mitigates transmission bottlenecks in MEC, achieving fast and energy-efficient inference. The evaluation demonstrates that Intra-DP reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines, without sacrificing accuracy.
zh
[AI-38] Constella: Supporting Storywriters Interconnected Character Creation through LLM -based Multi-Agents
【速读】:该论文试图解决长篇故事创作中角色塑造过程中存在的挑战,包括创作者难以设想能够影响现有角色的新角色、平衡角色间的相似性与差异性以及细致描绘角色间关系的问题。解决方案的关键在于设计Constella,这是一个基于大语言模型(LLM)的多智能体工具,通过FRIENDS DISCOVERY功能推荐相关角色、JOURNALS功能同时揭示多个角色的内心世界,并利用COMMENTS功能通过角色间的互动展现关系,从而支持创作者更高效地进行角色构建与关系深化。
链接: https://arxiv.org/abs/2507.05820
作者: Syemin Park,Soobin Park,Youn-kyung Lim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 50 pages
Abstract:Creating a cast of characters by attending to their relational dynamics is a critical aspect of most long-form storywriting. However, our formative study (N=14) reveals that writers struggle to envision new characters that could influence existing ones, to balance similarities and differences among characters, and to intricately flesh out their relationships. Based on these observations, we designed Constella, an LLM-based multi-agent tool that supports storywriters’ interconnected character creation process. Constella suggests related characters (FRIENDS DISCOVERY feature), reveals the inner mindscapes of several characters simultaneously (JOURNALS feature), and manifests relationships through inter-character responses (COMMENTS feature). Our 7-8 day deployment study with storywriters (N=11) shows that Constella enabled the creation of expansive communities composed of related characters, facilitated the comparison of characters’ thoughts and emotions, and deepened writers’ understanding of character relationships. We conclude by discussing how multi-agent interactions can help distribute writers’ attention and effort across the character cast.
zh
[AI-39] Automated Reasoning for Vulnerability Management by Design
【速读】:该论文试图解决系统脆弱性姿态管理不足的问题,当前的脆弱性管理方法无法支持对系统设计脆弱性姿态的系统性推理。解决方案的关键在于提出一种形式化基础的自动化推理机制,该机制能够帮助系统设计者识别适用于特定系统设计的脆弱性,明确指定脆弱性缓解选项,并声明所选控制措施,从而实现对脆弱性姿态的系统性管理。
链接: https://arxiv.org/abs/2507.05794
作者: Avi Shaked,Nan Messe
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Systems and Control (eess.SY)
备注:
Abstract:For securing systems, it is essential to manage their vulnerability posture and design appropriate security controls. Vulnerability management allows to proactively address vulnerabilities by incorporating pertinent security controls into systems designs. Current vulnerability management approaches do not support systematic reasoning about the vulnerability postures of systems designs. To effectively manage vulnerabilities and design security controls, we propose a formally grounded automated reasoning mechanism. We integrate the mechanism into an open-source security design tool and demonstrate its application through an illustrative example driven by real-world challenges. The automated reasoning mechanism allows system designers to identify vulnerabilities that are applicable to a specific system design, explicitly specify vulnerability mitigation options, declare selected controls, and thus systematically manage vulnerability postures.
zh
[AI-40] GTA1: GUI Test-time Scaling Agent
【速读】:该论文旨在解决图形用户界面(GUI)代理在任务规划中的模糊性以及在复杂高分辨率界面中准确定位操作目标的挑战。其解决方案的关键在于引入一种测试时扩展方法,通过在每一步采样多个候选操作提案并利用判断模型进行评估与选择,以提高决策质量;同时提出一种基于强化学习的模型,通过内在的目标对齐实现更精确的视觉元素定位。
链接: https://arxiv.org/abs/2507.05791
作者: Yan Yang,Dongxu Li,Yutong Dai,Yuhao Yang,Ziyang Luo,Zirui Zhao,Zhiyuan Hu,Junzhe Huang,Amrita Saha,Zeyuan Chen,Ran Xu,Liyuan Pan,Caiming Xiong,Junnan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Graphical user interface (GUI) agents autonomously operate across platforms (e.g., Linux) to complete tasks by interacting with visual elements. Specifically, a user instruction is decomposed into a sequence of action proposals, each corresponding to an interaction with the GUI. After each action, the agent observes the updated GUI environment to plan the next step. However, two main challenges arise: i) resolving ambiguity in task planning (i.e., the action proposal sequence), where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, i.e., precisely interacting with visual targets. This paper investigates the two aforementioned challenges with our GUI Test-time Scaling Agent, namely GTA1. First, to select the most appropriate action proposal, we introduce a test-time scaling method. At each step, we sample multiple candidate action proposals and leverage a judge model to evaluate and select the most suitable one. It trades off computation for better decision quality by concurrent sampling, shortening task execution steps, and improving overall performance. Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates visual grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, our method establishes state-of-the-art performance across diverse benchmarks. For example, GTA1-7B achieves 50.1%, 92.4%, and 67.7% accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G, respectively. When paired with a planner applying our test-time scaling strategy, it exhibits state-of-the-art agentic performance (e.g., 45.2% task success rate on OSWorld). We open-source our code and models here. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.05791 [cs.AI] (or arXiv:2507.05791v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.05791 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-41] Real-time monitoring of the SoH of lithium-ion batteries
【速读】:该论文试图解决微电网中电池状态健康(SoH)实时监测的问题,这一问题在传统方法受限的运行条件下尤为突出。解决方案的关键在于利用充电阶段末期放电脉冲的分析,通过等效电气模型描述电池端电压在该电流脉冲下的变化参数,进而估计电池的SoH。
链接: https://arxiv.org/abs/2507.05765
作者: Bruno Jammes(LAAS-ISGE),Edgar Hernando Sepúlveda-Oviedo(LAAS-ISGE),Corinne Alonso(LAAS-ISGE)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: in French language, Symposium de G{é}nie {É}lectrique SGE 2025, Jul 2025, Toulouse, France
Abstract:Real-time monitoring of the state of health (SoH) of batteries remains a major challenge, particularly in microgrids where operational constraints limit the use of traditional methods. As part of the 4BLife project, we propose an innovative method based on the analysis of a discharge pulse at the end of the charge phase. The parameters of the equivalent electrical model describing the voltage evolution across the battery terminals during this current pulse are then used to estimate the SoH. Based on the experimental data acquired so far, the initial results demonstrate the relevance of the proposed approach. After training using the parameters of two batteries with a capacity degradation of around 85%, we successfully predicted the degradation of two other batteries, cycled down to approximately 90% SoH, with a mean absolute error of around 1% in the worst case, and an explainability score of the estimator close to 0.9. If these performances are confirmed, this method can be easily integrated into battery management systems (BMS) and paves the way for optimized battery management under continuous operation.
zh
[AI-42] An autonomous agent for auditing and improving the reliability of clinical AI models
【速读】:该论文试图解决生成式 AI (Generative AI) 在临床实践中部署时面临的关键问题:即尽管某些 AI 模型在基准测试中表现出专家级性能,但在面对医学影像中的真实世界变化时可能会出现灾难性失败。当前的可靠性审计过程是定制化且耗时的,缺乏可访问且可解释的工具来暴露和修复隐藏的失败模式。论文提出的解决方案是 ModelAuditor,其关键在于它作为一个自省代理,能够与用户交互、选择任务特定指标,并模拟依赖于上下文的、临床上相关的分布偏移,从而生成可解释的报告,说明性能可能下降的程度,讨论具体的失败模式,并识别根本原因及缓解策略。
链接: https://arxiv.org/abs/2507.05755
作者: Lukas Kuhn,Florian Buettner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The deployment of AI models in clinical practice faces a critical challenge: models achieving expert-level performance on benchmarks can fail catastrophically when confronted with real-world variations in medical imaging. Minor shifts in scanner hardware, lighting or demographics can erode accuracy, but currently reliability auditing to identify such catastrophic failure cases before deployment is a bespoke and time-consuming process. Practitioners lack accessible and interpretable tools to expose and repair hidden failure modes. Here we introduce ModelAuditor, a self-reflective agent that converses with users, selects task-specific metrics, and simulates context-dependent, clinically relevant distribution shifts. ModelAuditor then generates interpretable reports explaining how much performance likely degrades during deployment, discussing specific likely failure modes and identifying root causes and mitigation strategies. Our comprehensive evaluation across three real-world clinical scenarios - inter-institutional variation in histopathology, demographic shifts in dermatology, and equipment heterogeneity in chest radiography - demonstrates that ModelAuditor is able correctly identify context-specific failure modes of state-of-the-art models such as the established SIIM-ISIC melanoma classifier. Its targeted recommendations recover 15-25% of performance lost under real-world distribution shift, substantially outperforming both baseline models and state-of-the-art augmentation methods. These improvements are achieved through a multi-agent architecture and execute on consumer hardware in under 10 minutes, costing less than US 0.50 per audit.
zh
[AI-43] LeAD: The LLM Enhanced Planning System Converged with End-to-end Autonomous Driving
【速读】:该论文旨在解决城市自动驾驶系统大规模部署中面临的复杂场景和边缘案例问题,现有系统在语义信息解析和交通参与者意图识别方面存在不足,导致决策与熟练驾驶员的推理模式不一致。其解决方案的关键在于提出LeAD架构,该架构融合了基于模仿学习的端到端(E2E)框架与大语言模型(LLM)增强机制,通过高频E2E子系统维持实时感知-规划-控制循环,以及低频LLM模块利用多模态感知融合高精地图并借助思维链(CoT)推理在基础规划器能力受限时生成最优决策。
链接: https://arxiv.org/abs/2507.05754
作者: Yuhang Zhang,Jiaqi Liu,Chengkai Xu,Peng Hang,Jian Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:A principal barrier to large-scale deployment of urban autonomous driving systems lies in the prevalence of complex scenarios and edge cases. Existing systems fail to effectively interpret semantic information within traffic contexts and discern intentions of other participants, consequently generating decisions misaligned with skilled drivers’ reasoning patterns. We present LeAD, a dual-rate autonomous driving architecture integrating imitation learning-based end-to-end (E2E) frameworks with large language model (LLM) augmentation. The high-frequency E2E subsystem maintains real-time perception-planning-control cycles, while the low-frequency LLM module enhances scenario comprehension through multi-modal perception fusion with HD maps and derives optimal decisions via chain-of-thought (CoT) reasoning when baseline planners encounter capability limitations. Our experimental evaluation in the CARLA Simulator demonstrates LeAD’s superior handling of unconventional scenarios, achieving 71 points on Leaderboard V1 benchmark, with a route completion of 93%.
zh
[AI-44] When Transformers Meet Recommenders: Integrating Self-Attentive Sequential Recommendation with Fine-Tuned LLM s
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在推荐系统中因缺乏领域特定知识和协同信号而导致的推荐质量不足的问题。其解决方案的关键在于提出SASRecLLM框架,该框架将自注意力序列推荐(Self-Attentive Sequential Recommendation, SASRec)作为协同编码器,并与经过低秩适应(Low-Rank Adaptation, LoRA)微调的LLM相结合,通过映射层对齐维度空间,并设计了三种针对性的训练策略以优化混合架构。
链接: https://arxiv.org/abs/2507.05733
作者: Kechen Liu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-Attentive Sequential Recommendation (SASRec) effectively captures long-term user preferences by applying attention mechanisms to historical interactions. Concurrently, the rise of Large Language Models (LLMs) has motivated research into LLM-based recommendation, which leverages their powerful generalization and language understanding capabilities. However, LLMs often lack the domain-specific knowledge and collaborative signals essential for high-quality recommendations when relying solely on textual prompts. To address this limitation, this study proposes SASRecLLM, a novel framework that integrates SASRec as a collaborative encoder with an LLM fine-tuned using Low-Rank Adaptation (LoRA). The components are connected via a mapping layer to align their dimensional spaces, and three targeted training strategies are designed to optimize the hybrid architecture. Extensive experiments on multiple datasets demonstrate that SASRecLLM achieves robust and consistent improvements over strong baselines in both cold-start and warm-start scenarios. This work advances the field of LLM-based recommendation by presenting a modular and effective paradigm for fusing structured collaborative filtering with the semantic power of fine-tuned LLMs. The implementation is available on GitHub: this https URL
zh
[AI-45] A Satellite-Ground Synergistic Large Vision-Language Model System for Earth Observation
【速读】:该论文试图解决在低地球轨道(LEO)卫星网络中部署生成式AI(Generative AI)模型进行实时地球观测应用时面临的挑战,包括卫星快速运动、卫星与地面站(GS)通信窗口短暂以及图像数据量大导致的数据下载难题。解决方案的关键在于设计SpaceVerse系统,通过在卫星上部署轻量级LVLM以处理轻量任务,而在地面站运行常规LVLM以处理计算密集型任务,并采用计算与通信协同设计框架,包括渐进置信度网络和基于注意力的多尺度预处理方法,以识别卫星端推理数据并减少传输前的数据冗余,从而提升精度并降低延迟。
链接: https://arxiv.org/abs/2507.05731
作者: Yuxin Zhang,Jiahao Yang,Zhe Chen,Wenjun Zhu,Jin Zhao,Yue Gao
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 12 figures
Abstract:Recently, large vision-language models (LVLMs) unleash powerful analysis capabilities for low Earth orbit (LEO) satellite Earth observation images in the data center. However, fast satellite motion, brief satellite-ground station (GS) contact windows, and large size of the images pose a data download challenge. To enable near real-time Earth observation applications (e.g., disaster and extreme weather monitoring), we should explore how to deploy LVLM in LEO satellite networks, and design SpaceVerse, an efficient satellite-ground synergistic LVLM inference system. To this end, firstly, we deploy compact LVLMs on satellites for lightweight tasks, whereas regular LVLMs operate on GSs to handle computationally intensive tasks. Then, we propose a computing and communication co-design framework comprised of a progressive confidence network and an attention-based multi-scale preprocessing, used to identify on-satellite inferring data, and reduce data redundancy before satellite-GS transmission, separately. We implement and evaluate SpaceVerse on real-world LEO satellite constellations and datasets, achieving a 31.2% average gain in accuracy and a 51.2% reduction in latency compared to state-of-the-art baselines.
zh
[AI-46] Divergent Realities: A Comparative Analysis of Human Expert vs. Artificial Intelligence Based Generation and Evaluation of Treatment Plans in Dermatology
【速读】:该论文试图解决在临床治疗计划生成中,如何评估人工智能(AI)生成方案的质量问题,特别是在AI从诊断扩展到治疗规划的背景下,新出现的推理模型带来的评估挑战。其解决方案的关键在于通过对比人类专家与两种AI模型(通用型AI和推理型AI)生成的治疗计划,并由人类同行和一个更高级的AI评判者进行评分,以揭示不同评估者对同一方案质量判断的显著差异。研究发现,人类专家倾向于给予同行生成的计划更高评分,而高级AI评判者则相反,显示出经验驱动的临床直觉与数据驱动的算法逻辑之间的深刻分歧。
链接: https://arxiv.org/abs/2507.05716
作者: Dipayan Sengupta,Saumya Panda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 3 tables
Abstract:Background: Evaluating AI-generated treatment plans is a key challenge as AI expands beyond diagnostics, especially with new reasoning models. This study compares plans from human experts and two AI models (a generalist and a reasoner), assessed by both human peers and a superior AI judge. Methods: Ten dermatologists, a generalist AI (GPT-4o), and a reasoning AI (o3) generated treatment plans for five complex dermatology cases. The anonymized, normalized plans were scored in two phases: 1) by the ten human experts, and 2) by a superior AI judge (Gemini 2.5 Pro) using an identical rubric. Results: A profound ‘evaluator effect’ was observed. Human experts scored peer-generated plans significantly higher than AI plans (mean 7.62 vs. 7.16; p=0.0313), ranking GPT-4o 6th (mean 7.38) and the reasoning model, o3, 11th (mean 6.97). Conversely, the AI judge produced a complete inversion, scoring AI plans significantly higher than human plans (mean 7.75 vs. 6.79; p=0.0313). It ranked o3 1st (mean 8.20) and GPT-4o 2nd, placing all human experts lower. Conclusions: The perceived quality of a clinical plan is fundamentally dependent on the evaluator’s nature. An advanced reasoning AI, ranked poorly by human experts, was judged as superior by a sophisticated AI, revealing a deep gap between experience-based clinical heuristics and data-driven algorithmic logic. This paradox presents a critical challenge for AI integration, suggesting the future requires synergistic, explainable human-AI systems that bridge this reasoning gap to augment clinical care. Comments: 13 pages, 3 tables Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.05716 [cs.AI] (or arXiv:2507.05716v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.05716 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dipayan Sengupta [view email] [v1] Tue, 8 Jul 2025 06:59:58 UTC (215 KB)
zh
[AI-47] Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach
【速读】:该论文旨在解决在联邦学习(Federated Learning, FL)框架下训练大规模人工神经网络(Large-Scale Artificial Intelligence Models, LAMs)时所面临的系统级挑战,尤其是如何高效地协调异构客户端资源与大量专业专家之间的动态对齐问题。其解决方案的关键在于提出一种智能客户端-专家对齐的概念性系统设计,该设计包含动态适应性评分、全局专家负载监控和客户端能力分析,从而实现更高效、可扩展且鲁棒的训练机制,减少通信轮次以提升边缘计算环境中的通信效率。
链接: https://arxiv.org/abs/2507.05685
作者: Xiaobing Chen,Boyang Zhang,Xiangwei Zhou,Mingxuan Sun,Shuai Zhang,Songyang Zhang,Geoffrey Ye Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:The integration of Federated Learning (FL) and Mixture-of-Experts (MoE) presents a compelling pathway for training more powerful, large-scale artificial intelligence models (LAMs) on decentralized data while preserving privacy. However, efficient federated training of these complex MoE-structured LAMs is hindered by significant system-level challenges, particularly in managing the interplay between heterogeneous client resources and the sophisticated coordination required for numerous specialized experts. This article highlights a critical, yet underexplored concept: the absence of robust quantitative strategies for dynamic client-expert alignment that holistically considers varying client capacities and the imperative for system-wise load balancing. Specifically, we propose a conceptual system design for intelligent client-expert alignment that incorporates dynamic fitness scoring, global expert load monitoring, and client capacity profiling. By tackling these systemic issues, we can unlock more scalable, efficient, and robust training mechanisms with fewer communication rounds for convergence, paving the way for the widespread deployment of large-scale federated MoE-structured LAMs in edge computing with ultra-high communication efficiency.
zh
[AI-48] GATMesh: Clock Mesh Timing Analysis using Graph Neural Networks
【速读】:该论文试图解决高性能VLSI系统中时钟网格(clock mesh)分析的难题,包括重构路径、多源驱动和输入网格缓冲器时序偏差等问题。传统SPICE仿真虽然准确但速度慢,而简化模型则忽略了关键因素如边沿速率和输入偏移。论文提出的解决方案是GATMesh,其关键在于利用图神经网络(Graph Neural Network, GNN)将时钟网格建模为包含增强结构和物理特征的图,通过SPICE数据训练实现高精度的时序分析,平均延迟误差仅为5.27ps,并在速度上相比多线程SPICE仿真提升了47146倍。
链接: https://arxiv.org/abs/2507.05681
作者: Muhammad Hadir Khan,Matthew Guthaus
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Clock meshes are essential in high-performance VLSI systems for minimizing skew and handling PVT variations, but analyzing them is difficult due to reconvergent paths, multi-source driving, and input mesh buffer skew. SPICE simulations are accurate but slow; yet simplified models miss key effects like slew and input skew. We propose GATMesh, a Graph Neural Network (GNN)-based framework that models the clock mesh as a graph with augmented structural and physical features. Trained on SPICE data, GATMesh achieves high accuracy with average delay error of 5.27ps on unseen benchmarks, while achieving speed-ups of 47146x over multi-threaded SPICE simulation.
zh
[AI-49] City-Level Foreign Direct Investment Prediction with Tabular Learning on Judicial Data IJCAI2025
【速读】:该论文试图解决基于经济数据(如GDP)进行城市层面外国直接投资(FDI)预测时存在的可靠性问题,因为这些数据可能容易被操纵。解决方案的关键在于利用大规模司法数据,该数据反映了司法绩效对地方投资安全和回报的影响,从而构建一个用于评估司法绩效的指标体系,并提出一种新的基于司法数据的表格学习方法(TLJD),通过整合行数据和列数据并对不同指标的权重进行区域差异调整,以提高城市层面FDI预测的准确性。
链接: https://arxiv.org/abs/2507.05651
作者: Tianxing Wu,Lizhe Cao,Shuang Wang,Jiming Wang,Shutong Zhu,Yerong Wu,Yuqing Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, accepted by IJCAI 2025
Abstract:To advance the United Nations Sustainable Development Goal on promoting sustained, inclusive, and sustainable economic growth, foreign direct investment (FDI) plays a crucial role in catalyzing economic expansion and fostering innovation. Precise city-level FDI prediction is quite important for local government and is commonly studied based on economic data (e.g., GDP). However, such economic data could be prone to manipulation, making predictions less reliable. To address this issue, we try to leverage large-scale judicial data which reflects judicial performance influencing local investment security and returns, for city-level FDI prediction. Based on this, we first build an index system for the evaluation of judicial performance over twelve million publicly available adjudication documents according to which a tabular dataset is reformulated. We then propose a new Tabular Learning method on Judicial Data (TLJD) for city-level FDI prediction. TLJD integrates row data and column data in our built tabular dataset for judicial performance indicator encoding, and utilizes a mixture of experts model to adjust the weights of different indicators considering regional variations. To validate the effectiveness of TLJD, we design cross-city and cross-time tasks for city-level FDI predictions. Extensive experiments on both tasks demonstrate the superiority of TLJD (reach to at least 0.92 R2) over the other ten state-of-the-art baselines in different evaluation metrics.
zh
[AI-50] DESIGN: Encrypted GNN Inference via Server-Side Input Graph Pruning NEURIPS2025
【速读】:该论文试图解决在全同态加密(FHE)环境下进行图神经网络(GNN)推理时计算开销过大、难以实现实时隐私保护的问题。解决方案的关键在于提出了一种名为DESIGN的框架,其核心是通过一种分层优化策略,在服务器端完全基于加密数据进行节点重要性评分计算,并利用这些评分引导同态分区过程,生成多级重要性掩码,从而实现输入图剪枝和自适应多项式激活机制,有效提升FHE下GNN推理的效率。
链接: https://arxiv.org/abs/2507.05649
作者: Kaixiang Zhao,Joseph Yousry Attalla,Qian Lou,Yushun Dong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review in Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Graph Neural Networks (GNNs) have achieved state-of-the-art performance in various graph-based learning tasks. However, enabling privacy-preserving GNNs in encrypted domains, such as under Fully Homomorphic Encryption (FHE), typically incurs substantial computational overhead, rendering real-time and privacy-preserving inference impractical. In this work, we propose DESIGN (EncrypteD GNN Inference via sErver-Side Input Graph pruNing), a novel framework for efficient encrypted GNN inference. DESIGN tackles the critical efficiency limitations of existing FHE GNN approaches, which often overlook input data redundancy and apply uniform computational strategies. Our framework achieves significant performance gains through a hierarchical optimization strategy executed entirely on the server: first, FHE-compatible node importance scores (based on encrypted degree statistics) are computed from the encrypted graph. These scores then guide a homomorphic partitioning process, generating multi-level importance masks directly under FHE. This dynamically generated mask facilitates both input graph pruning (by logically removing unimportant elements) and a novel adaptive polynomial activation scheme, where activation complexity is tailored to node importance levels. Empirical evaluations demonstrate that DESIGN substantially accelerates FHE GNN inference compared to state-of-the-art methods while maintaining competitive model accuracy, presenting a robust solution for secure graph analytics.
zh
[AI-51] FACT: the Features At Convergence Theorem for neural networks
【速读】:该论文试图解决深度学习理论中的核心问题,即理解神经网络如何学习和表示特征。其解决方案的关键是提出了“收敛特征定理”(Features at Convergence Theorem, FACT),该定理给出了在使用非零权重衰减训练时,神经网络权重在收敛时满足的自洽方程。该方程将权重矩阵 $ W $ 的“特征矩阵”$ W^\top W $ 与前向传播中输入向量的集合以及反向传播中通过该矩阵的损失梯度联系起来。通过验证这一关系,作者证明了神经特征确实在收敛时满足 FACT,并基于此提出了新的学习算法 FACT-RFM,该算法在表格数据上表现出色,并能捕捉神经网络训练中的多种特征学习行为。
链接: https://arxiv.org/abs/2507.05644
作者: Enric Boix-Adsera,Neil Mallinar,James B. Simon,Mikhail Belkin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:A central challenge in deep learning theory is to understand how neural networks learn and represent features. To this end, we prove the Features at Convergence Theorem (FACT), which gives a self-consistency equation that neural network weights satisfy at convergence when trained with nonzero weight decay. For each weight matrix W , this equation relates the “feature matrix” W^\top W to the set of input vectors passed into the matrix during forward propagation and the loss gradients passed through it during backpropagation. We validate this relation empirically, showing that neural features indeed satisfy the FACT at convergence. Furthermore, by modifying the “Recursive Feature Machines” of Radhakrishnan et al. 2024 so that they obey the FACT, we arrive at a new learning algorithm, FACT-RFM. FACT-RFM achieves high performance on tabular data and captures various feature learning behaviors that occur in neural network training, including grokking in modular arithmetic and phase transitions in learning sparse parities.
zh
[AI-52] LLM s are Introvert
【速读】:该论文试图解决当前基于大型语言模型(Large Language Models, LLM)的信息传播模拟中存在的人类心理和行为动态刻画不足的问题,特别是立场识别和心理现实性方面的差距。其解决方案的关键在于引入基于社会信息处理理论的思维链(Social Information Processing-based Chain of Thought, SIP-CoT)机制,并结合情感引导的记忆模块,以提升对社会线索的解释能力、目标个性化以及反馈评估的准确性,从而增强LLM代理的社会智能与行为真实性。
链接: https://arxiv.org/abs/2507.05638
作者: Litian Zhang,Xiaoming Zhang,Bingyu Yan,Ziyi Zhou,Bo Zhang,Zhenyu Guan,Xi Zhang,Chaozhuo Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:The exponential growth of social media and generative AI has transformed information dissemination, fostering connectivity but also accelerating the spread of misinformation. Understanding information propagation dynamics and developing effective control strategies is essential to mitigate harmful content. Traditional models, such as SIR, provide basic insights but inadequately capture the complexities of online interactions. Advanced methods, including attention mechanisms and graph neural networks, enhance accuracy but typically overlook user psychology and behavioral dynamics. Large language models (LLMs), with their human-like reasoning, offer new potential for simulating psychological aspects of information spread. We introduce an LLM-based simulation environment capturing agents’ evolving attitudes, emotions, and responses. Initial experiments, however, revealed significant gaps between LLM-generated behaviors and authentic human dynamics, especially in stance detection and psychological realism. A detailed evaluation through Social Information Processing Theory identified major discrepancies in goal-setting and feedback evaluation, stemming from the lack of emotional processing in standard LLM training. To address these issues, we propose the Social Information Processing-based Chain of Thought (SIP-CoT) mechanism enhanced by emotion-guided memory. This method improves the interpretation of social cues, personalization of goals, and evaluation of feedback. Experimental results confirm that SIP-CoT-enhanced LLM agents more effectively process social information, demonstrating behaviors, attitudes, and emotions closer to real human interactions. In summary, this research highlights critical limitations in current LLM-based propagation simulations and demonstrates how integrating SIP-CoT and emotional memory significantly enhances the social intelligence and realism of LLM agents.
zh
[AI-53] Graph Learning
【速读】:该论文试图解决图学习在实际应用中面临的一系列挑战,包括可扩展性、泛化能力、异构性、可解释性和可信度等问题。其解决方案的关键在于综述和分析当前图学习领域的前沿技术,涵盖可扩展图学习、时序图学习、多模态图学习、生成式AI、可解释AI(XAI)和负责任的AI等关键维度,旨在提供高效处理大规模图数据、捕捉动态依赖关系、融合异构数据模态、生成新颖图样本以及提升模型可解释性的方法,从而推动图学习在复杂现实场景中的广泛应用与可靠部署。
链接: https://arxiv.org/abs/2507.05636
作者: Feng Xia,Ciyuan Peng,Jing Ren,Falih Gozi Febrinanto,Renqiang Luo,Vidya Saikrishna,Shuo Yu,Xiangjie Kong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 178 pages
Abstract:Graph learning has rapidly evolved into a critical subfield of machine learning and artificial intelligence (AI). Its development began with early graph-theoretic methods, gaining significant momentum with the advent of graph neural networks (GNNs). Over the past decade, progress in scalable architectures, dynamic graph modeling, multimodal learning, generative AI, explainable AI (XAI), and responsible AI has broadened the applicability of graph learning to various challenging environments. Graph learning is significant due to its ability to model complex, non-Euclidean relationships that traditional machine learning struggles to capture, thus better supporting real-world applications ranging from drug discovery and fraud detection to recommender systems and scientific reasoning. However, challenges like scalability, generalization, heterogeneity, interpretability, and trustworthiness must be addressed to unlock its full potential. This survey provides a comprehensive introduction to graph learning, focusing on key dimensions including scalable, temporal, multimodal, generative, explainable, and responsible graph learning. We review state-of-the-art techniques for efficiently handling large-scale graphs, capturing dynamic temporal dependencies, integrating heterogeneous data modalities, generating novel graph samples, and enhancing interpretability to foster trust and transparency. We also explore ethical considerations, such as privacy and fairness, to ensure responsible deployment of graph learning models. Additionally, we identify and discuss emerging topics, highlighting recent integration of graph learning and other AI paradigms and offering insights into future directions. This survey serves as a valuable resource for researchers and practitioners seeking to navigate the rapidly evolving landscape of graph learning.
zh
[AI-54] How Not to Detect Prompt Injections with an LLM
【速读】:该论文试图解决生成式 AI (Generative AI) 应用和代理中因提示注入攻击(prompt injection attacks)而导致的安全问题,这类攻击通过在看似无害的用户输入中嵌入恶意指令来操控大语言模型(LLM)的行为。论文提出的解决方案关键在于针对已有的基于已知答案检测(Known-Answer Detection, KAD)的防御机制进行分析,并揭示其设计中的结构漏洞。研究者设计了一种系统性的自适应攻击方法 DataFlip,能够以极低的检测率(低至 1.5%)规避 KAD 防御,同时以高达 88% 的成功率诱导恶意行为,且无需白盒访问 LLM 或进行任何优化过程。
链接: https://arxiv.org/abs/2507.05630
作者: Sarthak Choudhary,Divyam Anshumaan,Nils Palumbo,Somesh Jha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLM-integrated applications and agents are vulnerable to prompt injection attacks, in which adversaries embed malicious instructions within seemingly benign user inputs to manipulate the LLM’s intended behavior. Recent defenses based on \textitknown-answer detection (KAD) have achieved near-perfect performance by using an LLM to classify inputs as clean or contaminated. In this work, we formally characterize the KAD framework and uncover a structural vulnerability in its design that invalidates its core security premise. We design a methodical adaptive attack, \textitDataFlip , to exploit this fundamental weakness. It consistently evades KAD defenses with detection rates as low as 1.5% while reliably inducing malicious behavior with success rates of up to 88% , without needing white-box access to the LLM or any optimization procedures.
zh
[AI-55] Enhancing Student Learning with LLM -Generated Retrieval Practice Questions: An Empirical Study in Data Science Courses
【速读】:该论文试图解决生成高质量检索练习问题耗时且劳动密集的问题,特别是在快速发展的技术学科中。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)自动生成多选题形式的检索练习问题,从而提升学生的学习效果和知识保留率。实验结果表明,使用LLM生成的检索练习问题能够显著提高学生的知识保留率,平均准确率达到89%,优于未使用此类练习的对照周的73%。这表明LLM生成的检索问题在支持学生学习方面具有有效性,并可能为实时教学中的检索练习整合提供可扩展的解决方案。然而,研究也指出LLM生成问题的质量可能存在差异,因此教师仍需手动验证和修改生成的问题后再发布给学生。
链接: https://arxiv.org/abs/2507.05629
作者: Yuan An,John Liu,Niyam Acharya,Ruhma Hashmi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval practice is a well-established pedagogical technique known to significantly enhance student learning and knowledge retention. However, generating high-quality retrieval practice questions is often time-consuming and labor intensive for instructors, especially in rapidly evolving technical subjects. Large Language Models (LLMs) offer the potential to automate this process by generating questions in response to prompts, yet the effectiveness of LLM-generated retrieval practice on student learning remains to be established. In this study, we conducted an empirical study involving two college-level data science courses, with approximately 60 students. We compared learning outcomes during one week in which students received LLM-generated multiple-choice retrieval practice questions to those from a week in which no such questions were provided. Results indicate that students exposed to LLM-generated retrieval practice achieved significantly higher knowledge retention, with an average accuracy of 89%, compared to 73% in the week without such practice. These findings suggest that LLM-generated retrieval questions can effectively support student learning and may provide a scalable solution for integrating retrieval practice into real-time teaching. However, despite these encouraging outcomes and the potential time-saving benefits, cautions must be taken, as the quality of LLM-generated questions can vary. Instructors must still manually verify and revise the generated questions before releasing them to students.
zh
[AI-56] ADMC: Attention-based Diffusion Model for Missing Modalities Feature Completion
【速读】:该论文旨在解决多模态情感与意图识别中因传感器故障或数据不完整导致的模态缺失问题。传统方法在尝试重建缺失信息时往往存在过度耦合和生成过程不精确的问题,从而导致次优结果。其解决方案的关键在于引入基于注意力的扩散模型(ADMC),该模型通过独立训练各模态的特征提取网络,保留各自独特特征并避免过度耦合,同时利用基于注意力的扩散网络(ADN)生成与真实多模态分布高度一致的缺失模态特征,从而提升在各种缺失模态场景下的性能。
链接: https://arxiv.org/abs/2507.05624
作者: Wei Zhang,Juan Chen,Yanbo J. Wang,En Zhu,Xuan Yang,Yiduo Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal emotion and intent recognition is essential for automated human-computer interaction, It aims to analyze users’ speech, text, and visual information to predict their emotions or intent. One of the significant challenges is that missing modalities due to sensor malfunctions or incomplete data. Traditional methods that attempt to reconstruct missing information often suffer from over-coupling and imprecise generation processes, leading to suboptimal outcomes. To address these issues, we introduce an Attention-based Diffusion model for Missing Modalities feature Completion (ADMC). Our framework independently trains feature extraction networks for each modality, preserving their unique characteristics and avoiding over-coupling. The Attention-based Diffusion Network (ADN) generates missing modality features that closely align with authentic multimodal distribution, enhancing performance across all missing-modality scenarios. Moreover, ADN’s cross-modal generation offers improved recognition even in full-modality contexts. Our approach achieves state-of-the-art results on the IEMOCAP and MIntRec benchmarks, demonstrating its effectiveness in both missing and complete modality scenarios.
zh
[AI-57] DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective
【速读】:该论文试图解决深度学习模型训练数据集的透明性问题,即如何判断一个可疑模型是否使用了特定的数据集进行训练。其解决方案的关键在于从对抗视角出发,对现有的数据集审计方法进行系统性评估,并提出相应的对抗攻击策略,以揭示现有方法在面对恶意攻击时的脆弱性。通过构建新的基准DATABench,该研究展示了当前审计方法在对抗环境下的不足,强调了开发更安全、可靠的审计机制的紧迫性。
链接: https://arxiv.org/abs/2507.05622
作者: Shuo Shao,Yiming Li,Mengren Zheng,Zhiyang Hu,Yukun Chen,Boheng Li,Yu He,Junfeng Guo,Tianwei Zhang,Dacheng Tao,Zhan Qin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The widespread application of Deep Learning across diverse domains hinges critically on the quality and composition of training datasets. However, the common lack of disclosure regarding their usage raises significant privacy and copyright concerns. Dataset auditing techniques, which aim to determine if a specific dataset was used to train a given suspicious model, provide promising solutions to addressing these transparency gaps. While prior work has developed various auditing methods, their resilience against dedicated adversarial attacks remains largely unexplored. To bridge the gap, this paper initiates a comprehensive study evaluating dataset auditing from an adversarial perspective. We start with introducing a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing). Subsequently, we formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset. Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery. These formulations and strategies lead to our new benchmark, DATABench, comprising 17 evasion attacks, 5 forgery attacks, and 9 representative auditing methods. Extensive evaluations using DATABench reveal that none of the evaluated auditing methods are sufficiently robust or distinctive under adversarial settings. These findings underscore the urgent need for developing a more secure and reliable dataset auditing method capable of withstanding sophisticated adversarial manipulation. Code is available at this https URL.
zh
[AI-58] Domain adaptation of large language models for geotechnical applications
【速读】:该论文试图解决如何将大型语言模型(LLMs)有效适配并应用于岩土工程领域的问题,以提升其在地质解释、地下特征描述、场地规划、设计计算、数值模拟、安全与风险评估及教育辅导等任务中的性能。解决方案的关键在于通过特定的领域适应方法,包括提示工程、检索增强生成、领域自适应预训练和微调,使通用LLMs能够更好地满足岩土工程的专业需求。
链接: https://arxiv.org/abs/2507.05613
作者: Lei Fan,Fangxue Liu,Cheng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent developments in large language models (LLMs) are opening up new opportunities in geotechnical engineering and engineering geology. While general-purpose LLMs possess broad capabilities, effective application in geotechnics often requires domain-specific adaptation. Such tailored LLMs are increasingly employed to streamline geotechnical workflows. This paper presents the first survey of the adaptation and application of LLMs in geotechnical engineering. It outlines key methodologies for adaptation to geotechnical domain, including prompt engineering, retrieval-augmented generation, domain-adaptive pretraining, and fine-tuning. The survey examines the state-of-the-art applications of geotechnical-adapted LLMs, including geological interpretation, subsurface characterization, site planning, design calculations, numerical modeling, safety and risk assessment, and educational tutoring. It also analyzes benefits and limitations of geotechnical-adapted LLMs, and identifies promising directions for future research in this interdisciplinary discipline. The findings serve as a valuable resource for practitioners seeking to integrate LLMs into geotechnical practice, while also providing a foundation to stimulate further investigation within the academic community.
zh
[AI-59] MLlm -DR: Towards Explainable Depression Recognition with MultiModal Large Language Models
【速读】:该论文试图解决自动化抑郁症诊断中缺乏明确评分依据以及现有大语言模型(Large Language Model, LLM)在处理访谈数据时表现不佳的问题。其解决方案的关键在于提出一种新型多模态大语言模型(Multimodal Large Language Model for Depression Recognition, MLLm-DR),该模型通过集成小型LLM和轻量级查询模块(Lightweight Query-former)来实现可解释的抑郁症诊断。其中,小型LLM用于生成抑郁评分及其评估依据,而LQ-former则负责从语音和视觉数据中提取与抑郁相关特征,从而提升模型对多模态信息的处理能力。
链接: https://arxiv.org/abs/2507.05591
作者: Wei Zhang,Juan Chen,En Zhu,Wenhong Cheng,YunPeng Li,Yanbo J. Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automated depression diagnosis aims to analyze multimodal information from interview videos to predict participants’ depression scores. Previous studies often lack clear explanations of how these scores were determined, limiting their adoption in clinical practice. While the advent of LLMs provides a possible pathway for explainable depression diagnosis, current LLMs capable of processing multimodal data lack training on interview data, resulting in poor diagnostic performance when used directly. In this paper, we propose a novel multimodal large language model (MLlm-DR) that can understand multimodal information inputs and supports explainable depression diagnosis. MLlm-DR integrates a smaller LLMs and a lightweight query module (LQ-former). Specifically, the smaller LLMs is designed to generate depression scores and corresponding evaluation rationales. To enhance its logical reasoning for domain-specific tasks while maintaining practicality, we constructed a robust training dataset to fine-tune it. Meanwhile, the LQ-former captures depression-related features from speech and visual data, aiding the model’s ability to process multimodal information, to achieve comprehensive depression diagnosis. Our approach achieves state-of-the-art results on two interview-based benchmark datasets, CMDC and E-DAIC-WOZ, demonstrating its effectiveness and superiority.
zh
[AI-60] owards Measurement Theory for Artificial Intelligence
【速读】:该论文试图解决如何对人工智能(Artificial Intelligence, AI)进行形式化测量的问题,旨在为研究人员、实践者和监管机构提供一种系统的方法来比较不同系统及其评估方法,并将前沿AI评估与工程和安全科学中的定量风险分析技术相联系。论文提出的关键解决方案是构建一个分层的测量架构,区分直接可观测量与间接可观测量,并通过这些要素建立一个统一且可校准的AI现象分类体系。
链接: https://arxiv.org/abs/2507.05587
作者: Elija Perrier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review for Iliad Conference 2025
Abstract:We motivate and outline a programme for a formal theory of measurement of artificial intelligence. We argue that formalising measurement for AI will allow researchers, practitioners, and regulators to: (i) make comparisons between systems and the evaluation methods applied to them; (ii) connect frontier AI evaluations with established quantitative risk analysis techniques drawn from engineering and safety science; and (iii) foreground how what counts as AI capability is contingent upon the measurement operations and scales we elect to use. We sketch a layered measurement stack, distinguish direct from indirect observables, and signpost how these ingredients provide a pathway toward a unified, calibratable taxonomy of AI phenomena.
zh
[AI-61] he Fourier Spectral Transformer Networks For Efficient and Generalizable Nonlinear PDEs Prediction
【速读】:该论文试图解决复杂动力系统实时预测与控制的问题,特别是针对流体动力学中的偏微分方程(PDEs)求解。解决方案的关键在于提出一种统一的傅里叶谱变换器网络,该网络结合了经典谱方法与基于注意力机制的神经网络架构的优势。通过将原始PDEs转换为谱常微分方程,利用高精度数值求解器生成训练数据,并使用Transformer网络建模谱系数的演化,从而实现对复杂动力系统的高精度长期预测。
链接: https://arxiv.org/abs/2507.05584
作者: Beibei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work we propose a unified Fourier Spectral Transformer network that integrates the strengths of classical spectral methods and attention based neural architectures. By transforming the original PDEs into spectral ordinary differential equations, we use high precision numerical solvers to generate training data and use a Transformer network to model the evolution of the spectral coefficients. We demonstrate the effectiveness of our approach on the two dimensional incompressible Navier-Stokes equations and the one dimensional Burgers’ equation. The results show that our spectral Transformer can achieve highly accurate long term predictions even with limited training data, better than traditional numerical methods and machine learning methods in forecasting future flow dynamics. The proposed framework generalizes well to unseen data, bringing a promising paradigm for real time prediction and control of complex dynamical systems.
zh
[AI-62] Prompt Migration: Stabilizing GenAI Applications with Evolving Large Language Models
【速读】:该论文试图解决生成式 AI(Generative AI)在业务应用中因底层大语言模型(LLMs)快速演变而导致的提示(prompt)一致性不足问题,进而引发的应用行为不可预测和可靠性下降的问题。解决方案的关键在于提出“提示迁移”(prompt migration)的概念,通过系统化的方法对提示进行重新设计,并构建迁移测试环境,以恢复应用的一致性和可靠性。
链接: https://arxiv.org/abs/2507.05573
作者: Shivani Tripathi,Pushpanjali Nema,Aditya Halder,Shi Qiao,Alekh Jindal
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Generative AI is transforming business applications by enabling natural language interfaces and intelligent automation. However, the underlying large language models (LLMs) are evolving rapidly and so prompting them consistently is a challenge. This leads to inconsistent and unpredictable application behavior, undermining the reliability that businesses require for mission-critical workflows. In this paper, we introduce the concept of prompt migration as a systematic approach to stabilizing GenAI applications amid changing LLMs. Using the Tursio enterprise search application as a case study, we analyze the impact of successive GPT model upgrades, detail our migration framework including prompt redesign and a migration testbed, and demonstrate how these techniques restore application consistency. Our results show that structured prompt migration can fully recover the application reliability that was lost due to model drift. We conclude with practical lessons learned, emphasizing the need for prompt lifecycle management and robust testing to ensure dependable GenAI-powered business applications.
zh
[AI-63] SingLoRA: Low Rank Adaptation Using a Single Matrix
【速读】:该论文旨在解决低秩适配(LoRA)在微调大规模预训练模型时因两个矩阵尺度差异导致的训练动态不稳定问题,进而影响模型性能。其解决方案的关键在于提出SingLoRA,通过将权重更新重构为一个低秩矩阵与其转置的乘积,从而消除矩阵间尺度冲突,确保优化过程稳定,并将参数量大约减少一半。
链接: https://arxiv.org/abs/2507.05566
作者: David Bensaïd,Noam Rotstein,Roy Velich,Daniel Bensaïd,Ron Kimmel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two smaller matrices that together form a low-rank matrix update. Recent research has shown that scale disparities between these two matrices often cause unstable training dynamics, leading to suboptimal performance. In this paper, we propose SingLoRA, which reformulates low-rank adaptation by learning the weights update as a decomposition of a single low-rank matrix multiplied by its transpose. This simple design inherently removes inter-matrix scale conflicts, ensuring stable optimization, and roughly halves the parameter count. We analyze SingLoRA within the infinite-width neural network framework, showing that it guarantees stable feature learning by construction. Extensive experiments on multiple tasks validate these benefits. In common sense reasoning, fine-tuning LLama 7B on MNLI with SingLoRA achieves 91.3% accuracy - surpassing LoRA (89.1%) and LoRA+ (90.2%) - while using only 60% of their parameter budget. In image generation, fine-tuning Stable Diffusion with SingLoRA significantly improves image fidelity on DreamBooth, achieving a DINO similarity score of 0.151, compared to scores of 0.148 and 0.143 for DoRA and LoRA, respectively.
zh
[AI-64] Search-based Selection of Metamorphic Relations for Optimized Robustness Testing of Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在鲁棒性测试中对可扩展的Metamorphic Relations (MRs)数量需求高,以及现有方法在测试空间覆盖和故障检测效率方面的不足。其解决方案的关键在于提出一种基于搜索的方法,通过优化MR组的选择来最大化故障检测并最小化LLM执行成本,同时涵盖MR中的组合扰动以扩展测试空间。研究实现了四种搜索算法(Single-GA、NSGA-II、SPEA2和MOEA/D)并采用新颖编码方式,以解决LLM鲁棒性测试中的MR选择问题。
链接: https://arxiv.org/abs/2507.05565
作者: Sangwon Hyun,Shaukat Ali,M. Ali Babar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Assessing the trustworthiness of Large Language Models (LLMs), such as robustness, has garnered significant attention. Recently, metamorphic testing that defines Metamorphic Relations (MRs) has been widely applied to evaluate the robustness of LLM executions. However, the MR-based robustness testing still requires a scalable number of MRs, thereby necessitating the optimization of selecting MRs. Most extant LLM testing studies are limited to automatically generating test cases (i.e., MRs) to enhance failure detection. Additionally, most studies only considered a limited test space of single perturbation MRs in their evaluation of LLMs. In contrast, our paper proposes a search-based approach for optimizing the MR groups to maximize failure detection and minimize the LLM execution cost. Moreover, our approach covers the combinatorial perturbations in MRs, facilitating the expansion of test space in the robustness assessment. We have developed a search process and implemented four search algorithms: Single-GA, NSGA-II, SPEA2, and MOEA/D with novel encoding to solve the MR selection problem in the LLM robustness testing. We conducted comparative experiments on the four search algorithms along with a random search, using two major LLMs with primary Text-to-Text tasks. Our statistical and empirical investigation revealed two key findings: (1) the MOEA/D algorithm performed the best in optimizing the MR space for LLM robustness testing, and (2) we identified silver bullet MRs for the LLM robustness testing, which demonstrated dominant capabilities in confusing LLMs across different Text-to-Text tasks. In LLM robustness assessment, our research sheds light on the fundamental problem for optimized testing and provides insights into search-based solutions.
zh
[AI-65] AI Agent Smart Contract Exploit Generation
【速读】:该论文试图解决智能合约中的漏洞自动检测与利用生成问题,旨在通过自动化手段提升攻击者在区块链环境中发现并利用漏洞的效率。其解决方案的关键在于提出A1系统,该系统基于代理执行驱动机制,将任意大语言模型(LLM)转化为端到端的漏洞利用生成器,无需依赖人工设计的启发式规则,而是通过六种领域特定工具实现自主的漏洞发现、策略生成、测试及优化,从而有效减少误报并提高攻击成功率。
链接: https://arxiv.org/abs/2507.05558
作者: Arthur Gervais,Liyi Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We present A1, an agentic execution driven system that transforms any LLM into an end-to-end exploit generator. A1 has no hand-crafted heuristics and provides the agent with six domain-specific tools that enable autonomous vulnerability discovery. The agent can flexibly leverage these tools to understand smart contract behavior, generate exploit strategies, test them on blockchain states, and refine approaches based on execution feedback. All outputs are concretely validated to eliminate false positives. The evaluation across 36 real-world vulnerable contracts on Ethereum and Binance Smart Chain demonstrates a 62.96% (17 out of 27) success rate on the VERITE benchmark. Beyond the VERITE dataset, A1 identified 9 additional vulnerable contracts, with 5 cases occurring after the strongest model’s training cutoff date. Across all 26 successful cases, A1 extracts up to 8.59 million USD per case and 9.33 million USD total. Through 432 experiments across six LLMs, we analyze iteration-wise performance showing diminishing returns with average marginal gains of +9.7%, +3.7%, +5.1%, and +2.8% for iterations 2-5 respectively, with per-experiment costs ranging 0.01- 3.59. A Monte Carlo analysis of 19 historical attacks shows success probabilities of 85.9%-88.8% without detection delays. We investigate whether an attacker or a defender benefits most from deploying A1 as a continuous on-chain scanning system. Our model shows that OpenAI’s o3-pro maintains profitability up to a 30.0 days scanning delay at 0.100% vulnerability incidence rates, while faster models require =1.000% rates to break-even. The findings exposes a troubling asymmetry: at 0.1% vulnerability rates, attackers achieve an on-chain scanning profitability at a 6000 exploit value, while defenders require 60000, raising fundamental questions about whether AI agents inevitably favor exploitation over defense. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.05558 [cs.CR] (or arXiv:2507.05558v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.05558 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-66] he Ethical Implications of AI in Creative Industries: A Focus on AI-Generated Art
【速读】:该论文试图解决生成式AI艺术在伦理层面引发的多重问题,包括环境影响、名人形象再现、知识产权侵权、深度伪造以及艺术家失业等。研究指出,生成式AI艺术导致了碳排放增加、虚假信息传播、版权侵犯、非法图像生成和职业替代等问题。解决方案的关键在于对生成式AI艺术进行正确的立法和监管,以应对这些问题并引导其健康发展。
链接: https://arxiv.org/abs/2507.05549
作者: Prerana Khatiwada,Joshua Washington,Tyler Walsh,Ahmed Saif Hamed,Lokesh Bhatta
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages
Abstract:As Artificial Intelligence (AI) continues to grow daily, more exciting (and somewhat controversial) technology emerges every other day. As we see the advancements in AI, we see more and more people becoming skeptical of it. This paper explores the complications and confusion around the ethics of generative AI art. We delve deep into the ethical side of AI, specifically generative art. We step back from the excitement and observe the impossible conundrums that this impressive technology produces. Covering environmental consequences, celebrity representation, intellectual property, deep fakes, and artist displacement. Our research found that generative AI art is responsible for increased carbon emissions, spreading misinformation, copyright infringement, unlawful depiction, and job displacement. In light of this, we propose multiple possible solutions for these problems. We address each situation’s history, cause, and consequences and offer different viewpoints. At the root of it all, though, the central theme is that generative AI Art needs to be correctly legislated and regulated.
zh
[AI-67] SenseCF: LLM -Prompted Counterfactuals for Intervention and Sensor Data Augmentation
【速读】:该论文旨在解决机器学习模型预测结果的可解释性问题,通过生成反事实解释(Counterfactual Explanations, CFs)来提供人类可理解的洞察,从而帮助理解和干预模型决策。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs),特别是GPT-4o-mini,在零样本和三样本设置下生成高质量的CFs,相较于传统方法在可解释性、有效性和稀疏性方面表现出色,并且能够通过作为增强数据提升下游分类器的性能。
链接: https://arxiv.org/abs/2507.05541
作者: Shovito Barua Soumma,Asiful Arefeen,Stephanie M. Carpenter,Melanie Hingle,Hassan Ghasemzadeh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: In review
Abstract:Counterfactual explanations (CFs) offer human-centric insights into machine learning predictions by highlighting minimal changes required to alter an outcome. Therefore, CFs can be used as (i) interventions for abnormality prevention and (ii) augmented data for training robust models. In this work, we explore large language models (LLMs), specifically GPT-4o-mini, for generating CFs in a zero-shot and three-shot setting. We evaluate our approach on two datasets: the AI-Readi flagship dataset for stress prediction and a public dataset for heart disease detection. Compared to traditional methods such as DiCE, CFNOW, and NICE, our few-shot LLM-based approach achieves high plausibility (up to 99%), strong validity (up to 0.99), and competitive sparsity. Moreover, using LLM-generated CFs as augmented samples improves downstream classifier performance (an average accuracy gain of 5%), especially in low-data regimes. This demonstrates the potential of prompt-based generative techniques to enhance explainability and robustness in clinical and physiological prediction tasks. Code base: this http URL.
zh
[AI-68] Robust Learning on Noisy Graphs via Latent Space Constraints with External Knowledge
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在面对噪声边时表现不佳的问题。其解决方案的关键在于引入潜在空间约束,通过训练两个编码器——一个在包含外部“干净”链接的完整图上进行训练,另一个在排除目标图潜在噪声边的正则化图上进行训练,并对两者潜在表示之间的差异进行惩罚,从而引导模型避免过拟合虚假边。
链接: https://arxiv.org/abs/2507.05540
作者: Chunhui Gu,Mohammad Sadegh Nasr,James P. Long,Kim-Anh Do,Ehsan Irajizad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks (GNNs) often struggle with noisy edges. We propose Latent Space Constrained Graph Neural Networks (LSC-GNN) to incorporate external “clean” links and guide embeddings of a noisy target graph. We train two encoders–one on the full graph (target plus external edges) and another on a regularization graph excluding the target’s potentially noisy links–then penalize discrepancies between their latent representations. This constraint steers the model away from overfitting spurious edges. Experiments on benchmark datasets show LSC-GNN outperforms standard and noise-resilient GNNs in graphs subjected to moderate noise. We extend LSC-GNN to heterogeneous graphs and validate it on a small protein-metabolite network, where metabolite-protein interactions reduce noise in protein co-occurrence data. Our results highlight LSC-GNN’s potential to boost predictive performance and interpretability in settings with noisy relational structures.
zh
[AI-69] Red Teaming AI Red Teaming
【速读】:该论文试图解决当前AI红队测试(red teaming)在实践中的局限性,即其过度关注模型层面的漏洞,而忽视了更广泛的社会技术系统和模型、用户与环境之间复杂交互所产生的涌现行为。解决方案的关键在于提出一个分两个层次的全面框架:宏观层面的系统红队测试,覆盖整个AI开发生命周期;微观层面的模型红队测试。该框架结合网络安全经验和系统理论,强调有效的AI红队测试需要多职能团队对涌现风险、系统性脆弱性和技术与社会因素之间的相互作用进行综合评估。
链接: https://arxiv.org/abs/2507.05538
作者: Subhabrata Majumdar,Brian Pendleton,Abhishek Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:
Abstract:Red teaming has evolved from its origins in military applications to become a widely adopted methodology in cybersecurity and AI. In this paper, we take a critical look at the practice of AI red teaming. We argue that despite its current popularity in AI governance, there exists a significant gap between red teaming’s original intent as a critical thinking exercise and its narrow focus on discovering model-level flaws in the context of generative AI. Current AI red teaming efforts focus predominantly on individual model vulnerabilities while overlooking the broader sociotechnical systems and emergent behaviors that arise from complex interactions between models, users, and environments. To address this deficiency, we propose a comprehensive framework operationalizing red teaming in AI systems at two levels: macro-level system red teaming spanning the entire AI development lifecycle, and micro-level model red teaming. Drawing on cybersecurity experience and systems theory, we further propose a set of recommendations. In these, we emphasize that effective AI red teaming requires multifunctional teams that examine emergent risks, systemic vulnerabilities, and the interplay between technical and social factors.
zh
[AI-70] Mitigating Shortcut Learning with InterpoLated Learning ACL2025
【速读】:该论文试图解决经验风险最小化(Empirical Risk Minimization, ERM)导致模型依赖训练数据中的捷径(shortcut),即输入属性与标签之间的虚假相关性,从而影响模型在少数样本上的泛化能力。解决方案的关键在于提出一种名为InterpoLated Learning (InterpoLL)的方法,通过将多数样本的表示与同类少数样本中具有缓解捷径模式的特征进行插值,削弱捷径的影响,使模型能够学习到在多数和少数样本上都具有预测性的特征。
链接: https://arxiv.org/abs/2507.05527
作者: Michalis Korakakis,Andreas Vlachos,Adrian Weller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 (Main)
Abstract:Empirical risk minimization (ERM) incentivizes models to exploit shortcuts, i.e., spurious correlations between input attributes and labels that are prevalent in the majority of the training data but unrelated to the task at hand. This reliance hinders generalization on minority examples, where such correlations do not hold. Existing shortcut mitigation approaches are model-specific, difficult to tune, computationally expensive, and fail to improve learned representations. To address these issues, we propose InterpoLated Learning (InterpoLL) which interpolates the representations of majority examples to include features from intra-class minority examples with shortcut-mitigating patterns. This weakens shortcut influence, enabling models to acquire features predictive across both minority and majority examples. Experimental results on multiple natural language understanding tasks demonstrate that InterpoLL improves minority generalization over both ERM and state-of-the-art shortcut mitigation methods, without compromising accuracy on majority examples. Notably, these gains persist across encoder, encoder-decoder, and decoder-only architectures, demonstrating the method’s broad applicability.
zh
[AI-71] Cultivating Multimodal Intelligence: Interpretive Reasoning and Agent ic RAG Approaches to Dermatological Diagnosis
【速读】:该论文旨在解决多模态皮肤科问答与分割问题,特别是在真实患者查询和图像背景下进行闭合视觉问答(CVQA)任务,即根据用户提交的图像和伴随的症状描述选择正确的多选临床问题答案。解决方案的关键在于三个核心组件的结合:首先是对Qwen、Gemma和LLaMA系列开源多模态模型在竞赛数据集上的微调;其次是一个结构化推理层,用于协调和裁定候选模型输出;最后是引入代理检索增强生成(agentic RAG),通过整合美国皮肤病学会的症状和疾病数据库信息来补充患者上下文的不足。
链接: https://arxiv.org/abs/2507.05520
作者: Karishma Thakrar,Shreyas Basavatia,Akshay Daftardar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 2025 ImageCLEF MEDIQA-MAGIC Challenge
Abstract:The second edition of the 2025 ImageCLEF MEDIQA-MAGIC challenge, co-organized by researchers from Microsoft, Stanford University, and the Hospital Clinic of Barcelona, focuses on multimodal dermatology question answering and segmentation, using real-world patient queries and images. This work addresses the Closed Visual Question Answering (CVQA) task, where the goal is to select the correct answer to multiple-choice clinical questions based on both user-submitted images and accompanying symptom descriptions. The proposed approach combines three core components: (1) fine-tuning open-source multimodal models from the Qwen, Gemma, and LLaMA families on the competition dataset, (2) introducing a structured reasoning layer that reconciles and adjudicates between candidate model outputs, and (3) incorporating agentic retrieval-augmented generation (agentic RAG), which adds relevant information from the American Academy of Dermatology’s symptom and condition database to fill in gaps in patient context. The team achieved second place with a submission that scored sixth, demonstrating competitive performance and high accuracy. Beyond competitive benchmarks, this research addresses a practical challenge in telemedicine: diagnostic decisions must often be made asynchronously, with limited input and with high accuracy and interpretability. By emulating the systematic reasoning patterns employed by dermatologists when evaluating skin conditions, this architecture provided a pathway toward more reliable automated diagnostic support systems.
zh
[AI-72] Modeling (Deontic) Modal Operators With the s(CASP) Goal-directed Predicated Answer Set Programming System
【速读】:该论文试图解决如何在回答集编程(Answer Set Programming, ASP)中实现道义模态逻辑(deontic modal logic)的问题。其解决方案的关键在于利用ASP中的默认否定(negation-as-failure)和强否定(strong negation)来优雅地表达道义模态算子,并通过ASP的全局约束来表示道义模态逻辑中的义务和禁止性陈述,从而巧妙地解决了道义模态逻辑中的各种悖论。
链接: https://arxiv.org/abs/2507.05519
作者: Gopal Gupta,Abhiramon Rajasekharan,Alexis R. Tudor,Elmer Salazar,Joaquín Arias
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:We consider the problem of implementing deontic modal logic. We show how (deontic) modal operators can be expressed elegantly using default negation (negation-as-failure) and strong negation present in answer set programming (ASP). We propose using global constraints of ASP to represent obligations and impermissibilities of deontic modal logic. We show that our proposed representation results in the various paradoxes of deontic modal logic being elegantly resolved.
zh
[AI-73] Disappearing Ink: Obfuscation Breaks N-gram Code Watermarks in Theory and Practice
【速读】:该论文试图解决生成式 AI (Generative AI) 生成代码与人工编写代码之间的水印技术在面对代码混淆等复杂变换时的鲁棒性不足问题。其关键在于通过形式化建模代码混淆,并基于分布一致性假设证明了基于 N-gram 的水印技术在面对代码混淆时的鲁棒性不可行,实验结果表明大多数水印检测器在混淆代码上的检测能力接近随机猜测(AUROC 紧密围绕 0.5)。
链接: https://arxiv.org/abs/2507.05512
作者: Gehao Zhang,Eugene Bagdasarian,Juan Zhai,Shiqing Ma
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Distinguishing AI-generated code from human-written code is becoming crucial for tasks such as authorship attribution, content tracking, and misuse detection. Based on this, N-gram-based watermarking schemes have emerged as prominent, which inject secret watermarks to be detected during the generation. However, their robustness in code content remains insufficiently evaluated. Most claims rely solely on defenses against simple code transformations or code optimizations as a simulation of attack, creating a questionable sense of robustness. In contrast, more sophisticated schemes already exist in the software engineering world, e.g., code obfuscation, which significantly alters code while preserving functionality. Although obfuscation is commonly used to protect intellectual property or evade software scanners, the robustness of code watermarking techniques against such transformations remains largely unexplored. In this work, we formally model the code obfuscation and prove the impossibility of N-gram-based watermarking’s robustness with only one intuitive and experimentally verified assumption, distribution consistency, satisfied. Given the original false positive rate of the watermarking detection, the ratio that the detector failed on the watermarked code after obfuscation will increase to 1 - fpr. The experiments have been performed on three SOTA watermarking schemes, two LLMs, two programming languages, four code benchmarks, and four obfuscators. Among them, all watermarking detectors show coin-flipping detection abilities on obfuscated codes (AUROC tightly surrounds 0.5). Among all models, watermarking schemes, and datasets, both programming languages own obfuscators that can achieve attack effects with no detection AUROC higher than 0.6 after the attack. Based on the theoretical and practical observations, we also proposed a potential path of robust code watermarking. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.05512 [cs.CR] (or arXiv:2507.05512v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.05512 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gehao Zhang [view email] [v1] Mon, 7 Jul 2025 22:18:19 UTC (1,576 KB)
zh
[AI-74] Explainable Hierarchical Deep Learning Neural Networks (Ex-HiDeNN)
【速读】:该论文试图解决从复杂数据集中高效发现可解释且准确的显式表达式的问题。其解决方案的关键在于提出了一种名为Explainable Hierarchical Deep Learning Neural Networks (Ex-HiDeNN) 的新方法,该方法结合了符号回归与一种精确、节省资源、快速、可分离且可扩展的神经网络架构,通过嵌入可分离性检查器的两步算法来实现对数据中隐含规律的高效捕捉和表达。
链接: https://arxiv.org/abs/2507.05498
作者: Reza T. Batley,Chanwook Park,Wing Kam Liu,Sourav Saha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Data-driven science and computation have advanced immensely to construct complex functional relationships using trainable parameters. However, efficiently discovering interpretable and accurate closed-form expressions from complex dataset remains a challenge. The article presents a novel approach called Explainable Hierarchical Deep Learning Neural Networks or Ex-HiDeNN that uses an accurate, frugal, fast, separable, and scalable neural architecture with symbolic regression to discover closed-form expressions from limited observation. The article presents the two-step Ex-HiDeNN algorithm with a separability checker embedded in it. The accuracy and efficiency of Ex-HiDeNN are tested on several benchmark problems, including discerning a dynamical system from data, and the outcomes are reported. Ex-HiDeNN generally shows outstanding approximation capability in these benchmarks, producing orders of magnitude smaller errors compared to reference data and traditional symbolic regression. Later, Ex-HiDeNN is applied to three engineering applications: a) discovering a closed-form fatigue equation, b) identification of hardness from micro-indentation test data, and c) discovering the expression for the yield surface with data. In every case, Ex-HiDeNN outperformed the reference methods used in the literature. The proposed method is built upon the foundation and published works of the authors on Hierarchical Deep Learning Neural Network (HiDeNN) and Convolutional HiDeNN. The article also provides a clear idea about the current limitations and future extensions of Ex-HiDeNN.
zh
[AI-75] Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents
【速读】:该论文试图解决如何有效评估深度研究代理(deep research agents)的问题,特别是针对长篇报告的评估以及对其生成过程中间步骤的详细反馈。解决方案的关键在于引入Deep Research Comparator平台,该平台提供了一个全面的框架,用于深度研究代理的托管、并排比较、细粒度的人工反馈收集和排名计算。通过该平台,用户可以同时查看两个不同代理生成的最终报告及其生成过程中的中间步骤,并由标注者进行整体质量评估及对中间步骤或最终报告中特定文本片段的详细反馈。此外,论文还开发了Simple Deepresearch,一个端到端的代理模板,作为基线以简化多种大语言模型向深度研究代理的转换与集成。
链接: https://arxiv.org/abs/2507.05495
作者: Prahaladh Chandrahasan,Jiahe Jin,Zhihan Zhang,Tevin Wang,Andy Tang,Lucy Mo,Morteza Ziyadi,Leonardo F.R. Ribeiro,Zimeng Qiu,Markus Dreyer,Akari Asai,Chenyan Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Effectively evaluating deep research agents that autonomously search the web, analyze information, and generate reports remains a major challenge, particularly when it comes to assessing long reports and giving detailed feedback on their intermediate steps. To address these gaps, we introduce Deep Research Comparator, a platform that offers a holistic framework for deep research agent hosting, side-by-side comparison, fine-grained human feedback collection, and ranking calculation. Given a user query, our platform displays the final reports from two different agents along with their intermediate steps during generation. Annotators can evaluate the overall quality of final reports based on side-by-side comparison, and also provide detailed feedback separately by assessing intermediate steps or specific text spans within the final report. Furthermore, we develop Simple Deepresearch, an end-to-end agent scaffold. This scaffold serves as a baseline that facilitates the easy integration of various large language models to transform them into deep research agents for evaluation. To demonstrate the platform’s utility for deep research agent development, we have collected real user preference data from 17 annotators on three deep research agents. A demo video of our platform can be found at this https URL.
zh
[AI-76] OLG: A Semantic Extension of Obligation Logic Graph
【速读】:该论文试图解决在市政和跨司法管辖区背景下对监管和法律规则进行建模的问题,其解决方案的关键在于提出OLG++,这是一种语义扩展的义务逻辑图(Obligation Logic Graph),通过引入更丰富的节点和边类型,包括空间、时间、当事人组、可反驳性和逻辑分组结构,实现了对法律义务、例外情况和层级关系的细致表示。该模型支持基于上下文条件、优先级和复杂触发器的结构化推理,并通过食品业务法规中的示例展示了其表达能力。
链接: https://arxiv.org/abs/2507.05488
作者: Subhasis Dasgupta,Jon Stephens,Amarnath Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:We present OLG++, a semantic extension of the Obligation Logic Graph (OLG) for modeling regulatory and legal rules in municipal and interjurisdictional contexts. OLG++ introduces richer node and edge types, including spatial, temporal, party group, defeasibility, and logical grouping constructs, enabling nuanced representations of legal obligations, exceptions, and hierarchies. The model supports structured reasoning over rules with contextual conditions, precedence, and complex triggers. We demonstrate its expressiveness through examples from food business regulations, showing how OLG++ supports legal question answering using property graph queries. OLG++ also improves over LegalRuleML by providing native support for subClassOf, spatial constraints, and reified exception structures. Our examples show that OLG++ is more expressive than prior graph-based models for legal knowledge representation.
zh
[AI-77] Epistemically-guided forward-backward exploration
【速读】:该论文试图解决在缺乏明确奖励信号的情况下,如何高效提取最优策略以实现快速适应未来问题设置的零样本强化学习问题。其解决方案的关键在于将前向-后向表示(Forward-backward representations, FB)用于探索过程,通过设计从FB表示中自然产生的探索策略,以最小化FB表示的后验方差,从而减少其认知不确定性。这种方法显著提高了FB算法的样本效率。
链接: https://arxiv.org/abs/2507.05477
作者: Núria Armengol Urpí,Marin Vlastelica,Georg Martius,Stelian Coros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Zero-shot reinforcement learning is necessary for extracting optimal policies in absence of concrete rewards for fast adaptation to future problem settings. Forward-backward representations (FB) have emerged as a promising method for learning optimal policies in absence of rewards via a factorization of the policy occupancy measure. However, up until now, FB and many similar zero-shot reinforcement learning algorithms have been decoupled from the exploration problem, generally relying on other exploration algorithms for data collection. We argue that FB representations should fundamentally be used for exploration in order to learn more efficiently. With this goal in mind, we design exploration policies that arise naturally from the FB representation that minimize the posterior variance of the FB representation, hence minimizing its epistemic uncertainty. We empirically demonstrate that such principled exploration strategies improve sample complexity of the FB algorithm considerably in comparison to other exploration methods. Code is publicly available at this https URL.
zh
[AI-78] Inaugural MOASEI Competition at AAMAS2025: A Technical Report AAMAS’2025
【速读】:该论文旨在解决开放世界条件下多智能体系统决策能力的评估问题,其核心挑战在于如何在动态、部分可观测且具有实体开放性的环境中实现有效的协作与适应。解决方案的关键在于构建一个名为MOASEI的多智能体AI基准测试平台,该平台基于free-range-zoo环境套件,支持动态变化的领域设置,并引入了任务和代理的开放性特性。通过2025年比赛中的三个赛道(野火、拼车和网络安全),研究者探索了不同维度的开放性和协调复杂性,参赛团队采用了包括图神经网络、卷积架构、预测建模以及大语言模型驱动的元优化等多种方法,以提升系统的预期效用、抗干扰能力和环境响应性。
链接: https://arxiv.org/abs/2507.05469
作者: Ceferino Patino,Tyler J. Billings,Alireza Saleh Abadi,Daniel Redder,Adam Eck,Prashant Doshi,Leen-Kiat Soh
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Report from the MOASEI’2025 Competition held at AAMAS’2025
Abstract:We present the Methods for Open Agent Systems Evaluation Initiative (MOASEI) Competition, a multi-agent AI benchmarking event designed to evaluate decision-making under open-world conditions. Built on the free-range-zoo environment suite, MOASEI introduced dynamic, partially observable domains with agent and task openness–settings where entities may appear, disappear, or change behavior over time. The 2025 competition featured three tracks–Wildfire, Rideshare, and Cybersecurity–each highlighting distinct dimensions of openness and coordination complexity. Eleven teams from international institutions participated, with four of those teams submitting diverse solutions including graph neural networks, convolutional architectures, predictive modeling, and large language model–driven meta–optimization. Evaluation metrics centered on expected utility, robustness to perturbations, and responsiveness to environmental change. The results reveal promising strategies for generalization and adaptation in open environments, offering both empirical insight and infrastructure for future research. This report details the competition’s design, findings, and contributions to the open-agent systems research community.
zh
[AI-79] 2048: Reinforcement Learning in a Delayed Reward Environment
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中延迟且稀疏奖励带来的挑战,这类问题使得智能体难以准确评估那些在多个步骤之后才产生收益的动作。论文提出的解决方案的关键在于引入一种统一的分布多步强化学习框架,该框架通过整合分布学习、dueling架构、噪声网络、优先级经验回放等技术,直接优化长时域性能。实验结果表明,所提出的Horizon-DQN(H-DQN)在2048游戏中表现显著优于传统方法,能够达到更高的得分和更高级别的方块。
链接: https://arxiv.org/abs/2507.05465
作者: Prady Saligram,Tanvir Bhathal,Robby Manihani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Delayed and sparse rewards present a fundamental obstacle for reinforcement-learning (RL) agents, which struggle to assign credit for actions whose benefits emerge many steps later. The sliding-tile game 2048 epitomizes this challenge: although frequent small score changes yield immediate feedback, they often mislead agents into locally optimal but globally suboptimal strategies. In this work, we introduce a unified, distributional multi-step RL framework designed to directly optimize long-horizon performance. Using the open source Gym-2048 environment we develop and compare four agent variants: standard DQN, PPO, QR-DQN (Quantile Regression DQN), and a novel Horizon-DQN (H-DQN) that integrates distributional learning, dueling architectures, noisy networks, prioritized replay, and more. Empirical evaluation reveals a clear hierarchy in effectiveness: max episode scores improve from 3.988K (DQN) to 5.756K (PPO), 8.66K (QR-DQN), and 18.21K (H-DQN), with H-DQN reaching the 2048 tile. Upon scaling H-DQN it reaches a max score 41.828K and a 4096 tile. These results demonstrate that distributional, multi-step targets substantially enhance performance in sparse-reward domains, and they suggest promising avenues for further gains through model-based planning and curriculum learning.
zh
[AI-80] EmissionNet: Air Quality Pollution Forecasting for Agriculture
【速读】:该论文试图解决农业源氮氧化物(N₂O)排放的准确预测问题,这一问题在环境和公共健康挑战中具有重要影响但常被忽视。传统空气质量管理模型依赖于物理基础方法,难以捕捉污染物之间复杂的非线性相互作用。论文的关键解决方案是引入两种新颖的深度学习架构——EmissionNet (ENV) 和 EmissionNet-Transformer (ENT),它们利用卷积和基于Transformer的结构从高分辨率排放数据中提取时空依赖性,从而提升预测精度。
链接: https://arxiv.org/abs/2507.05416
作者: Prady Saligram,Tanvir Bhathal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Air pollution from agricultural emissions is a significant yet often overlooked contributor to environmental and public health challenges. Traditional air quality forecasting models rely on physics-based approaches, which struggle to capture complex, nonlinear pollutant interactions. In this work, we explore forecasting N _2 O agricultural emissions through evaluating popular architectures, and proposing two novel deep learning architectures, EmissionNet (ENV) and EmissionNet-Transformer (ENT). These models leverage convolutional and transformer-based architectures to extract spatial-temporal dependencies from high-resolution emissions data
zh
[AI-81] Probabilistically Tightened Linear Relaxation-based Perturbation Analysis for Neural Network Verification
【速读】:该论文旨在解决神经网络形式化验证中的计算成本高且保证不足的问题,特别是针对复杂竞赛条目中现有方法失效的情况。其解决方案的关键在于提出一种名为PT-LiRPA的新框架,该框架结合了基于LiRPA的方法中的过近似技术与采样方法,以计算紧致的中间可达集合,从而在几乎不增加计算开销的情况下显著收紧神经网络输出的线性边界,提高了验证的准确性与效率,并提供了概率保证。
链接: https://arxiv.org/abs/2507.05405
作者: Luca Marzari,Ferdinando Cicalese,Alessandro Farinelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present \textbfP robabilistically \textbfT ightened \textbfLi near \textbfR elaxation-based \textbfP erturbation \textbfA nalysis ( \textttPT-LiRPA ), a novel framework that combines over-approximation techniques from LiRPA-based approaches with a sampling-based method to compute tight intermediate reachable sets. In detail, we show that with negligible computational overhead, \textttPT-LiRPA exploiting the estimated reachable sets, significantly tightens the lower and upper linear bounds of a neural network’s output, reducing the computational cost of formal verification tools while providing probabilistic guarantees on verification soundness. Extensive experiments on standard formal verification benchmarks, including the International Verification of Neural Networks Competition, show that our \textttPT-LiRPA -based verifier improves robustness certificates by up to 3.31X and 2.26X compared to related work. Importantly, our probabilistic approach results in a valuable solution for challenging competition entries where state-of-the-art formal verification methods fail, allowing us to provide answers with high confidence (i.e., at least 99%).
zh
[AI-82] Causal Foundation Models: Disentangling Physics from Instrument Properties ICML2025
【速读】:该论文试图解决结构化时间序列数据中基础模型面临的挑战,即观测数据常将真实的物理现象与测量仪器引入的系统性扭曲混杂在一起,从而限制了模型的泛化能力,尤其是在异质或多仪器设置中。解决方案的关键在于提出一种基于因果动机的基础模型,该模型采用双编码器架构,并通过结构化对比学习进行训练,以显式地解耦物理因素和仪器因素。该方法利用自然发生的观测三元组(即同一目标在不同条件下测量,以及不同目标在相同条件下测量的情况),学习到对底层物理信号和仪器效应的独立潜在表示。
链接: https://arxiv.org/abs/2507.05333
作者: Jeroen Audenaert,Daniel Muthukrishna,Paul F. Gregory,David W. Hogg,V. Ashley Villar
机构: 未知
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures. Accepted to the ICML 2025 Foundation Models for Structured Data Workshop and accepted to the Machine Learning for Astrophysics Workshop 2025
Abstract:Foundation models for structured time series data must contend with a fundamental challenge: observations often conflate the true underlying physical phenomena with systematic distortions introduced by measurement instruments. This entanglement limits model generalization, especially in heterogeneous or multi-instrument settings. We present a causally-motivated foundation model that explicitly disentangles physical and instrumental factors using a dual-encoder architecture trained with structured contrastive learning. Leveraging naturally occurring observational triplets (i.e., where the same target is measured under varying conditions, and distinct targets are measured under shared conditions) our model learns separate latent representations for the underlying physical signal and instrument effects. Evaluated on simulated astronomical time series designed to resemble the complexity of variable stars observed by missions like NASA’s Transiting Exoplanet Survey Satellite (TESS), our method significantly outperforms traditional single-latent space foundation models on downstream prediction tasks, particularly in low-data regimes. These results demonstrate that our model supports key capabilities of foundation models, including few-shot generalization and efficient adaptation, and highlight the importance of encoding causal structure into representation learning for structured data.
zh
[AI-83] Going Beyond Heuristics by Imposing Policy Improvement as a Constraint
【速读】:该论文试图解决在强化学习(Reinforcement Learning, RL)中,如何有效利用人类先验知识所形成的启发式奖励(heuristic rewards)以提升任务性能的问题,同时避免因启发式奖励非最优而导致的人力与计算资源浪费。传统方法依赖于策略不变性(policy invariance)理论,但实践中难以实现策略改进,且表现不佳。论文提出的解决方案关键在于引入Heuristic Enhanced Policy Optimization (HEPO)框架,该框架基于最大化策略改进的实际目标,而非单纯追求策略不变性,从而更有效地利用启发式奖励并减轻奖励设计的人工负担。
链接: https://arxiv.org/abs/2507.05328
作者: Chi-Chang Lee,Zhang-Wei Hong,Pulkit Agrawal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In many reinforcement learning (RL) applications, augmenting the task rewards with heuristic rewards that encode human priors about how a task should be solved is crucial for achieving desirable performance. However, because such heuristics are usually not optimal, much human effort and computational resources are wasted in carefully balancing tasks and heuristic rewards. Theoretically rigorous ways of incorporating heuristics rely on the idea of \textitpolicy invariance, which guarantees that the performance of a policy obtained by maximizing heuristic rewards is the same as the optimal policy with respect to the task reward. However, in practice, policy invariance doesn’t result in policy improvement, and such methods are known to empirically perform poorly. We propose a new paradigm to mitigate reward hacking and effectively use heuristics based on the practical goal of maximizing policy improvement instead of policy improvement. Our framework, Heuristic Enhanced Policy Optimization (HEPO), effectively leverages heuristics while avoiding the pitfall of prior methods for mitigating reward hacking. HEPO achieves superior performance on standard benchmarks with well-engineered reward functions. More surprisingly, HEPO allows policy optimization to achieve good performance even when heuristics are not well-engineered and designed by non-expert humans, showcasing HEPO’s ability to reduce human effort in reward design. % HEPO is a plug-and-play optimization method for leveraging heuristics in reinforcement learning. Code is available at this https URL.
zh
[AI-84] AGACCI : Affiliated Grading Agents for Criteria-Centric Interface in Educational Coding Contexts ICML2025
【速读】:该论文试图解决现有基于视觉-语言模型(VLM)的教育评估方法在处理复杂教学材料(如包含可执行组件和可测量输出的编程任务)时存在的结构性推理不足和评估标准对齐困难的问题。解决方案的关键在于引入AGACCI,这是一个多智能体系统,通过将专业评估角色分配给协作代理,以提升代码导向评估的准确性、可解释性和一致性。
链接: https://arxiv.org/abs/2507.05321
作者: Kwangsuk Park,Jiwoong Yang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2025 Workshop on Multi-Agent Systems in the Era of Foundation Models: Opportunities, Challenges and Futures (MAS)
Abstract:Recent advances in AI-assisted education have encouraged the integration of vision-language models (VLMs) into academic assessment, particularly for tasks that require both quantitative and qualitative evaluation. However, existing VLM based approaches struggle with complex educational artifacts, such as programming tasks with executable components and measurable outputs, that require structured reasoning and alignment with clearly defined evaluation criteria. We introduce AGACCI, a multi-agent system that distributes specialized evaluation roles across collaborative agents to improve accuracy, interpretability, and consistency in code-oriented assessment. To evaluate the framework, we collected 360 graduate-level code-based assignments from 60 participants, each annotated by domain experts with binary rubric scores and qualitative feedback. Experimental results demonstrate that AGACCI outperforms a single GPT-based baseline in terms of rubric and feedback accuracy, relevance, consistency, and coherence, while preserving the instructional intent and evaluative depth of expert assessments. Although performance varies across task types, AGACCI highlights the potential of multi-agent systems for scalable and context-aware educational evaluation.
zh
[AI-85] OASBuilder: Generating OpenAPI Specifications from Online API Documentation with Large Language Models
【速读】:该论文试图解决API文档信息在在线环境中通常以非结构化、自由格式的HTML形式呈现,导致外部用户需要耗费大量时间手动转换为结构化格式的问题。解决方案的关键在于提出OASBuilder框架,该框架通过一个精心设计的处理流程,结合大型语言模型和基于规则的算法,并利用对文档网页结构的领域知识进行指导,将长且多样的API文档页面转化为一致且机器可读的API规范。
链接: https://arxiv.org/abs/2507.05316
作者: Koren Lazar,Matan Vetzler,Kiran Kate,Jason Tsay,David Boaz Himanshu Gupta,Avraham Shinnar,Rohith D Vallam,David Amid Esther Goldbraich,Guy Uziel,Jim Laredo,Ateret Anaby Tavor
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents and business automation tools interacting with external web services require standardized, machine-readable information about their APIs in the form of API specifications. However, the information about APIs available online is often presented as unstructured, free-form HTML documentation, requiring external users to spend significant time manually converting it into a structured format. To address this, we introduce OASBuilder, a novel framework that transforms long and diverse API documentation pages into consistent, machine-readable API specifications. This is achieved through a carefully crafted pipeline that integrates large language models and rule-based algorithms which are guided by domain knowledge of the structure of documentation webpages. Our experiments demonstrate that OASBuilder generalizes well across hundreds of APIs, and produces valid OpenAPI specifications that encapsulate most of the information from the original documentation. OASBuilder has been successfully implemented in an enterprise environment, saving thousands of hours of manual effort and making hundreds of complex enterprise APIs accessible as tools for LLMs.
zh
[AI-86] PLACE: Prompt Learning for Attributed Community Search
【速读】:该论文试图解决属性社区搜索(Attribute Community Search, ACS)问题,旨在从图结构中识别与特定查询相关的结构紧密性和属性相似性模式。其解决方案的关键在于提出PLACE(Prompt Learning for Attributed Community Search),该框架通过将结构信息与可学习的提示标记(prompt tokens)整合到图中,形成一个查询依赖的增强图结构,从而提升图神经网络(GNN)对目标社区的识别能力。
链接: https://arxiv.org/abs/2507.05311
作者: Shuheng Fang,Kangfei Zhao,Rener Zhang,Yu Rong,Jeffrey Xu Yu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures
Abstract:In this paper, we propose PLACE (Prompt Learning for Attributed Community Search), an innovative graph prompt learning framework for ACS. Enlightened by prompt-tuning in Natural Language Processing (NLP), where learnable prompt tokens are inserted to contextualize NLP queries, PLACE integrates structural and learnable prompt tokens into the graph as a query-dependent refinement mechanism, forming a prompt-augmented graph. Within this prompt-augmented graph structure, the learned prompt tokens serve as a bridge that strengthens connections between graph nodes for the query, enabling the GNN to more effectively identify patterns of structural cohesiveness and attribute similarity related to the specific query. We employ an alternating training paradigm to optimize both the prompt parameters and the GNN jointly. Moreover, we design a divide-and-conquer strategy to enhance scalability, supporting the model to handle million-scale graphs. Extensive experiments on 9 real-world graphs demonstrate the effectiveness of PLACE for three types of ACS queries, where PLACE achieves higher F1 scores by 22% compared to the state-of-the-arts on average.
zh
[AI-87] Neural Velocity for hyperparameter tuning IJCNN2025
【速读】:该论文试图解决传统超参数调优方法中依赖验证损失来调整学习率衰减和定义停止准则的问题。其解决方案的关键在于引入“神经速度”(neural velocity)这一新概念,神经速度衡量每个神经元转移函数的变化速率,是模型收敛的指标,可以通过在网络中前向传播噪声进行采样,从而减少对保留数据集的依赖。
链接: https://arxiv.org/abs/2507.05309
作者: Gianluca Dalmasso,Andrea Bragagnolo,Enzo Tartaglione,Attilio Fiandrotti,Marco Grangetto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IJCNN 2025 (International Joint Conference on Neural Networks). 8 pages, 13 figures
Abstract:Hyperparameter tuning, such as learning rate decay and defining a stopping criterion, often relies on monitoring the validation loss. This paper presents NeVe, a dynamic training approach that adjusts the learning rate and defines the stop criterion based on the novel notion of “neural velocity”. The neural velocity measures the rate of change of each neuron’s transfer function and is an indicator of model convergence: sampling neural velocity can be performed even by forwarding noise in the network, reducing the need for a held-out dataset. Our findings show the potential of neural velocity as a key metric for optimizing neural network training efficiently
zh
[AI-88] ASSURE: Metamorphic Testing for AI-powered Browser Extensions
【速读】:该论文旨在解决AI驱动的浏览器扩展在测试与可靠性保障方面面临的独特挑战,包括非确定性行为、上下文敏感性以及复杂的网络环境集成问题。现有测试方法未能有效应对这些挑战,缺乏针对浏览器特定上下文的评估框架。论文提出的解决方案是ASSURE,其关键在于三个核心组件:模块化的测试用例生成引擎、自动化执行框架以及可配置的验证流水线,通过系统化评估行为一致性与安全不变量,而非依赖精确输出匹配,从而显著提升了测试效率与问题检测能力。
链接: https://arxiv.org/abs/2507.05307
作者: Xuanqi Gao,Juan Zhai,Shiqing Ma,Siyi Xie,Chao Shen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of Large Language Models (LLMs) into browser extensions has revolutionized web browsing, enabling sophisticated functionalities like content summarization, intelligent translation, and context-aware writing assistance. However, these AI-powered extensions introduce unprecedented challenges in testing and reliability assurance. Traditional browser extension testing approaches fail to address the non-deterministic behavior, context-sensitivity, and complex web environment integration inherent to LLM-powered extensions. Similarly, existing LLM testing methodologies operate in isolation from browser-specific contexts, creating a critical gap in effective evaluation frameworks. To bridge this gap, we present ASSURE, a modular automated testing framework specifically designed for AI-powered browser extensions. ASSURE comprises three principal components: (1) a modular test case generation engine that supports plugin-based extension of testing scenarios, (2) an automated execution framework that orchestrates the complex interactions between web content, extension processing, and AI model behavior, and (3) a configurable validation pipeline that systematically evaluates behavioral consistency and security invariants rather than relying on exact output matching. Our evaluation across six widely-used AI browser extensions demonstrates ASSURE’s effectiveness, identifying 531 distinct issues spanning security vulnerabilities, metamorphic relation violations, and content alignment problems. ASSURE achieves 6.4x improved testing throughput compared to manual approaches, detecting critical security vulnerabilities within 12.4 minutes on average. This efficiency makes ASSURE practical for integration into development pipelines, offering a comprehensive solution to the unique challenges of testing AI-powered browser extensions.
zh
[AI-89] Fuzzy Classification Aggregation for a Continuum of Agents
【速读】:该论文试图解决在对至少3个对象进行分类且分类类型数介于2到m之间的连续个体分类情况下,如何构造一个最优、独立且无一致同意的模糊分类聚合函数的问题。解决方案的关键在于证明此类最优聚合函数必须为加权算术平均。
链接: https://arxiv.org/abs/2507.05297
作者: Zijun Meng
机构: 未知
类目: Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注:
Abstract:We prove that any optimal, independent, and zero unanimous fuzzy classification aggregation function of a continuum of individual classifications of m\ge 3 objects into 2\le p\le m types must be a weighted arithmetic mean.
zh
[AI-90] Integrating Generative AI in BIM Education: Insights from Classroom Implementation
【速读】:该论文试图解决在建筑信息模型(BIM)课程中整合生成式AI(Generative AI)进行规则检查的实践问题,特别是在缺乏相关研究背景的情况下探索其教学应用效果。解决方案的关键在于设计基于课堂的实验性教学流程,包括教授提示工程(prompt engineering)和利用大语言模型(LLM)进行AI驱动的规则检查,并通过实际任务让学生使用Autodesk Revit识别设计中的规范违规情况。
链接: https://arxiv.org/abs/2507.05296
作者: Islem Sahraoui,Kinam Kim,Lu Gao,Zia Din,Ahmed Senouci
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This study evaluates the implementation of a Generative AI-powered rule checking workflow within a graduate-level Building Information Modeling (BIM) course at a U.S. university. Over two semesters, 55 students participated in a classroom-based pilot exploring the use of GenAI for BIM compliance tasks, an area with limited prior research. The instructional design included lectures on prompt engineering and AI-driven rule checking, followed by an assignment where students used a large language model (LLM) to identify code violations in designs using Autodesk Revit. Surveys and interviews were conducted to assess student workload, learning effectiveness, and overall experience, using the NASA-TLX scale and regression analysis. Findings indicate students generally achieved learning objectives but faced challenges such as difficulties debugging AI-generated code and inconsistent tool performance, probably due to their limited prompt engineering experience. These issues increased cognitive and emotional strain, especially among students with minimal programming backgrounds. Despite these challenges, students expressed strong interest in future GenAI applications, particularly with clear instructional support.
zh
[AI-91] Enhancing Learning Path Recommendation via Multi-task Learning
【速读】:该论文旨在解决个性化学习中的学习路径推荐问题,即如何根据学习者的个体需求和历史交互数据,序列化地推荐个性化的学习内容。其解决方案的关键在于提出一种多任务长短期记忆网络(multi-task LSTM)模型,该模型通过共享信息提升学习路径推荐的效果,将学习路径推荐重新建模为序列到序列(Seq2Seq)预测问题,并利用共享的LSTM层捕捉学习路径推荐与深度知识追踪的共性特征,同时引入任务特定的LSTM层以满足不同目标的需求,此外还设计了非重复损失函数以避免推荐内容的重复。
链接: https://arxiv.org/abs/2507.05295
作者: Afsana Nasrin,Lijun Qian,Pamela Obiomon,Xishuang Dong
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Personalized learning is a student-centered educational approach that adapts content, pace, and assessment to meet each learner’s unique needs. As the key technique to implement the personalized learning, learning path recommendation sequentially recommends personalized learning items such as lectures and exercises. Advances in deep learning, particularly deep reinforcement learning, have made modeling such recommendations more practical and effective. This paper proposes a multi-task LSTM model that enhances learning path recommendation by leveraging shared information across tasks. The approach reframes learning path recommendation as a sequence-to-sequence (Seq2Seq) prediction problem, generating personalized learning paths from a learner’s historical interactions. The model uses a shared LSTM layer to capture common features for both learning path recommendation and deep knowledge tracing, along with task-specific LSTM layers for each objective. To avoid redundant recommendations, a non-repeat loss penalizes repeated items within the recommended learning path. Experiments on the ASSIST09 dataset show that the proposed model significantly outperforms baseline methods for the learning path recommendation.
zh
[AI-92] Physics-Informed Graph Neural Networks to Reconstruct Local Fields Considering Finite Strain Hyperelasticity
【速读】:该论文试图解决在多尺度仿真中,基于周期性微观结构网格和宏观平均应力值,重建微观尺度局部应力场的问题。解决方案的关键在于提出一种物理信息图神经网络框架(P-DivGNN),该框架将周期性微观结构表示为图结构,并结合消息传递图神经网络进行学习,同时在训练过程中引入物理约束以保证局部应力场的平衡状态,并利用周期图表示来满足周期性边界条件。
链接: https://arxiv.org/abs/2507.05291
作者: Manuel Ricardo Guevara Garban,Yves Chemisky,Étienne Prulière,Michaël Clément
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 28 pages, 17 figures, pre-print
Abstract:We propose a physics-informed machine learning framework called P-DivGNN to reconstruct local stress fields at the micro-scale, in the context of multi-scale simulation given a periodic micro-structure mesh and mean, macro-scale, stress values. This method is based in representing a periodic micro-structure as a graph, combined with a message passing graph neural network. We are able to retrieve local stress field distributions, providing average stress values produced by a mean field reduced order model (ROM) or Finite Element (FE) simulation at the macro-scale. The prediction of local stress fields are of utmost importance considering fracture analysis or the definition of local fatigue criteria. Our model incorporates physical constraints during training to constraint local stress field equilibrium state and employs a periodic graph representation to enforce periodic boundary conditions. The benefits of the proposed physics-informed GNN are evaluated considering linear and non linear hyperelastic responses applied to varying geometries. In the non-linear hyperelastic case, the proposed method achieves significant computational speed-ups compared to FE simulation, making it particularly attractive for large-scale applications.
zh
[AI-93] Compressing Deep Neural Networks Using Explainable AI
【速读】:该论文试图解决深度神经网络(DNN)在计算成本和内存占用上的高消耗问题,以实现模型的高效压缩并使其适用于资源受限的边缘设备。其解决方案的关键在于利用可解释人工智能(XAI)方法,特别是基于梯度的层相关性传播(LRP)技术,计算DNN参数的重要性得分,并据此进行剪枝和混合精度量化,从而在几乎不损失准确率的情况下显著减小模型规模。
链接: https://arxiv.org/abs/2507.05286
作者: Kimia Soroush,Mohsen Raji,Behnam Ghavami
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep neural networks (DNNs) have demonstrated remarkable performance in many tasks but it often comes at a high computational cost and memory usage. Compression techniques, such as pruning and quantization, are applied to reduce the memory footprint of DNNs and make it possible to accommodate them on resource-constrained edge devices. Recently, explainable artificial intelligence (XAI) methods have been introduced with the purpose of understanding and explaining AI methods. XAI can be utilized to get to know the inner functioning of DNNs, such as the importance of different neurons and features in the overall performance of DNNs. In this paper, a novel DNN compression approach using XAI is proposed to efficiently reduce the DNN model size with negligible accuracy loss. In the proposed approach, the importance score of DNN parameters (i.e. weights) are computed using a gradient-based XAI technique called Layer-wise Relevance Propagation (LRP). Then, the scores are used to compress the DNN as follows: 1) the parameters with the negative or zero importance scores are pruned and removed from the model, 2) mixed-precision quantization is applied to quantize the weights with higher/lower score with higher/lower number of bits. The experimental results show that, the proposed compression approach reduces the model size by 64% while the accuracy is improved by 42% compared to the state-of-the-art XAI-based compression method.
zh
[AI-94] Exploring LLM Capabilities in Extracting DCAT-Compatible Metadata for Data Cataloging
【速读】:该论文试图解决数据消费者在面对数据量指数增长、异构性和分布性时,耗费大量时间寻找合适数据的问题,其核心在于通过自动化手段提升数据探索效率。解决方案的关键是利用大语言模型(Large Language Models, LLMs)自动维护文本数据的元数据,并生成符合DCAT标准的高质量元数据。研究验证了LLMs在生成标题、关键词等元数据任务中的有效性,同时发现模型规模、微调以及少样本提示策略对性能有显著影响。
链接: https://arxiv.org/abs/2507.05282
作者: Lennart Busch,Daniel Tebernum,Gissel Velarde
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Efficient data exploration is crucial as data becomes increasingly important for accelerating processes, improving forecasts and developing new business models. Data consumers often spend 25-98 % of their time searching for suitable data due to the exponential growth, heterogeneity and distribution of data. Data catalogs can support and accelerate data exploration by using metadata to answer user queries. However, as metadata creation and maintenance is often a manual process, it is time-consuming and requires expertise. This study investigates whether LLMs can automate metadata maintenance of text-based data and generate high-quality DCAT-compatible metadata. We tested zero-shot and few-shot prompting strategies with LLMs from different vendors for generating metadata such as titles and keywords, along with a fine-tuned model for classification. Our results show that LLMs can generate metadata comparable to human-created content, particularly on tasks that require advanced semantic understanding. Larger models outperformed smaller ones, and fine-tuning significantly improves classification accuracy, while few-shot prompting yields better results in most cases. Although LLMs offer a faster and reliable way to create metadata, a successful application requires careful consideration of task-specific criteria and domain context.
zh
[AI-95] Hungary and AI: efforts and opportunities in comparison with Singapore
【速读】:该论文试图评估匈牙利国家人工智能战略(National AI Strategy)的制定与实施情况,并通过对比新加坡的国家人工智能战略(NAIS 1.0和NAIS 2.0)提出改进建议。其解决方案的关键在于从概念、治理、时间安排和财务四个维度对匈牙利战略的22项目标进行分析,识别出资金分配不均、执行碎片化及缺乏定期审查机制等实施问题,并基于新加坡框架提出针对性改进措施,如适应大语言模型时代、重构三螺旋网络以促进更有效的对话与倡导,以及将匈牙利定位为东西方汽车AI实验的桥梁。
链接: https://arxiv.org/abs/2507.05280
作者: András Ferenczy
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 39 pages
Abstract:The study assesses Hungary’s National AI Strategy and its implementation through the analysis of strategic documents, publicly available financial records, and expert interviews with the Hungarian AI Coalition President and Chief Strategic Advisor to the Government Commissioner for AI. 22 goals from Hungary’s strategy were evaluated through conceptual, governance, temporal, and financial dimensions before being benchmarked against Singapore’s National AI Strategies (NAIS 1.0 and NAIS 2.0). Key findings include an estimated total of EUR 4.65 billion in AI-related public investment in Hungary. Openly available financial data was found for only half of the evaluated goals, and just three projects made up 98% of all documented funding. The research also reveals Hungary’s implementation challenges, including fragmented execution following ministerial reorganizations and the absence of designated biennial reviews since 2020. Furthermore, the paper provides targeted recommendations for Hungary’s forthcoming AI strategy, drawing on Singapore’s framework as a reference point. These include adapting to the era of large language models, restructuring the existing triple helix network to foster more effective dialogue and advocacy, and positioning the country as an East-West bridge for automotive AI experimentation.
zh
[AI-96] A Fuzzy Supervisor Agent Design for Clinical Reasoning Assistance in a Multi-Agent Educational Clinical Scenario Simulation
【速读】:该论文试图解决在临床情景培训中协助医学生进行临床推理(Clinical Reasoning, CR)的持续性挑战。其解决方案的关键在于设计并实现了一个名为模糊监督代理(Fuzzy Supervisor Agent, FSA)的新组件,该组件基于模糊推理系统(Fuzzy Inference System, FIS),通过预定义的模糊规则库对学生的互动行为进行持续分析,从而提供适应性、上下文感知的反馈,以在学生遇到困难时给予精准支持。
链接: https://arxiv.org/abs/2507.05275
作者: Weibing Zheng,Laurah Turner,Jess Kropczynski,Murat Ozer,Seth Overla,Shane Halse
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: 6 pages, 3 figures, 1 table. 2025 IFSA World Congress NAFIPS Annual Meeting
Abstract:Assisting medical students with clinical reasoning (CR) during clinical scenario training remains a persistent challenge in medical education. This paper presents the design and architecture of the Fuzzy Supervisor Agent (FSA), a novel component for the Multi-Agent Educational Clinical Scenario Simulation (MAECSS) platform. The FSA leverages a Fuzzy Inference System (FIS) to continuously interpret student interactions with specialized clinical agents (e.g., patient, physical exam, diagnostic, intervention) using pre-defined fuzzy rule bases for professionalism, medical relevance, ethical behavior, and contextual distraction. By analyzing student decision-making processes in real-time, the FSA is designed to deliver adaptive, context-aware feedback and provides assistance precisely when students encounter difficulties. This work focuses on the technical framework and rationale of the FSA, highlighting its potential to provide scalable, flexible, and human-like supervision in simulation-based medical education. Future work will include empirical evaluation and integration into broader educational settings. More detailed design and implementation is~\hrefthis https URLopen sourced here.
zh
[AI-97] FuzzFeed: An Automatic Approach to Weakest Precondition Generation using LLM s and Fuzzing
【速读】:该论文试图解决如何有效生成程序的最弱前条件(Weakest Precondition, WP)的问题,该问题在程序验证和运行时错误检查等领域具有重要应用。论文提出的解决方案的关键在于结合大型语言模型(Large Language Models, LLMs)与模糊测试(fuzz testing),并通过引入一种称为“模糊测试引导”(Fuzzing Guidance, FG)的技术,利用程序执行反馈指导LLMs生成正确的WP。FG通过模糊测试近似验证候选WP的有效性和薄弱性,并将这些信息作为上下文优化反馈给LLMs,从而提升其生成可行WP的能力。
链接: https://arxiv.org/abs/2507.05272
作者: Daragh King,Vasileios Koutavas,Laura Kovacs
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:The weakest precondition (WP) of a program describes the largest set of initial states from which all terminating executions of the program satisfy a given postcondition. The generation of WPs is an important task with practical applications in areas ranging from verification to run-time error checking. This paper proposes the combination of Large Language Models (LLMs) and fuzz testing for generating WPs. In pursuit of this goal, we introduce Fuzzing Guidance (FG); FG acts as a means of directing LLMs towards correct WPs using program execution feedback. FG utilises fuzz testing for approximately checking the validity and weakness of candidate WPs, this information is then fed back to the LLM as a means of context refinement. We demonstrate the effectiveness of our approach on a comprehensive benchmark set of deterministic array programs in Java. Our experiments indicate that LLMs are capable of producing viable candidate WPs, and that this ability can be practically enhanced through FG. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2507.05272 [cs.SE] (or arXiv:2507.05272v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.05272 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-98] CORE: Benchmarking LLM s Code Reasoning Capabilities through Static Analysis Tasks
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在程序语义推理能力方面研究不足的问题,尤其是在数据依赖、控制依赖和信息流等静态分析任务中的表现。现有基准主要评估端到端的结果,如代码是否被正确生成或修复,而忽略了模型对程序语义的理解能力。解决方案的关键是提出CoRe基准,这是一个高质量、经过人工验证的基准,用于评估LLMs在基础静态分析任务上的表现,并采用语义感知的多样化采样策略以确保任务的语义多样性和推理复杂性。
链接: https://arxiv.org/abs/2507.05269
作者: Danning Xie,Mingwei Zheng,Xuwei Liu,Jiannan Wang,Chengpeng Wang,Lin Tan,Xiangyu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have been widely adopted across diverse software engineering domains, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code patterns: value propagation, control flow, and interdependence between program elements. However, existing benchmarks primarily evaluate end-to-end outcomes, such as whether code is correctly repaired or generated, leaving the models ability for program semantic reasoning underexplored. This work presents CoRe, a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CoRe includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. To ensure semantic diversity and reasoning complexity, we propose a semantics-aware diverse sampling strategy that selects targets and task instances based on structural coverage and dependency depth. We evaluate 10 mainstream LLMs and show that, while they perform well at identifying dependencies, models still struggle with tasks that require deeper semantic understanding and multi-step reasoning. We further conduct qualitative analyses to uncover key challenges, such as complex control structures and backward dependency patterns, offering insights into improving LLMs code reasoning capabilities.
zh
[AI-99] Strongly Solving 7 times 6 Connect-Four on Consumer Grade Hardware
【速读】:该论文试图解决Connect-Four游戏在标准7×6棋盘上生成一个大规模的查找表以实现强解的问题。传统上认为这种查找表的构建在计算上是不可行的,但本文通过基于二进制决策图(Binary Decision Diagram, BDD)的符号搜索方法,实现了高效的解决方案。其关键在于对符号搜索方法的高效实现,使得在单个CPU核心和128GB内存环境下,能够在47小时内生成大小为89.6GB的查找表。此外,作者还在开源工具中集成了alpha-beta搜索,用于找到最快胜利或最慢失败的走法。
链接: https://arxiv.org/abs/2507.05267
作者: Markus Böck
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While the game Connect-Four has been solved mathematically and the best move can be effectively computed with search based methods, a strong solution in the form of a look-up table was believed to be infeasible. In this paper, we revisit a symbolic search method based on binary decision diagrams to produce strong solutions. With our efficient implementation we were able to produce a 89.6 GB large look-up table in 47 hours on a single CPU core with 128 GB main memory for the standard 7 \times 6 board size. In addition to this win-draw-loss evaluation, we include an alpha-beta search in our open source artifact to find the move which achieves the fastest win or slowest loss.
zh
[AI-100] Rethinking Over-Smoothing in Graph Neural Networks: A Perspective from Anderson Localization
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在深度增加时出现的过平滑(over-smoothing)问题,该问题导致节点表示失去区分性。解决方案的关键在于通过类比安德森局域化(Anderson localization)机制,引入参与度(participation degree)作为量化过平滑现象的指标,并提出通过减少信息传播中的无序性来缓解过平滑问题。
链接: https://arxiv.org/abs/2507.05263
作者: Kaichen Ouyang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 17 pages, 4 figures
Abstract:Graph Neural Networks (GNNs) have shown great potential in graph data analysis due to their powerful representation capabilities. However, as the network depth increases, the issue of over-smoothing becomes more severe, causing node representations to lose their distinctiveness. This paper analyzes the mechanism of over-smoothing through the analogy to Anderson localization and introduces participation degree as a metric to quantify this phenomenon. Specifically, as the depth of the GNN increases, node features homogenize after multiple layers of message passing, leading to a loss of distinctiveness, similar to the behavior of vibration modes in disordered systems. In this context, over-smoothing in GNNs can be understood as the expansion of low-frequency modes (increased participation degree) and the localization of high-frequency modes (decreased participation degree). Based on this, we systematically reviewed the potential connection between the Anderson localization behavior in disordered systems and the over-smoothing behavior in Graph Neural Networks. A theoretical analysis was conducted, and we proposed the potential of alleviating over-smoothing by reducing the disorder in information propagation.
zh
[AI-101] Challenges Opportunities with LLM -Assisted Visualization Retargeting
【速读】:该论文试图解决将现有的定制图表实现重新定位到新数据集时所面临的困难、耗时和繁琐的问题(visualization retargeting)。其关键解决方案是利用大型语言模型(Large Language Models, LLMs)从高层用户提示中自动适应代码,从而降低可视化重新定位的门槛。研究比较了两种方法:一种是直接将代码作为文本输入让LLM生成和适应代码,另一种是通过LLM提供结构信息(如视觉编码)来引导代码构建过程,以更受约束的方式进行程序合成。
链接: https://arxiv.org/abs/2507.01436
作者: Luke S. Snyder,Chenglong Wang,Steven M. Drucker
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 1 table
Abstract:Despite the ubiquity of visualization examples published on the web, retargeting existing custom chart implementations to new datasets remains difficult, time-intensive, and tedious. The adaptation process assumes author familiarity with both the implementation of the example as well as how the new dataset might need to be transformed to fit into the example code. With recent advances in Large Language Models (LLMs), automatic adaptation of code can be achieved from high-level user prompts, reducing the barrier for visualization retargeting. To better understand how LLMs can assist retargeting and its potential limitations, we characterize and evaluate the performance of LLM assistance across multiple datasets and charts of varying complexity, categorizing failures according to type and severity. In our evaluation, we compare two approaches: (1) directly instructing the LLM model to fully generate and adapt code by treating code as text inputs and (2) a more constrained program synthesis pipeline where the LLM guides the code construction process by providing structural information (e.g., visual encodings) based on properties of the example code and data. We find that both approaches struggle when new data has not been appropriately transformed, and discuss important design recommendations for future retargeting systems.
zh
[AI-102] A Survey on Model Extraction Attacks and Defenses for Large Language Models
【速读】:该论文试图解决语言模型(Language Model, LM)在部署环境中面临的模型提取攻击(Model Extraction Attacks)问题,这类攻击可能威胁知识产权和用户隐私。其解决方案的关键在于提出一种针对大语言模型(Large Language Model, LLM)的全面分类体系,涵盖功能提取、训练数据提取和针对提示的攻击,并分析相应的防御机制,包括模型保护、数据隐私保护和针对提示的策略。论文还提出了专门的评估指标,以衡量攻击效果和防御性能,旨在推动更安全且保持模型实用性的研究方向。
链接: https://arxiv.org/abs/2506.22521
作者: Kaixiang Zhao,Lincan Li,Kaize Ding,Neil Zhenqiang Gong,Yue Zhao,Yushun Dong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Model extraction attacks pose significant security threats to deployed language models, potentially compromising intellectual property and user privacy. This survey provides a comprehensive taxonomy of LLM-specific extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks. We analyze various attack methodologies including API-based knowledge distillation, direct querying, parameter recovery, and prompt stealing techniques that exploit transformer architectures. We then examine defense mechanisms organized into model protection, data privacy protection, and prompt-targeted strategies, evaluating their effectiveness across different deployment scenarios. We propose specialized metrics for evaluating both attack effectiveness and defense performance, addressing the specific challenges of generative language models. Through our analysis, we identify critical limitations in current approaches and propose promising research directions, including integrated attack methodologies and adaptive defense mechanisms that balance security with model utility. This work serves NLP researchers, ML engineers, and security professionals seeking to protect language models in production environments.
zh
[AI-103] he Problem of Algorithmic Collisions: Mitigating Unforeseen Risks in a Connected World
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)及其他自主算法系统部署所带来的新型系统性风险问题,特别是算法系统之间相互作用所引发的不可预见且迅速恶化的负面后果。解决方案的关键在于通过分阶段系统注册、部署许可框架以及增强监控能力来提升透明度和问责性,以应对当前治理框架在复杂算法交互生态系统中缺乏可见性的不足。
链接: https://arxiv.org/abs/2505.20181
作者: Maurice Chiodo,Dennis Müller
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); History and Overview (math.HO)
备注: 27 pages. This is an early concept paper, and we plan to add further content to it over time. Please get in touch if you want to be part of its further development. Keywords: algorithmic collision, AI agents, algorithmic ecosystem, flash crash, multiagent systems
Abstract:The increasing deployment of Artificial Intelligence (AI) and other autonomous algorithmic systems presents the world with new systemic risks. While focus often lies on the function of individual algorithms, a critical and underestimated danger arises from their interactions, particularly when algorithmic systems operate without awareness of each other, or when those deploying them are unaware of the full algorithmic ecosystem deployment is occurring in. These interactions can lead to unforeseen, rapidly escalating negative outcomes - from market crashes and energy supply disruptions to potential physical accidents and erosion of public trust - often exceeding the human capacity for effective monitoring and the legal capacities for proper intervention. Current governance frameworks are inadequate as they lack visibility into this complex ecosystem of interactions. This paper outlines the nature of this challenge and proposes some initial policy suggestions centered on increasing transparency and accountability through phased system registration, a licensing framework for deployment, and enhanced monitoring capabilities.
zh
[AI-104] Formalising Human-in-the-Loop: Computational Reductions Failure Modes and Legal-Moral Responsibility ACL
【速读】:该论文试图解决不同Human-in-the-loop (HITL) 设定在法律合规性和安全性方面的差异问题,以及如何在这些设定之间进行选择。其解决方案的关键在于通过可计算性理论中的Oracle机器概念对HITL设定进行形式化,区分了简单的人员监控、单点人类操作和高度参与的人机交互,并将其对应为总函数、多对一约简和图灵约简。这一分类有助于识别HITL失效模式,并揭示现有英国和欧盟法律框架在评估不同HITL设定效果上的不足,从而提出法律应更准确地认可不同HITL设定的有效性并合理分配责任。
链接: https://arxiv.org/abs/2505.10426
作者: Maurice Chiodo,Dennis Müller,Paul Siewert,Jean-Luc Wetherall,Zoya Yasmine,John Burden
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); History and Overview (math.HO)
备注: 12 pages. Keywords: Human-in-the-loop, Artificial Intelligence, Oracle Machines, Liability, AI Safety, AI Regulations, Turing Reduction
Abstract:The legal compliance and safety of different Human-in-the-loop (HITL) setups for AI can vary greatly. This manuscript aims to identify new ways of choosing between such setups, and shows that there is an unavoidable trade-off between the attribution of legal responsibility and the technical explainability of AI. We begin by using the notion of oracle machines from computability theory to formalise different HITL setups, distinguishing between trivial human monitoring, single endpoint human action, and highly involved interaction between the human(s) and the AI. These correspond to total functions, many-one reductions, and Turing reductions respectively. A taxonomy categorising HITL failure modes is then presented, highlighting the limitations on what any HITL setup can actually achieve. Our approach then identifies oversights from UK and EU legal frameworks, which focus on certain HITL setups which may not always achieve the desired ethical, legal, and sociotechnical outcomes. We suggest areas where the law should recognise the effectiveness of different HITL setups and assign responsibility in these contexts, avoiding unnecessary and unproductive human “scapegoating”. Overall, we show how HITL setups involve many technical design decisions, and can be prone to failures which are often out of the humans’ control. This opens up a new analytic perspective on the challenges arising in the creation of HITL setups, helping inform AI developers and lawmakers on designing HITL to better achieve their desired outcomes.
zh
[AI-105] Integrators at War: Mediating in AI-assisted Resort-to-Force Decisions
【速读】:该论文试图解决将人工智能(AI)系统整合到使用武力决策(RTF)过程中的挑战与不足,特别是关注在社会技术系统中被忽视的集成者群体所面临的问题。其解决方案的关键在于通过三步方法进行分析:首先,将不同行动者与AI系统之间的关系概念化为社会技术系统;其次,识别该系统中人类与机器协作在RTF决策中的挑战,包括技术本身、集成者角色以及人机交互方面的障碍;最后,提出政策建议以应对这些整合问题。
链接: https://arxiv.org/abs/2501.06861
作者: Dennis Müller,Maurice Chiodo,Mitja Sienknecht
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); History and Overview (math.HO)
备注: 32 pages. Keywords: education, artificial intelligence, AI integrators, resort to force, sociotechnical system, systems engineering
Abstract:The integration of AI systems into the military domain is changing the way war-related decisions are made. It binds together three disparate groups of actors - developers, integrators, users - and creates a relationship between these groups and the machine, embedded in the (pre-)existing organisational and system structures. In this article, we focus on the important, but often neglected, group of integrators within such a sociotechnical system. In complex human-machine configurations, integrators carry responsibility for linking the disparate groups of developers and users in the political and military system. To act as the mediating group requires a deep understanding of the other groups’ activities, perspectives and norms. We thus ask which challenges and shortcomings emerge from integrating AI systems into resort-to-force (RTF) decision-making processes, and how to address them. To answer this, we proceed in three steps. First, we conceptualise the relationship between different groups of actors and AI systems as a sociotechnical system. Second, we identify challenges within such systems for human-machine teaming in RTF decisions. We focus on challenges that arise a) from the technology itself, b) from the integrators’ role in the sociotechnical system, c) from the human-machine interaction. Third, we provide policy recommendations to address these shortcomings when integrating AI systems into RTF decision-making structures.
zh
[AI-106] Hierarchy or Heterarchy? A Theory of Long-Range Connections for the Sensorimotor Brain
【速读】:该论文试图解决传统神经皮层层次结构模型在解释感觉和运动信息处理机制时的局限性,特别是针对解剖学连接与标准层次解释不一致以及层次区域有时以并行方式响应的问题。论文提出的解决方案之关键在于“千脑理论”(Thousand Brains Theory),该理论认为每个皮层柱都是一个感觉运动学习系统,通过整合传感器在多次运动中的输入来学习,从而使得初级和次级区域也能学习和识别完整的三维物体。
链接: https://arxiv.org/abs/2507.05888
作者: Jeff Hawkins(1),Niels Leadholm(1),Viviane Clay(1) ((1) Thousand Brains Project)
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures
Abstract:In the traditional understanding of the neocortex, sensory information flows up a hierarchy of regions, with each level processing increasingly complex features. Information also flows down the hierarchy via a different set of connections. Although the hierarchical model has significant support, many anatomical connections do not conform to the standard hierarchical interpretation. In addition, hierarchically arranged regions sometimes respond in parallel, not sequentially as would occur in a hierarchy. This and other evidence suggests that two regions can act in parallel and hierarchically at the same time. Given this flexibility, the word “heterarchy” might be a more suitable term to describe neocortical organization. This paper proposes a new interpretation of how sensory and motor information is processed in the neocortex. The key to our proposal is what we call the “Thousand Brains Theory”, which posits that every cortical column is a sensorimotor learning system. Columns learn by integrating sensory input over multiple movements of a sensor. In this view, even primary and secondary regions, such as V1 and V2, can learn and recognize complete 3D objects. This suggests that the hierarchical connections between regions are used to learn the compositional structure of parent objects composed of smaller child objects. We explain the theory by examining the different types of long-range connections between cortical regions and between the neocortex and thalamus. We describe these connections, and then suggest the specific roles they play in the context of a heterarchy of sensorimotor regions. We also suggest that the thalamus plays an essential role in transforming the pose between objects and sensors. The novel perspective we argue for here has broad implications for both neuroscience and artificial intelligence.
zh
[AI-107] MP-ALOE: An r2SCAN dataset for universal machine learning interatomic potentials
【速读】:该论文试图解决生成高精度、广泛适用的机器学习原子间势函数的难题,以提升材料模拟的效率与准确性。其解决方案的关键在于构建了MP-ALOE数据集,这是一个包含近100万条基于r2SCAN泛函的密度泛函理论(DFT)计算结果的数据库,通过主动学习方法生成,并主要包含非平衡结构,从而增强了模型对复杂物理条件下的适应能力。
链接: https://arxiv.org/abs/2507.05559
作者: Matthew C. Kuner,Aaron D. Kaplan,Kristin A. Persson,Mark Asta,Daryl C. Chrzan
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: To download the dataset and associated files, see this https URL
Abstract:We present MP-ALOE, a dataset of nearly 1 million DFT calculations using the accurate r2SCAN meta-generalized gradient approximation. Covering 89 elements, MP-ALOE was created using active learning and primarily consists of off-equilibrium structures. We benchmark a machine learning interatomic potential trained on MP-ALOE, and evaluate its performance on a series of benchmarks, including predicting the thermochemical properties of equilibrium structures; predicting forces of far-from-equilibrium structures; maintaining physical soundness under static extreme deformations; and molecular dynamic stability under extreme temperatures and pressures. MP-ALOE shows strong performance on all of these benchmarks, and is made public for the broader community to utilize.
zh
[AI-108] Solar Flare Prediction Using LSTM and DLSTM with Sliding Window Pattern Recognition
【速读】:该论文试图解决太阳耀斑发生预测的问题,特别是在面对太阳复杂且自组织临界性驱动行为所带来的长期预测挑战时。其解决方案的关键在于结合长短期记忆网络(LSTM)与分解-LSTM(DLSTM)模型,并引入集成算法,同时采用滑动窗口技术处理不规则和正则化的时间序列数据。通过正则化降低复杂度并增强大耀斑活动的捕捉能力,以及应用重采样方法缓解类别不平衡问题,最终在正则化时间序列上使用集成DLSTM模型取得了优于其他模型的预测性能,表现为更高的TS S(0.74)、召回率(0.95)和AUC(0.87)。
链接: https://arxiv.org/abs/2507.05313
作者: Zeinab Hassani,Davud Mohammadpur,Hossein Safari
机构: 未知
类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in the Astrophysical Journal Supplement Series, volume 279, 2025, DOI: https://doi.org/10.3847/1538-4365/addc73
Abstract:We investigate the use of Long Short-Term Memory (LSTM) and Decomposition-LSTM (DLSTM) networks, combined with an ensemble algorithm, to predict solar flare occurrences using time-series data from the GOES catalog. The dataset spans from 2003 to 2023 and includes 151,071 flare events. Among approximately possible patterns, 7,552 yearly pattern windows are identified, highlighting the challenge of long-term forecasting due to the Sun’s complex, self-organized criticality-driven behavior. A sliding window technique is employed to detect temporal quasi-patterns in both irregular and regularized flare time series. Regularization reduces complexity, enhances large flare activity, and captures active days more effectively. To address class imbalance, resampling methods are applied. LSTM and DLSTM models are trained on sequences of peak fluxes and waiting times from irregular time series, while LSTM and DLSTM, integrated with an ensemble approach, are applied to sliding windows of regularized time series with a 3-hour interval. Performance metrics, particularly TSS (0.74), recall (0.95) and the area under the curve (AUC=0.87) in the receiver operating characteristic (ROC), indicate that DLSTM with an ensemble approach on regularized time series outperforms other models, offering more accurate large-flare forecasts with fewer false errors compared to models trained on irregular time series. The superior performance of DLSTM is attributed to its ability to decompose time series into trend and seasonal components, effectively isolating random noise. This study underscores the potential of advanced machine learning techniques for solar flare prediction and highlights the importance of incorporating various solar cycle phases and resampling strategies to enhance forecasting reliability.
zh
[AI-109] Enjoying Non-linearity in Multinomial Logistic Bandits
【速读】:该论文试图解决多类别逻辑回归老虎机(multinomial logistic bandit)问题,旨在通过利用概率反馈来最大化预期奖励。其解决方案的关键在于扩展了二元设置中问题相关的常数 κ∗ 的定义至多类别情形,并提出一种高效算法,该算法利用了问题的非线性特性。通过引入这一扩展的 κ∗,该方法在 T 轮次内实现了问题相关的遗憾界为 \smash\widetilde\mathcalO( Kd \sqrt{T}/\kappa_* ),相较于现有最优结果 \smash\widetilde\mathcalO( Kd \sqrt{T} ) 有所改进,且证明了对 κ∗ 的依赖是紧致的。
链接: https://arxiv.org/abs/2507.05306
作者: Pierre Boudart,Pierre Gaillard,Alessandro Rudi(PSL, DI-ENS, Inria)
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:
Abstract:We consider the multinomial logistic bandit problem, a variant of generalized linear bandits where a learner interacts with an environment by selecting actions to maximize expected rewards based on probabilistic feedback from multiple possible outcomes. In the binary setting, recent work has focused on understanding the impact of the non-linearity of the logistic model (Faury et al., 2020; Abeille et al., 2021). They introduced a problem-dependent constant \kappa_* , that may be exponentially large in some problem parameters and which is captured by the derivative of the sigmoid function. It encapsulates the non-linearity and improves existing regret guarantees over T rounds from \smashO(d\sqrtT) to \smashO(d\sqrtT/\kappa_) , where d is the dimension of the parameter space. We extend their analysis to the multinomial logistic bandit framework, making it suitable for complex applications with more than two choices, such as reinforcement learning or recommender systems. To achieve this, we extend the definition of \kappa_ to the multinomial setting and propose an efficient algorithm that leverages the problem’s non-linearity. Our method yields a problem-dependent regret bound of order \smash\widetilde\mathcalO( Kd \sqrtT/\kappa_) , where K is the number of actions and \kappa_ \ge 1 . This improves upon the best existing guarantees of order \smash\widetilde\mathcalO( Kd \sqrtT ) . Moreover, we provide a \smash \Omega(d\sqrtT/\kappa_) lower-bound, showing that our dependence on \kappa_ is optimal.
zh
机器学习
[LG-0] Deep Learning Optimization of Two-State Pinching Antennas Systems
链接: https://arxiv.org/abs/2507.06222
作者: Odysseas G. Karagiannidis,Victoria E. Galanopoulou,Panagiotis D. Diamantoulakis,Zhiguo Ding,Octavia Dobre
类目: Machine Learning (cs.LG)
*备注:
Abstract:The evolution of wireless communication systems requires flexible, energy-efficient, and cost-effective antenna technologies. Pinching antennas (PAs), which can dynamically control electromagnetic wave propagation through binary activation states, have recently emerged as a promising candidate. In this work, we investigate the problem of optimally selecting a subset of fixed-position PAs to activate in a waveguide, when the aim is to maximize the communication rate at a user terminal. Due to the complex interplay between antenna activation, waveguide-induced phase shifts, and power division, this problem is formulated as a combinatorial fractional 0-1 quadratic program. To efficiently solve this challenging problem, we use neural network architectures of varying complexity to learn activation policies directly from data, leveraging spatial features and signal structure. Furthermore, we incorporate user location uncertainty into our training and evaluation pipeline to simulate realistic deployment conditions. Simulation results demonstrate the effectiveness and robustness of the proposed models.
[LG-1] Modern Methods in Associative Memory ICML2025
链接: https://arxiv.org/abs/2507.06211
作者: Dmitry Krotov,Benjamin Hoover,Parikshit Ram,Bao Pham
类目: Machine Learning (cs.LG)
*备注: Tutorial at ICML 2025
Abstract:Associative Memories like the famous Hopfield Networks are elegant models for describing fully recurrent neural networks whose fundamental job is to store and retrieve information. In the past few years they experienced a surge of interest due to novel theoretical results pertaining to their information storage capabilities, and their relationship with SOTA AI architectures, such as Transformers and Diffusion Models. These connections open up possibilities for interpreting the computation of traditional AI networks through the theoretical lens of Associative Memories. Additionally, novel Lagrangian formulations of these networks make it possible to design powerful distributed models that learn useful representations and inform the design of novel architectures. This tutorial provides an approachable introduction to Associative Memories, emphasizing the modern language and methods used in this area of research, with practical hands-on mathematical derivations and coding notebooks.
[LG-2] Conservative approximation-based feedforward neural network for WENO schemes
链接: https://arxiv.org/abs/2507.06190
作者: Kwanghyuk Park,Jiaxi Gu,Jae-Hun Jung
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:In this work, we present the feedforward neural network based on the conservative approximation to the derivative from point values, for the weighted essentially non-oscillatory (WENO) schemes in solving hyperbolic conservation laws. The feedforward neural network, whose inputs are point values from the three-point stencil and outputs are two nonlinear weights, takes the place of the classical WENO weighting procedure. For the training phase, we employ the supervised learning and create a new labeled dataset for one-dimensional conservative approximation, where we construct a numerical flux function from the given point values such that the flux difference approximates the derivative to high-order accuracy. The symmetric-balancing term is introduced for the loss function so that it propels the neural network to match the conservative approximation to the derivative and satisfy the symmetric property that WENO3-JS and WENO3-Z have in common. The consequent WENO schemes, WENO3-CADNNs, demonstrate robust generalization across various benchmark scenarios and resolutions, where they outperform WENO3-Z and achieve accuracy comparable to WENO5-JS.
[LG-3] Aliasing in Convnets: A Frame-Theoretic Perspective
链接: https://arxiv.org/abs/2507.06152
作者: Daniel Haider,Vincent Lostanlen,Martin Ehler,Nicki Holighaus,Peter Balazs
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Using a stride in a convolutional layer inherently introduces aliasing, which has implications for numerical stability and statistical generalization. While techniques such as the parametrizations via paraunitary systems have been used to promote orthogonal convolution and thus ensure Parseval stability, a general analysis of aliasing and its effects on the stability has not been done in this context. In this article, we adapt a frame-theoretic approach to describe aliasing in convolutional layers with 1D kernels, leading to practical estimates for stability bounds and characterizations of Parseval stability, that are tailored to take short kernel sizes into account. From this, we derive two computationally very efficient optimization objectives that promote Parseval stability via systematically suppressing aliasing. Finally, for layers with random kernels, we derive closed-form expressions for the expected value and variance of the terms that describe the aliasing effects, revealing fundamental insights into the aliasing behavior at initialization.
[LG-4] Safe Domain Randomization via Uncertainty-Aware Out-of-Distribution Detection and Policy Adaptation
链接: https://arxiv.org/abs/2507.06111
作者: Mohamad H. Danesh,Maxime Wabartha,Stanley Wu,Joelle Pineau,Hsiu-Chin Lin
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Deploying reinforcement learning (RL) policies in real-world involves significant challenges, including distribution shifts, safety concerns, and the impracticality of direct interactions during policy refinement. Existing methods, such as domain randomization (DR) and off-dynamics RL, enhance policy robustness by direct interaction with the target domain, an inherently unsafe practice. We propose Uncertainty-Aware RL (UARL), a novel framework that prioritizes safety during training by addressing Out-Of-Distribution (OOD) detection and policy adaptation without requiring direct interactions in target domain. UARL employs an ensemble of critics to quantify policy uncertainty and incorporates progressive environmental randomization to prepare the policy for diverse real-world conditions. By iteratively refining over high-uncertainty regions of the state space in simulated environments, UARL enhances robust generalization to the target domain without explicitly training on it. We evaluate UARL on MuJoCo benchmarks and a quadrupedal robot, demonstrating its effectiveness in reliable OOD detection, improved performance, and enhanced sample efficiency compared to baselines.
[LG-5] CoRE: Enhancing Metacognition with Label-free Self-evaluation in LRMs
链接: https://arxiv.org/abs/2507.06087
作者: Haoxi Li,Sikai Bai,Jie Zhang,Song Guo
类目: Machine Learning (cs.LG)
*备注: 9 pages, 6 figures
Abstract:Large reasoning models (LRMs) have demonstrated impressive capabilities in domains like mathematics and program synthesis. Despite their strong performance, LRMs often exhibit overthinking – excessive and redundant reasoning steps that introduce inefficiencies during inference. This phenomenon raises an important question for LRM self-evaluation: How can a model autonomously assess the correctness of its own reasoning trajectory without external labels? To address this, we propose Chain-of-Reasoning Embedding (CoRE), a series of hidden states in latent space to enable label-free self-evaluation on intermediate reasoning steps of LRMs, so as to enhance metacognition abilities for improved reasoning efficiency. By analyzing the geometric properties of the CoRE trajectories, we reveal that redundant reasoning usually presents cyclical fluctuations, which correspond to repetitive and unconscious reflection/exploration. Leveraging this insight, we further introduce a training-free, label-free self-evaluation framework, CoRE-Eval, to detect such patterns and dynamically determine whether to terminate reasoning early. Extensive experiments on mathematical reasoning benchmarks (GSM8K, MATH-500, and AIME) and across model sizes from 7B to 32B demonstrate that CoRE-Eval reduces chain-of-thought length by 13.7% to 33.2% while improving answer accuracy by around 10%, achieving 70.0% accuracy on the challenging AIME benchmark with the 32B model.
[LG-6] Few-Shot Learning by Explicit Physics Integration: An Application to Groundwater Heat Transport
链接: https://arxiv.org/abs/2507.06062
作者: Julia Pelzer,Corné Verburg,Alexander Heinlein,Miriam Schulte
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Machine learning methods often struggle with real-world applications in science and engineering due to limited or low-quality training data. In this work, the example of groundwater flow with heat transport is considered; this corresponds to an advection-diffusion process under heterogeneous flow conditions, that is, spatially distributed material parameters and heat sources. Classical numerical simulations are costly and challenging due to high spatio-temporal resolution requirements and large domains. While often computationally more efficient, purely data-driven surrogate models face difficulties, particularly in predicting the advection process, which is highly sensitive to input variations and involves long-range spatial interactions. Therefore, in this work, a Local-Global Convolutional Neural Network (LGCNN) approach is introduced. It combines a lightweight numerical surrogate for the transport process (global) with convolutional neural networks for the groundwater velocity and heat diffusion processes (local). With the LGCNN, a city-wide subsurface temperature field is modeled, involving a heterogeneous groundwater flow field and one hundred groundwater heat pump injection points forming interacting heat plumes over long distances. The model is first systematically analyzed based on random subsurface input fields. Then, the model is trained on a handful of cut-outs from a real-world subsurface map of the Munich region in Germany, and it scales to larger cut-outs without retraining. All datasets, our code, and trained models are published for reproducibility.
[LG-7] EdgeCodec: Onboard Lightweight High Fidelity Neural Compressor with Residual Vector Quantization DATE
链接: https://arxiv.org/abs/2507.06040
作者: Benjamin Hodo(1),Tommaso Polonelli(1),Amirhossein Moallemi(2),Luca Benini(1),Michele Magno(1) ((1) D-ITET, ETH Zürich, Switzerland, (2) RTDT Laboratories, Switzerland)
类目: Machine Learning (cs.LG)
*备注: 7 Pages, 1 Figure, Accepted for presentation at the International Workshop on Advances in Sensors and Interfaces (IWASI), Italy 2025. \c{opyright} IEEE. DOI to be updated upon publication
Abstract:We present EdgeCodec, an end-to-end neural compressor for barometric data collected from wind turbine blades. EdgeCodec leverages a heavily asymmetric autoencoder architecture, trained with a discriminator and enhanced by a Residual Vector Quantizer to maximize compression efficiency. It achieves compression rates between 2’560:1 and 10’240:1 while maintaining a reconstruction error below 3%, and operates in real time on the GAP9 microcontroller with bitrates ranging from 11.25 to 45 bits per second. Bitrates can be selected on a sample-by-sample basis, enabling on-the-fly adaptation to varying network conditions. In its highest compression mode, EdgeCodec reduces the energy consumption of wireless data transmission by up to 2.9x, significantly extending the operational lifetime of deployed sensor units.
[LG-8] Fredholm Neural Networks for forward and inverse problems in elliptic PDEs
链接: https://arxiv.org/abs/2507.06038
作者: Kyriakos Georgiou,Constantinos Siettos,Athanasios N. Yannacopoulos
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Building on our previous work introducing Fredholm Neural Networks (Fredholm NNs/ FNNs) for solving integral equations, we extend the framework to tackle forward and inverse problems for linear and semi-linear elliptic partial differential equations. The proposed scheme consists of a deep neural network (DNN) which is designed to represent the iterative process of fixed-point iterations for the solution of elliptic PDEs using the boundary integral method within the framework of potential theory. The number of layers, weights, biases and hyperparameters are computed in an explainable manner based on the iterative scheme, and we therefore refer to this as the Potential Fredholm Neural Network (PFNN). We show that this approach ensures both accuracy and explainability, achieving small errors in the interior of the domain, and near machine-precision on the boundary. We provide a constructive proof for the consistency of the scheme and provide explicit error bounds for both the interior and boundary of the domain, reflected in the layers of the PFNN. These error bounds depend on the approximation of the boundary function and the integral discretization scheme, both of which directly correspond to components of the Fredholm NN architecture. In this way, we provide an explainable scheme that explicitly respects the boundary conditions. We assess the performance of the proposed scheme for the solution of both the forward and inverse problem for linear and semi-linear elliptic PDEs in two dimensions.
[LG-9] Multi-view mid fusion: a universal approach for learning in an HDLSS setting
链接: https://arxiv.org/abs/2507.06026
作者: Lynn Houthuys
类目: Machine Learning (cs.LG)
*备注:
Abstract:The high-dimensional low-sample-size (HDLSS) setting presents significant challenges in various applications where the feature dimension far exceeds the number of available samples. This paper introduces a universal approach for learning in HDLSS setting using multi-view mid fusion techniques. It shows how existing mid fusion multi-view methods perform well in an HDLSS setting even if no inherent views are provided. Three view construction methods are proposed that split the high-dimensional feature vectors into smaller subsets, each representing a different view. Extensive experimental validation across model-types and learning tasks confirm the effectiveness and generalization of the approach. We believe the work in this paper lays the foundation for further research into the universal benefits of multi-view mid fusion learning.
[LG-10] Kamae: Bridging Spark and Keras for Seamless ML Preprocessing
链接: https://arxiv.org/abs/2507.06021
作者: George Barrowclough,Marian Andrecki,James Shinner,Daniele Donghi
类目: Machine Learning (cs.LG)
*备注:
Abstract:In production recommender systems, feature preprocessing must be faithfully replicated across training and inference environments. This often requires duplicating logic between offline and online environments, increasing engineering effort and introducing risks of dataset shift. We present Kamae, an open-source Python library that bridges this gap by translating PySpark preprocessing pipelines into equivalent Keras models. Kamae provides a suite of configurable Spark transformers and estimators, each mapped to a corresponding Keras layer, enabling consistent, end-to-end preprocessing across the ML lifecycle. Framework’s utility is illustrated on real-world use cases, including MovieLens dataset and Expedia’s Learning-to-Rank pipelines. The code is available at this https URL.
[LG-11] KnowIt: Deep Time Series Modeling and Interpretation
链接: https://arxiv.org/abs/2507.06009
作者: M.W. Theunissen,R. Rabe,M.H. Davel
类目: Machine Learning (cs.LG)
*备注:
Abstract:KnowIt (Knowledge discovery in time series data) is a flexible framework for building deep time series models and interpreting them. It is implemented as a Python toolkit, with source code and documentation available from this https URL. It imposes minimal assumptions about task specifications and decouples the definition of dataset, deep neural network architecture, and interpretability technique through well defined interfaces. This ensures the ease of importing new datasets, custom architectures, and the definition of different interpretability paradigms while maintaining on-the-fly modeling and interpretation of different aspects of a user’s own time series data. KnowIt aims to provide an environment where users can perform knowledge discovery on their own complex time series data through building powerful deep learning models and explaining their behavior. With ongoing development, collaboration and application our goal is to make this a platform to progress this underexplored field and produce a trusted tool for deep time series modeling.
[LG-12] Robust Speech-Workload Estimation for Intelligent Human-Robot Systems
链接: https://arxiv.org/abs/2507.05985
作者: Julian Fortune,Julie A. Adams,Jamison Heard
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Demanding task environments (e.g., supervising a remotely piloted aircraft) require performing tasks quickly and accurately; however, periods of low and high operator workload can decrease task performance. Intelligent modulation of the system’s demands and interaction modality in response to changes in operator workload state may increase performance by avoiding undesirable workload states. This system requires real-time estimation of each workload component (i.e., cognitive, physical, visual, speech, and auditory) to adapt the correct modality. Existing workload systems estimate multiple workload components post-hoc, but few estimate speech workload, or function in real-time. An algorithm to estimate speech workload and mitigate undesirable workload states in real-time is presented. An analysis of the algorithm’s accuracy is presented, along with the results demonstrating the algorithm’s generalizability across individuals and human-machine teaming paradigms. Real-time speech workload estimation is a crucial element towards developing adaptive human-machine systems.
[LG-13] Generalized and Unified Equivalences between Hardness and Pseudoentropy
链接: https://arxiv.org/abs/2507.05972
作者: Lunjia Hu,Salil Vadhan
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:
Abstract:Pseudoentropy characterizations provide a quantitatively precise demonstration of the close relationship between computational hardness and computational randomness. We prove a unified pseudoentropy characterization that generalizes and strengthens previous results for both uniform and non-uniform models of computation. Our characterization holds for a general family of entropy notions that encompasses the common notions of Shannon entropy and min entropy as special cases. Moreover, we show that the characterizations for different entropy notions can be simultaneously achieved by a single, universal function that simultaneously witnesses computational hardness and computational randomness. A key technical insight of our work is that the notion of weight-restricted calibration from the recent literature on algorithm fairness, along with standard computational indistinguishability (known as multiaccuracy in the fairness literature), suffices for proving pseudoentropy characterizations for general entropy notions. This demonstrates the power of weight-restricted calibration to enhance the classic Complexity-Theoretic Regularity Lemma (Trevisan, Tulsiani, and Vadhan, 2009) and Leakage Simulation Lemma (Jetchev and Pietrzak, 2014) and allows us to achieve an exponential improvement in the complexity dependency on the alphabet size compared to the pseudoentropy characterizations by Casacuberta, Dwork, and Vadhan (2024) based on the much stronger notion of multicalibration. We show that the exponential dependency on the alphabet size is inevitable for multicalibration as well as for the weaker notion of calibrated multiaccuracy.
[LG-14] Improving AI-Based Canine Heart Disease Diagnosis with Expert-Consensus Auscultation Labeling
链接: https://arxiv.org/abs/2507.05950
作者: Pinar Bisgin,Tom Strube,Niklas Tschorn,Michael Pantförder,Maximilian Fecke,Ingrid Ljungvall,Jens Häggström,Gerhard Wess,Christoph Schummer,Sven Meister,Falk M. Howar
类目: Machine Learning (cs.LG)
*备注: Accepted to IEEE Engineering in Medicine and Biology Conference (EMBC) 2025
Abstract:Noisy labels pose significant challenges for AI model training in veterinary medicine. This study examines expert assessment ambiguity in canine auscultation data, highlights the negative impact of label noise on classification performance, and introduces methods for label noise reduction. To evaluate whether label noise can be minimized by incorporating multiple expert opinions, a dataset of 140 heart sound recordings (HSR) was annotated regarding the intensity of holosystolic heart murmurs caused by Myxomatous Mitral Valve Disease (MMVD). The expert opinions facilitated the selection of 70 high-quality HSR, resulting in a noise-reduced dataset. By leveraging individual heart cycles, the training data was expanded and classification robustness was enhanced. The investigation encompassed training and evaluating three classification algorithms: AdaBoost, XGBoost, and Random Forest. While AdaBoost and Random Forest exhibited reasonable performances, XGBoost demonstrated notable improvements in classification accuracy. All algorithms showed significant improvements in classification accuracy due to the applied label noise reduction, most notably XGBoost. Specifically, for the detection of mild heart murmurs, sensitivity increased from 37.71% to 90.98% and specificity from 76.70% to 93.69%. For the moderate category, sensitivity rose from 30.23% to 55.81% and specificity from 64.56% to 97.19%. In the loud/thrilling category, sensitivity and specificity increased from 58.28% to 95.09% and from 84.84% to 89.69%, respectively. These results highlight the importance of minimizing label noise to improve classification algorithms for the detection of canine heart murmurs. Index Terms: AI diagnosis, canine heart disease, heart sound classification, label noise reduction, machine learning, XGBoost, veterinary cardiology, MMVD.
[LG-15] Diffusion Dataset Condensation: Training Your Diffusion Model Faster with Less Data
链接: https://arxiv.org/abs/2507.05914
作者: Rui Huang,Shitong Shao,Zikai Zhou,Pukun Zhao,Hangyu Guo,Tian Ye,Lichen Bai,Shuo Yang,Zeke Xie
类目: Machine Learning (cs.LG)
*备注: Iintroduces D2C: a novel framework for diffusion dataset condensation
Abstract:Diffusion models have achieved remarkable success in various generative tasks, but training them remains highly resource-intensive, often requiring millions of images and many days of GPU computation. From a data-centric perspective addressing this limitation, we study diffusion dataset condensation as a new and challenging problem setting. The goal is to construct a “synthetic” sub-dataset with significantly fewer samples than the original dataset, enabling high-quality diffusion model training with greatly reduced cost. To the best of our knowledge, we are the first to formally investigate dataset condensation for diffusion models, whereas prior work focused on training discriminative models. To tackle this new challenge, we propose a novel Diffusion Dataset Condensation (D2C) framework, which consists of two phases: Select and Attach. The Select phase identifies a compact and diverse subset using a diffusion difficulty score and interval sampling. The Attach phase enhances the selected subset by attaching rich semantic and visual representations to strengthen the conditional signals. Extensive experiments across various dataset sizes, model architectures, and resolutions show that our D2C framework enables significantly faster diffusion model training with dramatically fewer data, while preserving high visual quality. Notably, for the SiT-XL/2 architecture, D2C achieves a 100x training speed-up, reaching a FID score of 4.3 in just 40k steps using only 0.8% of the training data.
[LG-16] Stable Acoustic Relay Assignment with High Throughput via Lase Chaos-based Reinforcement Learning
链接: https://arxiv.org/abs/2507.05900
作者: Zengjing Chen,Lu Wang,Chengzhi Xing
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Optimization and Control (math.OC)
*备注:
Abstract:This study addresses the problem of stable acoustic relay assignment in an underwater acoustic network. Unlike the objectives of most existing literature, two distinct objectives, namely classical stable arrangement and ambiguous stable arrangement, are considered. To achieve these stable arrangements, a laser chaos-based multi-processing learning (LC-ML) method is introduced to efficiently obtain high throughput and rapidly attain stability. In order to sufficiently explore the relay’s decision-making, this method uses random numbers generated by laser chaos to learn the assignment of relays to multiple source nodes. This study finds that the laser chaos-based random number and multi-processing in the exchange process have a positive effect on higher throughput and strong adaptability with environmental changing over time. Meanwhile, ambiguous cognitions result in the stable configuration with less volatility compared to accurate ones. This provides a practical and useful method and can be the basis for relay selection in complex underwater environments.
[LG-17] Robust Power System State Estimation using Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2507.05874
作者: Solon Falas,Markos Asprou,Charalambos Konstantinou,Maria K. Michael
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Modern power systems face significant challenges in state estimation and real-time monitoring, particularly regarding response speed and accuracy under faulty conditions or cyber-attacks. This paper proposes a hybrid approach using physics-informed neural networks (PINNs) to enhance the accuracy and robustness, of power system state estimation. By embedding physical laws into the neural network architecture, PINNs improve estimation accuracy for transmission grid applications under both normal and faulty conditions, while also showing potential in addressing security concerns such as data manipulation attacks. Experimental results show that the proposed approach outperforms traditional machine learning models, achieving up to 83% higher accuracy on unseen subsets of the training dataset and 65% better performance on entirely new, unrelated datasets. Experiments also show that during a data manipulation attack against a critical bus in a system, the PINN can be up to 93% more accurate than an equivalent neural network.
[LG-18] Communication-Efficient Module-Wise Federated Learning for Grasp Pose Detection in Cluttered Environments
链接: https://arxiv.org/abs/2507.05861
作者: Woonsang Kang,Joohyung Lee,Seungjun Kim,Jungchan Cho,Yoonseon Oh
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures. Submitted to IEEE Robotics and Automation Letters (RA-L)
Abstract:Grasp pose detection (GPD) is a fundamental capability for robotic autonomy, but its reliance on large, diverse datasets creates significant data privacy and centralization challenges. Federated Learning (FL) offers a privacy-preserving solution, but its application to GPD is hindered by the substantial communication overhead of large models, a key issue for resource-constrained robots. To address this, we propose a novel module-wise FL framework that begins by analyzing the learning dynamics of the GPD model’s functional components. This analysis identifies slower-converging modules, to which our framework then allocates additional communication effort. This is realized through a two-phase process: a standard full-model training phase is followed by a communication-efficient phase where only the identified subset of slower-converging modules is trained and their partial updates are aggregated. Extensive experiments on the GraspNet-1B dataset demonstrate that our method outperforms standard FedAvg and other baselines, achieving higher accuracy for a given communication budget. Furthermore, real-world experiments on a physical robot validate our approach, showing a superior grasp success rate compared to baseline methods in cluttered scenes. Our work presents a communication-efficient framework for training robust, generalized GPD models in a decentralized manner, effectively improving the trade-off between communication cost and model performance.
[LG-19] Prototype-Guided and Lightweight Adapters for Inherent Interpretation and Generalisation in Federated Learning MICCAI2025
链接: https://arxiv.org/abs/2507.05852
作者: Samuel Ofosu Mensah,Kerol Djoumessi,Philipp Berens
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 4 figures, submitted to MICCAI 2025, used llncs document class
Abstract:Federated learning (FL) provides a promising paradigm for collaboratively training machine learning models across distributed data sources while maintaining privacy. Nevertheless, real-world FL often faces major challenges including communication overhead during the transfer of large model parameters and statistical heterogeneity, arising from non-identical independent data distributions across clients. In this work, we propose an FL framework that 1) provides inherent interpretations using prototypes, and 2) tackles statistical heterogeneity by utilising lightweight adapter modules to act as compressed surrogates of local models and guide clients to achieve generalisation despite varying client distribution. Each client locally refines its model by aligning class embeddings toward prototype representations and simultaneously adjust the lightweight adapter. Our approach replaces the need to communicate entire model weights with prototypes and lightweight adapters. This design ensures that each client’s model aligns with a globally shared structure while minimising communication load and providing inherent interpretations. Moreover, we conducted our experiments on a real-world retinal fundus image dataset, which provides clinical-site information. We demonstrate inherent interpretable capabilities and perform a classification task, which shows improvements in accuracy over baseline algorithms.
[LG-20] Improving Robustness of Foundation Models in Domain Adaptation with Soup-Adapters
链接: https://arxiv.org/abs/2507.05807
作者: Marco Roschkowski
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we tackle two fundamental problems in few-shot domain adaptation of foundation models. First, hyperparameter tuning is often impractical due to the lack of large validation datasets. Second, model robustness under distribution shifts where test time data deviates slightly from training distributions, remains a concern. We show that by training multiple independent adapters and averaging their outputs, the new model has a higher performance and is more robust to distribution shifts compared to any individual adapter. This improvement holds even when the adapters are trained with diverse hyperparameters sampled from a wide range, resulting in varied individual performance. Consequently, our method addresses both of the problems described above. The ensemble is also significantly less sensitive to the residual ratio, a critical hyperparameter of CLIP-Adapter. Since the ensemble can be reparameterized to a single adapter again using a principled concatenation of the parameters, we refer to our method as Soup-Adapter. This is also the first study to explore CLIP adapter-style techniques for DINOv2 and to directly compare them with CLIP in this setting.
[LG-21] Predicting Graph Structure via Adapted Flux Balance Analysis
链接: https://arxiv.org/abs/2507.05806
作者: Sevvandi Kandanaarachchi,Ziqi Xu,Stefan Westerlund,Conrad Sanderson
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: extended and revised version of arXiv:2401.04280
Abstract:Many dynamic processes such as telecommunication and transport networks can be described through discrete time series of graphs. Modelling the dynamics of such time series enables prediction of graph structure at future time steps, which can be used in applications such as detection of anomalies. Existing approaches for graph prediction have limitations such as assuming that the vertices do not to change between consecutive graphs. To address this, we propose to exploit time series prediction methods in combination with an adapted form of flux balance analysis (FBA), a linear programming method originating from biochemistry. FBA is adapted to incorporate various constraints applicable to the scenario of growing graphs. Empirical evaluations on synthetic datasets (constructed via Preferential Attachment model) and real datasets (UCI Message, HePH, Facebook, Bitcoin) demonstrate the efficacy of the proposed approach.
[LG-22] Robust Bandwidth Estimation for Real-Time Communication with Offline Reinforcement Learning
链接: https://arxiv.org/abs/2507.05785
作者: Jian Kai,Tianwei Zhang,Zihan Ling,Yang Cao,Can Shen
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:Accurate bandwidth estimation (BWE) is critical for real-time communication (RTC) systems. Traditional heuristic approaches offer limited adaptability under dynamic networks, while online reinforcement learning (RL) suffers from high exploration costs and potential service disruptions. Offline RL, which leverages high-quality data collected from real-world environments, offers a promising alternative. However, challenges such as out-of-distribution (OOD) actions, policy extraction from behaviorally diverse datasets, and reliable deployment in production systems remain unsolved. We propose RBWE, a robust bandwidth estimation framework based on offline RL that integrates Q-ensemble (an ensemble of Q-functions) with a Gaussian mixture policy to mitigate OOD risks and enhance policy learning. A fallback mechanism ensures deployment stability by switching to heuristic methods under high uncertainty. Experimental results show that RBWE reduces overestimation errors by 18% and improves the 10th percentile Quality of Experience (QoE) by 18.6%, demonstrating its practical effectiveness in real-world RTC applications.
[LG-23] From Motion to Meaning: Biomechanics-Informed Neural Network for Explainable Cardiovascular Disease Identification
链接: https://arxiv.org/abs/2507.05783
作者: Comte Valentin,Gemma Piella,Mario Ceresa,Miguel A. Gonzalez Ballester
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cardiac diseases are among the leading causes of morbidity and mortality worldwide, which requires accurate and timely diagnostic strategies. In this study, we introduce an innovative approach that combines deep learning image registration with physics-informed regularization to predict the biomechanical properties of moving cardiac tissues and extract features for disease classification. We utilize the energy strain formulation of Neo-Hookean material to model cardiac tissue deformations, optimizing the deformation field while ensuring its physical and biomechanical coherence. This explainable approach not only improves image registration accuracy, but also provides insights into the underlying biomechanical processes of the cardiac tissues. Evaluation on the Automated Cardiac Diagnosis Challenge (ACDC) dataset achieved Dice scores of 0.945 for the left ventricular cavity, 0.908 for the right ventricular cavity, and 0.905 for the myocardium. Subsequently, we estimate the local strains within the moving heart and extract a detailed set of features used for cardiovascular disease classification. We evaluated five classification algorithms, Logistic Regression, Multi-Layer Perceptron, Support Vector Classifier, Random Forest, and Nearest Neighbour, and identified the most relevant features using a feature selection algorithm. The best performing classifier obtained a classification accuracy of 98% in the training set and 100% in the test set of the ACDC dataset. By integrating explainable artificial intelligence, this method empowers clinicians with a transparent understanding of the model’s predictions based on cardiac mechanics, while also significantly improving the accuracy and reliability of cardiac disease diagnosis, paving the way for more personalized and effective patient care.
[LG-24] Jigsaw: Training Multi-Billion-Parameter AI Weather Models with Optimized Model Parallelism
链接: https://arxiv.org/abs/2507.05753
作者: Deifilia Kieckhefen,Markus Götz,Lars H. Heyen,Achim Streit,Charlotte Debus
类目: Machine Learning (cs.LG)
*备注: 12 pages, 10 figures
Abstract:AI-based methods have revolutionized atmospheric forecasting, with recent successes in medium-range forecasting spurring the development of climate foundation models. Accurate modeling of complex atmospheric dynamics at high spatial resolutions and longer lead times requires large neural networks and gigabyte-sized data samples, making accelerator memory and I/O-bandwidth the bottlenecks for model training. We introduce WeatherMixer, a multi-layer-perceptron-based architecture whose workload scales linearly with input size, allowing the model to learn global weather phenomena at accuracies similar to numerical weather prediction. To cope with the computational demand, we propose Jigsaw, a novel model parallelization scheme that employs both domain and tensor parallelism, eliminating memory redundancy. Jigsaw exceeds state-of-the-art performance in strong scaling in compute-communication-limited systems and achieves superscalar weak scaling in I/O-bandwidth-limited systems. We scale training to 256 GPUs, reaching peak performances of 9 and 11 PFLOPs, 23% and 28% of theoretical peaks, achieving 68% and 72% scaling efficiency versus 51% without model parallelism.
[LG-25] Hierarchical Task Offloading for UAV-Assisted Vehicular Edge Computing via Deep Reinforcement Learning
链接: https://arxiv.org/abs/2507.05722
作者: Hongbao Li,Ziye Jia,Sijie He,Kun Guo,Qihui Wu
类目: Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, conference
Abstract:With the emergence of compute-intensive and delay-sensitive applications in vehicular networks, unmanned aerial vehicles (UAVs) have emerged as a promising complement for vehicular edge computing due to the high mobility and flexible deployment. However, the existing UAV-assisted offloading strategies are insufficient in coordinating heterogeneous computing resources and adapting to dynamic network conditions. Hence, this paper proposes a dual-layer UAV-assisted edge computing architecture based on partial offloading, composed of the relay capability of high-altitude UAVs and the computing support of low-altitude UAVs. The proposed architecture enables efficient integration and coordination of heterogeneous resources. A joint optimization problem is formulated to minimize the system delay and energy consumption while ensuring the task completion rate. To solve the high-dimensional decision problem, we reformulate the problem as a Markov decision process and propose a hierarchical offloading scheme based on the soft actor-critic algorithm. The method decouples global and local decisions, where the global decisions integrate offloading ratios and trajectory planning into continuous actions, while the local scheduling is handled via designing a priority-based mechanism. Simulations are conducted and demonstrate that the proposed approach outperforms several baselines in task completion rate, system efficiency, and convergence speed, showing strong robustness and applicability in dynamic vehicular environments.
[LG-26] Canine Clinical Gait Analysis for Orthopedic and Neurological Disorders: An Inertial Deep-Learning Approach
链接: https://arxiv.org/abs/2507.05671
作者: Netta Palez,Léonie Straß,Sebastian Meller,Holger Volk,Anna Zamansky,Itzik Klein
类目: Machine Learning (cs.LG)
*备注: 20 pages, 11 figures (one combine 2 images), 7 tables, 41 references
Abstract:Canine gait analysis using wearable inertial sensors is gaining attention in veterinary clinical settings, as it provides valuable insights into a range of mobility impairments. Neurological and orthopedic conditions cannot always be easily distinguished even by experienced clinicians. The current study explored and developed a deep learning approach using inertial sensor readings to assess whether neurological and orthopedic gait could facilitate gait analysis. Our investigation focused on optimizing both performance and generalizability in distinguishing between these gait abnormalities. Variations in sensor configurations, assessment protocols, and enhancements to deep learning model architectures were further suggested. Using a dataset of 29 dogs, our proposed approach achieved 96% accuracy in the multiclass classification task (healthy/orthopedic/neurological) and 82% accuracy in the binary classification task (healthy/non-healthy) when generalizing to unseen dogs. Our results demonstrate the potential of inertial-based deep learning models to serve as a practical and objective diagnostic and clinical aid to differentiate gait assessment in orthopedic and neurological conditions.
[LG-27] Detecting and Mitigating Reward Hacking in Reinforcement Learning Systems: A Comprehensive Empirical Study
链接: https://arxiv.org/abs/2507.05619
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Reward hacking in Reinforcement Learning (RL) systems poses a critical threat to the deployment of autonomous agents, where agents exploit flaws in reward functions to achieve high scores without fulfilling intended objectives. Despite growing awareness of this problem, systematic detection and mitigation approaches remain limited. This paper presents a large-scale empirical study of reward hacking across diverse RL environments and algorithms. We analyze 15,247 training episodes across 15 RL environments (Atari, MuJoCo, custom domains) and 5 algorithms (PPO, SAC, DQN, A3C, Rainbow), implementing automated detection algorithms for six categories of reward hacking: specification gaming, reward tampering, proxy optimization, objective misalignment, exploitation patterns, and wireheading. Our detection framework achieves 78.4% precision and 81.7% recall across environments, with computational overhead under 5%. Through controlled experiments varying reward function properties, we demonstrate that reward density and alignment with true objectives significantly impact hacking frequency ( p 0.001 , Cohen’s d = 1.24 ). We validate our approach through three simulated application studies representing recommendation systems, competitive gaming, and robotic control scenarios. Our mitigation techniques reduce hacking frequency by up to 54.6% in controlled scenarios, though we find these trade-offs are more challenging in practice due to concept drift, false positive costs, and adversarial adaptation. All detection algorithms, datasets, and experimental protocols are publicly available to support reproducible research in RL safety.
[LG-28] Model-free Optical Processors using In Situ Reinforcement Learning with Proximal Policy Optimization
链接: https://arxiv.org/abs/2507.05583
作者: Yuhang Li,Shiqi Chen,Tingyu Gong,Aydogan Ozcan
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Applied Physics (physics.app-ph); Optics (physics.optics)
*备注: 19 Pages, 7 Figures
Abstract:Optical computing holds promise for high-speed, energy-efficient information processing, with diffractive optical networks emerging as a flexible platform for implementing task-specific transformations. A challenge, however, is the effective optimization and alignment of the diffractive layers, which is hindered by the difficulty of accurately modeling physical systems with their inherent hardware imperfections, noise, and misalignments. While existing in situ optimization methods offer the advantage of direct training on the physical system without explicit system modeling, they are often limited by slow convergence and unstable performance due to inefficient use of limited measurement data. Here, we introduce a model-free reinforcement learning approach utilizing Proximal Policy Optimization (PPO) for the in situ training of diffractive optical processors. PPO efficiently reuses in situ measurement data and constrains policy updates to ensure more stable and faster convergence. We experimentally validated our method across a range of in situ learning tasks, including targeted energy focusing through a random diffuser, holographic image generation, aberration correction, and optical image classification, demonstrating in each task better convergence and performance. Our strategy operates directly on the physical system and naturally accounts for unknown real-world imperfections, eliminating the need for prior system knowledge or modeling. By enabling faster and more accurate training under realistic experimental constraints, this in situ reinforcement learning approach could offer a scalable framework for various optical and physical systems governed by complex, feedback-driven dynamics.
[LG-29] Preemptive Solving of Future Problems: Multitask Preplay in Humans and Machines
链接: https://arxiv.org/abs/2507.05561
作者: Wilka Carvalho,Sam Hall-McMaster,Honglak Lee,Samuel J. Gershman
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Humans can pursue a near-infinite variety of tasks, but typically can only pursue a small number at the same time. We hypothesize that humans leverage experience on one task to preemptively learn solutions to other tasks that were accessible but not pursued. We formalize this idea as Multitask Preplay, a novel algorithm that replays experience on one task as the starting point for “preplay” – counterfactual simulation of an accessible but unpursued task. Preplay is used to learn a predictive representation that can support fast, adaptive task performance later on. We first show that, compared to traditional planning and predictive representation methods, multitask preplay better predicts how humans generalize to tasks that were accessible but not pursued in a small grid-world, even when people didn’t know they would need to generalize to these tasks. We then show these predictions generalize to Craftax, a partially observable 2D Minecraft environment. Finally, we show that Multitask Preplay enables artificial agents to learn behaviors that transfer to novel Craftax worlds sharing task co-occurrence structure. These findings demonstrate that Multitask Preplay is a scalable theory of how humans counterfactually learn and generalize across multiple tasks; endowing artificial agents with the same capacity can significantly improve their performance in challenging multitask environments.
[LG-30] Gait-Based Hand Load Estimation via Deep Latent Variable Models with Auxiliary Information
链接: https://arxiv.org/abs/2507.05544
作者: Jingyi Gao,Sol Lim,Seokhyun Chung
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning methods are increasingly applied to ergonomic risk assessment in manual material handling, particularly for estimating carried load from gait motion data collected from wearable sensors. However, existing approaches often rely on direct mappings from loaded gait to hand load, limiting generalization and predictive accuracy. In this study, we propose an enhanced load estimation framework that incorporates auxiliary information, including baseline gait patterns during unloaded walking and carrying style. While baseline gait can be automatically captured by wearable sensors and is thus readily available at inference time, carrying style typically requires manual labeling and is often unavailable during deployment. Our model integrates deep latent variable modeling with temporal convolutional networks and bi-directional cross-attention to capture gait dynamics and fuse loaded and unloaded gait patterns. Guided by domain knowledge, the model is designed to estimate load magnitude conditioned on carrying style, while eliminating the need for carrying style labels at inference time. Experiments using real-world data collected from inertial measurement units attached to participants demonstrate substantial accuracy gains from incorporating auxiliary information and highlight the importance of explicit fusion mechanisms over naive feature concatenation.
[LG-31] heoretical Learning Performance of Graph Neural Networks: The Impact of Jumping Connections and Layer-wise Sparsification
链接: https://arxiv.org/abs/2507.05533
作者: Jiawei Sun,Hongkang Li,Meng Wang
类目: Machine Learning (cs.LG)
*备注: TMLR
Abstract:Jumping connections enable Graph Convolutional Networks (GCNs) to overcome over-smoothing, while graph sparsification reduces computational demands by selecting a sub-matrix of the graph adjacency matrix during neighborhood aggregation. Learning GCNs with graph sparsification has shown empirical success across various applications, but a theoretical understanding of the generalization guarantees remains limited, with existing analyses ignoring either graph sparsification or jumping connections. This paper presents the first learning dynamics and generalization analysis of GCNs with jumping connections using graph sparsification. Our analysis demonstrates that the generalization accuracy of the learned model closely approximates the highest achievable accuracy within a broad class of target functions dependent on the proposed sparse effective adjacency matrix A^* . Thus, graph sparsification maintains generalization performance when A^* preserves the essential edges that support meaningful message propagation. We reveal that jumping connections lead to different sparsification requirements across layers. In a two-hidden-layer GCN, the generalization is more affected by the sparsified matrix deviations from A^* of the first layer than the second layer. To the best of our knowledge, this marks the first theoretical characterization of jumping connections’ role in sparsification requirements. We validate our theoretical results on benchmark datasets in deep GCNs.
[LG-32] Bit-Flip Fault Attack: Crushing Graph Neural Networks via Gradual Bit Search
链接: https://arxiv.org/abs/2507.05531
作者: Sanaz Kazemi Abharian,Sai Manoj Pudukotai Dinakarrao
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:Graph Neural Networks (GNNs) have emerged as a powerful machine learning method for graph-structured data. A plethora of hardware accelerators has been introduced to meet the performance demands of GNNs in real-world applications. However, security challenges of hardware-based attacks have been generally overlooked. In this paper, we investigate the vulnerability of GNN models to hardware-based fault attack, wherein an attacker attempts to misclassify output by modifying trained weight parameters through fault injection in a memory device. Thus, we propose Gradual Bit-Flip Fault Attack (GBFA), a layer-aware bit-flip fault attack, selecting a vulnerable bit in each selected weight gradually to compromise the GNN’s performance by flipping a minimal number of bits. To achieve this, GBFA operates in two steps. First, a Markov model is created to predict the execution sequence of layers based on features extracted from memory access patterns, enabling the launch of the attack within a specific layer. Subsequently, GBFA identifies vulnerable bits within the selected weights using gradient ranking through an in-layer search. We evaluate the effectiveness of the proposed GBFA attack on various GNN models for node classification tasks using the Cora and PubMed datasets. Our findings show that GBFA significantly degrades prediction accuracy, and the variation in its impact across different layers highlights the importance of adopting a layer-aware attack strategy in GNNs. For example, GBFA degrades GraphSAGE’s prediction accuracy by 17% on the Cora dataset with only a single bit flip in the last layer.
[LG-33] Estimating Interventional Distributions with Uncertain Causal Graphs through Meta-Learning
链接: https://arxiv.org/abs/2507.05526
作者: Anish Dhir,Cristiana Diaconu,Valentinian Mihai Lungu,James Requeima,Richard E. Turner,Mark van der Wilk
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:In scientific domains – from biology to the social sciences – many questions boil down to \textitWhat effect will we observe if we intervene on a particular variable? If the causal relationships (e.g.~a causal graph) are known, it is possible to estimate the intervention distributions. In the absence of this domain knowledge, the causal structure must be discovered from the available observational data. However, observational data are often compatible with multiple causal graphs, making methods that commit to a single structure prone to overconfidence. A principled way to manage this structural uncertainty is via Bayesian inference, which averages over a posterior distribution on possible causal structures and functional mechanisms. Unfortunately, the number of causal structures grows super-exponentially with the number of nodes in the graph, making computations intractable. We propose to circumvent these challenges by using meta-learning to create an end-to-end model: the Model-Averaged Causal Estimation Transformer Neural Process (MACE-TNP). The model is trained to predict the Bayesian model-averaged interventional posterior distribution, and its end-to-end nature bypasses the need for expensive calculations. Empirically, we demonstrate that MACE-TNP outperforms strong Bayesian baselines. Our work establishes meta-learning as a flexible and scalable paradigm for approximating complex Bayesian causal inference, that can be scaled to increasingly challenging settings in the future.
[LG-34] Deep Learning of Continuous and Structured Policies for Aggregated Heterogeneous Treatment Effects
链接: https://arxiv.org/abs/2507.05511
作者: Jennifer Y. Zhang,Shuyang Du,Will Y. Zou
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 10 pages
Abstract:As estimation of Heterogeneous Treatment Effect (HTE) is increasingly adopted across a wide range of scientific and industrial applications, the treatment action space can naturally expand, from a binary treatment variable to a structured treatment policy. This policy may include several policy factors such as a continuous treatment intensity variable, or discrete treatment assignments. From first principles, we derive the formulation for incorporating multiple treatment policy variables into the functional forms of individual and average treatment effects. Building on this, we develop a methodology to directly rank subjects using aggregated HTE functions. In particular, we construct a Neural-Augmented Naive Bayes layer within a deep learning framework to incorporate an arbitrary number of factors that satisfies the Naive Bayes assumption. The factored layer is then applied with continuous treatment variables, treatment assignment, and direct ranking of aggregated treatment effect functions. Together, these algorithms build towards a generic framework for deep learning of heterogeneous treatment policies, and we show their power to improve performance with public datasets.
[LG-35] Heterogeneous Causal Learning for Optimizing Aggregated Functions in User Growth
链接: https://arxiv.org/abs/2507.05510
作者: Shuyang Du,Jennifer Zhang,Will Y. Zou
类目: Machine Learning (cs.LG)
*备注: 11 pages. arXiv admin note: text overlap with arXiv:2004.09702
Abstract:User growth is a major strategy for consumer internet companies. To optimize costly marketing campaigns and maximize user engagement, we propose a novel treatment effect optimization methodology to enhance user growth marketing. By leveraging deep learning, our algorithm learns from past experiments to optimize user selection and reward allocation, maximizing campaign impact while minimizing costs. Unlike traditional prediction methods, our model directly models uplifts in key business metrics. Further, our deep learning model can jointly optimize parameters for an aggregated loss function using softmax gating. Our approach surpasses traditional methods by directly targeting desired business metrics and demonstrates superior algorithmic flexibility in handling complex business constraints. Comprehensive evaluations, including comparisons with state-of-the-art techniques such as R-learner and Causal Forest, validate the effectiveness of our model. We experimentally demonstrate that our proposed constrained and direct optimization algorithms significantly outperform state-of-the-art methods by over 20% , proving their cost-efficiency and real-world impact. The versatile methods can be applied to various product scenarios, including optimal treatment allocation. Its effectiveness has also been validated through successful worldwide production deployments.
[LG-36] Beyond Communication Overhead: A Multilevel Monte Carlo Approach for Mitigating Compression Bias in Distributed Learning ICML2025
链接: https://arxiv.org/abs/2507.05508
作者: Ze’ev Zukerman,Bassel Hamoud,Kfir Y. Levy
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2025
Abstract:Distributed learning methods have gained substantial momentum in recent years, with communication overhead often emerging as a critical bottleneck. Gradient compression techniques alleviate communication costs but involve an inherent trade-off between the empirical efficiency of biased compressors and the theoretical guarantees of unbiased compressors. In this work, we introduce a novel Multilevel Monte Carlo (MLMC) compression scheme that leverages biased compressors to construct statistically unbiased estimates. This approach effectively bridges the gap between biased and unbiased methods, combining the strengths of both. To showcase the versatility of our method, we apply it to popular compressors, like Top- k and bit-wise compressors, resulting in enhanced variants. Furthermore, we derive an adaptive version of our approach to further improve its performance. We validate our method empirically on distributed deep learning tasks.
[LG-37] Dynamic Campus Origin-Destination Mobility Prediction using Graph Convolutional Neural Network on WiFi Logs
链接: https://arxiv.org/abs/2507.05507
作者: Godwin Badu-Marfo,Bilal Farooq
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present an integrated graph-based neural networks architecture for predicting campus buildings occupancy and inter-buildings movement at dynamic temporal resolution that learns traffic flow patterns from Wi-Fi logs combined with the usage schedules within the buildings. The relative traffic flows are directly estimated from the WiFi data without assuming the occupant behaviour or preferences while maintaining individual privacy. We formulate the problem as a data-driven graph structure represented by a set of nodes (representing buildings), connected through a route of edges or links using a novel Graph Convolution plus LSTM Neural Network (GCLSTM) which has shown remarkable success in modelling complex patterns. We describe the formulation, model estimation, interpretability and examine the relative performance of our proposed model. We also present an illustrative architecture of the models and apply on real-world WiFi logs collected at the Toronto Metropolitan University campus. The results of the experiments show that the integrated GCLSTM models significantly outperform traditional pedestrian flow estimators like the Multi Layer Perceptron (MLP) and Linear Regression.
[LG-38] Navigating Sparse Molecular Data with Stein Diffusion Guidance
链接: https://arxiv.org/abs/2507.05482
作者: Van Khoa Nguyen,Lionel Blondé,Alexandros Kalousis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Stochastic optimal control (SOC) has recently emerged as a principled framework for fine-tuning diffusion models. However, its dependence on computationally intensive simulations makes it impractical for fast sampling. In parallel, a class of training-free approaches has been developed that guides diffusion models using off-the-shelf classifiers on predicted clean samples, bypassing the need to train classifiers on noisy data. These methods can be interpreted as approximate SOC schemes, using Tweedie’s formula to estimate diffusion posteriors. In practice, however, such direct approximations can introduce significant errors, leading to unreliable guidance. In this work, we unify the strengths of both paradigms by proposing a novel training-free diffusion guidance framework based on a surrogate stochastic optimal control objective. We derive a new theoretical bound on the value function that reveals the necessity of correcting the approximate posteriors to remain faithful to the true diffusion posterior. To this end, we connect the problem with Stein variational inference, which seeks the steepest descent direction that minimizes the Kullback-Leibler discrepancy between the two posteriors. Our method, which we refer to as Stein Diffusion Guidance (SDG), introduces a principled correction mechanism and incorporates a novel running cost functional to enable effective guidance in low-density regions. Experiments on challenging molecular generation tasks demonstrate that SDG significantly outperforms standard training-free guidance methods, highlighting its potential for broader applications.
[LG-39] Dynamic Regret Reduces to Kernelized Static Regret
链接: https://arxiv.org/abs/2507.05478
作者: Andrew Jacobsen,Alessandro Rudi,Francesco Orabona,Nicolo Cesa-Bianchi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 38 pages, 2 figures
Abstract:We study dynamic regret in online convex optimization, where the objective is to achieve low cumulative loss relative to an arbitrary benchmark sequence. By observing that competing with an arbitrary sequence of comparators u_1,\ldots,u_T in \mathcalW\subseteq\mathbbR^d is equivalent to competing with a fixed comparator function u:[1,T]\to \mathcalW , we frame dynamic regret minimization as a static regret problem in a function space. By carefully constructing a suitable function space in the form of a Reproducing Kernel Hilbert Space (RKHS), our reduction enables us to recover the optimal R_T(u_1,\ldots,u_T) = \mathcalO(\sqrt\sum_t|u_t-u_t-1|T) dynamic regret guarantee in the setting of linear losses, and yields new scale-free and directionally-adaptive dynamic regret guarantees. Moreover, unlike prior dynamic-to-static reductions – which are valid only for linear losses – our reduction holds for any sequence of losses, allowing us to recover \mathcalO\big(|u|^2+d_\mathrmeff(\lambda)\ln T\big) bounds in exp-concave and improper linear regression settings, where d_\mathrmeff(\lambda) is a measure of complexity of the RKHS. Despite working in an infinite-dimensional space, the resulting reduction leads to algorithms that are computable in practice, due to the reproducing property of RKHSs.
[LG-40] Adversarial Machine Learning Attacks on Financial Reporting via Maximum Violated Multi-Objective Attack KDD
链接: https://arxiv.org/abs/2507.05441
作者: Edward Raff,Karen Kukla,Michel Benaroch,Joseph Comprix
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: KDD Workshop on Machine Learning in Finance
Abstract:Bad actors, primarily distressed firms, have the incentive and desire to manipulate their financial reports to hide their distress and derive personal gains. As attackers, these firms are motivated by potentially millions of dollars and the availability of many publicly disclosed and used financial modeling frameworks. Existing attack methods do not work on this data due to anti-correlated objectives that must both be satisfied for the attacker to succeed. We introduce Maximum Violated Multi-Objective (MVMO) attacks that adapt the attacker’s search direction to find 20\times more satisfying attacks compared to standard attacks. The result is that in \approx50% of cases, a company could inflate their earnings by 100-200%, while simultaneously reducing their fraud scores by 15%. By working with lawyers and professional accountants, we ensure our threat model is realistic to how such frauds are performed in practice.
[LG-41] Incorporating Interventional Independence Improves Robustness against Interventional Distribution Shift
链接: https://arxiv.org/abs/2507.05412
作者: Gautam Sreekumar,Vishnu Naresh Boddeti
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:We consider the problem of learning robust discriminative representations of causally-related latent variables. In addition to observational data, the training dataset also includes interventional data obtained through targeted interventions on some of these latent variables to learn representations robust against the resulting interventional distribution shifts. Existing approaches treat interventional data like observational data, even when the underlying causal model is known, and ignore the independence relations that arise from these interventions. Since these approaches do not fully exploit the causal relational information resulting from interventions, they learn representations that produce large disparities in predictive performance on observational and interventional data, which worsens when the number of interventional training samples is limited. In this paper, (1) we first identify a strong correlation between this performance disparity and adherence of the representations to the independence conditions induced by the interventional causal model. (2) For linear models, we derive sufficient conditions on the proportion of interventional data in the training dataset, for which enforcing interventional independence between representations corresponding to the intervened node and its non-descendants lowers the error on interventional data. Combining these insights, (3) we propose RepLIn, a training algorithm to explicitly enforce this statistical independence during interventions. We demonstrate the utility of RepLIn on a synthetic dataset and on real image and text datasets on facial attribute classification and toxicity detection, respectively. Our experiments show that RepLIn is scalable with the number of nodes in the causal graph and is suitable to improve the robust representations against interventional distribution shifts of both continuous and discrete latent variables.
[LG-42] AXLearn: Modular Large Model Training on Heterogeneous Infrastructure
链接: https://arxiv.org/abs/2507.05411
作者: Mark Lee,Tom Gunter,Chang Lan,John Peebles,Hanzhi Zhou,Kelvin Zou,Sneha Bangalore,Chung-Cheng Chiu,Nan Du,Xianzhi Du,Philipp Dufter,Ruixuan Hou,Haoshuo Huang,Dongseong Hwang,Xiang Kong,Jinhao Lei,Tao Lei,Meng Li,Li Li,Jiarui Lu,Zhiyun Lu,Yiping Ma,David Qiu,Vivek Rathod,Senyu Tong,Zhucheng Tu,Jianyu Wang,Yongqiang Wang,Zirui Wang,Floris Weers,Sam Wiseman,Guoli Yin,Bowen Zhang,Xiyou Zhou,Danyang Zhuo,Cheng Leong,Ruoming Pang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn’s internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on heterogeneous compute infrastructure. We introduce a novel method of quantifying modularity via Lines-of-Code (LoC)-complexity, which demonstrates how our system maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in other systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn.
[LG-43] Dataless Neural Networks for Resource-Constrained Project Scheduling
链接: https://arxiv.org/abs/2507.05322
作者: Marc Bara
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 9 pages, 1 figure. Introduces dataless neural networks for resource-constrained project scheduling
Abstract:Dataless neural networks represent a paradigm shift in applying neural architectures to combinatorial optimization problems, eliminating the need for training datasets by encoding problem instances directly into network parameters. Despite the pioneering work of Alkhouri et al. (2022) demonstrating the viability of dataless approaches for the Maximum Independent Set problem, our comprehensive literature review reveals that no published work has extended these methods to the Resource-Constrained Project Scheduling Problem (RCPSP). This paper addresses this gap by presenting the first dataless neural network approach for RCPSP, providing a complete mathematical framework that transforms discrete scheduling constraints into differentiable objectives suitable for gradient-based optimization. Our approach leverages smooth relaxations and automatic differentiation to unlock GPU parallelization for project scheduling, traditionally a domain of sequential algorithms. We detail the mathematical formulation for both precedence and renewable resource constraints, including a memory-efficient dense time-grid representation. Implementation and comprehensive experiments on PSPLIB benchmark instances (J30, J60, and J120) are currently underway, with empirical results to be reported in an updated version of this paper.
[LG-44] High Order Collaboration-Oriented Federated Graph Neural Network for Accurate QoS Prediction
链接: https://arxiv.org/abs/2507.05308
作者: Zehuan Chen,Xiangwei Lai
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Predicting Quality of Service (QoS) data crucial for cloud service selection, where user privacy is a critical concern. Federated Graph Neural Networks (FGNNs) can perform QoS data prediction as well as maintaining user privacy. However, existing FGNN-based QoS predictors commonly implement on-device training on scattered explicit user-service graphs, thereby failing to utilize the implicit user-user interactions. To address this issue, this study proposes a high order collaboration-oriented federated graph neural network (HC-FGNN) to obtain accurate QoS prediction with privacy preservation. Concretely, it magnifies the explicit user-service graphs following the principle of attention mechanism to obtain the high order collaboration, which reflects the implicit user-user interactions. Moreover, it utilizes a lightweight-based message aggregation way to improve the computational efficiency. The extensive experiments on two QoS datasets from real application indicate that the proposed HC-FGNN possesses the advantages of high prediction accurate and privacy protection.
[LG-45] mporal Window Smoothing of Exogenous Variables for Improved Time Series Prediction IJCNN2025
链接: https://arxiv.org/abs/2507.05284
作者: Mustafa Kamal,Niyaz Bin Hashem,Robin Krambroeckers,Nabeel Mohammed,Shafin Rahman
类目: Machine Learning (cs.LG)
*备注: Accepted at IJCNN 2025
Abstract:Although most transformer-based time series forecasting models primarily depend on endogenous inputs, recent state-of-the-art approaches have significantly improved performance by incorporating external information through exogenous inputs. However, these methods face challenges, such as redundancy when endogenous and exogenous inputs originate from the same source and limited ability to capture long-term dependencies due to fixed look-back windows. In this paper, we propose a method that whitens the exogenous input to reduce redundancy that may persist within the data based on global statistics. Additionally, our approach helps the exogenous input to be more aware of patterns and trends over extended periods. By introducing this refined, globally context-aware exogenous input to the endogenous input without increasing the lookback window length, our approach guides the model towards improved forecasting. Our approach achieves state-of-the-art performance in four benchmark datasets, consistently outperforming 11 baseline models. These results establish our method as a robust and effective alternative for using exogenous inputs in time series forecasting.
[LG-46] Bridging Prediction and Intervention Problems in Social Systems
链接: https://arxiv.org/abs/2507.05216
作者: Lydia T. Liu,Inioluwa Deborah Raji,Angela Zhou,Luke Guerdan,Jessica Hullman,Daniel Malinsky,Bryan Wilder,Simone Zhang,Hammaad Adam,Amanda Coston,Ben Laufer,Ezinne Nwankwo,Michael Zanger-Tishler,Eli Ben-Michael,Solon Barocas,Avi Feller,Marissa Gerchick,Talia Gillis,Shion Guha,Daniel Ho,Lily Hu,Kosuke Imai,Sayash Kapoor,Joshua Loftus,Razieh Nabi,Arvind Narayanan,Ben Recht,Juan Carlos Perdomo,Matthew Salganik,Mark Sendak,Alexander Tolbert,Berk Ustun,Suresh Venkatasubramanian,Angelina Wang,Ashia Wilson
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Many automated decision systems (ADS) are designed to solve prediction problems – where the goal is to learn patterns from a sample of the population and apply them to individuals from the same population. In reality, these prediction systems operationalize holistic policy interventions in deployment. Once deployed, ADS can shape impacted population outcomes through an effective policy change in how decision-makers operate, while also being defined by past and present interactions between stakeholders and the limitations of existing organizational, as well as societal, infrastructure and context. In this work, we consider the ways in which we must shift from a prediction-focused paradigm to an interventionist paradigm when considering the impact of ADS within social systems. We argue this requires a new default problem setup for ADS beyond prediction, to instead consider predictions as decision support, final decisions, and outcomes. We highlight how this perspective unifies modern statistical frameworks and other tools to study the design, implementation, and evaluation of ADS systems, and point to the research directions necessary to operationalize this paradigm shift. Using these tools, we characterize the limitations of focusing on isolated prediction tasks, and lay the foundation for a more intervention-oriented approach to developing and deploying ADS.
[LG-47] What ZTF Saw Where Rubin Looked: Anomaly Hunting in DR23
链接: https://arxiv.org/abs/2507.06217
作者: Maria V. Pruzhinskaya,Anastasia D. Lavrukhina,Timofey A. Semenikhi,Alina A. Volnova,Sreevarsha Sreejith,Vadim V. Krushinsky,Emmanuel Gangler,Emille E. O. Ishida,Matwey V. Kornilov,Konstantin L. Malanchev
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures
Abstract:We present results from the SNAD VIII Workshop, during which we conducted the first systematic anomaly search in the ZTF fields also observed by LSSTComCam during Rubin Scientific Pipeline commissioning. Using the PineForest active anomaly detection algorithm, we analysed four selected fields (two galactic and two extragalactic) and visually inspected 400 candidates. As a result, we discovered six previously uncatalogued variable stars, including RS~CVn, BY Draconis, ellipsoidal, and solar-type variables, and refined classifications and periods for six known objects. These results demonstrate the effectiveness of the SNAD anomaly detection pipeline and provide a preview of the discovery potential in the upcoming LSST data.
[LG-48] Estimating prevalence with precision and accuracy
链接: https://arxiv.org/abs/2507.06061
作者: Aime Bienfait Igiraneza,Christophe Fraser,Robert Hinch
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Unlike classification, whose goal is to estimate the class of each data point in a dataset, prevalence estimation or quantification is a task that aims to estimate the distribution of classes in a dataset. The two main tasks in prevalence estimation are to adjust for bias, due to the prevalence in the training dataset, and to quantify the uncertainty in the estimate. The standard methods used to quantify uncertainty in prevalence estimates are bootstrapping and Bayesian quantification methods. It is not clear which approach is ideal in terms of precision (i.e. the width of confidence intervals) and coverage (i.e. the confidence intervals being well-calibrated). Here, we propose Precise Quantifier (PQ), a Bayesian quantifier that is more precise than existing quantifiers and with well-calibrated coverage. We discuss the theory behind PQ and present experiments based on simulated and real-world datasets. Through these experiments, we establish the factors which influence quantification precision: the discriminatory power of the underlying classifier; the size of the labeled dataset used to train the quantifier; and the size of the unlabeled dataset for which prevalence is estimated. Our analysis provides deep insights into uncertainty quantification for quantification learning.
[LG-49] Kernel Trace Distance: Quantum Statistical Metric between Measures through RKHS Density Operators
链接: https://arxiv.org/abs/2507.06055
作者: Arturo Castellanos,Anna Korba,Pavlo Mozharovskyi,Hicham Janati
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Distances between probability distributions are a key component of many statistical machine learning tasks, from two-sample testing to generative modeling, among others. We introduce a novel distance between measures that compares them through a Schatten norm of their kernel covariance operators. We show that this new distance is an integral probability metric that can be framed between a Maximum Mean Discrepancy (MMD) and a Wasserstein distance. In particular, we show that it avoids some pitfalls of MMD, by being more discriminative and robust to the choice of hyperparameters. Moreover, it benefits from some compelling properties of kernel methods, that can avoid the curse of dimensionality for their sample complexity. We provide an algorithm to compute the distance in practice by introducing an extension of kernel matrix for difference of distributions that could be of independent interest. Those advantages are illustrated by robust approximate Bayesian computation under contamination as well as particle flow simulations.
[LG-50] Minimal Deterministic Echo State Networks Outperform Random Reservoirs in Learning Chaotic Dynamics
链接: https://arxiv.org/abs/2507.06050
作者: Francesco Martinuzzi
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning (ML) is widely used to model chaotic systems. Among ML approaches, echo state networks (ESNs) have received considerable attention due to their simple construction and fast training. However, ESN performance is highly sensitive to hyperparameter choices and to its random initialization. In this work, we demonstrate that ESNs constructed using deterministic rules and simple topologies (MESNs) outperform standard ESNs in the task of chaotic attractor reconstruction. We use a dataset of more than 90 chaotic systems to benchmark 10 different minimal deterministic reservoir initializations. We find that MESNs obtain up to a 41% reduction in error compared to standard ESNs. Furthermore, we show that the MESNs are more robust, exhibiting less inter-run variation, and have the ability to reuse hyperparameters across different systems. Our results illustrate how structured simplicity in ESN design can outperform stochastic complexity in learning chaotic dynamics.
[LG-51] Instance-Optimal Quantum State Certification with Entangled Measurements
链接: https://arxiv.org/abs/2507.06010
作者: Ryan O’Donnell,Chirag Wadhwa
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 27 pages
Abstract:We consider the task of quantum state certification: given a description of a hypothesis state \sigma and multiple copies of an unknown state \rho , a tester aims to determine whether the two states are equal or \epsilon -far in trace distance. It is known that \Theta(d/\epsilon^2) copies of \rho are necessary and sufficient for this task, assuming the tester can make entangled measurements over all copies [CHW07,OW15,BOW19]. However, these bounds are for a worst-case \sigma , and it is not known what the optimal copy complexity is for this problem on an instance-by-instance basis. While such instance-optimal bounds have previously been shown for quantum state certification when the tester is limited to measurements unentangled across copies [CLO22,CLHL22], they remained open when testers are unrestricted in the kind of measurements they can perform. We address this open question by proving nearly instance-optimal bounds for quantum state certification when the tester can perform fully entangled measurements. Analogously to the unentangled setting, we show that the optimal copy complexity for certifying \sigma is given by the worst-case complexity times the fidelity between \sigma and the maximally mixed state. We prove our lower bounds using a novel quantum analogue of the Ingster-Suslina method, which is likely to be of independent interest. This method also allows us to recover the \Omega(d/\epsilon^2) lower bound for mixedness testing [OW15], i.e., certification of the maximally mixed state, with a surprisingly simple proof. Comments: 27 pages Subjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2507.06010 [quant-ph] (or arXiv:2507.06010v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2507.06010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-52] Beating the Best Constant Rebalancing Portfolio in Long-Term Investment: A Generalization of the Kelly Criterion and Universal Learning Algorithm for Markets with Serial Dependence
链接: https://arxiv.org/abs/2507.05994
作者: Duy Khanh Lam
类目: Portfolio Management (q-fin.PM); Information Theory (cs.IT); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 19 pages, 7 figures. Working paper (1st full draft); typos may exist
Abstract:In the online portfolio optimization framework, existing learning algorithms generate strategies that yield significantly poorer cumulative wealth compared to the best constant rebalancing portfolio in hindsight, despite being consistent in asymptotic growth rate. While this unappealing performance can be improved by incorporating more side information, it raises difficulties in feature selection and high-dimensional settings. Instead, the inherent serial dependence of assets’ returns, such as day-of-the-week and other calendar effects, can be leveraged. Although latent serial dependence patterns are commonly detected using large training datasets, this paper proposes an algorithm that learns such dependence using only gradually revealed data, without any assumption on their distribution, to form a strategy that eventually exceeds the cumulative wealth of the best constant rebalancing portfolio. Moreover, the classical Kelly criterion, which requires independent assets’ returns, is generalized to accommodate serial dependence in a market modeled as an independent and identically distributed process of random matrices. In such a stochastic market, where existing learning algorithms designed for stationary processes fail to apply, the proposed learning algorithm still generates a strategy that asymptotically grows to the highest rate among all strategies, matching that of the optimal strategy constructed under the generalized Kelly criterion. The experimental results with real market data demonstrate the theoretical guarantees of the algorithm and its performance as expected, as long as serial dependence is significant, regardless of the validity of the generalized Kelly criterion in the experimental market. This further affirms the broad applicability of the algorithm in general contexts. Comments: 19 pages, 7 figures. Working paper (1st full draft); typos may exist Subjects: Portfolio Management (q-fin.PM); Information Theory (cs.IT); Machine Learning (cs.LG); Computational Finance (q-fin.CP) Cite as: arXiv:2507.05994 [q-fin.PM] (or arXiv:2507.05994v1 [q-fin.PM] for this version) https://doi.org/10.48550/arXiv.2507.05994 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-53] Online Regularized Learning Algorithms in RKHS with β- and ϕ-Mixing Sequences
链接: https://arxiv.org/abs/2507.05929
作者: Priyanka Roy,Susanne Saminger-Platz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注: arXiv admin note: substantial text overlap with arXiv:2502.03551
Abstract:In this paper, we study an online regularized learning algorithm in a reproducing kernel Hilbert spaces (RKHS) based on a class of dependent processes. We choose such a process where the degree of dependence is measured by mixing coefficients. As a representative example, we analyze a strictly stationary Markov chain, where the dependence structure is characterized by the (\phi)- and (\beta)-mixing coefficients. Under these assumptions, we derive probabilistic upper bounds as well as convergence rates for both the exponential and polynomial decay of the mixing coefficients.
[LG-54] Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis ICML
链接: https://arxiv.org/abs/2507.05913
作者: Gholamali Aminian,Idan Shenfeld,Amir R. Asadi,Ahmad Beirami,Youssef Mroueh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Workshop on Efficient Systems for Foundation Models at iCML
Abstract:A simple yet effective method for inference-time alignment of generative models is Best-of- N (BoN), where N outcomes are sampled from a reference policy, evaluated using a proxy reward model, and the highest-scoring one is selected. While prior work argues that BoN is almost optimal in reward vs KL tradeoffs, the effectiveness of BoN depends critically on the quality of the proxy reward model used for selection. For this purpose, we study BoN through a smooth version known as Soft Best-of-N (SBoN) and develop a theoretical framework to address this gap. We analyze the scaling behaviour of BoN by providing bounds on the KL divergence between the SBoN policy and the reference policy, offering insights into how performance varies with the number of samples. We also study the regret gap, i.e., the gap between the expected true reward under the optimal policy and the SBoN policy. Our theoretical and empirical findings show that smoothing helps SBoN mitigate reward overoptimization, especially when the quality of the proxy reward is low.
[LG-55] Property Elicitation on Imprecise Probabilities
链接: https://arxiv.org/abs/2507.05857
作者: James Bailie,Rabanus Derr
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Property elicitation studies which attributes of a probability distribution can be determined by minimising a risk. We investigate a generalisation of property elicitation to imprecise probabilities (IP). This investigation is motivated by multi-distribution learning, which takes the classical machine learning paradigm of minimising a single risk over a (precise) probability and replaces it with \Gamma -maximin risk minimization over an IP. We provide necessary conditions for elicitability of a IP-property. Furthermore, we explain what an elicitable IP-property actually elicits through Bayes pairs – the elicited IP-property is the corresponding standard property of the maximum Bayes risk distribution.
[LG-56] Just Say Better or Worse: A Human-AI Collaborative Framework for Medical Image Segmentation Without Manual Annotations
链接: https://arxiv.org/abs/2507.05815
作者: Yizhe Zhang
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures
Abstract:Manual annotation of medical images is a labor-intensive and time-consuming process, posing a significant bottleneck in the development and deployment of robust medical imaging AI systems. This paper introduces a novel Human-AI collaborative framework for medical image segmentation that substantially reduces the annotation burden by eliminating the need for explicit manual pixel-level labeling. The core innovation lies in a preference learning paradigm, where human experts provide minimal, intuitive feedback – simply indicating whether an AI-generated segmentation is better or worse than a previous version. The framework comprises four key components: (1) an adaptable foundation model (FM) for feature extraction, (2) label propagation based on feature similarity, (3) a clicking agent that learns from human better-or-worse feedback to decide where to click and with which label, and (4) a multi-round segmentation learning procedure that trains a state-of-the-art segmentation network using pseudo-labels generated by the clicking agent and FM-based label propagation. Experiments on three public datasets demonstrate that the proposed approach achieves competitive segmentation performance using only binary preference feedback, without requiring experts to directly manually annotate the images.
[LG-57] PSAT: Pediatric Segmentation Approaches via Adult Augmentations and Transfer Learning
链接: https://arxiv.org/abs/2507.05764
作者: Tristan Kirscher(ICube, ICANS),Sylvain Faisan(ICube),Xavier Coubez(ICANS),Loris Barrier(ICANS),Philippe Meyer(ICube, ICANS)
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
Abstract:Pediatric medical imaging presents unique challenges due to significant anatomical and developmental differences compared to adults. Direct application of segmentation models trained on adult data often yields suboptimal performance, particularly for small or rapidly evolving structures. To address these challenges, several strategies leveraging the nnU-Net framework have been proposed, differing along four key axes: (i) the fingerprint dataset (adult, pediatric, or a combination thereof) from which the Training Plan -including the network architecture-is derived; (ii) the Learning Set (adult, pediatric, or mixed), (iii) Data Augmentation parameters, and (iv) the Transfer learning method (finetuning versus continual learning). In this work, we introduce PSAT (Pediatric Segmentation Approaches via Adult Augmentations and Transfer learning), a systematic study that investigates the impact of these axes on segmentation performance. We benchmark the derived strategies on two pediatric CT datasets and compare them with state-of-theart methods, including a commercial radiotherapy solution. PSAT highlights key pitfalls and provides actionable insights for improving pediatric segmentation. Our experiments reveal that a training plan based on an adult fingerprint dataset is misaligned with pediatric anatomy-resulting in significant performance degradation, especially when segmenting fine structures-and that continual learning strategies mitigate institutional shifts, thus enhancing generalization across diverse pediatric datasets. The code is available at this https URL.
[LG-58] HRRRCast: a data-driven emulator for regional weather forecasting at convection allowing scales
链接: https://arxiv.org/abs/2507.05658
作者: Daniel Abdi,Isidora Jankov,Paul Madden,Vanderlei Vargas,Timothy A. Smith,Sergey Frolov,Montgomery Flora,Corey Potvin
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:The High-Resolution Rapid Refresh (HRRR) model is a convection-allowing model used in operational weather forecasting across the contiguous United States (CONUS). To provide a computationally efficient alternative, we introduce HRRRCast, a data-driven emulator built with advanced machine learning techniques. HRRRCast includes two architectures: a ResNet-based model (ResHRRR) and a Graph Neural Network-based model (GraphHRRR). ResHRRR uses convolutional neural networks enhanced with squeeze-and-excitation blocks and Feature-wise Linear Modulation, and supports probabilistic forecasting via the Denoising Diffusion Implicit Model (DDIM). To better handle longer lead times, we train a single model to predict multiple lead times (1h, 3h, and 6h), then use a greedy rollout strategy during inference. When evaluated on composite reflectivity over the full CONUS domain using ensembles of 3 to 10 members, ResHRRR outperforms HRRR forecast at light rainfall threshold (20 dBZ) and achieves competitive performance at moderate thresholds (30 dBZ). Our work advances the StormCast model of Pathak et al. [21] by: a) training on the full CONUS domain, b) using multiple lead times to improve long-range skill, c) training on analysis data instead of the +1h post-analysis data inadvertently used in StormCast, and d) incorporating future GFS states as inputs, enabling downscaling that improves long-lead accuracy. Grid-, neighborhood-, and object-based metrics confirm better storm placement, lower frequency bias, and higher success ratios than HRRR. HRRRCast ensemble forecasts also maintain sharper spatial detail, with power spectra more closely matching HRRR analysis. While GraphHRRR underperforms in its current form, it lays groundwork for future graph-based forecasting. HRRRCast represents a step toward efficient, data-driven regional weather prediction with competitive accuracy and ensemble capability.
[LG-59] Learnable quantum spectral filters for hybrid graph neural networks
链接: https://arxiv.org/abs/2507.05640
作者: Ammar Daskin
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: The simulation code and results used for this paper is publicly available at: this https URL
Abstract:In this paper, we describe a parameterized quantum circuit that can be considered as convolutional and pooling layers for graph neural networks. The circuit incorporates the parameterized quantum Fourier circuit where the qubit connections for the controlled gates derived from the Laplacian operator. Specifically, we show that the eigenspace of the Laplacian operator of a graph can be approximated by using QFT based circuit whose connections are determined from the adjacency matrix. For an N\times N Laplacian, this approach yields an approximate polynomial-depth circuit requiring only n=log(N) qubits. These types of circuits can eliminate the expensive classical computations for approximating the learnable functions of the Laplacian through Chebyshev polynomial or Taylor expansions. Using this circuit as a convolutional layer provides an n- dimensional probability vector that can be considered as the filtered and compressed graph signal. Therefore, the circuit along with the measurement can be considered a very efficient convolution plus pooling layer that transforms an N -dimensional signal input into n- dimensional signal with an exponential compression. We then apply a classical neural network prediction head to the output of the circuit to construct a complete graph neural network. Since the circuit incorporates geometric structure through its graph connection-based approach, we present graph classification results for the benchmark datasets listed in TUDataset library. Using only [1-100] learnable parameters for the quantum circuit and minimal classical layers (1000-5000 parameters) in a generic setting, the obtained results are comparable to and in some cases better than many of the baseline results, particularly for the cases when geometric structure plays a significant role. Comments: The simulation code and results used for this paper is publicly available at: this https URL Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2507.05640 [quant-ph] (or arXiv:2507.05640v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2507.05640 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-60] On the Inherent Privacy of Zeroth Order Projected Gradient Descent AISTATS’25
链接: https://arxiv.org/abs/2507.05610
作者: Devansh Gupta,Meisam Razaviyayn,Vatsal Sharan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at AISTATS’25
Abstract:Differentially private zeroth-order optimization methods have recently gained popularity in private fine tuning of machine learning models due to their reduced memory requirements. Current approaches for privatizing zeroth-order methods rely on adding Gaussian noise to the estimated zeroth-order gradients. However, since the search direction in the zeroth-order methods is inherently random, researchers including Tang et al. (2024) and Zhang et al. (2024a) have raised an important question: is the inherent noise in zeroth-order estimators sufficient to ensure the overall differential privacy of the algorithm? This work settles this question for a class of oracle-based optimization algorithms where the oracle returns zeroth-order gradient estimates. In particular, we show that for a fixed initialization, there exist strongly convex objective functions such that running (Projected) Zeroth-Order Gradient Descent (ZO-GD) is not differentially private. Furthermore, we show that even with random initialization and without revealing (initial and) intermediate iterates, the privacy loss in ZO-GD can grow superlinearly with the number of iterations when minimizing convex objective functions.
[LG-61] Exact and efficient basis pursuit denoising via differential inclusions and a selection principle
链接: https://arxiv.org/abs/2507.05562
作者: Gabriel P. Langlois,Jérôme Darbon
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注: 50 pages, 2 figures, submitted
Abstract:Basis pursuit denoising (BPDN) is a cornerstone of compressive sensing, statistics and machine learning. While various algorithms for BPDN have been proposed, they invariably suffer from drawbacks and must either favor efficiency at the expense of accuracy or vice versa. As such, state-of-the-art algorithms remain ineffective for high-dimensional applications that require accurate solutions within a reasonable amount of computational time. In this work, we address this issue and propose an exact and efficient algorithm for BPDN using differential inclusions. Specifically, we prove that a selection principle from the theory of differential inclusions turns the dual problem of BPDN into calculating the trajectory of an \emphintegrable projected dynamical system, that is, whose trajectory and asymptotic limit can be computed exactly. Our analysis naturally yields an exact algorithm, numerically up to machine precision, that is amenable to computing regularization paths and very fast. Numerical experiments confirm that our algorithm outperforms the state-of-the-art algorithms in both accuracy and efficiency. Moreover, we show that the global continuation of solutions (in terms of the hyperparameter and data) of the projected dynamical system yields a rigorous homotopy algorithm for BPDN, as well as a novel greedy algorithm for computing feasible solutions to basis pursuit in strongly polynomial time. Beyond this work, we expect that our results and analysis can be adapted to compute exact or approximate solutions to a broader class of polyhedral-constrained optimization problems.
[LG-62] A Malliavin calculus approach to score functions in diffusion generative models
链接: https://arxiv.org/abs/2507.05550
作者: Ehsan Mirafzali,Frank Proske,Utkarsh Gupta,Daniele Venturi,Razvan Marinescu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:Score-based diffusion generative models have recently emerged as a powerful tool for modelling complex data distributions. These models aim at learning the score function, which defines a map from a known probability distribution to the target data distribution via deterministic or stochastic differential equations (SDEs). The score function is typically estimated from data using a variety of approximation techniques, such as denoising or sliced score matching, Hyvärien’s method, or Schrödinger bridges. In this paper, we derive an exact, closed form, expression for the score function for a broad class of nonlinear diffusion generative models. Our approach combines modern stochastic analysis tools such as Malliavin derivatives and their adjoint operators (Skorokhod integrals or Malliavin Divergence) with a new Bismut-type formula. The resulting expression for the score function can be written entirely in terms of the first and second variation processes, with all Malliavin derivatives systematically eliminated, thereby enhancing its practical applicability. The theoretical framework presented in this work offers a principled foundation for advancing score estimation methods in generative modelling, enabling the design of new sampling algorithms for complex probability distributions. Our results can be extended to broader classes of stochastic differential equations, opening new directions for the development of score-based diffusion generative models.
[LG-63] Special-Unitary Parameterization for Trainable Variational Quantum Circuits
链接: https://arxiv.org/abs/2507.05535
作者: Kuan-Cheng Chen,Huan-Hsin Tseng,Samuel Yen-Chi Chen,Chen-Yu Liu,Kin K. Leung
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:We propose SUN-VQC, a variational-circuit architecture whose elementary layers are single exponentials of a symmetry-restricted Lie subgroup, \mathrmSU(2^k) \subset \mathrmSU(2^n) with k \ll n . Confining the evolution to this compact subspace reduces the dynamical Lie-algebra dimension from \mathcalO(4^n) to \mathcalO(4^k) , ensuring only polynomial suppression of gradient variance and circumventing barren plateaus that plague hardware-efficient ansätze. Exact, hardware-compatible gradients are obtained using a generalized parameter-shift rule, avoiding ancillary qubits and finite-difference bias. Numerical experiments on quantum auto-encoding and classification show that SUN-VQCs sustain order-of-magnitude larger gradient signals, converge 2–3 \times faster, and reach higher final fidelities than depth-matched Pauli-rotation or hardware-efficient circuits. These results demonstrate that Lie-subalgebra engineering provides a principled, scalable route to barren-plateau-resilient VQAs compatible with near-term quantum processors.
[LG-64] Predicting mutational effects on protein binding from folding energy
链接: https://arxiv.org/abs/2507.05502
作者: Arthur Deng,Karsten Householder,Fang Wu,Sebastian Thrun,K. Christopher Garcia,Brian Trippe
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Code: this https URL
Abstract:Accurate estimation of mutational effects on protein-protein binding energies is an open problem with applications in structural biology and therapeutic design. Several deep learning predictors for this task have been proposed, but, presumably due to the scarcity of binding data, these methods underperform computationally expensive estimates based on empirical force fields. In response, we propose a transfer-learning approach that leverages advances in protein sequence modeling and folding stability prediction for this task. The key idea is to parameterize the binding energy as the difference between the folding energy of the protein complex and the sum of the folding energies of its binding partners. We show that using a pre-trained inverse-folding model as a proxy for folding energy provides strong zero-shot performance, and can be fine-tuned with (1) copious folding energy measurements and (2) more limited binding energy measurements. The resulting predictor, StaB-ddG, is the first deep learning predictor to match the accuracy of the state-of-the-art empirical force-field method FoldX, while offering an over 1,000x speed-up.
[LG-65] mporal Conformal Prediction (TCP): A Distribution-Free Statistical and Machine Learning Framework for Adaptive Risk Forecasting
链接: https://arxiv.org/abs/2507.05470
作者: Agnideep Aich,Ashit Baran Aich,Dipak C. Jain
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We propose Temporal Conformal Prediction (TCP), a novel framework for constructing prediction intervals in financial time-series with guaranteed finite-sample validity. TCP integrates quantile regression with a conformal calibration layer that adapts online via a decaying learning rate. This hybrid design bridges statistical and machine learning paradigms, enabling TCP to accommodate non-stationarity, volatility clustering, and regime shifts which are hallmarks of real-world asset returns, without relying on rigid parametric assumptions. We benchmark TCP against established methods including GARCH, Historical Simulation, and static Quantile Regression across equities (SP 500), cryptocurrency (Bitcoin), and commodities (Gold). Empirical results show that TCP consistently delivers sharper intervals with competitive or superior coverage, particularly in high-volatility regimes. Our study underscores TCP’s strength in navigating the coverage-sharpness tradeoff, a central challenge in modern risk forecasting. Overall, TCP offers a distribution-free, adaptive, and interpretable alternative for financial uncertainty quantification, advancing the interface between statistical inference and machine learning in finance.
[LG-66] he Neural Networks with Tensor Weights and the Corresponding Fermionic Quantum Field Theory
链接: https://arxiv.org/abs/2507.05303
作者: Guojun Huang,Kai Zhou
类目: High Energy Physics - Theory (hep-th); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注:
Abstract:In this paper, we establish a theoretical connection between complex-valued neural networks (CVNNs) and fermionic quantum field theory (QFT), bridging a fundamental gap in the emerging framework of neural network quantum field theory (NN-QFT). While prior NN-QFT works have linked real-valued architectures to bosonic fields, we demonstrate that CVNNs equipped with tensor-valued weights intrinsically generate fermionic quantum fields. By promoting hidden-to-output weights to Clifford algebra-valued tensors, we induce anticommutation relations essential for fermionic statistics. Through analytical study of the generating functional, we obtain the exact quantum state in the infinite-width limit, revealing that the parameters between the input layer and the last hidden layer correspond to the eigenvalues of the quantum system, and the tensor weighting parameters in the hidden-to-output layer map to dynamical fermionic fields. The continuum limit reproduces free fermion correlators, with diagrammatic expansions confirming anticommutation. The work provides the first explicit mapping from neural architectures to fermionic QFT at the level of correlation functions and generating functional. It extends NN-QFT beyond bosonic theories and opens avenues for encoding fermionic symmetries into machine learning models, with potential applications in quantum simulation and lattice field theory.
[LG-67] BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects
链接: https://arxiv.org/abs/2507.05265
作者: Hongyang Li,Sanjoy Dey,Bum Chul Kwon,Michael Danziger,Michal Rosen-Tzvi,Jianying Hu,James Kozloski,Ching-Huei Tsou,Bharath Dandala,Pablo Meyer
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks. These models have been adapted to decipher the language of DNA, where sequences of nucleotides act as “words” that encode genomic functions. However, the genome differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar. Although DNA language models (DNALMs) such as DNABERT, GENA-LM have achieved high level of performance on genome-related biological tasks, these models do not encode biological functions in the presence of sequence variations. To address this problem, we pre-train foundation models that effectively integrate sequence variations, in particular Single Nucleotide Polymorphisms (SNPs), as they underlie important biological functions. Specifically, we use ModernBERT to pre-train two different Biomedical Foundation Models (BMFM), namely, BMFM-DNA-REF in which the model is trained with sequences of varying lengths along with their reverse complements derived from the reference genome and BMFM-DNA-SNP in which the model is trained with sequences created using a novel representation scheme that encodes sequence variations. Our findings indicate that integrating sequence variations into DNALMs helps capture the biological functions as seen in improvements on all fine-tuning tasks. To explore the model’s practical utility, we experimented with various strategies for SNP imputation on promoter detection task introduced in DNABERT-2. However, we acknowledge that the current benchmarks are limited in their ability to fully evaluate these models. To enable more comprehensive assessment in the future and encourage community contributions, we release our models through HuggingFace and the code to reproduce the results at this https URL
信息检索
[IR-0] Unconditional Diffusion for Generative Sequential Recommendation
链接: https://arxiv.org/abs/2507.06121
作者: Yimeng Bai,Yang Zhang,Sihao Ding,Shaohui Ruan,Han Yao,Danhui Guan,Fuli Feng,Tat-Seng Chua
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Diffusion models, known for their generative ability to simulate data creation through noise-adding and denoising processes, have emerged as a promising approach for building generative recommenders. To incorporate user history for personalization, existing methods typically adopt a conditional diffusion framework, where the reverse denoising process of reconstructing items from noise is modified to be conditioned on the user history. However, this design may fail to fully utilize historical information, as it gets distracted by the need to model the “item \leftrightarrow noise” translation. This motivates us to reformulate the diffusion process for sequential recommendation in an unconditional manner, treating user history (instead of noise) as the endpoint of the forward diffusion process (i.e., the starting point of the reverse process), rather than as a conditional input. This formulation allows for exclusive focus on modeling the “item \leftrightarrow history” translation. To this end, we introduce Brownian Bridge Diffusion Recommendation (BBDRec). By leveraging a Brownian bridge process, BBDRec enforces a structured noise addition and denoising mechanism, ensuring that the trajectories are constrained towards a specific endpoint – user history, rather than noise. Extensive experiments demonstrate BBDRec’s effectiveness in enhancing sequential recommendation performance. The source code is available at this https URL.
[IR-1] Hierarchical Interaction Summarization and Contrastive Prompting for Explainable Recommendations
链接: https://arxiv.org/abs/2507.06044
作者: Yibin Liu,Ang Li,Shijian Li
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Explainable recommendations, which use the information of user and item with interaction to generate a explanation for why the user would interact with the item, are crucial for improving user trust and decision transparency to the recommender system. Existing methods primarily rely on encoding features of users and items to embeddings, which often leads to information loss due to dimensionality reduction, sparse interactions, and so on. With the advancements of large language models (LLMs) in language comprehension, some methods use embeddings as LLM inputs for explanation generation. However, since embeddings lack inherent semantics, LLMs must adjust or extend their parameters to interpret them, a process that inevitably incurs information loss. To address this issue, we propose a novel approach combining profile generation via hierarchical interaction summarization (PGHIS), which leverages a pretrained LLM to hierarchically summarize user-item interactions, generating structured textual profiles as explicit representations of user and item characteristics. Additionally, we propose contrastive prompting for explanation generation (CPEG) which employs contrastive learning to guide another reasoning language models in producing high-quality ground truth recommendation explanations. Finally, we use the textual profiles of user and item as input and high-quality explanation as output to fine-tune a LLM for generating explanations. Experimental results on multiple datasets demonstrate that our approach outperforms existing state-of-the-art methods, achieving a great improvement on metrics about explainability (e.g., 5% on GPTScore) and text quality. Furthermore, our generated ground truth explanations achieve a significantly higher win rate compared to user-written reviews and those produced by other methods, demonstrating the effectiveness of CPEG in generating high-quality ground truths.
[IR-2] RecRankerEval: A Flexible and Extensible Framework for Top-k LLM -based Recommendation
链接: https://arxiv.org/abs/2507.05880
作者: Zeyuan Meng,Zixuan Yi,Iadh Ounis
类目: Information Retrieval (cs.IR)
*备注:
Abstract:A recent Large language model (LLM)-based recommendation model, called RecRanker, has demonstrated a superior performance in the top-k recommendation task compared to other models. In particular, RecRanker samples users via clustering, generates an initial ranking list using an initial recommendation model, and fine-tunes an LLM through hybrid instruction tuning to infer user preferences. However, the contribution of each core component remains underexplored. In this work, we inspect the reproducibility of RecRanker, and study the impact and role of its various components. We begin by reproducing the RecRanker pipeline through the implementation of all its key components. Our reproduction shows that the pairwise and listwise methods achieve a performance comparable to that reported in the original paper. For the pointwise method, while we are also able to reproduce the original paper’s results, further analysis shows that the performance is abnormally high due to data leakage from the inclusion of ground-truth information in the prompts. To enable a fair and comprehensive evaluation of LLM-based top-k recommendations, we propose RecRankerEval, an extensible framework that covers five key dimensions: user sampling strategy, initial recommendation model, LLM backbone, dataset selection, and instruction tuning method. Using the RecRankerEval framework, we show that the original results of RecRanker can be reproduced on the ML-100K and ML-1M datasets, as well as the additional Amazon-Music dataset, but not on BookCrossing due to the lack of timestamp information in the original RecRanker paper. Furthermore, we demonstrate that RecRanker’s performance can be improved by employing alternative user sampling methods, stronger initial recommenders, and more capable LLMs.
[IR-3] On the Costs and Benefits of Learned Indexing for Dynamic High-Dimensional Data: Extended Version
链接: https://arxiv.org/abs/2507.05865
作者: Terézia Slanináková,Jaroslav Olha,David Procházka,Matej Antol,Vlastislav Dohnal
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注:
Abstract:One of the main challenges within the growing research area of learned indexing is the lack of adaptability to dynamically expanding datasets. This paper explores the dynamization of a static learned index for complex data through operations such as node splitting and broadening, enabling efficient adaptation to new data. Furthermore, we evaluate the trade-offs between static and dynamic approaches by introducing an amortized cost model to assess query performance in tandem with the build costs of the index structure, enabling experimental determination of when a dynamic learned index outperforms its static counterpart. We apply the dynamization method to a static learned index and demonstrate that its superior scaling quickly surpasses the static implementation in terms of overall costs as the database grows. This is an extended version of the paper presented at DAWAK 2025.
[IR-4] KERAG _R: Knowledge-Enhanced Retrieval-Augmented Generation for Recommendation
链接: https://arxiv.org/abs/2507.05863
作者: Zeyuan Meng,Zixuan Yi,Iadh Ounis
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large Language Models (LLMs) have shown strong potential in recommender systems due to their contextual learning and generalisation capabilities. Existing LLM-based recommendation approaches typically formulate the recommendation task using specialised prompts designed to leverage their contextual abilities, and aligning their outputs closely with human preferences to yield an improved recommendation performance. However, the use of LLMs for recommendation tasks is limited by the absence of domain-specific knowledge. This lack of relevant relational knowledge about the items to be recommended in the LLM’s pre-training corpus can lead to inaccuracies or hallucinations, resulting in incorrect or misleading recommendations. Moreover, directly using information from the knowledge graph introduces redundant and noisy information, which can affect the LLM’s reasoning process or exceed its input context length, thereby reducing the performance of LLM-based recommendations. To address the lack of domain-specific knowledge, we propose a novel model called Knowledge-Enhanced Retrieval-Augmented Generation for Recommendation (KERAG_R). Specifically, we leverage a graph retrieval-augmented generation (GraphRAG) component to integrate additional information from a knowledge graph (KG) into instructions, enabling the LLM to collaboratively exploit recommendation signals from both text-based user interactions and the knowledge graph to better estimate the users’ preferences in a recommendation context. In particular, we perform graph RAG by pre-training a graph attention network (GAT) to select the most relevant triple for the target users for the used LLM, thereby enhancing the LLM while reducing redundant and noisy information. Our extensive experiments on three public datasets show that our proposed KERAG_R model significantly outperforms ten existing state-of-the-art recommendation methods.
[IR-5] Vers un cadre ontologique pour la gestion des compétences : à des fins de formation de recrutement de métier ou de recherches associées
链接: https://arxiv.org/abs/2507.05767
作者: Ngoc Luyen Le(Heudiasyc),Marie-Hélène Abel(Heudiasyc),Bertrand Laforge(LPNHE)
类目: Information Retrieval (cs.IR)
*备注: in French language. 36es Journ{é}es francophones d’Ing{é}nierie des Connaissances (IC 2025) @ Plate-Forme Intelligence Artificielle (PFIA 2025), Jul 2025, Dijon, France
Abstract:The rapid transformation of the labor market, driven by technological advancements and the digital economy, requires continuous competence development and constant adaptation. In this context, traditional competence management systems lack interoperability, adaptability, and semantic understanding, making it difficult to align individual competencies with labor market needs and training programs. This paper proposes an ontology-based framework for competence management, enabling a structured representation of competencies, occupations, and training programs. By leveraging ontological models and semantic reasoning, this framework aims to enhance the automation of competence-to-job matching, the personalization of learning recommendations, and career planning. This study discusses the design, implementation, and potential applications of the framework, focusing on competence training programs, job searching, and finding competent individuals.
[IR-6] From ID-based to ID-free: Rethinking ID Effectiveness in Multimodal Collaborative Filtering Recommendation ACM-MM’25
链接: https://arxiv.org/abs/2507.05715
作者: Guohao Li,Li Jing,Jia Wu,Xuefei Li,Kai Zhu,Yue He
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: ACM MM’25 (Experimental supplementary version)
Abstract:Most existing multimodal collaborative filtering recommendation (MCFRec) methods rely heavily on ID features and multimodal content to enhance recommendation performance. However, this paper reveals that ID features are effective but have limited benefits in multimodal collaborative filtering recommendation. Therefore, this paper systematically deconstruct the pros and cons of ID features: (i) they provide initial embedding but lack semantic richness, (ii) they provide a unique identifier for each user and item but hinder generalization to untrained data, and (iii) they assist in aligning and fusing multimodal features but may lead to representation shift. Based on these insights, this paper proposes IDFREE, an ID-free multimodal collaborative Filtering REcommEndation baseline. IDFREE replaces ID features with multimodal features and positional encodings to generate semantically meaningful ID-free embeddings. For ID-free multimodal collaborative filtering, it further proposes an adaptive similarity graph module to construct dynamic user-user and item-item graphs based on multimodal features. Then, an augmented user-item graph encoder is proposed to construct more effective user and item encoding. Finally, IDFREE achieves inter-multimodal alignment based on the contrastive learning and uses Softmax loss as recommendation loss. Basic experiments on three public datasets demonstrate that IDFREE outperforms existing ID-based MCFRec methods, achieving an average performance gain of 72.24% across standard metrics (Recall@5, 10, 20, 50 and NDCG@5, 10, 20, 50). Exploratory and extended experiments further validate our findings on the limitations of ID features in MCFRec. The code is released at this https URL.
[IR-7] Information Needs and Practices Supported by ChatGPT
链接: https://arxiv.org/abs/2507.05537
作者: Tim Gorichanaz
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: To be presented at the 2025 ASIST virtual satellite meeting, December 2025
Abstract:This study considers ChatGPT as an information source, investigating the information needs that people come to ChatGPT with and the information practices that ChatGPT supports, through a qualitative content analysis of 205 user vignettes. The findings show that ChatGPT is used in a range of life domains (home/family, work, leisure, etc.) and for a range of human needs (writing/editing, learning, simple programming tasks, etc.), constituting the information needs that people use ChatGPT to address. Related to these information needs, the findings show six categories of information practices that ChatGPT supports: Writing, Deciding, Identifying, Ideating, Talking, and Critiquing. This work suggests that, in the AI age, information need should be conceptualized not just as a matter of “getting questions answered” or even “making sense,” but as skillfully coping in the world, a notion that includes both understanding and action. This study leads to numerous opportunities for future work at the junction of generative AI and information needs, seeking, use and experience.