Arxiv今日论文 | 2025-07-18

本篇博文主要内容为 2025-07-18 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决视觉-语言模型（VLMs）在处理实际任务时存在冗余视觉令牌的问题，特别是在大多数通用视觉问答（VQA）任务中并不需要高分辨率图像带来的大量视觉令牌。其解决方案的关键在于提出一种新的视觉令牌压缩范式——VisionThink，该方法通过动态处理不同样本以不同分辨率进行处理，智能判断是否需要更高分辨率图像，从而实现按需压缩，相较于现有基于固定裁剪比例或阈值的高效VLM方法更具灵活性和针对性。

链接: https://arxiv.org/abs/2507.13348
作者: Senqiao Yang,Junyi Li,Xin Lai,Bei Yu,Hengshuang Zhao,Jiaya Jia
机构: CUHK(香港中文大学); HKU(香港大学); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code and models are available at this https URL

点击查看摘要

Abstract:Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at this https URL.
zh

[NLP-1] Comparing Apples to Oranges: A Dataset Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在解释不同形式幽默（humour）能力上的差异问题，特别是探究模型是否能够根据幽默的具体形式生成准确且全面的解释。其解决方案的关键在于构建一个包含600个笑话的数据集，这些笑话分为四种类型，包括异形双关语、同形双关语、当代网络幽默和涉及现实世界实体与事件的 topical 笑话，并手动编写高质量的解释。通过该数据集，研究者对比了多种LLMs在零样本设置下解释不同类型笑话的能力，从而揭示了当前计算幽默研究中的关键研究空白。

链接: https://arxiv.org/abs/2507.13335
作者: Tyler Loakman,William Thorne,Chenghua Lin
机构: University of Sheffield(谢菲尔德大学); University of Manchester(曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humour, as a complex language form, is derived from myriad aspects of life, whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes. In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. In doing so, we curate a dataset of 600 jokes split across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes, where understanding relies on reasoning beyond “common sense”, rooted instead in world knowledge regarding news events and pop culture. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most works in computational humour on overly simple joke forms.
zh

[NLP-2] A Survey of Context Engineering for Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在推理过程中依赖上下文信息的性能瓶颈问题，其核心在于通过系统化的上下文工程（Context Engineering）提升模型对复杂上下文的理解与利用能力。解决方案的关键在于构建一个涵盖上下文检索与生成、处理与管理的基础组件体系，并将其集成到诸如检索增强生成（Retrieval-Augmented Generation, RAG）、记忆系统、工具融合推理及多智能体系统等高级系统实现中，从而形成一套完整的上下文优化技术路线。

链接: https://arxiv.org/abs/2507.13334
作者: Lingrui Mei,Jiayu Yao,Yuyao Ge,Yiwei Wang,Baolong Bi,Yujun Cai,Jiazhi Liu,Mingyu Li,Zhong-Zhi Li,Duzhen Zhang,Chenlin Zhou,Jiayi Mao,Tianze Xia,Jiafeng Guo,Shenghua Liu
机构: Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); University of California, Merced(加州大学默塞德分校); The University of Queensland(昆士兰大学); Peking University(北京大学); Tsinghua University(清华大学); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computation and Language (cs.CL)
备注: ongoing work; 165 pages, 1401 citations

点击查看摘要

Abstract:The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.
zh

[NLP-3] he Imitation Game: Turing Machine Imitator is Length Generalizable Reason er

【速读】：该论文试图解决Transformer-based大型语言模型（LLM）在长度泛化能力上的核心挑战，即模型在面对比训练时观察到的序列更长的问题时表现不佳。解决方案的关键在于提出Turing Machine Imitation Learning (TAIL)，通过合成模仿图灵机执行过程的思维链（Chain-of-Thought, CoT）数据，将推理步骤线性扩展为原子状态，并引入显式记忆获取机制，以缓解快捷学习问题并降低基础操作中动态和长距离数据访问的难度。

链接: https://arxiv.org/abs/2507.13332
作者: Zhouqi Hua,Wenwei Zhang,Chengqi Lyu,Yuzhe Gu,Songyang Gao,Kuikun Liu,Kai Chen
机构: Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.
zh

[NLP-4] Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

【速读】：该论文试图解决视觉-语言（Vision-and-Language, VL）训练是否能以有意义的方式改变语言模型的语义表示问题。其解决方案的关键在于假设VL训练对词汇-概念知识（lexical-conceptual knowledge），尤其是其分类组织（taxonomic organization）具有显著影响。研究通过对比纯文本语言模型与经过VL训练的模型在需要分类理解的任务中的表现，发现VL模型在任务表现上更优，但其分类知识本身并未发生显著变化，变化主要体现在如何在特定任务中部署这些知识，尤其是在纯语言任务中。

链接: https://arxiv.org/abs/2507.13328
作者: Yulu Qin,Dheeraj Varghese,Adam Dahlgren Lindström,Lucia Donatelli,Kanishka Misra,Najoung Kim
机构: Boston University (波士顿大学); University of Amsterdam (阿姆斯特丹大学); Umeå University (于默奥大学); Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); Toyota Technological Institute at Chicago (芝加哥丰田技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.
zh

[NLP-5] Social and Political Framing in Search Engine Results

【速读】：该论文试图解决搜索引擎与意识形态驱动的用户查询如何共同导致搜索结果中的偏见问题。研究的关键在于分析主流搜索引擎在政治和社会议题上的输出，揭示搜索引擎不仅以反映潜在偏见的方式优先呈现内容，而且意识形态驱动的用户查询会加剧这些偏见，从而放大特定叙事。此外，研究还发现不同搜索引擎在优先考虑的信息来源上存在显著差异。

链接: https://arxiv.org/abs/2507.13325
作者: Amrit Poudel,Tim Weninger
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ICWSM 2026

点击查看摘要

Abstract:Search engines play a crucial role in shaping public discourse by influencing how information is accessed and framed. While prior research has extensively examined various dimensions of search bias – such as content prioritization, indexical bias, political polarization, and sources of bias – an important question remains underexplored: how do search engines and ideologically-motivated user queries contribute to bias in search results. This study analyzes the outputs of major search engines using a dataset of political and social topics. The findings reveal that search engines not only prioritize content in ways that reflect underlying biases but also that ideologically-driven user queries exacerbate these biases, resulting in the amplification of specific narratives. Moreover, significant differences were observed across search engines in terms of the sources they prioritize. These results suggest that search engines may play a pivotal role in shaping public perceptions by reinforcing ideological divides, thereby contributing to the broader issue of information polarization.
zh

[NLP-6] HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals

【速读】：该论文试图解决如何将用户对触觉信号（haptic signals）的描述与对应的振动触觉信号进行匹配的问题，主要挑战在于缺乏大规模带有文本描述的触觉数据集，以及现有模型在用文本描述振动信号方面的能力有限。解决方案的关键是构建了HapticCap数据集，这是首个完全由人类标注的触觉-文本配对数据集，包含92,070组用于描述振动感官、情感和关联属性的文本与触觉信号对，并提出了基于监督对比学习框架的触觉-文本检索任务，其中结合语言模型T5和音频模型AST取得了最佳性能。

链接: https://arxiv.org/abs/2507.13318
作者: Guimin Hu,Daniel Hershcovich,Hasti Seifi
机构: University of Copenhagen (哥本哈根大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Haptic signals, from smartphone vibrations to virtual reality touch feedback, can effectively convey information and enhance realism, but designing signals that resonate meaningfully with users is challenging. To facilitate this, we introduce a multimodal dataset and task, of matching user descriptions to vibration haptic signals, and highlight two primary challenges: (1) lack of large haptic vibration datasets annotated with textual descriptions as collecting haptic descriptions is time-consuming, and (2) limited capability of existing tasks and models to describe vibration signals in text. To advance this area, we create HapticCap, the first fully human-annotated haptic-captioned dataset, containing 92,070 haptic-text pairs for user descriptions of sensory, emotional, and associative attributes of vibrations. Based on HapticCap, we propose the haptic-caption retrieval task and present the results of this task from a supervised contrastive learning framework that brings together text representations within specific categories and vibrations. Overall, the combination of language model T5 and audio model AST yields the best performance in the haptic-caption retrieval task, especially when separately trained for each description category.
zh

[NLP-7] he Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM ) Human Evaluations

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）评估中存在的人类评价与模型性能之间相关性不足以及传统评价方法在规模和成本上的局限性问题。其解决方案的关键在于引入GEA（Generative Energy Arena），一个将模型能耗信息纳入评估过程的公开平台，通过让用户在了解模型能耗的情况下进行比较和排名，从而研究能耗意识对用户选择模型的影响。

链接: https://arxiv.org/abs/2507.13302
作者: Carlos Arriaga,Gonzalo Martínez,Eneko Sendin,Javier Conde,Pedro Reviriego
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.
zh

[NLP-8] AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research ACL2025

【速读】：该论文试图解决当前大语言模型（Large Language Models, LLMs）在设计科学实验中的消融研究（ablation study）能力评估问题，以及现有自动化评估方法的可靠性不足问题。其解决方案的关键在于构建AbGen基准，这是一个包含1,500个专家标注示例的数据集，用于评估LLMs生成详细消融研究设计的能力，并开发AbGen-Eval作为元评估基准，以检验常用自动化评估系统的可靠性，从而为未来更有效和可靠的LLM评估系统提供研究方向。

链接: https://arxiv.org/abs/2507.13300
作者: Yilun Zhao,Weiyuan Chen,Zhijian Xu,Manasi Patwardhan,Yixin Liu,Chengye Wang,Lovekesh Vig,Arman Cohan
机构: Yale NLP Lab (耶鲁自然语言处理实验室); TCS Research (塔塔咨询服务公司研究部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025

点击查看摘要

Abstract:We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.
zh

[NLP-9] Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis ACL

【速读】：该论文试图解决自动化生成高质量媒体演示文稿的难题，这一过程需要强大的内容提取、叙事规划、视觉设计以及整体质量优化。现有方法常产生逻辑不一致和布局不佳的演示文稿，难以达到专业标准。其解决方案的关键在于提出RCPS（Reflective Coherent Presentation Synthesis）框架，该框架整合了三个核心组件：深度结构化叙事规划、自适应布局生成以及迭代优化循环。

链接: https://arxiv.org/abs/2507.13285
作者: Wang Xi,Quan Shi,Tian Yu,Yujie Peng,Jiayi Sun,Mengxing Ren,Zenghui Ding,Ningguang Yao
机构: Hefei Institutes of Physical Science, Chinese Academy of Sciences(合肥物质科学研究院，中国科学院); University of Science and Technology of China(中国科学技术大学); Changzhou University(常州大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 7 figures, 3 tables. Submitted to an ACL-style conference

点击查看摘要

Abstract:Automated generation of high-quality media presentations is challenging, requiring robust content extraction, narrative planning, visual design, and overall quality optimization. Existing methods often produce presentations with logical inconsistencies and suboptimal layouts, thereby struggling to meet professional standards. To address these challenges, we introduce RCPS (Reflective Coherent Presentation Synthesis), a novel framework integrating three key components: (1) Deep Structured Narrative Planning; (2) Adaptive Layout Generation; (3) an Iterative Optimization Loop. Additionally, we propose PREVAL, a preference-based evaluation framework employing rationale-enhanced multi-dimensional models to assess presentation quality across Content, Coherence, and Design. Experimental results demonstrate that RCPS significantly outperforms baseline methods across all quality dimensions, producing presentations that closely approximate human expert standards. PREVAL shows strong correlation with human judgments, validating it as a reliable automated tool for assessing presentation quality.
zh

[NLP-10] Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management

【速读】：该论文试图解决人力资源管理中智能系统构建所面临的可靠性和公平性问题，特别是针对技能和职位名称智能识别的评估基准缺失问题。解决方案的关键在于提出了TalentCLEF 2025，这是首个专注于技能和职位名称智能的评估活动，包含多语言职位名称匹配（Task A）和基于职位名称的技能预测（Task B）两个任务，其数据集来源于真实职位申请并经过匿名化与人工标注，以反映劳动力市场的复杂性和多样性。此外，评估涵盖了单语和跨语言场景以及性别偏见分析，为该领域提供了首个公开基准，推动了鲁棒、公平且可迁移的语言技术发展。

链接: https://arxiv.org/abs/2507.13275
作者: Luis Gasco,Hermenegildo Fabregat,Laura García-Sardiña,Paula Estrella,Daniel Deniz,Alvaro Rodrigo,Rabih Zbib
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Advances in natural language processing and large language models are driving a major transformation in Human Capital Management, with a growing interest in building smart systems based on language technologies for talent acquisition, upskilling strategies, and workforce planning. However, the adoption and progress of these technologies critically depend on the development of reliable and fair models, properly evaluated on public data and open benchmarks, which have so far been unavailable in this domain. To address this gap, we present TalentCLEF 2025, the first evaluation campaign focused on skill and job title intelligence. The lab consists of two tasks: Task A - Multilingual Job Title Matching, covering English, Spanish, German, and Chinese; and Task B - Job Title-Based Skill Prediction, in English. Both corpora were built from real job applications, carefully anonymized, and manually annotated to reflect the complexity and diversity of real-world labor market data, including linguistic variability and gender-marked expressions. The evaluations included monolingual and cross-lingual scenarios and covered the evaluation of gender bias. TalentCLEF attracted 76 registered teams with more than 280 submissions. Most systems relied on information retrieval techniques built with multilingual encoder-based models fine-tuned with contrastive learning, and several of them incorporated large language models for data augmentation or re-ranking. The results show that the training strategies have a larger effect than the size of the model alone. TalentCLEF provides the first public benchmark in this field and encourages the development of robust, fair, and transferable language technologies for the labor market. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2507.13275 [cs.CL] (or arXiv:2507.13275v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.13275 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Luis Gasco [view email] [v1] Thu, 17 Jul 2025 16:33:57 UTC (909 KB)
zh

[NLP-11] QuestA: Expanding Reasoning Capacity in LLM s via Question Augmentation

【速读】：该论文试图解决强化学习（Reinforcement Learning, RL）在训练大语言推理模型（Large Language Reasoning Models, LLMs）时，特别是在提升多步骤推理能力方面效果有限的问题，尤其是在处理复杂问题时表现不佳。其解决方案的关键在于引入一种名为QuestA的策略，通过在训练过程中加入部分解题方案来降低问题难度，并提供更具信息量的学习信号，从而增强模型的推理能力。

链接: https://arxiv.org/abs/2507.13266
作者: Jiazheng Li,Hong Lu,Kaiyue Wen,Zaiwen Yang,Jiaxuan Gao,Hongzhou Lin,Yi Wu,Jingzhao Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.
zh

[NLP-12] Automating Steering for Safe Multimodal Large Language Models

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在面对对抗性多模态输入时产生的安全问题，特别是在推理过程中可能引发有害输出的风险。其解决方案的关键在于提出一种模块化且自适应的推理时干预技术——AutoSteer，该技术无需对底层模型进行微调。AutoSteer的核心组成部分包括：（1）一种新型的安全意识评分（Safety Awareness Score, SAS），用于自动识别模型内部层中最相关的安全差异；（2）一个自适应的安全探测器，用于估计从中间表示中产生有害输出的可能性；（3）一个轻量级的拒绝头，用于在检测到安全风险时选择性地干预生成过程。

链接: https://arxiv.org/abs/2507.13255
作者: Lyucheng Wu,Mengru Wang,Ziwen Xu,Tri Cao,Nay Oo,Bryan Hooi,Shumin Deng
机构: Zhejiang University (浙江大学); Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph (浙江大学-蚂蚁集团知识图谱联合实验室); National University of Singapore, NUS-NCS Joint Lab (新加坡国立大学，NUS-NCS联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Working in progress. 22 pages (8+ for main); 25 figures; 1 table

点击查看摘要

Abstract:Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model’s internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.
zh

[NLP-13] HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在印地语等印欧语言中的推理能力评估不足的问题，这限制了我们对这些模型跨语言泛化能力的理解。解决方案的关键是引入了一个新的印地语类比测试集（Hindi Analogy Test Set, HATS），并采用基于认知理论的具身思维链方法，以提升模型在印地语类比问题上的表现。实验表明，无论采用何种提示策略，使用英语提示的模型在印地语类比任务中表现最佳。

链接: https://arxiv.org/abs/2507.13238
作者: Ashray Gupta,Rohan Joseph,Sunny Rai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Analogies test a model’s ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.
zh

[NLP-14] Enhancing Cross-task Transfer of Large Language Models via Activation Steering

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在面对未见过的任务时，尤其是在数据稀缺场景下表现不佳的问题，以及跨任务上下文学习在鲁棒性、可扩展性和效率方面的挑战。其解决方案的关键在于通过潜在空间引导（latent space steering）实现跨任务迁移，而无需参数更新或输入扩展。该方法基于对LLMs潜在空间中激活模式的分析，利用高资源任务中有影响力且多样化的样本，通过对比表示增强的激活状态来适应低资源任务，从而实现有效的知识迁移。

链接: https://arxiv.org/abs/2507.13236
作者: Xinyu Tang,Zhihao Lv,Xiaoxue Cheng,Junyi Li,Wayne Xin Zhao,Zujie Wen,Zhiqiang Zhang,Jun Zhou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Department of Computer Science, National University of Singapore (新加坡国立大学计算机科学系); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive abilities in leveraging pretrained knowledge through prompting, but they often struggle with unseen tasks, particularly in data-scarce scenarios. While cross-task in-context learning offers a direct solution for transferring knowledge across tasks, it still faces critical challenges in terms of robustness, scalability, and efficiency. In this paper, we investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion. Through an analysis of activation patterns in the latent space of LLMs, we observe that the enhanced activations induced by in-context examples have consistent patterns across different tasks. Inspired by these findings, we propose CAST, a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model’s internal activation states. Our approach first selects influential and diverse samples from high-resource tasks, then utilizes their contrastive representation-enhanced activations to adapt LLMs to low-resource tasks. Extensive experiments across both cross-domain and cross-lingual transfer settings show that our method outperforms competitive baselines and demonstrates superior scalability and lower computational costs.
zh

[NLP-15] Automatically assessing oral narratives of Afrikaans and isiXhosa children

【速读】：该论文试图解决在大型幼儿园班级中，教师难以准确识别需要干预的儿童的问题。解决方案的关键是开发一个自动评估系统，该系统利用自动语音识别技术对学前儿童的口语叙述进行分析，并通过机器学习评分模型预测叙事和理解分数。其中，基于大语言模型（LLM）的系统在多数情况下表现优于线性模型，且其性能接近人类专家，在标记需要干预的儿童方面具有竞争力。

链接: https://arxiv.org/abs/2507.13205
作者: R. Louw(1),E. Sharratt(1),F. de Wet(1),C. Jacobs(1),A. Smith(1),H. Kamper(1) ((1) Stellenbosch University)
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to SLaTE 2025

点击查看摘要

Abstract:Developing narrative and comprehension skills in early childhood is critical for later literacy. However, teachers in large preschool classrooms struggle to accurately identify students who require intervention. We present a system for automatically assessing oral narratives of preschool children in Afrikaans and isiXhosa. The system uses automatic speech recognition followed by a machine learning scoring model to predict narrative and comprehension scores. For scoring predicted transcripts, we compare a linear model to a large language model (LLM). The LLM-based system outperforms the linear model in most cases, but the linear system is competitive despite its simplicity. The LLM-based system is comparable to a human expert in flagging children who require intervention. We lay the foundation for automatic oral assessments in classrooms, giving teachers extra capacity to focus on personalised support for children’s learning.
zh

[NLP-16] GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems EMNLP

【速读】：该论文试图解决多智能体系统在协作推理任务中仅依赖最终输出准确性评估所带来的不足，这些问题包括忽略通信效率和协调性对冗余推理和计算成本的影响。解决方案的关键是引入GEMMAS，这是一个基于图的评估框架，通过将智能体交互建模为有向无环图来分析内部协作过程，并提出了两个过程级指标：信息多样性得分（Information Diversity Score, IDS）用于衡量智能体间消息的语义变化，以及不必要的路径比例（Unnecessary Path Ratio, UPR）用于量化冗余推理路径。

链接: https://arxiv.org/abs/2507.13190
作者: Jisoo Lee,Raeyoung Chang,Dongwook Kwon,Harmanpreet Singh,Nikhil Verma
机构: Seoul National University (首尔国立大学); Sogang University (高丽大学); Kwangwoon University (光云大学); LG Electronics, Toronto AI Lab (LG电子多伦多AI实验室)
类目: Computation and Language (cs.CL)
备注: 4 figures, 1 algorithm, 2 tables, 6 pages, under review at EMNLP Industry track 2025

点击查看摘要

Abstract:Multi-agent systems built on language models have shown strong performance on collaborative reasoning tasks. However, existing evaluations focus only on the correctness of the final output, overlooking how inefficient communication and poor coordination contribute to redundant reasoning and higher computational costs. We introduce GEMMAS, a graph-based evaluation framework that analyzes the internal collaboration process by modeling agent interactions as a directed acyclic graph. To capture collaboration quality, we propose two process-level metrics: Information Diversity Score (IDS) to measure semantic variation in inter-agent messages, and Unnecessary Path Ratio (UPR) to quantify redundant reasoning paths. We evaluate GEMMAS across five benchmarks and highlight results on GSM8K, where systems with only a 2.1% difference in accuracy differ by 12.8% in IDS and 80% in UPR, revealing substantial variation in internal collaboration. These findings demonstrate that outcome-only metrics are insufficient for evaluating multi-agent performance and highlight the importance of process-level diagnostics in designing more interpretable and resource-efficient collaborative AI systems.
zh

[NLP-17] Feature-based analysis of oral narratives from Afrikaans and isiXhosa children

【速读】：该论文试图解决如何通过分析儿童的口头叙事特征来识别需要干预的语言发展问题，从而为多语言环境下的早期评估提供依据。其解决方案的关键在于利用简单的机器学习方法分析四至五岁阿非利卡语和科萨语儿童的口语叙事，识别出与典型发展相关的特征，如词汇多样性（unique words）和平均话语长度，并发现特定动词和助动词的使用与降低干预需求的相关性。

链接: https://arxiv.org/abs/2507.13164
作者: Emma Sharratt,Annelien Smith,Retief Louw,Daleen Klop,Febe de Wet,Herman Kamper
机构: 未知
类目: Computation and Language (cs.CL)
备注: SLaTE 2025 in Nijmegen, Netherlands

点击查看摘要

Abstract:Oral narrative skills are strong predictors of later literacy development. This study examines the features of oral narratives from children who were identified by experts as requiring intervention. Using simple machine learning methods, we analyse recorded stories from four- and five-year-old Afrikaans- and isiXhosa-speaking children. Consistent with prior research, we identify lexical diversity (unique words) and length-based features (mean utterance length) as indicators of typical development, but features like articulation rate prove less informative. Despite cross-linguistic variation in part-of-speech patterns, the use of specific verbs and auxiliaries associated with goal-directed storytelling is correlated with a reduced likelihood of requiring intervention. Our analysis of two linguistically distinct languages reveals both language-specific and shared predictors of narrative proficiency, with implications for early assessment in multilingual contexts.
zh

[NLP-18] Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics Advances and Opportunities

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）对齐问题，即如何使LLM的行为与人类价值观和意图保持一致，从而提升其可靠性、可控性和能力。解决方案的关键在于利用逆强化学习（Inverse Reinforcement Learning, IRL）方法，通过从人类数据中构建神经奖励模型，来推断并优化模型的行为策略。这一范式转变强调了在LLM对齐中使用RL技术与传统RL任务之间的差异，并探讨了其理论与实践意义。

链接: https://arxiv.org/abs/2507.13158
作者: Hao Sun,Mihaela van der Schaar
机构: University of Cambridge (剑桥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.
zh

[NLP-19] From Roots to Rewards: Dynamic Tree Reasoning with RL

【速读】：该论文试图解决传统树结构推理方法在动态适应性和计算效率方面的不足，特别是ProbTree框架中推理树固定构建和节点需全面评估所有解题策略所带来的局限。其解决方案的关键在于引入一种动态强化学习框架，通过实时置信度估计逐步构建推理树，并学习最优的动作选择策略（分解、检索或聚合），从而在保持概率严谨性的同时提升解题质量和计算效率。

链接: https://arxiv.org/abs/2507.13142
作者: Ahmed Bahloul,Simon Malberg
机构: Technical University of Munich (慕尼黑工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree’s static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree’s probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems.
zh

[NLP-20] Assessing the Reliability of LLM s Annotations in the Context of Demographic Bias and Model Explanation

【速读】：该论文试图解决标注变异性的来源问题，特别是针对性别歧视检测等任务中可能存在的群体偏见。其核心问题是评估标注者的人口统计学特征对标注决策的影响程度，并探索生成式 AI (Generative AI) 作为标注者的可靠性。解决方案的关键在于通过广义线性混合模型量化人口统计学因素的影响，发现其仅占观察到方差的8%，而文本内容是主要影响因素；同时，研究还表明基于人口统计学角色的提示无法有效提升生成式 AI 模型与人类判断的一致性，且模型预测主要依赖于与性别歧视相关的具体内容标记，而非人口统计学特征。因此，论文主张通过内容驱动的解释和稳健的标注协议来实现公平性，而非依赖潜在的人口统计学模拟。

链接: https://arxiv.org/abs/2507.13138
作者: Hadi Mohammadi,Tina Shahedi,Pablo Mosteiro,Massimo Poesio,Ayoub Bagheri,Anastasia Giachanou
机构: Utrecht University, The Netherlands; Queen Mary University of London, London, United Kingdom
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding the sources of variability in annotations is crucial for developing fair NLP systems, especially for tasks like sexism detection where demographic bias is a concern. This study investigates the extent to which annotator demographic features influence labeling decisions compared to text content. Using a Generalized Linear Mixed Model, we quantify this inf luence, finding that while statistically present, demographic factors account for a minor fraction ( 8%) of the observed variance, with tweet content being the dominant factor. We then assess the reliability of Generative AI (GenAI) models as annotators, specifically evaluating if guiding them with demographic personas improves alignment with human judgments. Our results indicate that simplistic persona prompting often fails to enhance, and sometimes degrades, performance compared to baseline models. Furthermore, explainable AI (XAI) techniques reveal that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. We argue that focusing on content-driven explanations and robust annotation protocols offers a more reliable path towards fairness than potentially persona simulation.
zh

[NLP-21] A Computational Framework to Identify Self-Aspects in Text ACL

【速读】：该论文试图解决在自然语言处理（NLP）中对文本中的自我概念（Self-aspects）进行识别的问题。尽管自我是一个多维度的构念，在语言中有所体现，并与心理学和其他已深入研究的现象相关，但在NLP领域仍缺乏系统性的分析。解决方案的关键在于构建一个自我概念本体（ontology of Self-aspects）和一个金标准标注数据集，并基于此开发和评估传统判别模型、生成式大语言模型以及基于嵌入的检索方法，以满足可解释性、真实数据一致性、准确性和计算效率四个主要标准。

链接: https://arxiv.org/abs/2507.13115
作者: Jaya Caporusso,Matthew Purver,Senja Pollak
机构: Jožef Stefan Institute (乔泽夫·斯蒂芬研究所); Jožef Stefan International Postgraduate School (乔泽夫·斯蒂芬国际研究生院); Queen Mary University of London (玛丽皇后大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL SRW 2025

点击查看摘要

Abstract:This Ph.D. proposal introduces a plan to develop a computational framework to identify Self-aspects in text. The Self is a multifaceted construct and it is reflected in language. While it is described across disciplines like cognitive science and phenomenology, it remains underexplored in natural language processing (NLP). Many of the aspects of the Self align with psychological and other well-researched phenomena (e.g., those related to mental health), highlighting the need for systematic NLP-based analysis. In line with this, we plan to introduce an ontology of Self-aspects and a gold-standard annotated dataset. Using this foundation, we will develop and evaluate conventional discriminative models, generative large language models, and embedding-based retrieval approaches against four main criteria: interpretability, ground-truth adherence, accuracy, and computational efficiency. Top-performing models will be applied in case studies in mental health and empirical phenomenology.
zh

[NLP-22] SemCSE: Semantic Contrastive Sentence Embeddings Using LLM -Generated Summaries For Scientific Abstracts

【速读】：该论文试图解决科学文本语义嵌入学习中的语义相似性捕捉问题，传统基于引用的方法无法准确反映文本的语义相似性。其解决方案的关键在于利用大语言模型（LLM）生成的科学摘要作为监督信号，通过对比学习方法训练模型，使语义相关的摘要在嵌入空间中更加接近，从而增强嵌入空间中的语义区分度。

链接: https://arxiv.org/abs/2507.13105
作者: Marc Brinner,Sina Zarriess
机构: Bielefeld University(比勒费尔德大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model’s ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text embeddings, where it achieves state-of-the-art performance among models of its size, thus highlighting the benefits of a semantically focused training approach.
zh

[NLP-23] Formalizing Attack Scenario Description: A Proposed Model

【速读】：该论文试图解决组织在不断变化的威胁环境中，如何通过增加网络安全自动化来保护资产的问题，特别是在使用攻击场景作为输入的过程中需要对输入数据进行形式化的问题。解决方案的关键在于提出了一种新的形式化模型，该模型涵盖了攻击的上下文描述及其场景，并通过UML类图进行抽象。该模型能够支持上游攻击分析过程以及在网络安全培训背景下自动生成攻击脚本，从而实现了对攻击场景的有效处理与利用。

链接: https://arxiv.org/abs/2507.13076
作者: Quentin Goux(CEDRIC - ISID),Nadira Lammari(CEDRIC - ISID)
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Organizations face an ever-changing threat landscape. They must continuously dedicate significant efforts to protect their assets, making their adoption of increased cybersecurity automation inevitable. However, process automation requires formalization of input data. Through this paper, we address this need for processes that use attack scenarios as input. Among these processes, one can mention both the generation of scripts for attack simulation and training purposes, as well as the analysis of attacks. Therefore, the paper’s main research contribution is a novel formal model that encompasses the attack’s context description and its scenario. It is abstracted using UML class model. Once the description of our model done, we will show how it could serve an upstream attack analysis process. We will show also its use for an automatic generation of attack scripts in the context of cybersecurity training. These two uses cases constitute the second contribution of this present research work.
zh

[NLP-24] Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities ICCV2025

【速读】：该论文试图解决视觉-语言导航（Vision-and-Language Navigation, VLN）方法在物理机器人部署中的适应性问题，即当前VLN模型在理想化假设下表现良好，但在实际物理环境中因观察空间受限、环境光照变化及物理挑战（如碰撞和跌倒）导致性能显著下降。解决方案的关键在于提出VLN-PE平台，该平台支持拟人、四足和轮式机器人，并首次在物理机器人设置中系统评估了多种基于自我中心的VLN方法，包括单步离散动作预测的分类模型、用于密集路径点预测的扩散模型，以及集成路径规划的无需训练的基于地图的大语言模型（LLM）。

链接: https://arxiv.org/abs/2507.13019
作者: Liuyi Wang,Xinyuan Xia,Hui Zhao,Hanqing Wang,Tai Wang,Yilun Chen,Chengju Liu,Qijun Chen,Jiangmiao Pang
机构: Shanghai AI Laboratory (上海人工智能实验室); Tongji University (同济大学); Shanghai Jiao Tong University (上海交通大学); State Key Laboratory of Autonomous Intelligent Unmanned Systems (国家智能无人系统重点实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment’s overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at this https URL.
zh

[NLP-25] ach Old SAEs New Domain Tricks with Boosting

【速读】：该论文试图解决稀疏自编码器（Sparse Autoencoder, SAE）在解释大型语言模型（Large Language Model, LLM）内部表示时，难以捕捉训练语料中不常见的领域特定特征的问题。其解决方案的关键在于引入一种残差学习方法，通过训练一个辅助的SAE来建模预训练SAE在领域特定文本上的重构误差，从而有效捕获主模型遗漏的特征。在推理阶段，通过将两个模型的输出相加，实现了在多个专业领域中LLM交叉熵和解释方差指标的显著提升，同时保持了在通用任务上的性能。

链接: https://arxiv.org/abs/2507.12990
作者: Nikita Koriagin,Yaroslav Aksenov,Daniil Laptev,Gleb Gerasimov,Nikita Balagansky,Daniil Gavrilov
机构: T-Tech (T-Tech); HSE University (HSE University); Moscow Institute of Physics and Technology (莫斯科物理技术学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.
zh

[NLP-26] MRT at IberLEF-2025 PRESTA Task: Maximizing Recovery from Tables with Multiple Steps

【速读】：该论文试图解决西班牙语表格问答任务（Preguntas y Respuestas sobre Tablas en Español, PRESTA）的问题，旨在从表格数据中准确提取答案。解决方案的关键在于利用生成式 AI (Generative AI) 生成 Python 代码，通过代码对表格进行过滤和处理，从而回答用户提出的问题。该方法基于对表格内容的分析、有用列的选择、自然语言指令的生成与翻译、代码执行及异常处理等多个步骤，并结合开源 LLMs 和细粒度优化提示词实现高效准确的问答。

链接: https://arxiv.org/abs/2507.12981
作者: Maximiliano Hormazábal Lagos,Álvaro Bueno Sáez,Héctor Cerezo-Costas,Pedro Alonso Doval,Jorge Alcalde Vesteiro
机构: Fundación Centro Tecnolóxico de Telecomunicacións de Galicia (GRADIANT)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as an official challenge paper in the PRESTA: Questions and Answers over Tabular Data shared task at IberLEF 2025, colocated with the 41st SEPLN Conference in Zaragoza, Spain

点击查看摘要

Abstract:This paper presents our approach for the IberLEF 2025 Task PRESTA: Preguntas y Respuestas sobre Tablas en Español (Questions and Answers about Tables in Spanish). Our solution obtains answers to the questions by implementing Python code generation with LLMs that is used to filter and process the table. This solution evolves from the MRT implementation for the Semeval 2025 related task. The process consists of multiple steps: analyzing and understanding the content of the table, selecting the useful columns, generating instructions in natural language, translating these instructions to code, running it, and handling potential errors or exceptions. These steps use open-source LLMs and fine-grained optimized prompts for each step. With this approach, we achieved an accuracy score of 85% in the task.
zh

[NLP-27] Probabilistic Soundness Guarantees in LLM Reasoning Chains

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）生成的推理链中初始错误传播导致最终结论不可靠的问题。现有基于LLM的错误检测方法往往无法有效检测传播性错误，因其未能充分考虑早期错误对后续推理判断的干扰。论文提出的解决方案是引入自回归推理蕴含稳定性（Autoregressive Reasoning Entailment Stability, ARES），其关键在于通过仅基于已评估的合理前提来判断每个命题，从而防止错误传播。该归纳方法为每一步生成细致的得分，并提供经过认证的统计可靠性保证，而非脆弱的二元标签。

链接: https://arxiv.org/abs/2507.12948
作者: Weiqiu You,Anton Xue,Shreya Havaldar,Delip Rao,Helen Jin,Chris Callison-Burch,Eric Wong
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because they do not properly account for how earlier errors might corrupt judgments of downstream reasoning. To better detect such propagated errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a novel probabilistic framework that prevents error propagation by judging each claim based only on previously-assessed sound premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).
zh

[NLP-28] Making Language Model a Hierarchical Classifier and Generator

【速读】：该论文试图解决传统解码器仅在最后一层进行解码的局限性，旨在提升模型在复杂任务中的表现。其解决方案的关键在于构建一种分层解码架构，通过将最后一层的语言头复制到不同的中间层，并针对不同任务输入进行微调，从而实现多层同时解码，有效提升了模型在层次化文本分类、分类引导生成和层次化文本生成等任务上的性能。

链接: https://arxiv.org/abs/2507.12930
作者: Yihong Wang,Zhonglin Jiang,Ningyuan Xi,Yue Zhao,Qingqing Gu,Xiyuan Chen,Hao Wu,Sheng Xu,Hange Zhou,Yong Chen,Luo Ji
机构: Geely AI Lab(吉利人工智能实验室); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human’s hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder. Language heads of the last layer are copied to different selected intermediate layers, and fine-tuned with different task inputs. By thorough experiments, we validate that these selective intermediate layers could be adapted to speak meaningful and reasonable contents, and this paradigm of hierarchical decoder can obtain state-of-the-art performances on multiple tasks such as hierarchical text classification, classification-guided generation, and hierarchical text generation. This study suggests the possibility of a generalized hierarchical reasoner, pretraining from scratch.
zh

[NLP-29] Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent?

【速读】：该论文试图解决多语言模型在跨语言一致性（cross-lingual consistency）方面的问题，旨在评估跨语言可迁移性、保持模型知识的真实性以及确保语言模型性能的均衡性。其解决方案的关键在于通过分析代码混杂的核心指代语句来研究跨语言知识的一致性，并利用可解释性方法分析模型在跨语言环境中的行为，发现多语言模型在一致性方面存在差异，受语言家族、语言因素及特定层上的跨语言一致性瓶颈影响。此外，论文还评估了提升多语言性能的常见策略，发现代码切换训练和跨语言词对齐目标在提升知识一致性方面表现最为显著，强调了跨语言对齐监督和代码切换训练在增强多语言性能与跨语言一致性中的重要性。

链接: https://arxiv.org/abs/2507.12838
作者: Xi Ai,Mahardika Krisna Ihsani,Min-Yen Kan
机构: School of Computing, National University of Singapore (计算学院，新加坡国立大学); Department of Natural Language Processing, MBZUAI (自然语言处理系，MBZUAI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-lingual consistency should be considered to assess cross-lingual transferability, maintain the factuality of the model knowledge across languages, and preserve the parity of language model performance. We are thus interested in analyzing, evaluating, and interpreting cross-lingual consistency for factual knowledge. We examine code-mixed coreferential statements conveyed identical knowledge across languages to study cross-lingual knowledge consistency. We use some interpretability approaches to analyze the behavior of a model in cross-lingual contexts, discovering that multilingual models show different levels of consistency, subject to language families, linguistic factors, and a bottleneck in cross-lingual consistency on a particular layer. In addition, we evaluate common strategies aimed at improving multilingual performance to observe whether these strategies can improve knowledge consistency at the same time. While knowledge is not cross-lingual consistency in many cases, code-switching training and cross-lingual word alignment objectives show the most promising results, emphasizing the noteworthiness of cross-lingual alignment supervision and code-switching training for both multilingual performance and cross-lingual consistency enhancement.
zh

[NLP-30] Emotional Support with LLM -based Empathetic Dialogue Generation

【速读】：该论文试图解决情感支持对话（Emotional Support Conversation, ESC）中如何通过对话提供共情且有效的心理支持问题，以应对日益增长的心理健康支持需求。其解决方案的关键在于利用大规模语言模型，并结合提示工程和微调技术进行优化，探索了参数高效的低秩适应与全参数微调策略，以提升模型生成支持性且符合语境的回复能力。

链接: https://arxiv.org/abs/2507.12820
作者: Shiquan Wang,Ruiyu Fang,Zhongjiang He,Shuangyong Song,Yongxiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotional Support Conversation (ESC) aims to provide empathetic and effective emotional assistance through dialogue, addressing the growing demand for mental health support. This paper presents our solution for the NLPCC 2025 Task 8 ESC evaluation, where we leverage large-scale language models enhanced by prompt engineering and finetuning techniques. We explore both parameter-efficient Low-Rank Adaptation and full-parameter fine-tuning strategies to improve the model’s ability to generate supportive and contextually appropriate responses. Our best model ranked second in the competition, highlighting the potential of combining LLMs with effective adaptation methods for ESC tasks. Future work will focus on further enhancing emotional understanding and response personalization to build more practical and reliable emotional support systems.
zh

[NLP-31] Large Language Models Internal Perception of Symbolic Music

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在符号音乐领域中隐式建模能力的探索问题，具体是研究LLMs如何通过文本提示生成符号音乐数据，并评估其在音乐识别与生成任务中的有效性。解决方案的关键在于利用LLMs从文本描述中生成MIDI文件数据集，而无需依赖显式的音乐训练，随后在此数据集上训练神经网络进行风格和流派分类以及旋律补全任务，从而评估LLMs在音乐领域的潜在能力和局限性。

链接: https://arxiv.org/abs/2507.12808
作者: Andrew Shin,Kunitake Kaneko
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at modeling relationships between strings in natural language and have shown promise in extending to other symbolic domains like coding or mathematics. However, the extent to which they implicitly model symbolic music remains underexplored. This paper investigates how LLMs represent musical concepts by generating symbolic music data from textual prompts describing combinations of genres and styles, and evaluating their utility through recognition and generation tasks. We produce a dataset of LLM-generated MIDI files without relying on explicit musical training. We then train neural networks entirely on this LLM-generated MIDI dataset and perform genre and style classification as well as melody completion, benchmarking their performance against established models. Our results demonstrate that LLMs can infer rudimentary musical structures and temporal relationships from text, highlighting both their potential to implicitly encode musical patterns and their limitations due to a lack of explicit musical context, shedding light on their generative capabilities for symbolic music.
zh

[NLP-32] MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）驱动的智能代理在实际应用中缺乏稳健且可扩展评估框架的问题。现有方法依赖静态基准和人工密集型数据收集，限制了实际评估的可行性。其解决方案的关键在于提出一种基于Model Context Protocol (MCP)的开源框架MCPEval，该框架能够自动化端到端的任务生成与深度评估，标准化评估指标，并无缝集成原生代理工具，从而消除构建评估流水线的手动工作量。

链接: https://arxiv.org/abs/2507.12806
作者: Zhiwei Liu,Jielin Qiu,Shiyu Wang,Jianguo Zhang,Zuxin Liu,Roshan Ram,Haolin Chen,Weiran Yao,Huan Wang,Shelby Heinecke,Silvio Savarese,Caiming Xiong
机构: Salesforce AI Research (Salesforce人工智能研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval this https URL to promote reproducible and standardized LLM agent evaluation.
zh

[NLP-33] PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database KDD-25

【速读】：该论文旨在解决基于学习的无损压缩器在大规模基因组数据库备份、存储、传输和管理中面临的三个主要问题：压缩比不足、压缩/解压吞吐量低以及压缩鲁棒性差。其解决方案的关键在于提出一种名为Parallel Multi-Knowledge Learning-based Compressor (PMKLC) 的新型压缩框架，包含四个关键设计：1) 自动化多知识学习压缩框架以提升压缩比和鲁棒性；2) GPU加速的(s, k)-mer编码器以优化吞吐量和计算资源使用；3) 数据块划分与Step-wise Model Passing (SMP) 机制实现并行加速；4) 设计两种压缩模式PMKLC-S和PMKLC-M以适应不同应用场景。

链接: https://arxiv.org/abs/2507.12805
作者: Hui Sun,Yanfeng Ding,Liping Yi,Huidong Ma,Gang Wang,Xiaoguang Liu,Cheng Zhong,Wentong Cai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注: Accepted via KDD-25

点击查看摘要

Abstract:Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression \ decompression throughput, and 3) poor compression robustness limit their widespread adoption and application in both industry and academia. To solve those challenges, we propose a novel \underlineParallel \underlineMulti-\underlineKnowledge \underlineLearning-based \underlineCompressor (PMKLC) with four crucial designs: 1) We propose an automated multi-knowledge learning-based compression framework as compressors’ backbone to enhance compression ratio and robustness; 2) we design a GPU-accelerated ( s , k )-mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for parallel acceleration; 4) We design two compression modes PMKLC-S and PMKLC-M to meet the complex application scenarios, where the former runs on a resource-constrained single GPU and the latter is multi-GPU accelerated. We benchmark PMKLC-S/M and 14 baselines (7 traditional and 7 leaning-based) on 15 real-world datasets with different species and data sizes. Compared to baselines on the testing datasets, PMKLC-S/M achieve the average compression ratio improvement up to 73.609% and 73.480%, the average throughput improvement up to 3.036 \times and 10.710 \times , respectively. Besides, PMKLC-S/M also achieve the best robustness and competitive memory cost, indicating its greater stability against datasets with different probability distribution perturbations, and its strong ability to run on memory-constrained devices.
zh

[NLP-34] Learning Robust Negation Text Representations

【速读】：该论文试图解决文本编码器在处理否定语义时表现不足的问题，这一问题影响了依赖文本嵌入的下游应用。解决方案的关键在于通过从大型语言模型中蒸馏包含多样否定和模糊表达模式的数据，来增强文本编码器对否定的鲁棒性。研究采用标准的对比学习策略微调一个强大的基于BERT的模型，在保持通用基准竞争性能的同时显著提升了否定理解能力。此外，该方法还可适配至大语言模型，从而提升否定相关基准的性能。

链接: https://arxiv.org/abs/2507.12782
作者: Thinh Hung Truong,Karin Verspoor,Trevor Cohn,Timothy Baldwin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite rapid adoption of autoregressive large language models, smaller text encoders still play an important role in text understanding tasks that require rich contextualized representations. Negation is an important semantic function that is still not properly captured by such methods, affecting many downstream applications relying on text embeddings. We propose a strategy to improve negation robustness of text encoders, by distilling data from large language models using diverse patterns of negation and hedging. We adopt a standard contrastive learning strategy to finetune a strong BERT-based model, and observe large improvement in negation understanding capabilities while maintaining competitive performance on general benchmarks. In addition, we also show that our method can be adapted to LLMs, leading to improved performance on negation benchmarks.
zh

[NLP-35] A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models

【速读】：该论文试图解决电子健康记录（Electronic Health Records, EHRs）在人工智能建模中的独特挑战，包括数据的异质性、时间不规则性和领域特定性。其解决方案的关键在于构建一个统一的分类体系，涵盖数据驱动方法、神经网络架构设计、学习策略、多模态学习以及基于大语言模型（Large Language Models, LLMs）的建模系统，从而系统地梳理和总结深度学习与LLMs在EHR建模中的最新进展。

链接: https://arxiv.org/abs/2507.12774
作者: Weijieying Ren,Jingxi Zhu,Zehao Liu,Tianxiang Zhao,Vasant Honavar
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has demonstrated significant potential in transforming healthcare through the analysis and modeling of electronic health records (EHRs). However, the inherent heterogeneity, temporal irregularity, and domain-specific nature of EHR data present unique challenges that differ fundamentally from those in vision and natural language tasks. This survey offers a comprehensive overview of recent advancements at the intersection of deep learning, large language models (LLMs), and EHR modeling. We introduce a unified taxonomy that spans five key design dimensions: data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems. Within each dimension, we review representative methods addressing data quality enhancement, structural and temporal representation, self-supervised learning, and integration with clinical knowledge. We further highlight emerging trends such as foundation models, LLM-driven clinical agents, and EHR-to-text translation for downstream reasoning. Finally, we discuss open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings. This survey aims to provide a structured roadmap for advancing AI-driven EHR modeling and clinical decision support. For a comprehensive list of EHR-related methods, kindly refer to this https URL.
zh

[NLP-36] Synergy: End-to-end Concept Model

【速读】：该论文试图解决传统语言模型中分词器（tokenizer）依赖问题，以及如何在不同抽象层次间实现端到端的高效融合。其解决方案的关键在于提出了一种基于学习路由机制的架构——Synergy，该模型以字节级语言模型的方式进行训练，能够自发地学习字节的分词过程，在保持与Byte-level Byte Pair Encoder（BBPE）相当性能的同时生成更少的概念标记。此外，研究还发现模型中层（更高抽象层次）在移除位置编码后表现更优，表明出现了与位置无关的概念，这进一步验证了无需分词器架构的可行性。

链接: https://arxiv.org/abs/2507.12769
作者: Keli Zheng,Zerong Xie
机构: Institute of Software, Chinese Academy of Science(软件研究所，中国科学院); The University of Hong Kong(香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present Synergy, a language model that bridges different levels of abstraction in an end-to-end fashion through a learned routing mechanism. Focusing on low-level linguistic abstraction, we trained our model as a byte-level language model. Our model spontaneously learns to tokenize bytes, producing fewer concept tokens than Byte-level Byte Pair Encoder (BBPE) tokenizers while keeping comparable performance. By comparing with Llama3, we observed an advantage of Synergy under the same model scale and training dataset size. Further studies show that the middle part (the higher abstraction part) of our model performs better when positional encodings are removed, suggesting the emergence of position-independent concepts. These findings demonstrate the feasibility of tokenizer-free architectures, paving the way for more robust and flexible pipelines.
zh

[NLP-37] Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

【速读】：该论文试图解决如何在不进行额外训练的情况下激发大型语言模型（Large Language Model, LLM）的长链推理（long chain-of-thought, CoT）能力，以及如何进一步提升这种能力的问题。其解决方案的关键在于提出一种解码阶段的方法——ThinkLogit，该方法利用logits算术（logits arithmetic）通过一个较小的模型作为引导者（guider）来调整目标大模型，从而实现对长推理能力的激发。此外，通过在正确/错误推理对上进行偏好优化（preference optimization），进一步提升了性能，这一改进版本称为ThinkLogit-DPO。

链接: https://arxiv.org/abs/2507.12759
作者: Yunxiang Zhang,Muhammad Khalifa,Lechen Zhang,Xin Liu,Ayoung Lee,Xinliang Frederick Zhang,Farima Fatahi Bayat,Lu Wang
机构: University of Michigan, Ann Arbor(密歇根大学安娜堡分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction. Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training. Our work first investigates whether we can elicit such behavior without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logits arithmetic (Liu et al., 2024) to tune a target large LM for long reasoning using a substantially smaller model as guider. We then show that we can further boost performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model – a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in pass@1 by 26% and 29%, respectively, over four mathematical datasets using the Qwen2.5-32B when guided by R1-Distill-Qwen-1.5B – a model 21x smaller. Lastly, we show that ThinkLogit can transfer long reasoning skills acquired through reinforcement learning, improving pass@1 by 13% relative compared to the Qwen2.5-32B base model. Our work presents a computationally-efficient method to elicit long reasoning in large models with minimal or no additional training.
zh

[NLP-38] Strategy Adaptation in Large Language Model Werewolf Agents

【速读】：该论文试图解决Werewolf代理在面对动态变化的游戏情境时缺乏适应性的问题，传统方法通过提示工程隐式定义有效策略，无法灵活应对变化。解决方案的关键在于提出一种显式选择合适策略的方法，该方法基于游戏上下文和对其他玩家角色的估计来调整策略。

链接: https://arxiv.org/abs/2507.12732
作者: Fuya Nakamori,Yin Jou Huang,Fei Cheng
机构: nlp.ist.i.kyoto-u.ac.jp
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:This study proposes a method to improve the performance of Werewolf agents by switching between predefined strategies based on the attitudes of other players and the context of conversations. While prior works of Werewolf agents using prompt engineering have employed methods where effective strategies are implicitly defined, they cannot adapt to changing situations. In this research, we propose a method that explicitly selects an appropriate strategy based on the game context and the estimated roles of other players. We compare the strategy adaptation Werewolf agents with baseline agents using implicit or fixed strategies and verify the effectiveness of our proposed method.
zh

[NLP-39] ransEvalnia: Reasoning -based Evaluation and Ranking of Translations

【速读】：该论文试图解决机器翻译质量评估与排序的问题，旨在提供一种能够进行推理并实现细粒度评估的翻译评价系统。其解决方案的关键在于利用基于提示（prompting）的方法，结合多维质量指标（Multidimensional Quality Metrics）对翻译结果进行评估，并通过大型语言模型（LLM）如Anthropic’s Claude-3.5-Sonnet和Qwen-2.5-72B-Instruct进行推理，从而生成可接受性高且与人类评分高度相关的数值评分。

链接: https://arxiv.org/abs/2507.12724
作者: Richard Sproat,Tianyu Zhao,Llion Jones
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (this https URL), returns an assessment of which translation it deems the best, and provides numerical scores for the various dimensions and for the overall translation. We show that TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker (Moosa et al. 2024) on our own English-Japanese data as well as several language pairs from various WMT shared tasks. Using Anthropic’s Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct as the evaluation LLMs, we show that the evaluations returned are deemed highly acceptable to human raters, and that the scores assigned to the translations by Sonnet, as well as other LLMs, correlate well with scores assigned by the human raters. We also note the sensitivity of our system – as well as MT-Ranker – to the order in which the translations are presented, and we propose methods to address this position bias. All data, including the system’s evaluation and reasoning, human assessments, as well as code is released.
zh

[NLP-40] FLEXITOKENS: Flexible Tokenization for Evolving Language Models

【速读】：该论文试图解决语言模型在适应新数据分布时因子词分词器刚性而导致的效率低下问题，特别是在处理分布外领域、未见过的语言或脚本时出现的过度分词现象。解决方案的关键在于开发字节级语言模型（language models, LMs）并引入可学习的分词器，通过一个子模块学习预测输入字节序列的边界，从而实现分词过程的自适应性。与现有无分词器方法中依赖辅助损失强制固定压缩率的训练方式不同，本文提出的FLEXITOKENS简化了训练目标，显著提升了模型在适应过程中的灵活性。

链接: https://arxiv.org/abs/2507.12720
作者: Abraham Toluase Owodunni,Orevaoghene Ahia,Sachin Kumar
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at this https URL
zh

[NLP-41] AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation

【速读】：该论文旨在解决当前语音评估中存在的两个关键问题：一是设计针对个体音频特征的专用系统的需求与难度，二是自动评估方法与人类偏好之间的相关性较差。其解决方案的关键在于引入大型音频模型（Large Audio Model, LAM）作为评估者，即AudioJudge，通过系统性研究其在音频特征检测任务和系统级人类偏好模拟中的应用，利用音频拼接结合上下文学习的提示工程策略提升性能，并采用多方面集成的AudioJudge实现通用的多维度音频评估，从而提供一个统一的评估框架。

链接: https://arxiv.org/abs/2507.12705
作者: Potsawee Manakul,Woody Haosheng Gan,Michael J. Ryan,Ali Sartaz Khan,Warit Sirichotedumrong,Kunat Pipatanakul,William Held,Diyi Yang
机构: SCB 10X, SCBX Group(SCB 10X, SCBX 组); University of Southern California(南加州大学); Stanford University(斯坦福大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference simulation tasks. We further introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on our system ranking benchmark. Robustness analysis reveals that while LAMs maintain strong performance under acoustic noise, they exhibit significant verbosity and positional biases that require careful mitigation.
zh

[NLP-42] AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis

【速读】：该论文试图解决多模态方面情感分析（Multimodal Aspect-Based Sentiment Analysis, MABSA）中情感分类和方面词提取的挑战，特别是在处理文本与图像结合的数据时，如何有效捕捉跨模态关系的问题。解决方案的关键在于引入AdaptiSent框架，该框架采用自适应跨模态注意力机制，结合动态模态加权和上下文自适应注意力，以增强对情感和方面相关信息的提取能力，从而更准确地识别文本与视觉上下文之间的复杂交互关系。

链接: https://arxiv.org/abs/2507.12695
作者: S M Rafiuddin,Sadia Kamal,Mohammed Rakib,Arunkumar Bagavathi,Atriya Sen
机构: Oklahoma State University (俄克拉荷马州立大学)
类目: Computation and Language (cs.CL)
备注: 12 pages (including references), 2 figures (Fig. 1 overview, Fig. 2 hyperparameter sensitivity with two subplots), 6 tables (performance, ablation, dataset stats, case studies, etc.), accepted at ASONAM 2025 (Social Network Analysis and Mining)

点击查看摘要

Abstract:We introduce AdaptiSent, a new framework for Multimodal Aspect-Based Sentiment Analysis (MABSA) that uses adaptive cross-modal attention mechanisms to improve sentiment classification and aspect term extraction from both text and images. Our model integrates dynamic modality weighting and context-adaptive attention, enhancing the extraction of sentiment and aspect-related information by focusing on how textual cues and visual context interact. We tested our approach against several baselines, including traditional text-based models and other multimodal methods. Results from standard Twitter datasets show that AdaptiSent surpasses existing models in precision, recall, and F1 score, and is particularly effective in identifying nuanced inter-modal relationships that are crucial for accurate sentiment and aspect term extraction. This effectiveness comes from the model’s ability to adjust its focus dynamically based on the context’s relevance, improving the depth and accuracy of sentiment analysis across various multimodal data sets. AdaptiSent sets a new standard for MABSA, significantly outperforming current methods, especially in understanding complex multimodal information.
zh

[NLP-43] Improving Drug Identification in Overdose Death Surveillance using Large Language Models

【速读】：该论文旨在解决美国药物相关死亡率上升（尤其是芬太尼导致的死亡）所带来的及时准确监测问题，当前关键数据常被隐藏在法医报告的自由文本中，导致在编码为ICD-10分类时出现延迟和信息丢失。解决方案的关键在于利用自然语言处理（NLP）模型自动提取和分类死亡证明中的药物使用信息，其中采用微调的临床领域语言模型如BioClinicalBERT，在内部测试集上实现了接近完美的性能（宏F1分数=0.998），并在外部验证集中保持了较高的鲁棒性（宏F1分数=0.966），显著优于传统机器学习方法和通用领域模型。

链接: https://arxiv.org/abs/2507.12679
作者: Arthur J. Funnell,Panayiotis Petousis,Fabrice Harel-Canada,Ruby Romero,Alex A. T. Bui,Adam Koncsol,Hritika Chaturvedi,Chelsea Shover,David Goodman-Meza
机构: 未知
类目: Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 30 pages, 1 figure, 4 tables, 2 supplemental figures, 4 supplemental tables, submitted to Journal of Forensic Sciences (JFS)

点击查看摘要

Abstract:The rising rate of drug-related deaths in the United States, largely driven by fentanyl, requires timely and accurate surveillance. However, critical overdose data are often buried in free-text coroner reports, leading to delays and information loss when coded into ICD (International Classification of Disease)-10 classifications. Natural language processing (NLP) models may automate and enhance overdose surveillance, but prior applications have been limited. A dataset of 35,433 death records from multiple U.S. jurisdictions in 2020 was used for model training and internal testing. External validation was conducted using a novel separate dataset of 3,335 records from 2023-2024. Multiple NLP approaches were evaluated for classifying specific drug involvement from unstructured death certificate text. These included traditional single- and multi-label classifiers, as well as fine-tuned encoder-only language models such as Bidirectional Encoder Representations from Transformers (BERT) and BioClinicalBERT, and contemporary decoder-only large language models such as Qwen 3 and Llama 3. Model performance was assessed using macro-averaged F1 scores, and 95% confidence intervals were calculated to quantify uncertainty. Fine-tuned BioClinicalBERT models achieved near-perfect performance, with macro F1 scores =0.998 on the internal test set. External validation confirmed robustness (macro F1=0.966), outperforming conventional machine learning, general-domain BERT models, and various decoder-only large language models. NLP models, particularly fine-tuned clinical variants like BioClinicalBERT, offer a highly accurate and scalable solution for overdose death classification from free-text reports. These methods can significantly accelerate surveillance workflows, overcoming the limitations of manual ICD-10 coding and supporting near real-time detection of emerging substance use trends.
zh

[NLP-44] he first open machine translation system for the Chechen language

【速读】：该论文试图解决将弱势语言切尔克斯语与俄语之间进行翻译的问题，特别是通过构建一个开放源代码模型和相关数据集来支持这一任务。解决方案的关键在于探索如何通过微调大型多语言翻译模型NLLB-200，以纳入新的语言，从而实现有效的跨语言翻译。

链接: https://arxiv.org/abs/2507.12672
作者: Abu-Viskhan A. Umishov,Vladislav A. Grigorian
机构: Southern Federal University (南方联邦大学)
类目: Computation and Language (cs.CL)
备注: 7 pages

点击查看摘要

Abstract:We introduce the first open-source model for translation between the vulnerable Chechen language and Russian, and the dataset collected to train and evaluate it. We explore fine-tuning capabilities for including a new language into a large language model system for multilingual translation NLLB-200. The BLEU / ChrF++ scores for our model are 8.34 / 34.69 and 20.89 / 44.55 for translation from Russian to Chechen and reverse direction, respectively. The release of the translation models is accompanied by the distribution of parallel words, phrases and sentences corpora and multilingual sentence encoder adapted to the Chechen language.
zh

[NLP-45] A Fuzzy Approach to Project Success: Measuring What Matters

【速读】：该论文试图解决传统Likert量表在项目成功评估中忽视项目成功的情境依赖性和多维性的问题。其解决方案的关键在于引入一种分层的Type-1 Mamdani模糊系统，该系统优先考虑对最终用户的持续正面影响，而非次要结果如利益相关者满意度和内部项目成功。这种方法通过模糊逻辑增强了对项目成功复杂性的描述能力，从而可能提供更精确的评估指标。

链接: https://arxiv.org/abs/2507.12653
作者: João Granja-Correia,Remedios Hernández-Linares,Luca Ferranti,Arménio Rego
机构: Universidad de Cantabria (坎塔布里亚大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Aalto University (阿尔托大学); Católica Porto Business School (波尔图天主教商学院)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 3 pages, 1 figure, presented at FUZZ-IEEE 2025

点击查看摘要

Abstract:This paper introduces a novel approach to project success evaluation by integrating fuzzy logic into an existing construct. Traditional Likert-scale measures often overlook the context-dependent and multifaceted nature of project success. The proposed hierarchical Type-1 Mamdani fuzzy system prioritizes sustained positive impact for end-users, reducing emphasis on secondary outcomes like stakeholder satisfaction and internal project success. This dynamic approach may provide a more accurate measure of project success and could be adaptable to complex evaluations. Future research will focus on empirical testing and broader applications of fuzzy logic in social science.
zh

[NLP-46] Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

【速读】：该论文旨在解决单体多模态大语言模型（Monolithic Multimodal Large Language Models, MLLMs）在优化稳定性与灾难性遗忘方面的挑战。其解决方案的关键在于将新的视觉参数空间嵌入预训练的语言模型中，通过delta调优实现从噪声数据中稳定学习视觉知识。基于此原则，作者提出了Mono-InternVL，并引入了端生视觉预训练（Endogenous Visual Pre-training, EViP）以提升视觉能力，后续版本Mono-InternVL-1.5进一步优化了EViP（EViP++），提升了模型效率与性能。

链接: https://arxiv.org/abs/2507.12566
作者: Gen Luo,Wenhan Dou,Wenhao Li,Zhaokai Wang,Xue Yang,Changyao Tian,Hao Li,Weiyun Wang,Wenhai Wang,Xizhou Zhu,Yu Qiao,Jifeng Dai
机构: Shanghai Artificial Intelligence Laboratory; Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at this https URL.
zh

[NLP-47] Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility

【速读】：该论文试图解决语言模型（Language Models, LMs）在模态分类任务中的能力问题，即模型是否能够准确区分句子的模态类别（如可能、不可能、完全无意义等）。其解决方案的关键在于识别出能够区分不同模态类别的线性表示，即模态差异向量（modal difference vectors）。通过分析这些向量，研究发现LMs具备比之前报道更可靠的模态分类判断能力，并且模态差异向量在模型训练过程中以一致的顺序出现。此外，研究还表明，这些向量可以用于建模人类对模态类别的精细分类行为，从而为理解人类如何区分模态类别提供新的视角。

链接: https://arxiv.org/abs/2507.12553
作者: Michael A. Lepori,Jennifer Hu,Ishita Dasgupta,Roma Patel,Thomas Serre,Ellie Pavlick
机构: Brown University (布朗大学); Johns Hopkins University (约翰霍普金斯大学); Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models (LMs) are used for a diverse range of tasks, from question answering to writing fantastical stories. In order to reliably accomplish these tasks, LMs must be able to discern the modal category of a sentence (i.e., whether it describes something that is possible, impossible, completely nonsensical, etc.). However, recent studies have called into question the ability of LMs to categorize sentences according to modality (Michaelov et al., 2025; Kauf et al., 2023). In this work, we identify linear representations that discriminate between modal categories within a variety of LMs, or modal difference vectors. Analysis of modal difference vectors reveals that LMs have access to more reliable modal categorization judgments than previously reported. Furthermore, we find that modal difference vectors emerge in a consistent order as models become more competent (i.e., through training steps, layers, and parameter count). Notably, we find that modal difference vectors identified within LM activations can be used to model fine-grained human categorization behavior. This potentially provides a novel view into how human participants distinguish between modal categories, which we explore by correlating projections along modal difference vectors with human participants’ ratings of interpretable features. In summary, we derive new insights into LM modal categorization using techniques from mechanistic interpretability, with the potential to inform our understanding of modal categorization in humans.
zh

[NLP-48] Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models

【速读】：该论文试图解决如何让机器在面对新情境时，能够整合广泛背景知识并进行连贯的推理与预测的问题。其解决方案的关键在于提出一种“模型合成架构”（Model Synthesis Architecture, MSA），该架构结合了语言模型实现基于全局相关性的检索与模型合成，以及概率程序实现定制化、连贯的世界模型，从而更有效地模拟人类在开放领域中的推理能力。

链接: https://arxiv.org/abs/2507.12547
作者: Lionel Wong,Katherine M. Collins,Lance Ying,Cedegao E. Zhang,Adrian Weller,Tobias Gersternberg,Timothy O’Donnell,Alexander K. Lew,Jacob D. Andreas,Joshua B. Tenenbaum,Tyler Brooke-Wilson
机构: Stanford University (斯坦福大学); MIT (麻省理工学院); University of Cambridge (剑桥大学); Harvard University (哈佛大学); McGill University (麦吉尔大学); Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Presented at CogSci 2025

点击查看摘要

Abstract:When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea – a ``Model Synthesis Architecture’’ (MSA) – using language models to implement global relevance-based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset – built around a Model Olympics domain of sports vignettes – tests models’ capacity for human-like, open-ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model-only baselines, under both direct and chain-of-thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people’s ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open-ended domains.
zh

[NLP-49] Scaling Up RL: Unlocking Diverse Reasoning in LLM s via Prolonged Training

【速读】：该论文试图解决如何通过延长强化学习过程来提升小型语言模型在多种推理领域中的性能问题。其解决方案的关键在于引入可验证的奖励任务、对Group Relative Policy Optimization (GRPO) 的改进，以及一系列提高训练稳定性和泛化能力的技术，包括控制KL正则化、裁剪比例和周期性参考策略重置，这些组件共同促进了长期性能的提升。

链接: https://arxiv.org/abs/2507.12507
作者: Mingjie Liu,Shizhe Diao,Jian Hu,Ximing Lu,Xin Dong,Hao Zhang,Alexander Bukharin,Shaokun Zhang,Jiaqi Zeng,Makesh Narsimhan Sreedhar,Gerald Shen,David Mosallanezhad,Di Zhang,Jonas Yang,June Yang,Oleksii Kuchaiev,Guilin Liu,Zhiding Yu,Pavlo Molchanov,Yejin Choi,Jan Kautz,Yi Dong
机构: NVIDIA(英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Recent advancements in reasoning-focused language models such as OpenAI’s O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on complex tasks like mathematics and code generation. These breakthroughs have been driven by large-scale reinforcement learning (RL), particularly when combined with verifiable reward signals that provide objective and grounded supervision. In this report, we investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains. Our work identifies several key ingredients for effective training, including the use of verifiable reward tasks, enhancements to Group Relative Policy Optimization (GRPO), and practical techniques to improve training stability and generalization. We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains. Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks. To facilitate continued research, we release our model publicly.
zh

[NLP-50] Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering

【速读】：该论文试图解决文档视觉问答（DocVQA）任务中模型决策透明度不足与可复现性差的问题，同时在不进行额外模型微调的情况下提升模型性能。其解决方案的关键在于EaGERS框架，该框架通过生成式AI（Generative AI）生成自然语言推理过程，利用多模态嵌入相似性计算将这些推理与图像中的空间子区域进行对齐，并通过多数投票机制选择相关区域，最终仅基于这些区域生成回答，从而提高模型的解释性和可靠性。

链接: https://arxiv.org/abs/2507.12490
作者: Maximiliano Hormazábal Lagos,Héctor Cerezo-Costas,Dimosthenis Karatzas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This work has been accepted for presentation at the 16th Conference and Labs of the Evaluation Forum (CLEF 2025) and will be published in the proceedings by Springer in the Lecture Notes in Computer Science (LNCS) series. Please cite the published version when available

点击查看摘要

Abstract:We introduce EaGERS, a fully training-free and model-agnostic pipeline that (1) generates natural language rationales via a vision language model, (2) grounds these rationales to spatial sub-regions by computing multimodal embedding similarities over a configurable grid with majority voting, and (3) restricts the generation of responses only from the relevant regions selected in the masked image. Experiments on the DocVQA dataset demonstrate that our best configuration not only outperforms the base model on exact match accuracy and Average Normalized Levenshtein Similarity metrics but also enhances transparency and reproducibility in DocVQA without additional model fine-tuning.
zh

[NLP-51] A Survey of AIOps in the Era of Large Language Models

【速读】：该论文试图解决当前对生成式人工智能（Generative AI）在人工智能运维（AIOps）领域中的影响、潜力及局限性的理解尚处于初级阶段的问题。其关键解决方案在于通过系统性地分析2020年1月至2024年12月间发表的183篇相关研究论文，围绕四个核心研究问题展开深入探讨，包括失败数据源的多样性、AIOps任务的演进、基于生成式AI的方法以及评估方法的适配性，从而全面揭示生成式AI在AIOps中的应用现状与未来发展方向。

链接: https://arxiv.org/abs/2507.12472
作者: Lingzhe Zhang,Tong Jia,Mengxi Jia,Yifan Wu,Aiwei Liu,Yong Yang,Zhonghai Wu,Xuming Hu,Philip S. Yu,Ying Li
机构: Peking University(北京大学); Tsinghua University(清华大学); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); University of Illinois Chicago(伊利诺伊大学芝加哥分校)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted By CSUR, an extended version of “A Survey of AIOps for Failure Management in the Era of Large Language Models” [ arXiv:2406.11213 ]

点击查看摘要

Abstract:As large language models (LLMs) grow increasingly sophisticated and pervasive, their application to various Artificial Intelligence for IT Operations (AIOps) tasks has garnered significant attention. However, a comprehensive understanding of the impact, potential, and limitations of LLMs in AIOps remains in its infancy. To address this gap, we conducted a detailed survey of LLM4AIOps, focusing on how LLMs can optimize processes and improve outcomes in this domain. We analyzed 183 research papers published between January 2020 and December 2024 to answer four key research questions (RQs). In RQ1, we examine the diverse failure data sources utilized, including advanced LLM-based processing techniques for legacy data and the incorporation of new data sources enabled by LLMs. RQ2 explores the evolution of AIOps tasks, highlighting the emergence of novel tasks and the publication trends across these tasks. RQ3 investigates the various LLM-based methods applied to address AIOps challenges. Finally, RQ4 reviews evaluation methodologies tailored to assess LLM-integrated AIOps approaches. Based on our findings, we discuss the state-of-the-art advancements and trends, identify gaps in existing research, and propose promising directions for future exploration.
zh

[NLP-52] Perfect diffusion is mathsfTC0 – Bad diffusion is Turing-complete

【速读】：该论文试图解决扩散模型在语言建模任务中的计算复杂性问题，特别是其在顺序计算任务中的能力与限制。解决方案的关键在于分析得分匹配网络（score-matching network）的质量对扩散模型计算能力的影响：若网络能精确计算初始分布的得分函数，则模型仅能在TC⁰复杂度类内进行语言建模；而若不强制要求网络匹配任何得分函数，则扩散建模可以以某种方式模拟任意图灵机。这一二分法为理解扩散模型的能力边界提供了理论视角。

链接: https://arxiv.org/abs/2507.12469
作者: Yuxi Liu
机构: 未知
类目: Computational Complexity (cs.CC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages

点击查看摘要

Abstract:This paper explores the computational complexity of diffusion-based language modeling. We prove a dichotomy based on the quality of the score-matching network in a diffusion model. In one direction, a network that exactly computes the score function of some initial distribution can only perform language modeling within the \mathsfTC^0 complexity class, reflecting limitations tied to rapid convergence. In the other direction, we show that if there is no requirement for the network to match any score function, then diffusion modeling can simulate any Turing machine in a certain sense. This dichotomy provides a theoretical lens on the capabilities and limitations of diffusion models, particularly concerning tasks requiring sequential computation. We conjecture extensions of our theoretical results, including for the case where the diffusion model is not perfect, but merely good. We also discuss the wider context and practical implications, and hypothesize that a machine learning architecture that can interpolate between sequential and parallel modes of operation would be superior to both Transformers and diffusion models.
zh

[NLP-53] UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets

【速读】：该论文试图解决传统方法在语音语言理解（SLU）任务中依赖独立模型架构导致系统复杂度高、跨任务交互受限以及未能充分利用多任务异构数据集的问题。其解决方案的关键在于提出一种统一框架UniSLU，通过构建适用于多种SLU任务的统一表示，实现对多任务异构数据的充分利用，并在此基础上设计一种统一的生成方法，联合建模自动语音识别（ASR）、语音命名实体识别（NER）和语音情感分析（SA）任务，从而增强任务间交互并提升与大语言模型的集成能力。

链接: https://arxiv.org/abs/2507.12951
作者: Zhichao Sheng,Shilin Zhou,Chen Gong,Zhenghua Li
机构: Soochow University(苏州大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Spoken Language Understanding (SLU) plays a crucial role in speech-centric multimedia applications, enabling machines to comprehend spoken language in scenarios such as meetings, interviews, and customer service interactions. SLU encompasses multiple tasks, including Automatic Speech Recognition (ASR), spoken Named Entity Recognition (NER), and spoken Sentiment Analysis (SA). However, existing methods often rely on separate model architectures for individual tasks such as spoken NER and SA, which increases system complexity, limits cross-task interaction, and fails to fully exploit heterogeneous datasets available across tasks. To address these limitations, we propose UniSLU, a unified framework that jointly models multiple SLU tasks within a single architecture. Specifically, we propose a unified representation for diverse SLU tasks, enabling full utilization of heterogeneous datasets across multiple tasks. Built upon this representation, we propose a unified generative method that jointly models ASR, spoken NER, and SA tasks, enhancing task interactions and enabling seamless integration with large language models to harness their powerful generative capabilities. Extensive experiments on public SLU datasets demonstrate the effectiveness of our approach, achieving superior SLU performance compared to several benchmark methods, making it well-suited for real-world speech-based multimedia scenarios. We will release all code and models at github to facilitate future research.
zh

计算机视觉

[CV-0] VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

【速读】：该论文试图解决长视频理解中复杂场景下视频帧选择不精准的问题，现有方法在处理此类场景时表现不足。解决方案的关键在于提出Instructed Temporal Grounding for Videos (VideoITG)，其核心是VidThinker管道，这是一个模拟人类标注过程的自动化标注框架，通过生成细粒度的片段描述、基于指令引导的视频片段检索以及精细的帧选择，实现与用户指令对齐的定制化帧采样。

链接: https://arxiv.org/abs/2507.13353
作者: Shihao Wang,Guo Chen,De-an Huang,Zhiqi Li,Minghan Li,Guilin Li,Jose M. Alvarez,Lei Zhang,Zhiding Yu
机构: The Hong Kong Polytechnic Univ. (香港理工大学); Nanjing Univ. (南京大学); NVIDIA (英伟达); Harvard Univ. (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report

点击查看摘要

Abstract:Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.
zh

[CV-1] Hierarchical Rectified Flow Matching with Mini-Batch Couplings

【速读】：该论文试图解决传统层次化流匹配（hierarchical flow matching）在建模多模态速度分布时，不同层级间的分布复杂度保持一致的问题。其解决方案的关键在于通过小批量耦合（mini-batch couplings）逐步调整层次结构中各层级分布的复杂度，从而更灵活地捕捉数据中的多模态特性。实验结果表明，这种方法在合成数据和成像数据上均表现出显著优势。

链接: https://arxiv.org/abs/2507.13350
作者: Yichi Zhang,Yici Yan,Alex Schwing,Zhizhen Zhao
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Flow matching has emerged as a compelling generative modeling approach that is widely used across domains. To generate data via a flow matching model, an ordinary differential equation (ODE) is numerically solved via forward integration of the modeled velocity field. To better capture the multi-modality that is inherent in typical velocity fields, hierarchical flow matching was recently introduced. It uses a hierarchy of ODEs that are numerically integrated when generating data. This hierarchy of ODEs captures the multi-modal velocity distribution just like vanilla flow matching is capable of modeling a multi-modal data distribution. While this hierarchy enables to model multi-modal velocity distributions, the complexity of the modeled distribution remains identical across levels of the hierarchy. In this paper, we study how to gradually adjust the complexity of the distributions across different levels of the hierarchy via mini-batch couplings. We show the benefits of mini-batch couplings in hierarchical rectified flow matching via compelling results on synthetic and imaging data. Code is available at this https URL.
zh

[CV-2] π3: Scalable Permutation-Equivariant Visual Geometry Learning

【速读】：该论文试图解决传统视觉几何重建方法依赖于固定参考视角所带来的不稳定性和失败问题。以往的方法通常将重建结果锚定在一个指定的视角上，这种归纳偏差在参考视角不理想时可能导致性能下降。解决方案的关键在于提出一种完全排列等变的架构——\pi^3，该架构能够无需任何参考帧即可预测仿射不变的相机位姿和尺度不变的局部点图，从而使得模型对输入顺序具有内在鲁棒性并具备高度可扩展性。

链接: https://arxiv.org/abs/2507.13347
作者: Yifan Wang,Jianjun Zhou,Haoyi Zhu,Wenzheng Chang,Yang Zhou,Zizun Li,Junyi Chen,Jiangmiao Pang,Chunhua Shen,Tong He
机构: Shanghai AI Lab(上海人工智能实验室); ZJU(浙江大学); SII(智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce \pi^3 , a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, \pi^3 employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes our model inherently robust to input ordering and highly scalable. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are publicly available.
zh

[CV-3] AutoPartGen: Autogressive 3D Part Generation and Discovery

【速读】：该论文试图解决如何在3D空间中基于输入信息（如图像、2D掩码或现有3D对象）自回归地生成由多个3D部件组成的物体的问题。其解决方案的关键在于利用3DShape2VecSet这一具有强大几何表达能力的潜在3D表示，该表示展现出显著的组合特性，使得模型能够依次生成物体的各个部件，并在生成过程中条件化于已生成的部件及其他输入信息，从而自动确定部件的类型和数量，最终实现无需额外优化的连贯物体或场景重建。

链接: https://arxiv.org/abs/2507.13346
作者: Minghao Chen,Jianyuan Wang,Roman Shapovalov,Tom Monnier,Hyunyoung Jung,Dilin Wang,Rakesh Ranjan,Iro Laina,Andrea Vedaldi
机构: University of Oxford (牛津大学); Meta AI (Meta人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce AutoPartGen, a model that generates objects composed of 3D parts in an autoregressive manner. This model can take as input an image of an object, 2D masks of the object’s parts, or an existing 3D object, and generate a corresponding compositional 3D reconstruction. Our approach builds upon 3DShape2VecSet, a recent latent 3D representation with powerful geometric expressiveness. We observe that this latent space exhibits strong compositional properties, making it particularly well-suited for part-based generation tasks. Specifically, AutoPartGen generates object parts autoregressively, predicting one part at a time while conditioning on previously generated parts and additional inputs, such as 2D images, masks, or 3D objects. This process continues until the model decides that all parts have been generated, thus determining automatically the type and number of parts. The resulting parts can be seamlessly assembled into coherent objects or scenes without requiring additional optimization. We evaluate both the overall 3D generation capabilities and the part-level generation quality of AutoPartGen, demonstrating that it achieves state-of-the-art performance in 3D part generation.
zh

[CV-4] Imbalance in Balance: Online Concept Balancing in Generation Models ICCV2025

【速读】：该论文试图解决视觉生成任务中复杂概念的响应和组合缺乏稳定性和易出错的问题，这一领域仍处于研究不足的状态。其解决方案的关键在于设计一种基于概念的均衡损失函数（IMBA loss），通过在线学习的方式消除对离线数据集处理的需求，并且仅需少量代码修改即可显著提升基线模型的概念响应能力，在新提出的复杂概念基准Inert-CompBench及两个其他公开测试集中取得了具有竞争力的结果。

链接: https://arxiv.org/abs/2507.13345
作者: Yukai Shi,Jiarong Ou,Rui Chen,Haotian Yang,Jiahao Wang,Xin Tao,Pengfei Wan,Di Zhang,Kun Gai
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. We also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes. In our newly proposed complex concept benchmark Inert-CompBench and two other public test sets, our method significantly enhances the concept response capability of baseline models and yields highly competitive results with only a few codes.
zh

[CV-5] Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

【速读】：该论文试图解决从稀疏视角视频输入中实现高保真的人体视图合成问题。现有方法通过利用4D扩散模型在新视角生成视频，但生成的视频常缺乏时空一致性，从而降低视图合成质量。该论文的关键解决方案是提出一种新颖的滑动迭代去噪过程，通过在潜在网格中交替沿空间和时间维度进行滑动窗口去噪，从而增强4D扩散模型的时空一致性。此方法使得信息在潜在网格中充分流动，使扩散模型获得更大的感受野，提升输出的4D一致性，同时保持GPU内存消耗的可行性。

链接: https://arxiv.org/abs/2507.13344
作者: Yudong Jin,Sida Peng,Xuan Wang,Tao Xie,Zhen Xu,Yifan Yang,Yujun Shen,Hujun Bao,Xiaowei Zhou
机构: Zhejiang University (浙江大学); Ant Research (蚂蚁研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: this https URL .
zh

[CV-6] aming Diffusion Transformer for Real-Time Mobile Video Generation

【速读】：该论文旨在解决Diffusion Transformers (DiT)在视频生成任务中计算成本过高，导致难以在资源受限设备如智能手机上实现实时生成的问题。其关键解决方案包括：采用高度压缩的变分自编码器（VAE）以在不牺牲视觉质量的前提下降低输入数据的维度；引入基于知识蒸馏（KD）引导的、敏感度感知的三级剪枝策略，在缩小模型规模的同时保持关键性能特征；开发针对DiT的对抗性步骤蒸馏技术，将推理步骤数减少至四步。这些优化共同实现了在iPhone 16 Pro Max上超过10帧每秒（FPS）的视频生成速度，验证了移动设备上实时高质量视频生成的可行性。

链接: https://arxiv.org/abs/2507.13343
作者: Yushu Wu,Yanyu Li,Anil Kag,Ivan Skorokhodov,Willi Menapace,Ke Ma,Arpit Sahni,Ju Hu,Aliaksandr Siarohin,Dhritiman Sagar,Yanzhi Wang,Sergey Tulyakov
机构: Snap Inc.(Snap Inc.); Northeastern University(东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 9 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and real-time generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platform while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve over 10 frames per second (FPS) generation on an iPhone 16 Pro Max, demonstrating the feasibility of real-time, high-quality video generation on mobile devices.
zh

[CV-7] A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains

【速读】：该论文试图解决实时应用中手-物体交互检测的问题，这一问题对直观的用户体验至关重要，因为需要快速且准确地检测与周围物体的交互。解决方案的关键在于提出一种高效的流式第一视角视觉手-物体交互检测方法，其核心是结合动作识别模块和物体检测模块的级联架构，其中动作识别模块用于确认交互并激活物体检测模块，从而在相关帧上进行主动物体的检测与分类。

链接: https://arxiv.org/abs/2507.13326
作者: Antonio Finocchiaro,Alessandro Sebastiano Catinello,Michele Mazzamuto,Rosario Leonardi,Antonino Furnari,Giovanni Maria Farinella
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, In International Conference on Image Analysis and Processing

点击查看摘要

Abstract:Hand-object interaction detection remains an open challenge in real-time applications, where intuitive user experiences depend on fast and accurate detection of interactions with surrounding objects. We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision that operates in real time. Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction. Our Mamba model with EfficientNetV2 as backbone for action recognition achieves 38.52% p-AP on the ENIGMA-51 benchmark at 30fps, while our fine-tuned YOLOWorld reaches 85.13% AP for hand and object. We implement our models in a cascaded architecture where the action recognition and object detection modules operate sequentially. When the action recognition predicts a contact state, it activates the object detection module, which in turn performs inference on the relevant frame to detect and classify the active object.
zh

[CV-8] Revisiting Reliability in the Reasoning -based Pose Estimation Benchmark

【速读】：该论文试图解决基于推理的人体姿态估计（Reasoning-based Pose Estimation, RPE）基准在可复现性和基准质量方面存在的问题，这些问题影响了姿态感知多模态大语言模型（Pose-aware Multimodal Large Language Models, MMLLs）的公平和一致的定量评估。解决方案的关键在于通过细致的视觉匹配对地面真实标注（Ground-Truth, GT）进行精修，并公开这些优化后的标注作为开源资源，从而减少人工工作量并提升评估的一致性和可靠性。

链接: https://arxiv.org/abs/2507.13314
作者: Junsu Kim,Naeun Kim,Jaeho Lee,Incheol Park,Dongyoon Han,Seungryul Baek
机构: UNIST; NVIDIA Foundation Models Lab, MODULABS; Naver AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To be presented as a poster at MMFM 2025

点击查看摘要

Abstract:The reasoning-based pose estimation (RPE) benchmark has emerged as a widely adopted evaluation standard for pose-aware multimodal large language models (MLLMs). Despite its significance, we identified critical reproducibility and benchmark-quality issues that hinder fair and consistent quantitative evaluations. Most notably, the benchmark utilizes different image indices from those of the original 3DPW dataset, forcing researchers into tedious and error-prone manual matching processes to obtain accurate ground-truth (GT) annotations for quantitative metrics (\eg, MPJPE, PA-MPJPE). Furthermore, our analysis reveals several inherent benchmark-quality limitations, including significant image redundancy, scenario imbalance, overly simplistic poses, and ambiguous textual descriptions, collectively undermining reliable evaluations across diverse scenarios. To alleviate manual effort and enhance reproducibility, we carefully refined the GT annotations through meticulous visual matching and publicly release these refined annotations as an open-source resource, thereby promoting consistent quantitative evaluations and facilitating future advancements in human pose-aware multimodal reasoning.
zh

[CV-9] FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization

【速读】：该论文试图解决时尚电商中真实且可控制的服装可视化问题，用户期望在不同姿态和光照条件下获得个性化的预览效果。现有方法通常依赖于预定义的姿态，限制了语义灵活性和光照适应性。解决方案的关键在于提出FashionPose，这是一个统一的文本到姿态再到重新照明生成框架，通过自然语言描述驱动姿态预测、高保真人物图像生成以及轻量级重新照明模块，从而实现精确的姿态对齐、忠实的服装渲染和灵活的光照控制。

链接: https://arxiv.org/abs/2507.13311
作者: Chuancheng Shi,Yixiang Chen,Burong Lei,Jichao Chen
机构: The University of Sydney (悉尼大学); Shenyang Aerospace University (沈阳航空航天大学); Zhonghuan Information College, Tianjin University of Technology (天津理工大学中环信息学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Realistic and controllable garment visualization is critical for fashion e-commerce, where users expect personalized previews under diverse poses and lighting conditions. Existing methods often rely on predefined poses, limiting semantic flexibility and illumination adaptability. To address this, we introduce FashionPose, the first unified text-to-pose-to-relighting generation framework. Given a natural language description, our method first predicts a 2D human pose, then employs a diffusion model to generate high-fidelity person images, and finally applies a lightweight relighting module, all guided by the same textual input. By replacing explicit pose annotations with text-driven conditioning, FashionPose enables accurate pose alignment, faithful garment rendering, and flexible lighting control. Experiments demonstrate fine-grained pose synthesis and efficient, consistent relighting, providing a practical solution for personalized virtual fashion display.
zh

[CV-10] DiffClean: Diffusion-based Makeup Removal for Accurate Age Estimation

【速读】：该论文试图解决由于面部妆容导致的年龄估计和人脸识别准确率下降的问题，这些问题可能被用于欺骗人类和机器以获取未经授权的访问权限。解决方案的关键在于提出DiffClean，该方法利用文本引导的扩散模型消除妆容痕迹，从而提升年龄估计和人脸验证的准确性。

链接: https://arxiv.org/abs/2507.13292
作者: Ekta Balkrishna Gavas,Chinmay Hegde,Nasir Memon,Sudipta Banerjee
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate age verification can protect underage users from unauthorized access to online platforms and e-commerce sites that provide age-restricted services. However, accurate age estimation can be confounded by several factors, including facial makeup that can induce changes to alter perceived identity and age to fool both humans and machines. In this work, we propose DiffClean which erases makeup traces using a text-guided diffusion model to defend against makeup attacks. DiffClean improves age estimation (minor vs. adult accuracy by 4.8%) and face verification (TMR by 8.9% at FMR=0.01%) over competing baselines on digitally simulated and real makeup images.
zh

[CV-11] Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy ICCV2025

【速读】：该论文试图解决在预训练Vision Transformer（ViT）的参数高效微调过程中，低秩适配矩阵（如down/up-projection矩阵）缺乏与预训练主干参数相似的近似正交性，从而限制了模型泛化能力的问题。解决方案的关键在于提出一种近似正交微调（AOFT）策略，通过一个可学习向量生成一组近似正交向量，用于构建down/up-projection矩阵，从而使这些矩阵的属性与主干参数相匹配，进而提升微调后ViT的泛化能力。

链接: https://arxiv.org/abs/2507.13260
作者: Yiting Yang,Hao Luo,Yuan Sun,Qingsen Yan,Haokui Zhang,Wei Dong,Guoqing Wang,Peng Wang,Yang Yang,Hengtao Shen
机构: Xi’an University of Architecture and Technology (西安建筑科技大学); University of Electronic Science and Technology of China (电子科技大学); Northwestern Polytechnical University (西北工业大学); TongJi University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper is accepted by ICCV 2025

点击查看摘要

Abstract:A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this work, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model’s generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices.
zh

[CV-12] VITA: Vision-to-Action Flow Matching Policy

【速读】：该论文试图解决视觉-动作（vision-to-action）控制中传统流匹配和扩散策略存在的条件机制复杂、计算开销大以及生成能力受限的问题。其解决方案的关键在于提出一种新的范式，将潜在图像作为流的起点，直接学习从视觉到动作的内在映射，从而消除独立的条件模块，并保留生成建模能力。通过使用自编码器构建结构化的动作潜在空间，并利用流潜在解码进行监督，实现端到端的有效学习，同时提升了推理效率。

链接: https://arxiv.org/abs/2507.13231
作者: Dechen Gao,Boqi Zhao,Andrew Lee,Ian Chuang,Hanchu Zhou,Hang Wang,Zhe Zhao,Junshan Zhang,Iman Soltani
机构: University of California, Davis; University of California, Berkeley
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.
zh

[CV-13] S2M2: Scalable Stereo Matching Model for Reliable Depth Estimation ICCV

【速读】：该论文试图解决通用立体匹配模型在不同分辨率和视差范围下实现良好泛化能力的问题，这一问题面临着迭代局部搜索方法与全局匹配架构之间的权衡。传统迭代方法虽然在受限基准上表现优异，但其核心机制限制了全局一致性；而全局匹配架构虽理论上更鲁棒，但因计算和内存成本过高而难以实用。论文提出的解决方案S^2M^2的关键在于设计一种全局匹配架构，通过多尺度Transformer实现稳健的长距离对应关系，并采用一种新颖的损失函数集中概率于可行匹配，从而在不依赖代价体积过滤或深度细化堆栈的情况下，实现了最先进的精度与高效率。

链接: https://arxiv.org/abs/2507.13229
作者: Junhong Min,Youngpil Jeon,Jimin Kim,Minyong Choi
机构: Samsung Electronics(三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 5 figures, ICCV accepted paper

点击查看摘要

Abstract:The pursuit of a generalizable stereo matching model, capable of performing across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. On the other hand, global matching architectures, while theoretically more robust, have been historically rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with S^2M^2 : a global matching architecture that achieves both state-of-the-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. S^2M^2 establishes a new state of the art on the Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods across most metrics while reconstructing high-quality details with competitive efficiency.
zh

[CV-14] Leverag ing Pre-Trained Visual Models for AI-Generated Video Detection

【速读】：该论文旨在解决AI生成视频内容（尤其是非深度伪造的通用内容）检测能力不足的问题，以应对由AI生成内容引发的虚假信息、隐私泄露和安全威胁。其解决方案的关键在于利用预训练的视觉模型提取特征，这些特征源自大量真实视觉内容的训练，能够包含区分真实与生成视频的内在信号，从而实现无需额外模型训练的高效检测，并通过在提取特征上添加简单线性分类层进一步提升性能。

链接: https://arxiv.org/abs/2507.13224
作者: Keerthi Veeramachaneni,Praveen Tirupattur,Amrit Singh Bedi,Mubarak Shah
机构: Georgia Institute of Technology (佐治亚理工学院); Univeristy of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Generative AI (GenAI) have led to significant improvements in the quality of generated visual content. As AI-generated visual content becomes increasingly indistinguishable from real content, the challenge of detecting the generated content becomes critical in combating misinformation, ensuring privacy, and preventing security threats. Although there has been substantial progress in detecting AI-generated images, current methods for video detection are largely focused on deepfakes, which primarily involve human faces. However, the field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content. To address this gap, we propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos. The features extracted from these pre-trained models, which have been trained on extensive real visual content, contain inherent signals that can help distinguish real from generated videos. Using these extracted features, we achieve high detection performance without requiring additional model training, and we further improve performance by training a simple linear classification layer on top of the extracted features. We validated our method on a dataset we compiled (VID-AID), which includes around 10,000 AI-generated videos produced by 9 different text-to-video models, along with 4,000 real videos, totaling over 7 hours of video content. Our evaluation shows that our approach achieves high detection accuracy, above 90% on average, underscoring its effectiveness. Upon acceptance, we plan to publicly release the code, the pre-trained models, and our dataset to support ongoing research in this critical area.
zh

[CV-15] Synthesizing Reality: Leverag ing the Generative AI-Powered Platform Midjourney for Construction Worker Detection

【速读】：该论文试图解决深度神经网络（DNN）在建筑领域中因数据多样性与数量不足而导致的性能受限问题。其解决方案的关键在于利用生成式 AI 平台 Midjourney 创建一个包含12,000张合成图像的数据集，通过设计3000个不同的提示词以确保图像的真实性和多样性，并通过人工标注后用于 DNN 的训练。

链接: https://arxiv.org/abs/2507.13221
作者: Hongyang Zhao,Tianyu Liang,Sina Davari,Daeho Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work was presented at ASCE International Conference on Computing in Civil Engineering (i3CE) 2024 and is currently under consideration for publication in ASCE proceedings

点击查看摘要

Abstract:While recent advancements in deep neural networks (DNNs) have substantially enhanced visual AI’s capabilities, the challenge of inadequate data diversity and volume remains, particularly in construction domain. This study presents a novel image synthesis methodology tailored for construction worker detection, leveraging the generative-AI platform Midjourney. The approach entails generating a collection of 12,000 synthetic images by formulating 3000 different prompts, with an emphasis on image realism and diversity. These images, after manual labeling, serve as a dataset for DNN training. Evaluation on a real construction image dataset yielded promising results, with the model attaining average precisions (APs) of 0.937 and 0.642 at intersection-over-union (IoU) thresholds of 0.5 and 0.5 to 0.95, respectively. Notably, the model demonstrated near-perfect performance on the synthetic dataset, achieving APs of 0.994 and 0.919 at the two mentioned thresholds. These findings reveal both the potential and weakness of generative AI in addressing DNN training data scarcity.
zh

[CV-16] Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models

【速读】：该论文旨在解决自动驾驶中世界模型在长时序生成和应对复杂场景时的不足。其解决方案的关键在于采用简单的设计选择，无需额外的监督信号或传感器数据（如地图、深度信息或多摄像头），并通过仅使用280小时的视频数据进行训练，构建了一个参数量为469M的模型，该模型在复杂场景（如转弯操作和城市交通）中表现出色，并实现了当前最优性能。

链接: https://arxiv.org/abs/2507.13162
作者: Arian Mousakhan,Sudhanshu Mittal,Silvio Galesso,Karim Farid,Thomas Brox
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at this https URL.
zh

[CV-17] SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

【速读】：该论文旨在解决视觉语言导航（Vision-Language Navigation, VLN）中因大型语言模型（Large Language Models, LLMs）固定知识库和推理能力受限而导致的难以有效整合经验知识、缺乏高效演化能力的问题。其解决方案的关键在于提出一种自演化VLN框架（SE-VLN），该框架包含三个核心模块：层次化记忆模块用于将成功与失败案例转化为可复用知识，检索增强的基于思考的推理模块用于检索经验并支持多步骤决策，以及反思模块以实现持续演化。

链接: https://arxiv.org/abs/2507.13152
作者: Xiangyu Dong,Haoran Zhao,Jiang Gao,Haozhou Li,Xiaoguang Ma,Yaoming Zhou,Fuhai Chen,Juan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and failure cases into reusable knowledge, a retrieval-augmented thought-based reasoning module to retrieve experience and enable multi-step decision-making, and a reflection module to realize continual evolution. Comprehensive tests illustrated that the SE-VLN achieved navigation success rates of 57% and 35.2% in unseen environments, representing absolute performance improvements of 23.9% and 15.0% over current state-of-the-art methods on R2R and REVERSE datasets, respectively. Moreover, the SE-VLN showed performance improvement with increasing experience repository, elucidating its great potential as a self-evolving agent framework for VLN.
zh

[CV-18] DINO-VO: A Feature-based Visual Odometry Leverag ing a Visual Foundation Model

【速读】：该论文试图解决基于学习的单目视觉里程计（monocular visual odometry, VO）在机器人领域中面临的鲁棒性、泛化能力和效率问题。其解决方案的关键在于利用DINOv2视觉基础模型提取稀疏特征进行匹配，并通过设计针对DINOv2粗粒度特征的显著关键点检测器，以及结合细粒度几何特征以增强局部可定位性。此外，采用基于Transformer的匹配器和可微分位姿估计层，实现了精确的相机运动估计。

链接: https://arxiv.org/abs/2507.13145
作者: Maulana Bisyir Azhari,David Hyunchul Shim
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 6 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L), July 2025

点击查看摘要

Abstract:Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2’s coarse features. Furthermore, we complement DINOv2’s robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like SuperPoint, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and generalization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-to-frame VO methods on the TartanAir and KITTI datasets and is competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM systems on outdoor driving scenarios, showcasing its generalization capabilities.
zh

[CV-19] RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images

【速读】：该论文旨在解决遥感（Remote Sensing, RS）图像中微小目标检测的问题，该问题由于目标的空间信息极其有限、特征表示弱以及在复杂背景中的密集分布而长期存在。论文提出的解决方案关键在于RS-TinyNet模型，其核心是通过多阶段特征融合与增强机制实现对微小目标的高效检测，具体包括微小目标显著性建模和特征完整性重建两个创新设计，以及多维协同注意力模块、辅助可逆分支和渐进融合检测头等三个逐步增强特征的模块，从而有效提升检测性能。

链接: https://arxiv.org/abs/2507.13120
作者: Xiaozheng Jiang,Wei Zhang,Xuerui Mao
机构: Beijing Institute of Technology(北京理工大学); Beijing Institute of Technology (Zhuhai)(北京理工大学珠海校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting tiny objects in remote sensing (RS) imagery has been a long-standing challenge due to their extremely limited spatial information, weak feature representations, and dense distributions across complex backgrounds. Despite numerous efforts devoted, mainstream detectors still underperform in such scenarios. To bridge this gap, we introduce RS-TinyNet, a multi-stage feature fusion and enhancement model explicitly tailored for RS tiny object detection in various RS scenarios. RS-TinyNet comes with two novel designs: tiny object saliency modeling and feature integrity reconstruction. Guided by these principles, we design three step-wise feature enhancement modules. Among them, the multi-dimensional collaborative attention (MDCA) module employs multi-dimensional attention to enhance the saliency of tiny objects. Additionally, the auxiliary reversible branch (ARB) and a progressive fusion detection head (PFDH) module are introduced to preserve information flow and fuse multi-level features to bridge semantic gaps and retain structural detail. Comprehensive experiments on public RS dataset AI-TOD show that our RS-TinyNet surpasses existing state-of-the-art (SOTA) detectors by 4.0% AP and 6.5% AP75. Evaluations on DIOR benchmark dataset further validate its superior detection performance in diverse RS scenarios. These results demonstrate that the proposed multi-stage feature fusion strategy offers an effective and practical solution for tiny object detection in complex RS environments.
zh

[CV-20] Leverag ing Language Prior for Infrared Small Target Detection

【速读】：该论文试图解决红外小目标检测（IRSTD）中因目标尺寸小且分布稀疏而导致的检测难度大的问题。其解决方案的关键在于提出一种多模态IRSTD框架，通过引入语言先验来引导小目标的检测，利用语言引导的注意力权重增强模型的检测能力，从而结合文本信息与图像数据提升IRSTD性能。

链接: https://arxiv.org/abs/2507.13113
作者: Pranav Singh,Pravendra Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:IRSTD (InfraRed Small Target Detection) detects small targets in infrared blurry backgrounds and is essential for various applications. The detection task is challenging due to the small size of the targets and their sparse distribution in infrared small target datasets. Although existing IRSTD methods and datasets have led to significant advancements, they are limited by their reliance solely on the image modality. Recent advances in deep learning and large vision-language models have shown remarkable performance in various visual recognition tasks. In this work, we propose a novel multimodal IRSTD framework that incorporates language priors to guide small target detection. We leverage language-guided attention weights derived from the language prior to enhance the model’s ability for IRSTD, presenting a novel approach that combines textual information with image data to improve IRSTD capabilities. Utilizing the state-of-the-art GPT-4 vision model, we generate text descriptions that provide the locations of small targets in infrared images, employing careful prompt engineering to ensure improved accuracy. Due to the absence of multimodal IR datasets, existing IRSTD methods rely solely on image data. To address this shortcoming, we have curated a multimodal infrared dataset that includes both image and text modalities for small target detection, expanding upon the popular IRSTD-1k and NUDT-SIRST datasets. We validate the effectiveness of our approach through extensive experiments and comprehensive ablation studies. The results demonstrate significant improvements over the state-of-the-art method, with relative percentage differences of 9.74%, 13.02%, 1.25%, and 67.87% in IoU, nIoU, Pd, and Fa on the NUAA-SIRST subset, and 4.41%, 2.04%, 2.01%, and 113.43% on the IRSTD-1k subset of the LangIR dataset, respectively.
zh

[CV-21] 3DKeyAD: High-Resolution 3D Point Cloud Anomaly Detection via Keypoint-Guided Point Clustering

【速读】：该论文旨在解决工业检测中高分辨率3D点云在检测细微结构异常时所面临的挑战，包括计算成本高、对空间错位敏感以及难以捕捉局部结构差异等问题。其解决方案的关键在于提出一种基于配准的异常检测框架，该框架结合多原型对齐与聚类差异分析，通过将测试样本与多个正常原型进行配准以实现直接结构对比，并采用关键点引导的聚类策略选择具有几何信息的点作为聚类中心，从而提升局部异常检测的准确性与稳定性。

链接: https://arxiv.org/abs/2507.13110
作者: Zi Wang,Katsuya Hotta,Koichiro Kamide,Yawen Zou,Chao Zhang,Jun Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution 3D point clouds are highly effective for detecting subtle structural anomalies in industrial inspection. However, their dense and irregular nature imposes significant challenges, including high computational cost, sensitivity to spatial misalignment, and difficulty in capturing localized structural differences. This paper introduces a registration-based anomaly detection framework that combines multi-prototype alignment with cluster-wise discrepancy analysis to enable precise 3D anomaly localization. Specifically, each test sample is first registered to multiple normal prototypes to enable direct structural comparison. To evaluate anomalies at a local level, clustering is performed over the point cloud, and similarity is computed between features from the test sample and the prototypes within each cluster. Rather than selecting cluster centroids randomly, a keypoint-guided strategy is employed, where geometrically informative points are chosen as centroids. This ensures that clusters are centered on feature-rich regions, enabling more meaningful and stable distance-based comparisons. Extensive experiments on the Real3D-AD benchmark demonstrate that the proposed method achieves state-of-the-art performance in both object-level and point-level anomaly detection, even using only raw features.
zh

[CV-22] R2MoE: Redundancy-Removal Mixture of Experts for Lifelong Concept Learning

【速读】：该论文旨在解决大规模生成式AI（Generative AI）模型在持续学习新视觉概念时面临的两个核心挑战：灾难性遗忘和参数膨胀。其解决方案的关键在于提出一种名为R^2MoE（Redundancy-Removal Mixture of Experts）的参数高效框架，该框架通过引入专家混合机制与路由蒸馏技术，使专家能够获取特定概念知识同时保留门控网络的路由能力，从而有效缓解灾难性遗忘；同时，通过消除冗余层级专家策略，充分利用已有专家以减少参数数量，并采用分层局部注意力引导的推理方法减轻生成视觉概念间的干扰。

链接: https://arxiv.org/abs/2507.13107
作者: Xiaohan Guo,Yusong Cai,Zejia Liu,Zhengning Wang,Lili Pan,Hongliang Li
机构: School of Information and Communication Engineering, University of Electronic Science and Technology of China (信息与通信工程学院，电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Enabling large-scale generative models to continuously learn new visual concepts is essential for personalizing pre-trained models to meet individual user preferences. Existing approaches for continual visual concept learning are constrained by two fundamental challenges: catastrophic forgetting and parameter expansion. In this paper, we propose Redundancy-Removal Mixture of Experts (R^2MoE), a parameter-efficient framework for lifelong visual concept learning that effectively learns new concepts while incurring minimal parameter overhead. Our framework includes three key innovative contributions: First, we propose a mixture-of-experts framework with a routing distillation mechanism that enables experts to acquire concept-specific knowledge while preserving the gating network’s routing capability, thereby effectively mitigating catastrophic forgetting. Second, we propose a strategy for eliminating redundant layer-wise experts that reduces the number of expert parameters by fully utilizing previously learned experts. Third, we employ a hierarchical local attention-guided inference approach to mitigate interference between generated visual concepts. Extensive experiments have demonstrated that our method generates images with superior conceptual fidelity compared to the state-of-the-art (SOTA) method, achieving an impressive 87.8% reduction in forgetting rates and 63.3% fewer parameters on the CustomConcept 101 dataset. Our code is available at this https URL
zh

[CV-23] Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction

【速读】：该论文旨在解决胎儿肺成熟度评估中依赖手动分割导致的耗时问题，从而提升其临床适用性。其解决方案的关键在于构建一个全自动的肺成熟度评估流程，该流程包括基于深度学习的胎儿肺部分割模型和模型拟合的肺成熟度评估方法，其中3D nnU-Net模型在基准帧上进行了训练，实现了较高的分割精度，且与手动分割结果在IVIM参数量化方面无显著差异。

链接: https://arxiv.org/abs/2507.13106
作者: Zhennan Xiao,Katharine Brudkiewicz,Zhen Yuan,Rosalind Aughwane,Magdalena Sokolska,Joanna Chappell,Trevor Gaunt,Anna L. David,Andrew P. King,Andrew Melbourne
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fetal lung maturity is a critical indicator for predicting neonatal outcomes and the need for post-natal intervention, especially for pregnancies affected by fetal growth restriction. Intra-voxel incoherent motion analysis has shown promising results for non-invasive assessment of fetal lung development, but its reliance on manual segmentation is time-consuming, thus limiting its clinical applicability. In this work, we present an automated lung maturity evaluation pipeline for diffusion-weighted magnetic resonance images that consists of a deep learning-based fetal lung segmentation model and a model-fitting lung maturity assessment. A 3D nnU-Net model was trained on manually segmented images selected from the baseline frames of 4D diffusion-weighted MRI scans. The segmentation model demonstrated robust performance, yielding a mean Dice coefficient of 82.14%. Next, voxel-wise model fitting was performed based on both the nnU-Net-predicted and manual lung segmentations to quantify IVIM parameters reflecting tissue microstructure and perfusion. The results suggested no differences between the two. Our work shows that a fully automated pipeline is possible for supporting fetal lung maturity assessment and clinical decision-making.
zh

[CV-24] MUPAX: Multidimensional Problem Agnostic eXplainable AI

【速读】：该论文试图解决可解释人工智能（Explainable AI, XAI）中存在的一致性、模型无关性和收敛性保障不足的问题。其解决方案的关键在于提出一种确定性、模型无关且具有保证收敛性的技术——多维问题无关可解释人工智能（MUPAX），该技术通过测度论框架进行结构化扰动分析，从而实现基于数据内在输入模式的特征重要性归因，有效消除虚假关系，确保在多种数据模态和任务中的维度无关有效性。

链接: https://arxiv.org/abs/2507.13090
作者: Vincenzo Dentamaro,Felice Franchini,Giuseppe Pirlo,Irina Voiculescu
机构: University of Bari “Aldo Moro”(巴里大学阿尔多·莫罗校区); University of Oxford(牛津大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust XAI techniques should ideally be simultaneously deterministic, model agnostic, and guaranteed to converge. We propose MULTIDIMENSIONAL PROBLEM AGNOSTIC EXPLAINABLE AI (MUPAX), a deterministic, model agnostic explainability technique, with guaranteed convergency. MUPAX measure theoretic formulation gives principled feature importance attribution through structured perturbation analysis that discovers inherent input patterns and eliminates spurious relationships. We evaluate MUPAX on an extensive range of data modalities and tasks: audio classification (1D), image classification (2D), volumetric medical image analysis (3D), and anatomical landmark detection, demonstrating dimension agnostic effectiveness. The rigorous convergence guarantees extend to any loss function and arbitrary dimensions, making MUPAX applicable to virtually any problem context for AI. By contrast with other XAI methods that typically decrease performance when masking, MUPAX not only preserves but actually enhances model accuracy by capturing only the most important patterns of the original data. Extensive benchmarking against the state of the XAI art demonstrates MUPAX ability to generate precise, consistent and understandable explanations, a crucial step towards explainable and trustworthy AI systems. The source code will be released upon publication.
zh

[CV-25] GLAD: Generalizable Tuning for Vision-Language Models ICCV2025

【速读】：该论文旨在解决预训练视觉-语言模型在少样本场景下的过拟合问题以及现有提示调优方法对复杂任务特定模型架构和敏感超参数调优的依赖问题。其解决方案的关键在于提出一种更简单且通用的框架GLAD（Generalizable LoRA tuning with RegulArized GraDient），通过结合LoRA（Low-Rank Adaptation）与基于梯度的正则化技术，有效提升模型在不同数据分布下的稳定性与泛化能力。

链接: https://arxiv.org/abs/2507.13089
作者: Yuqi Peng,Pengfei Wang,Jianzhuang Liu,Shifeng Chen
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (深圳市先进科技研究所，中国科学院); Northeastern University (东北大学); The Hong Kong Polytechnic University (香港理工大学); Shenzhen University of Advanced Technology (深圳先进技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 workshop

点击查看摘要

Abstract:Pre-trained vision-language models, such as CLIP, show impressive zero-shot recognition ability and can be easily transferred to specific downstream tasks via prompt tuning, even with limited training data. However, existing prompt tuning methods face two main challenges: (1) In few-shot scenarios, data scarcity often leads to overfitting, making the model sensitive to changes in the input domain. (2) To mitigate overfitting, these methods typically rely on complex task-specific model architectures and sensitive hyperparameter tuning, severely restricting their general applicability. To address these issues, we propose a simpler and more general framework called GLAD (Generalizable LoRA tuning with RegulArized GraDient). We show that merely applying LoRA achieves performance in downstream tasks comparable to current state-of-the-art prompt-based methods. While LoRA is effective and easy to use, it remains susceptible to overfitting in few-shot learning scenarios. To mitigate this risk, we introduce a gradient-based regularization technique. This technique effectively steers the optimization trajectory, encouraging the model to find a more stable parameter region that is robust to variations in data distribution. Through extensive experiments conducted on 15 benchmark datasets, we demonstrate that GLAD outperforms previous tuning approaches in terms of base-to-novel class generalization, image domain generalization, and cross-dataset generalization. The code will be publicly available.
zh

[CV-26] DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model

【速读】：该论文试图解决医学图像分割中的标注变异性问题，这一问题源于模糊的影像边界和不同的临床专业知识。传统深度学习方法生成单一确定性分割结果往往无法捕捉标注者的偏差。该研究提出的解决方案是DiffOSeg，其关键在于采用基于扩散的两阶段框架，同时实现共识驱动（融合所有专家意见）和偏好驱动（反映专家个体评估）的分割。第一阶段通过概率共识策略建立群体共识，第二阶段则通过自适应提示捕捉专家特定偏好。

链接: https://arxiv.org/abs/2507.13087
作者: Han Zhang,Xiangde Luo,Yong Chen,Kang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Annotation variability remains a substantial challenge in medical image segmentation, stemming from ambiguous imaging boundaries and diverse clinical expertise. Traditional deep learning methods producing single deterministic segmentation predictions often fail to capture these annotator biases. Although recent studies have explored multi-rater segmentation, existing methods typically focus on a single perspective – either generating a probabilistic ``gold standard’’ consensus or preserving expert-specific preferences – thus struggling to provide a more omni view. In this study, we propose DiffOSeg, a two-stage diffusion-based framework, which aims to simultaneously achieve both consensus-driven (combining all experts’ opinions) and preference-driven (reflecting experts’ individual assessments) segmentation. Stage I establishes population consensus through a probabilistic consensus strategy, while Stage II captures expert-specific preference via adaptive prompts. Demonstrated on two public datasets (LIDC-IDRI and NPC-170), our model outperforms existing state-of-the-art methods across all evaluated metrics. Source code is available at this https URL .
zh

[CV-27] Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection WACV2025

【速读】：该论文旨在解决开放世界目标检测（OWOD）中因未知物体缺乏真实标注而导致的学习冲突问题，以及如何在不依赖伪标签的情况下有效区分已知与未知物体。其解决方案的关键在于提出一种新型模型——Decoupled PROB，该模型通过引入早期终止物体性预测（ETOP）来在解码器适当层停止物体性预测，从而缓解类别与物体性预测之间的学习冲突；同时，采用任务解耦查询初始化（TDQI）以高效提取已知与未知物体的特征，提升检测性能。

链接: https://arxiv.org/abs/2507.13085
作者: Riku Inoue,Masamitsu Tsuchiya,Yuji Yasui
机构: Honda R&&&D Co., Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to WACV 2025 (Tucson, Arizona, USA), February 28-March 4 2025

点击查看摘要

Abstract:Open World Object Detection (OWOD) is a challenging computer vision task that extends standard object detection by (1) detecting and classifying unknown objects without supervision, and (2) incrementally learning new object classes without forgetting previously learned ones. The absence of ground truths for unknown objects makes OWOD tasks particularly challenging. Many methods have addressed this by using pseudo-labels for unknown objects. The recently proposed Probabilistic Objectness transformer-based open-world detector (PROB) is a state-of-the-art model that does not require pseudo-labels for unknown objects, as it predicts probabilistic objectness. However, this method faces issues with learning conflicts between objectness and class predictions. To address this issue and further enhance performance, we propose a novel model, Decoupled PROB. Decoupled PROB introduces Early Termination of Objectness Prediction (ETOP) to stop objectness predictions at appropriate layers in the decoder, resolving the learning conflicts between class and objectness predictions in PROB. Additionally, we introduce Task-Decoupled Query Initialization (TDQI), which efficiently extracts features of known and unknown objects, thereby improving performance. TDQI is a query initialization method that combines query selection and learnable queries, and it is a module that can be easily integrated into existing DETR-based OWOD models. Extensive experiments on OWOD benchmarks demonstrate that Decoupled PROB surpasses all existing methods across several metrics, significantly improving performance. Comments: This paper has been accepted to WACV 2025 (Tucson, Arizona, USA), February 28-March 4 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.13085 [cs.CV] (or arXiv:2507.13085v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.13085 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/WACV61041.2025.00796 Focus to learn more DOI(s) linking to related resources
zh

[CV-28] Channel-wise Motion Features for Efficient Motion Segmentation IROS2024

【速读】：该论文旨在解决安全关键型机器人应用（如自动驾驶）中实时准确检测所有必要物体的问题，其核心挑战在于现有运动分割模型因联合使用多个子网络（如深度估计、位姿估计、光流和场景流）导致计算成本过高，从而影响实时性能。该论文提出的解决方案的关键是引入一种基于代价体积的运动特征表示方法——通道级运动特征（Channel-wise Motion Features），通过仅使用位姿网络提取每个实例的深度特征并捕捉场景的三维运动信息，实现了更高的计算效率。该方法在KITTI数据集和Cityscapes的VCAS-Motion数据集中实现了约4倍于现有最先进模型的帧率（FPS），同时将参数量减少至约25%且保持相当的精度。

链接: https://arxiv.org/abs/2507.13082
作者: Riku Inoue,Masamitsu Tsuchiya,Yuji Yasui
机构: Honda R&&&D Co., Ltd.(本田研发有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to IROS 2024 (Abu Dhabi, UAE), October 14-18, 2024

点击查看摘要

Abstract:For safety-critical robotics applications such as autonomous driving, it is important to detect all required objects accurately in real-time. Motion segmentation offers a solution by identifying dynamic objects from the scene in a class-agnostic manner. Recently, various motion segmentation models have been proposed, most of which jointly use subnetworks to estimate Depth, Pose, Optical Flow, and Scene Flow. As a result, the overall computational cost of the model increases, hindering real-time performance. In this paper, we propose a novel cost-volume-based motion feature representation, Channel-wise Motion Features. By extracting depth features of each instance in the feature map and capturing the scene’s 3D motion information, it offers enhanced efficiency. The only subnetwork used to build Channel-wise Motion Features is the Pose Network, and no others are required. Our method not only achieves about 4 times the FPS of state-of-the-art models in the KITTI Dataset and Cityscapes of the VCAS-Motion Dataset, but also demonstrates equivalent accuracy while reducing the parameters to about 25 % . Comments: This paper has been accepted to IROS 2024 (Abu Dhabi, UAE), October 14-18, 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.13082 [cs.CV] (or arXiv:2507.13082v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.13082 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/IROS58592.2024.10802584 Focus to learn more DOI(s) linking to related resources
zh

[CV-29] DASViT: Differentiable Architecture Search for Vision Transformer IJCNN

【速读】：该论文试图解决在Vision Transformer (ViT)架构搜索中，传统离散方法如进化算法存在创新设计发现能力不足、计算资源消耗大以及耗时等问题。其解决方案的关键在于提出Differentiable Architecture Search for Vision Transformer (DASViT)，通过引入可微分搜索机制，弥补了ViT在可微分架构搜索方面的空白，并能够探索出性能更优且参数和FLOPs更少的新型架构。

链接: https://arxiv.org/abs/2507.13079
作者: Pengjin Wu,Ferrante Neri,Zhenhua Feng
机构: University of Surrey(萨里大学); Jiangnan University(江南大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the International Joint Conference on Neural Networks (IJCNN) 2025

点击查看摘要

Abstract:Designing effective neural networks is a cornerstone of deep learning, and Neural Architecture Search (NAS) has emerged as a powerful tool for automating this process. Among the existing NAS approaches, Differentiable Architecture Search (DARTS) has gained prominence for its efficiency and ease of use, inspiring numerous advancements. Since the rise of Vision Transformers (ViT), researchers have applied NAS to explore ViT architectures, often focusing on macro-level search spaces and relying on discrete methods like evolutionary algorithms. While these methods ensure reliability, they face challenges in discovering innovative architectural designs, demand extensive computational resources, and are time-intensive. To address these limitations, we introduce Differentiable Architecture Search for Vision Transformer (DASViT), which bridges the gap in differentiable search for ViTs and uncovers novel designs. Experiments show that DASViT delivers architectures that break traditional Transformer encoder designs, outperform ViT-B/16 on multiple datasets, and achieve superior efficiency with fewer parameters and FLOPs.
zh

[CV-30] Label-Consistent Dataset Distillation with Detector-Guided Refinement

【速读】：该论文试图解决生成的代理数据集在标签不一致或结构细节不足导致下游任务性能不佳的问题。解决方案的关键在于提出一种基于检测器引导的数据集蒸馏框架，该框架显式利用预训练检测器识别并优化异常合成样本，从而确保标签一致性并提升图像质量。具体而言，通过检测器模型识别存在标签不匹配或低分类置信度的异常图像，并结合图像原型和标签生成多个候选样本，最终通过联合考虑检测器的置信度分数与现有合格合成样本的差异性选择最优候选样本，以保证标签准确性和类内多样性。

链接: https://arxiv.org/abs/2507.13074
作者: Yawen Zou,Guang Li,Zi Wang,Chunzhi Gu,Chao Zhang
机构: University of Toyama(富山大学); Hokkaido University(北海道大学); Niigata University(新泻大学); Toyohashi University of Technology(丰田工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation (DD) aims to generate a compact yet informative dataset that achieves performance comparable to the original dataset, thereby reducing demands on storage and computational resources. Although diffusion models have made significant progress in dataset distillation, the generated surrogate datasets often contain samples with label inconsistencies or insufficient structural detail, leading to suboptimal downstream performance. To address these issues, we propose a detector-guided dataset distillation framework that explicitly leverages a pre-trained detector to identify and refine anomalous synthetic samples, thereby ensuring label consistency and improving image quality. Specifically, a detector model trained on the original dataset is employed to identify anomalous images exhibiting label mismatches or low classification confidence. For each defective image, multiple candidates are generated using a pre-trained diffusion model conditioned on the corresponding image prototype and label. The optimal candidate is then selected by jointly considering the detector’s confidence score and dissimilarity to existing qualified synthetic samples, thereby ensuring both label accuracy and intra-class diversity. Experimental results demonstrate that our method can synthesize high-quality representative images with richer details, achieving state-of-the-art performance on the validation set.
zh

[CV-31] Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment Data Collection and Preliminary Analysis ITSC

【速读】：该论文试图解决交通流量计数（Traffic Movement Count, TMC）的准确估计问题，尤其是在恶劣天气和夜间等光照条件不佳的情况下，传统基于摄像头的TMC方法易出现误差。解决方案的关键在于采用双LiDAR系统进行3D目标检测与跟踪，通过两个LiDAR的3D边界框检测对车辆进行分类计数，从而提高TMC估计的准确性与可靠性。

链接: https://arxiv.org/abs/2507.13073
作者: Saswat Priyadarshi Nayak,Guoyuan Wu,Kanok Boriboonsomsin,Matthew Barth
机构: 未知
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 Pages, 8 Figures. This paper has been accepted for publication at the 2025 IEEE ITSC. Copyright IEEE

点击查看摘要

Abstract:Traffic Movement Count (TMC) at intersections is crucial for optimizing signal timings, assessing the performance of existing traffic control measures, and proposing efficient lane configurations to minimize delays, reduce congestion, and promote safety. Traditionally, methods such as manual counting, loop detectors, pneumatic road tubes, and camera-based recognition have been used for TMC estimation. Although generally reliable, camera-based TMC estimation is prone to inaccuracies under poor lighting conditions during harsh weather and nighttime. In contrast, Light Detection and Ranging (LiDAR) technology is gaining popularity in recent times due to reduced costs and its expanding use in 3D object detection, tracking, and related applications. This paper presents the authors’ endeavor to develop, deploy and evaluate a dual-LiDAR system at an intersection in the city of Rialto, California, for TMC estimation. The 3D bounding box detections from the two LiDARs are used to classify vehicle counts based on traffic directions, vehicle movements, and vehicle classes. This work discusses the estimated TMC results and provides insights into the observed trends and irregularities. Potential improvements are also discussed that could enhance not only TMC estimation, but also trajectory forecasting and intent prediction at intersections.
zh

[CV-32] Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection

【速读】：该论文试图解决视觉语言模型（Vision-Language Models, VLMs）在适应未见过的复杂大范围场景时面临的挑战。其解决方案的关键在于提出一种分层核心集选择（Hierarchical Coresets Selection, HCS）机制，该机制通过理论保证的重要性函数逐步优化选定区域，该函数综合考虑了效用、代表性、鲁棒性和协同性，从而在无需额外微调的情况下，使VLMs能够利用最少可解释区域快速理解不同尺度的未知场景，并缓解特征密度不足的问题。

链接: https://arxiv.org/abs/2507.13061
作者: Jingyao Wang,Yiming Chen,Lingyu Si,Changwen Zheng
机构: Institute of Software Chinese Academy of Sciences(软件研究所中国科学院); University of the Chinese Academy of Sciences(中国科学院大学); Beijing(北京); China(中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene understanding is one of the core tasks in computer vision, aiming to extract semantic information from images to identify objects, scene categories, and their interrelationships. Although advancements in Vision-Language Models (VLMs) have driven progress in this field, existing VLMs still face challenges in adaptation to unseen complex wide-area scenes. To address the challenges, this paper proposes a Hierarchical Coresets Selection (HCS) mechanism to advance the adaptation of VLMs in complex wide-area scene understanding. It progressively refines the selected regions based on the proposed theoretically guaranteed importance function, which considers utility, representativeness, robustness, and synergy. Without requiring additional fine-tuning, HCS enables VLMs to achieve rapid understandings of unseen scenes at any scale using minimal interpretable regions while mitigating insufficient feature density. HCS is a plug-and-play method that is compatible with any VLM. Experiments demonstrate that HCS achieves superior performance and universality in various tasks.
zh

[CV-33] Resurrect Mask AutoRegressive Modeling for Efficient and Scalable Image Generation

【速读】：该论文试图解决Masked AutoRegressive (MAR)模型在图像生成任务中相较于标准AutoRegressive (AR)模型表现较差的问题，以及其在推理步骤上的效率劣势。解决方案的关键在于对MAR架构的改进，包括采用更有效的图像分词器、引入结合双向注意力机制与2D RoPE的增强型LLaMA架构，并将其命名为MaskGIL。该模型在保持高生成质量的同时，显著减少了推理步骤，从而实现了高效的并行解码。

链接: https://arxiv.org/abs/2507.13032
作者: Yi Xin,Le Zhuo,Qi Qin,Siqi Luo,Yuewen Cao,Bin Fu,Yangfan He,Hongsheng Li,Guangtao Zhai,Xiaohong Liu,Peng Gao
机构: Shanghai AI Laboratory(上海人工智能实验室); Nanjing University(南京大学); Shanghai Innovation Institute(上海创新研究院); The Chinese University of Hong Kong(香港中文大学); Shanghai Jiao Tong University(上海交通大学); University of Minnesota Twin Cities(明尼苏达大学双城分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 10 figures, 10 tables

点击查看摘要

Abstract:AutoRegressive (AR) models have made notable progress in image generation, with Masked AutoRegressive (MAR) models gaining attention for their efficient parallel decoding. However, MAR models have traditionally underperformed when compared to standard AR models. This study refines the MAR architecture to improve image generation quality. We begin by evaluating various image tokenizers to identify the most effective one. Subsequently, we introduce an improved Bidirectional LLaMA architecture by replacing causal attention with bidirectional attention and incorporating 2D RoPE, which together form our advanced model, MaskGIL. Scaled from 111M to 1.4B parameters, MaskGIL achieves a FID score of 3.71, matching state-of-the-art AR models in the ImageNet 256x256 benchmark, while requiring only 8 inference steps compared to the 256 steps of AR models. Furthermore, we develop a text-driven MaskGIL model with 775M parameters for generating images from text at various resolutions. Beyond image generation, MaskGIL extends to accelerate AR-based generation and enable real-time speech-to-image conversion. Our codes and models are available at this https URL.
zh

[CV-34] Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization

【速读】：该论文试图解决深度学习图像篡改定位（IML）方法依赖于大规模像素级标注数据集的问题，以及弱监督方法因监督信号不足而导致性能受限的问题。其关键解决方案是引入涂鸦标注监督（scribble annotation supervision），通过重新标注主流IML数据集并构建首个基于涂鸦的IML（Sc-IML）数据集，同时提出一种基于涂鸦的弱监督IML框架。该框架的核心技术包括自监督训练结合结构一致性损失、先验感知特征调制模块（PFMM）、门控自适应融合模块（GAFM）以及置信度感知熵最小化损失（ $\mathcal{L}_{CEM}$ ），以提升模型在弱标注或未标注区域的检测性能和预测一致性。

链接: https://arxiv.org/abs/2507.13018
作者: Songlin Li,Guofeng Yu,Zhiqing Guo,Yunfeng Diao,Dan Ma,Gaobo Yang,Liejun Wang
机构: Model call failure
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based image manipulation localization (IML) methods have achieved remarkable performance in recent years, but typically rely on large-scale pixel-level annotated datasets. To address the challenge of acquiring high-quality annotations, some recent weakly supervised methods utilize image-level labels to segment manipulated regions. However, the performance is still limited due to insufficient supervision signals. In this study, we explore a form of weak supervision that improves the annotation efficiency and detection performance, namely scribble annotation supervision. We re-annotated mainstream IML datasets with scribble labels and propose the first scribble-based IML (Sc-IML) dataset. Additionally, we propose the first scribble-based weakly supervised IML framework. Specifically, we employ self-supervised training with a structural consistency loss to encourage the model to produce consistent predictions under multi-scale and augmented inputs. In addition, we propose a prior-aware feature modulation module (PFMM) that adaptively integrates prior information from both manipulated and authentic regions for dynamic feature adjustment, further enhancing feature discriminability and prediction consistency in complex scenes. We also propose a gated adaptive fusion module (GAFM) that utilizes gating mechanisms to regulate information flow during feature fusion, guiding the model toward emphasizing potential tampered regions. Finally, we propose a confidence-aware entropy minimization loss ( \mathcalL_ CEM ). This loss dynamically regularizes predictions in weakly annotated or unlabeled regions based on model uncertainty, effectively suppressing unreliable predictions. Experimental results show that our method outperforms existing fully supervised approaches in terms of average performance both in-distribution and out-of-distribution.
zh

[CV-35] Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning

【速读】：该论文试图解决对比学习中由于噪声对应关系导致的样本选择效率与质量不足的问题。现有方法要么依赖于离线的高质核心集选择，受限于冷启动场景，要么基于实时模型预测的在线选择，未能充分有效地处理噪声对应关系。解决方案的关键在于提出一种新颖的差分感知样本选择方法（DISSect），通过分析当前模型与历史模型预测相关性的差异来更准确地识别噪声对应关系，从而提升训练效率和效果。

链接: https://arxiv.org/abs/2507.12998
作者: Zihua Zhao,Feng Hong,Mengxi Chen,Pengyi Chen,Benyuan Liu,Jiangchao Yao,Ya Zhang,Yanfeng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The remarkable success of contrastive-learning-based multimodal models has been greatly driven by training on ever-larger datasets with expensive compute consumption. Sample selection as an alternative efficient paradigm plays an important direction to accelerate the training process. However, recent advances on sample selection either mostly rely on an oracle model to offline select a high-quality coreset, which is limited in the cold-start scenarios, or focus on online selection based on real-time model predictions, which has not sufficiently or efficiently considered the noisy correspondence. To address this dilemma, we propose a novel Differential-Informed Sample Selection (DISSect) method, which accurately and efficiently discriminates the noisy correspondence for training acceleration. Specifically, we rethink the impact of noisy correspondence on contrastive learning and propose that the differential between the predicted correlation of the current model and that of a historical model is more informative to characterize sample quality. Based on this, we construct a robust differential-based sample selection and analyze its theoretical insights. Extensive experiments on three benchmark datasets and various downstream tasks demonstrate the consistent superiority of DISSect over current state-of-the-art methods. Source code is available at: this https URL.
zh

[CV-36] Variance-Based Pruning for Accelerating and Compressing Trained Networks ICCV

【速读】：该论文旨在解决大规模模型训练成本高昂、部署时面临延迟、计算和内存需求高等问题，尤其是在资源受限的硬件上。其核心挑战在于结构化剪枝后如何保持模型性能并避免大量的重新训练。该论文提出的解决方案是Variance-Based Pruning，其关键在于通过收集激活统计信息来选择需要剪枝的神经元，同时将平均激活值重新整合到模型中，从而在一次性剪枝后仍能维持较高的性能，仅需少量微调即可恢复接近原始精度。

链接: https://arxiv.org/abs/2507.12988
作者: Uranik Berisha,Jens Mehnert,Alexandru Paul Condurache
机构: Automated Driving Research, Robert Bosch GmbH (自动驾驶研究，博世有限公司); Institute for Signal Processing, University of Lübeck (信号处理研究所，吕贝克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IEEE/CVF International Conference on Computer Vision (ICCV) 2025

点击查看摘要

Abstract:Increasingly expensive training of ever larger models such as Vision Transfomers motivate reusing the vast library of already trained state-of-the-art networks. However, their latency, high computational costs and memory demands pose significant challenges for deployment, especially on resource-constrained hardware. While structured pruning methods can reduce these factors, they often require costly retraining, sometimes for up to hundreds of epochs, or even training from scratch to recover the lost accuracy resulting from the structural modifications. Maintaining the provided performance of trained models after structured pruning and thereby avoiding extensive retraining remains a challenge. To solve this, we introduce Variance-Based Pruning, a simple and structured one-shot pruning technique for efficiently compressing networks, with minimal finetuning. Our approach first gathers activation statistics, which are used to select neurons for pruning. Simultaneously the mean activations are integrated back into the model to preserve a high degree of performance. On ImageNet-1k recognition tasks, we demonstrate that directly after pruning DeiT-Base retains over 70% of its original performance and requires only 10 epochs of fine-tuning to regain 99% of the original accuracy while simultaneously reducing MACs by 35% and model size by 36%, thus speeding up the model by 1.44x.
zh

[CV-37] WaveletInception Networks for Drive-by Vibration-Based Infrastructure Health Monitoring

【速读】：该论文旨在解决基础设施健康监测中如何有效利用行驶通过振动响应信号进行准确、局部化和自动化的评估问题。其解决方案的关键在于提出一种基于深度学习的框架——WaveletInception-BiLSTM网络，该网络结合了可学习小波包变换（LWPT）与1D Inception结构以提取多尺度特征，并通过长短期记忆（LSTM）层融合操作条件信息，最终利用双向长短期记忆（BiLSTM）网络捕捉双向时间关系，实现高分辨率的结构健康状态评估。

链接: https://arxiv.org/abs/2507.12969
作者: Reza Riahi Samani,Alfredo Nunez,Bart De Schutter
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel deep learning-based framework for infrastructure health monitoring using drive-by vibration response signals. Recognizing the importance of spectral and temporal information, we introduce the WaveletInception-BiLSTM network. The WaveletInception feature extractor utilizes a Learnable Wavelet Packet Transform (LWPT) as the stem for extracting vibration signal features, incorporating spectral information in the early network layers. This is followed by 1D Inception networks that extract multi-scale, high-level features at deeper layers. The extracted vibration signal features are then integrated with operational conditions via a Long Short-term Memory (LSTM) layer. The resulting feature extraction network effectively analyzes drive-by vibration signals across various measurement speeds without preprocessing and uses LSTM to capture interrelated temporal dependencies among different modes of information and to create feature vectors for health condition estimation. The estimator head is designed with a sequential modeling architecture using bidirectional LSTM (BiLSTM) networks, capturing bi-directional temporal relationships from drive-by measurements. This architecture allows for a high-resolution, beam-level assessment of infrastructure health conditions. A case study focusing on railway track stiffness estimation with simulated drive-by vibration signals shows that the model significantly outperforms state-of-the-art methods in estimating railway ballast and railpad stiffness parameters. Results underscore the potential of this approach for accurate, localized, and fully automated drive-by infrastructure health monitoring.
zh

[CV-38] RGB Pre-Training Enhanced Unobservable Feature Latent Diffusion Model for Spectral Reconstruction

【速读】：该论文试图解决从对应的RGB图像重建高光谱图像（HSI）的问题，其核心挑战在于估计未被RGB成像传感器捕获的不可观测特征（unobservable feature），该特征包含了重要的光谱信息。解决方案的关键在于有效地构建条件于RGB图像的光谱-空间联合分布，以补充不可观测特征。为此，作者将预训练的RGB潜在扩散模型（RGB-LDM）扩展为不可观测特征扩散模型（ULDM），通过分离不可观测特征并利用RGB预训练模型中的丰富空间知识，使ULDM专注于建模光谱结构，并在紧凑的潜在空间中学习联合分布。

链接: https://arxiv.org/abs/2507.12967
作者: Keli Deng,Jie Nie,Yuntao Qian
机构: Zhejiang University (浙江大学); Ocean University of China (中国海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spectral reconstruction (SR) is a crucial problem in image processing that requires reconstructing hyperspectral images (HSIs) from the corresponding RGB images. A key difficulty in SR is estimating the unobservable feature, which encapsulates significant spectral information not captured by RGB imaging sensors. The solution lies in effectively constructing the spectral-spatial joint distribution conditioned on the RGB image to complement the unobservable feature. Since HSIs share a similar spatial structure with the corresponding RGB images, it is rational to capitalize on the rich spatial knowledge in RGB pre-trained models for spectral-spatial joint distribution learning. To this end, we extend the RGB pre-trained latent diffusion model (RGB-LDM) to an unobservable feature LDM (ULDM) for SR. As the RGB-LDM and its corresponding spatial autoencoder (SpaAE) already excel in spatial knowledge, the ULDM can focus on modeling spectral structure. Moreover, separating the unobservable feature from the HSI reduces the redundant spectral information and empowers the ULDM to learn the joint distribution in a compact latent space. Specifically, we propose a two-stage pipeline consisting of spectral structure representation learning and spectral-spatial joint distribution learning to transform the RGB-LDM into the ULDM. In the first stage, a spectral unobservable feature autoencoder (SpeUAE) is trained to extract and compress the unobservable feature into a 3D manifold aligned with RGB space. In the second stage, the spectral and spatial structures are sequentially encoded by the SpeUAE and the SpaAE, respectively. The ULDM is then acquired to model the distribution of the coded unobservable feature with guidance from the corresponding RGB images. Experimental results on SR and downstream relighting tasks demonstrate that our proposed method achieves state-of-the-art performance.
zh

[CV-39] Demographic-aware fine-grained classification of pediatric wrist fractures

【速读】：该论文旨在解决在手腕病理诊断中因数据集有限而导致的诊断准确性不足问题。其解决方案的关键在于采用细粒度识别策略和患者元数据与X-ray图像的融合，以提升模型性能，尤其在数据量较小的情况下效果显著。

链接: https://arxiv.org/abs/2507.12964
作者: Ammar Ahmed,Ali Shariq Imran,Zenun Kastrati,Sher Muhammad Daudpota
机构: Norwegian University of Science & Technology (NTNU); Linnaeus University; Sukkur IBA University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Wrist pathologies are frequently observed, particularly among children who constitute the majority of fracture cases. However, diagnosing these conditions is time-consuming and requires specialized expertise. Computer vision presents a promising avenue, contingent upon the availability of extensive datasets, a notable challenge in medical imaging. Therefore, reliance solely on one modality, such as images, proves inadequate, especially in an era of diverse and plentiful data types. In this study, we employ a multifaceted approach to address the challenge of recognizing wrist pathologies using an extremely limited dataset. Initially, we approach the problem as a fine-grained recognition task, aiming to identify subtle X-ray pathologies that conventional CNNs overlook. Secondly, we enhance network performance by fusing patient metadata with X-ray images. Thirdly, rather than pre-training on a coarse-grained dataset like ImageNet, we utilize weights trained on a fine-grained dataset. While metadata integration has been used in other medical domains, this is a novel application for wrist pathologies. Our results show that a fine-grained strategy and metadata integration improve diagnostic accuracy by 2% with a limited dataset and by over 10% with a larger fracture-focused dataset.
zh

[CV-40] FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

【速读】：该论文试图解决从静态图像生成富有表现力的面部动画的问题，尤其是现有方法在跨角色再现中容易产生伪影、难以捕捉细微情绪，以及在多角色动画中因不同个体驱动特征相互干扰而导致任务复杂化的问题。其解决方案的关键在于提出FantasyPortrait框架，该框架基于扩散变换器，引入了表达增强的学习策略，利用隐式表示捕捉与身份无关的面部动态，从而提升模型对细粒度情绪的渲染能力；同时设计了掩码交叉注意力机制，实现多角色独立但协调的表达生成，有效防止特征干扰。

链接: https://arxiv.org/abs/2507.12956
作者: Qiang Wang,Mengchao Wang,Fan Jiang,Yaqi Fan,Yonggang Qi,Mu Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Producing expressive facial animations from static images is a challenging task. Prior methods relying on explicit geometric priors (e.g., facial landmarks or 3DMM) often suffer from artifacts in cross reenactment and struggle to capture subtle emotions. Furthermore, existing approaches lack support for multi-character animation, as driving features from different individuals frequently interfere with one another, complicating the task. To address these challenges, we propose FantasyPortrait, a diffusion transformer based framework capable of generating high-fidelity and emotion-rich animations for both single- and multi-character scenarios. Our method introduces an expression-augmented learning strategy that utilizes implicit representations to capture identity-agnostic facial dynamics, enhancing the model’s ability to render fine-grained emotions. For multi-character control, we design a masked cross-attention mechanism that ensures independent yet coordinated expression generation, effectively preventing feature interference. To advance research in this area, we propose the Multi-Expr dataset and ExprBench, which are specifically designed datasets and benchmarks for training and evaluating multi-character portrait animations. Extensive experiments demonstrate that FantasyPortrait significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative evaluations, excelling particularly in challenging cross reenactment and multi-character contexts. Our project page is this https URL.
zh

[CV-41] cIDIR: Conditioned Implicit Neural Representation for Regularized Deformable Image Registration

【速读】：该论文试图解决在基于学习的可变形图像配准（Deformable Image Registration, DIR）框架中，微调正则化参数计算成本高昂的问题。传统方法需要针对每组正则化超参数进行重新训练，导致效率低下。解决方案的关键在于提出一种名为cIDIR的新框架，该框架基于隐式神经表示（Implicit Neural Representations, INRs），通过将配准过程条件化于正则化超参数，在先验分布下训练模型后，利用分割掩码作为观测数据对正则化超参数进行优化。此外，cIDIR建模了一个连续且可微的变形向量场（DVF），从而可通过自动微分无缝集成先进的正则化技术。

链接: https://arxiv.org/abs/2507.12953
作者: Sidaty El Hadramy,Oumeymah Cherkaoui,Philippe C. Cattin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Regularization is essential in deformable image registration (DIR) to ensure that the estimated Deformation Vector Field (DVF) remains smooth, physically plausible, and anatomically consistent. However, fine-tuning regularization parameters in learning-based DIR frameworks is computationally expensive, often requiring multiple training iterations. To address this, we propose cIDI, a novel DIR framework based on Implicit Neural Representations (INRs) that conditions the registration process on regularization hyperparameters. Unlike conventional methods that require retraining for each regularization hyperparameter setting, cIDIR is trained over a prior distribution of these hyperparameters, then optimized over the regularization hyperparameters by using the segmentations masks as an observation. Additionally, cIDIR models a continuous and differentiable DVF, enabling seamless integration of advanced regularization techniques via automatic differentiation. Evaluated on the DIR-LAB dataset, \operatornamecIDIR achieves high accuracy and robustness across the dataset.
zh

[CV-42] LoViC: Efficient Long Video Generation with Context Compression

【速读】：该论文试图解决扩散变压器（Diffusion Transformers, DiTs）在文本到视频生成中难以扩展至长时长内容的问题，这一挑战源于自注意力机制的二次复杂度。解决方案的关键在于提出LoViC框架，其核心是FlexFormer，一种能够将视频和文本联合压缩为统一潜在表示的表达性自编码器。FlexFormer通过基于Q-Former架构的单个查询令牌设计，支持可线性调整压缩率的变长输入，并通过位置感知机制编码时间上下文，从而实现预测、回溯、插值和多样本生成等任务的统一范式。

链接: https://arxiv.org/abs/2507.12952
作者: Jiaxiu Jiang,Wenbo Li,Jingjing Ren,Yuping Qiu,Yong Guo,Xiaogang Xu,Han Wu,Wangmeng Zuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts – such as sparse attention and temporally autoregressive models – offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.
zh

[CV-43] Analysis of Image-and-Text Uncertainty Propagation in Multimodal Large Language Models with Cardiac MR-Based Applications MICCAI

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理来自不同模态（如图像和文本）的数据时，如何准确表征和理解各模态间的不确定性关系及其临床应用的问题。其解决方案的关键在于提出一种基于不确定性传播的多模态不确定性传播模型（Multimodal Uncertainty Propagation Model, MUPM），通过该模型量化图像、文本以及联合图像-文本变化所带来的不确定性，并揭示它们之间的相互关系，从而实现对不确定性的有效估计与分析，进而支持更鲁棒的临床应用。

链接: https://arxiv.org/abs/2507.12945
作者: Yucheng Tang,Yunguan Fu,Weixi Yi,Yipei Wang,Daniel C. Alexander,Rhodri Davies,Yipeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: It is accepted by 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2025

点击查看摘要

Abstract:Multimodal large language models (MLLMs) can process and integrate information from multimodality sources, such as text and images. However, interrelationship among input modalities, uncertainties due to individual uni-modal data and potential clinical applications following such an uncertainty decomposition are yet fully understood in the context of large-scale MLLMs. In this work, we propose a multimodal uncertainty propagation model (MUPM) based on uncertainty propagation, to characterise the relationship among the uncertainties arising from image-only, text-only, and joint image-text variations in MLLM inputs. Using real clinical data consisting of cardiac MR scans and digital health records, we describe that MUPMs can be optimised robustly with a few samples. We then show that the fitted MUPMs are generalisable across different input data distributions and, perhaps surprisingly, across different downstream tasks. Such a transferability may be explained by the shared pretraining, comparatively light MLLM fine-tuning, along with the low-dimensional nature of the MUPMs. More importantly, this learned transferability, quantifying the relationship between these uncertainties, led to direct clinical applications in which uncertainties may be estimated and thus analysed robustly for varying data or even a novel set of cardiac disease prediction tasks. In addition, we show experimentally the efficiency in multimodal data required for estimating the overall uncertainty and its ability to identify redundant factors, both of which are considered practical yet clinically useful applications with the proposed MUPMs. Codes are available at this https URL.
zh

[CV-44] Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning ICCV2025

【速读】：该论文试图解决可见-红外行人重识别（Visible-Infrared Person ReID）模型对跨模态标注样本的依赖问题，特别是在缺乏跨模态身份标签的情况下如何提升模型性能。其解决方案的关键在于提出一种弱监督的跨模态行人ReID方法，通过引入异构专家协作一致性学习框架，利用单模态样本的身份标签进行训练。该框架通过每个模态的专用分类专家独立训练，并作为异构预测器来预测另一模态样本的身份，同时设计跨模态关系融合机制以提升预测准确性，从而在隐式监督下实现专家间的协同与一致性学习，增强模型提取模态不变特征的能力。

链接: https://arxiv.org/abs/2507.12942
作者: Yafei Zhang,Lingqi Kong,Huafeng Li,Jie Wen
机构: Kunming University of Science and Technology (昆明理工大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:To reduce the reliance of visible-infrared person re-identification (ReID) models on labeled cross-modal samples, this paper explores a weakly supervised cross-modal person ReID method that uses only single-modal sample identity labels, addressing scenarios where cross-modal identity labels are unavailable. To mitigate the impact of missing cross-modal labels on model performance, we propose a heterogeneous expert collaborative consistency learning framework, designed to establish robust cross-modal identity correspondences in a weakly supervised manner. This framework leverages labeled data from each modality to independently train dedicated classification experts. To associate cross-modal samples, these classification experts act as heterogeneous predictors, predicting the identities of samples from the other modality. To improve prediction accuracy, we design a cross-modal relationship fusion mechanism that effectively integrates predictions from different experts. Under the implicit supervision provided by cross-modal identity correspondences, collaborative and consistent learning among the experts is encouraged, significantly enhancing the model’s ability to extract modality-invariant features and improve cross-modal identity recognition. Experimental results on two challenging datasets validate the effectiveness of the proposed method.
zh

[CV-45] A Deep-Learning Framework for Land-Sliding Classification from Remote Sensing Image

【速读】：该论文试图解决卫星遥感图像中滑坡自动检测的挑战，特别是如何选择合适的深度学习架构以优化性能并避免过拟合。解决方案的关键在于提出了一种基于深度学习的框架，该框架结合了在线与离线数据增强以处理数据不平衡问题，采用EfficientNet_Large作为主干网络提取鲁棒的嵌入特征，并使用后处理支持向量机（SVM）分类器来平衡和提升分类性能。

链接: https://arxiv.org/abs/2507.12939
作者: Hieu Tang,Truong Vo,Dong Pham,Toan Nguyen,Lam Pham,Truong Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The use of satellite imagery combined with deep learning to support automatic landslide detection is becoming increasingly widespread. However, selecting an appropriate deep learning architecture to optimize performance while avoiding overfitting remains a critical challenge. To address these issues, we propose a deep-learning based framework for landslide detection from remote sensing image in this paper. The proposed framework presents an effective combination of the online an offline data augmentation to tackle the imbalanced data, a backbone EfficientNet_Large deep learning model for extracting robust embedding features, and a post-processing SVM classifier to balance and enhance the classification performance. The proposed model achieved an F1-score of 0.8938 on the public test set of the Zindi challenge.
zh

[CV-46] DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization ICCV2025

【速读】：该论文旨在解决扩散模型（Diffusion Models）在资源受限环境中部署时面临的计算成本过高问题。现有后训练量化（Post-Training Quantization, PTQ）方法虽试图通过关注扩散模型的迭代特性来缓解这一问题，但常因忽略异常值而导致低比特宽度下的性能下降。论文提出的DMQ方法结合了学习等效缩放（Learned Equivalent Scaling, LES）和通道级二进制幂次缩放（channel-wise Power-of-Two Scaling, PTS），关键在于通过优化通道级缩放因子来重新分配权重与激活的量化难度，从而降低整体量化误差，并引入自适应时间步加权机制以提升关键去噪步骤的稳定性，同时采用投票算法确保PTS因子选择的鲁棒性。

链接: https://arxiv.org/abs/2507.12933
作者: Dongyeun Lee,Jiwan Hur,Hyounguk Shon,Jae Young Lee,Junmo Kim
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing works, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model stability. The code is available at this https URL.
zh

[CV-47] Argus: Leverag ing Multiview Images for Improved 3-D Scene Understanding With Large Language Models

【速读】：该论文旨在解决3D场景理解中由于3D点云重建导致的信息丢失问题，如无纹理平面或重复模式的遗漏以及复杂结构物体因图像与点云对齐误差引起的细节失真。其解决方案的关键在于提出Argus框架，该框架通过融合多视角2D图像和相机位姿生成视图-场景特征，并与3D特征进行交互，从而创建全面且详细的3D感知场景嵌入，以弥补点云重建中的信息缺失并增强大型语言模型（LLM）对3D世界的理解能力。

链接: https://arxiv.org/abs/2507.12916
作者: Yifan Xu,Chao Zhang,Hanqi Jiang,Xiaoyan Wang,Ruifei Ma,Yiwei Li,Zihao Wu,Zeju Li,Xiangde Liu
机构: Beihang University (北京航空航天大学); Beijing Digital Native Digital City Research Center (北京数字原生数字城市研究中心); The University of Georgia (佐治亚大学); University of Science and Technology Beijing (北京科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by TNNLS2025

点击查看摘要

Abstract:Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend Large Language Models (LLMs) for tackling tasks of 3D scene understanding. Current methods rely heavily on 3D point clouds, but the 3D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. 2D multi-view images present visual consistency with 3D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3D multimodal framework that leverages multi-view images for enhanced 3D scene understanding with LLMs. In general, Argus can be treated as a 3D Large Multimodal Foundation Model (3D-LMM) since it takes various modalities as input(text instructions, 2D multi-view images, and 3D point clouds) and expands the capability of LLMs to tackle 3D tasks. Argus involves fusing and integrating multi-view images and camera poses into view-as-scene features, which interact with the 3D features to create comprehensive and detailed 3D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3D point clouds and helps LLMs better understand the 3D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.
zh

[CV-48] AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability

【速读】：该论文旨在解决单目3D姿态估计在体育分析中的实际应用问题，特别是由于缺乏真实体育数据集和对体育任务可靠性不明确所带来的挑战。其解决方案的关键在于引入了AthleticsPose数据集，这是一个包含23名运动员在田径场进行多种田径项目的“真实”运动数据的公开数据集。通过该数据集训练的代表性3D姿态估计模型，在性能上显著优于基于模拟体育动作数据集训练的基线模型，证明了使用真实体育运动数据进行训练的重要性。

链接: https://arxiv.org/abs/2507.12905
作者: Tomohiro Suzuki,Ryota Tanaka,Calvin Yeung,Keisuke Fujii
机构: Nagoya University(名古屋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Monocular 3D pose estimation is a promising, flexible alternative to costly motion capture systems for sports analysis. However, its practical application is hindered by two factors: a lack of realistic sports datasets and unclear reliability for sports tasks. To address these challenges, we introduce the AthleticsPose dataset, a new public dataset featuring ``real’’ motions captured from 23 athletes performing various athletics events on an athletic field. Using this dataset, we trained a representative 3D pose estimation model and performed a comprehensive evaluation. Our results show that the model trained on AthleticsPose significantly outperforms a baseline model trained on an imitated sports motion dataset, reducing MPJPE by approximately 75 %. These results show the importance of training on authentic sports motion data, as models based on imitated motions do not effectively transfer to real-world motions. Further analysis reveals that estimation accuracy is sensitive to camera view and subject scale. In case studies of kinematic indicators, the model demonstrated the potential to capture individual differences in knee angles but struggled with higher-speed metrics, such as knee-drive velocity, due to prediction biases. This work provides the research community with a valuable dataset and clarifies the potential and practical limitations of using monocular 3D pose estimation for sports motion analysis. Our dataset, code, and checkpoints are available at this https URL.
zh

[CV-49] Federated Learning for Commercial Image Sources WACV WACV56688

【速读】：该论文旨在解决联邦学习（Federated Learning）中数据隐私保护与模型性能优化的问题。其关键解决方案是提出了一种新的图像分类数据集，以及两种新的联邦学习算法——Fed-Cyclic和Fed-Star。Fed-Cyclic通过循环拓扑结构实现客户端间的权重传递与更新，而Fed-Star则通过预聚合处理统计异质性并采用星型拓扑结构进行权重共享，从而提升了模型的性能。

链接: https://arxiv.org/abs/2507.12903
作者: Shreyansh Jain,Koteswar Rao Jerripothula
机构: IIIT Delhi(印度信息科技学院新德里分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Published in the Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023 with DOI: https://doi.org/10.1109/WACV56688.2023.00647

点击查看摘要

Abstract:Federated Learning is a collaborative machine learning paradigm that enables multiple clients to learn a global model without exposing their data to each other. Consequently, it provides a secure learning platform with privacy-preserving capabilities. This paper introduces a new dataset containing 23,326 images collected from eight different commercial sources and classified into 31 categories, similar to the Office-31 dataset. To the best of our knowledge, this is the first image classification dataset specifically designed for Federated Learning. We also propose two new Federated Learning algorithms, namely Fed-Cyclic and Fed-Star. In Fed-Cyclic, a client receives weights from its previous client, updates them through local training, and passes them to the next client, thus forming a cyclic topology. In Fed-Star, a client receives weights from all other clients, updates its local weights through pre-aggregation (to address statistical heterogeneity) and local training, and sends its updated local weights to all other clients, thus forming a star-like topology. Our experiments reveal that both algorithms perform better than existing baselines on our newly introduced dataset.
zh

[CV-50] LanePerf: a Performance Estimation Framework for Lane Detection ITSC2025

【速读】：该论文旨在解决车道检测模型在面对域转移时可靠性下降的问题，特别是在缺乏真实标签的情况下如何高效评估模型性能。其关键解决方案是提出一种新的车道性能估计框架（LanePerf），该框架结合了预训练图像编码器与基于DeepSets的架构，融合图像和车道特征，有效处理零车道检测场景和大域转移情况，从而提升了模型在不同环境下的鲁棒性与准确性。

链接: https://arxiv.org/abs/2507.12894
作者: Yin Wu,Daniel Slieter,Ahmed Abouelazm,Christian Hubschneider,J. Marius Zöllner
机构: CARIAD SE (CARIAD SE); Karlsruhe Institute of Technology (Karlsruhe Institute of Technology); FZI Research Center for Information Technology (FZI Research Center for Information Technology)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE ITSC 2025

点击查看摘要

Abstract:Lane detection is a critical component of Advanced Driver-Assistance Systems (ADAS) and Automated Driving System (ADS), providing essential spatial information for lateral control. However, domain shifts often undermine model reliability when deployed in new environments. Ensuring the robustness and safety of lane detection models typically requires collecting and annotating target domain data, which is resource-intensive. Estimating model performance without ground-truth labels offers a promising alternative for efficient robustness assessment, yet remains underexplored in lane detection. While previous work has addressed performance estimation in image classification, these methods are not directly applicable to lane detection tasks. This paper first adapts five well-performing performance estimation methods from image classification to lane detection, building a baseline. Addressing the limitations of prior approaches that solely rely on softmax scores or lane features, we further propose a new Lane Performance Estimation Framework (LanePerf), which integrates image and lane features using a pretrained image encoder and a DeepSets-based architecture, effectively handling zero-lane detection scenarios and large domain-shift cases. Extensive experiments on the OpenLane dataset, covering diverse domain shifts (scenes, weather, hours), demonstrate that our LanePerf outperforms all baselines, achieving a lower MAE of 0.117 and a higher Spearman’s rank correlation coefficient of 0.727. These findings pave the way for robust, label-free performance estimation in ADAS, supporting more efficient testing and improved safety in challenging driving scenarios.
zh

[CV-51] Camera-based implicit mind reading by capturing higher-order semantic dynamics of human gaze within environmental context

【速读】：该论文试图解决传统情绪识别方法依赖显性信号（如面部表情、语音或手势）或生理信号所带来的局限性，这些方法要么无法捕捉深层次的隐性情绪，要么需要复杂的传感器设备，影响自然行为。其解决方案的关键在于提出一种基于摄像头的、用户无感知的情绪识别方法，该方法结合注视点模式与环境语义及时间信息，通过标准高清摄像头非侵入式地捕捉用户的视觉线索，从而建模注视行为的空间、语义和时间维度，实现无需特殊硬件或用户主动配合的实时、连续情绪识别。

链接: https://arxiv.org/abs/2507.12889
作者: Mengke Song,Yuge Xie,Qi Cui,Luming Li,Xinyu Liu,Guotao Wang,Chenglizhao Chen,Shanchen Pang
机构: China University of Petroleum (East China) (中国石油大学（华东）); the Qingdao Institute of Software (青岛软件研究院); the College of Computer Science and Technology (计算机科学与技术学院); Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software (山东省智能油气工业软件重点实验室); Nanjing Forestry University (南京林业大学); School of Marxism (马克思主义学院); College of Information Science and Technology (信息科学与技术学院); Qingdao University of Science and Technology (青岛科技大学); State Key Laboratory of Chemical Safety (化学安全国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Emotion recognition,as a step toward mind reading,seeks to infer internal states from external this http URL existing methods rely on explicit signals-such as facial expressions,speech,or gestures-that reflect only bodily responses and overlook the influence of environmental this http URL cues are often voluntary,easy to mask,and insufficient for capturing deeper,implicit emotions. Physiological signal-based approaches offer more direct access to internal states but require complex sensors that compromise natural behavior and limit this http URL-based methods typically rely on static fixation analysis and fail to capture the rich,dynamic interactions between gaze and the environment,and thus cannot uncover the deep connection between emotion and implicit this http URL address these limitations,we propose a novel camera-based,user-unaware emotion recognition approach that integrates gaze fixation patterns with environmental semantics and temporal this http URL standard HD cameras,our method unobtrusively captures users’eye appearance and head movements in natural settings-without the need for specialized hardware or active user this http URL these visual cues,the system estimates gaze trajectories over time and space, providing the basis for modeling the spatial, semantic,and temporal dimensions of gaze behavior. This allows us to capture the dynamic interplay between visual attention and the surrounding environment,revealing that emotions are not merely physiological responses but complex outcomes of human-environment this http URL proposed approach enables user-unaware,real-time,and continuous emotion recognition,offering high generalizability and low deployment cost.
zh

[CV-52] From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation

【速读】：该论文试图解决无视线（line-of-sight-free）环境下高精度头姿跟踪的问题，传统基于视觉的方法受限于摄像头视角和环境光照条件。解决方案的关键在于提出NeckSense系统，该系统采用多通道生物阻抗传感技术，结合软性干电极和轻量项链式结构，通过捕捉颈部组织阻抗的动态变化来推断头部姿态。进一步地，其核心创新在于设计了一种融合解剖学先验信息（如关节约束和自然头旋转范围）的深度学习框架，以提升头姿估计的鲁棒性和准确性。

链接: https://arxiv.org/abs/2507.12884
作者: Mengxi Liu,Lala Shakti Swarup Ray,Sizhen Bian,Ko Watanabe,Ankur Bhatt,Joanna Sorysz,Russel Torah,Bo Zhou,Paul Lukowicz
机构: DFKIKaiserslautern(德国基尔亥姆霍兹中心); University of Southampton(南安普顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:We present NeckSense, a novel wearable system for head pose tracking that leverages multi-channel bio-impedance sensing with soft, dry electrodes embedded in a lightweight, necklace-style form factor. NeckSense captures dynamic changes in tissue impedance around the neck, which are modulated by head rotations and subtle muscle activations. To robustly estimate head pose, we propose a deep learning framework that integrates anatomical priors, including joint constraints and natural head rotation ranges, into the loss function design. We validate NeckSense on 7 participants using the current SOTA pose estimation model as ground truth. Our system achieves a mean per-vertex error of 25.9 mm across various head movements with a leave-one-person-out cross-validation method, demonstrating that a compact, line-of-sight-free bio-impedance wearable can deliver head-tracking performance comparable to SOTA vision-based methods.
zh

[CV-53] HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation

【速读】：该论文试图解决推理分割任务中因视觉编码器预训练分辨率较低而导致的感知分辨率不足问题，以及通过简单插值位置嵌入来提升感知分辨率所带来的性能提升有限且计算成本高的问题。解决方案的关键在于提出HRSeg模型，其核心创新包括高分辨率感知（High-Resolution Perception, HRP）和高分辨率增强（High-Resolution Enhancement, HRE）模块，分别用于处理高分辨率图像并融合局部与全局特征，以及通过整合高分辨率图像中的细粒度信息来优化分割掩码与文本特征的对齐。

链接: https://arxiv.org/abs/2507.12883
作者: Weihuang Lin,Yiwei Ma,Xiaoshuai Sun,Shuting He,Jiayi Ji,Liujuan Cao,Rongrong Ji
机构: Xiamen University (厦门大学); Shanghai University of Finance and Economics (上海财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The reasoning segmentation task involves segmenting objects within an image by interpreting implicit user instructions, which may encompass subtleties such as contextual cues and open-world knowledge. Despite significant advancements made by existing approaches, they remain constrained by low perceptual resolution, as visual encoders are typically pre-trained at lower resolutions. Furthermore, simply interpolating the positional embeddings of visual encoders to enhance perceptual resolution yields only marginal performance improvements while incurring substantial computational costs. To address this, we propose HRSeg, an efficient model with high-resolution fine-grained perception. It features two key innovations: High-Resolution Perception (HRP) and High-Resolution Enhancement (HRE). The HRP module processes high-resolution images through cropping, integrating local and global features for multi-granularity quality. The HRE module enhances mask features by integrating fine-grained information from high-resolution images, refining their alignment with text features for precise segmentation. Extensive ablation studies validate the effectiveness of our modules, while comprehensive experiments on multiple benchmark datasets demonstrate HRSeg’s superior performance.
zh

[CV-54] WhoFi: Deep Person Re-Identification via Wi-Fi Channel Signal Encoding

【速读】：该论文试图解决视频监控中的行人再识别（Person Re-Identification）问题，传统方法依赖视觉数据，但受光照不良、遮挡和拍摄角度不佳等因素影响，性能受限。其解决方案的关键在于引入WhoFi，利用Wi-Fi信号中的信道状态信息（Channel State Information, CSI）提取生物特征，并通过基于Transformer的模块化深度神经网络进行处理，采用批次内负样本损失函数训练模型，以学习鲁棒且泛化的生物特征表示。

链接: https://arxiv.org/abs/2507.12869
作者: Danilo Avola,Daniele Pannone,Dario Montagnini,Emad Emam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Person Re-Identification is a key and challenging task in video surveillance. While traditional methods rely on visual data, issues like poor lighting, occlusion, and suboptimal angles often hinder performance. To address these challenges, we introduce WhoFi, a novel pipeline that utilizes Wi-Fi signals for person re-identification. Biometric features are extracted from Channel State Information (CSI) and processed through a modular Deep Neural Network (DNN) featuring a Transformer-based encoder. The network is trained using an in-batch negative loss function to learn robust and generalizable biometric signatures. Experiments on the NTU-Fi dataset show that our approach achieves competitive results compared to state-of-the-art methods, confirming its effectiveness in identifying individuals via Wi-Fi signals.
zh

[CV-55] SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation ICCV2025

【速读】：该论文试图解决现有遥感实例分割方法在开放词汇（Open-vocabulary, OV）场景下的局限性，即这些方法通常仅适用于封闭词汇预测，难以识别新类别或跨数据集泛化。为应对这一问题，论文提出了一种融合多粒度场景上下文的解决方案，其关键在于引入Region-Aware Integration和Global Context Adaptation机制，通过结合区域上下文与全局上下文来增强视觉和文本表征，从而提升模型对遥感数据中复杂场景、季节变化及小目标或模糊目标的适应能力。

链接: https://arxiv.org/abs/2507.12857
作者: Shiqi Huang,Shuting He,Huaiyuan Qin,Bihan Wen
机构: Nanyang Technological University (南洋理工大学); MoE Key Laboratory of Interdisciplinary Research of Computation and Economics, Shanghai University of Finance and Economics (教育部计算与经济交叉研究重点实验室，上海财经大学); Institute for Infocomm Research (I2R), A*STAR, Singapore (信息通信研究所（I2R），新加坡科技局，新加坡)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Most existing remote sensing instance segmentation approaches are designed for close-vocabulary prediction, limiting their ability to recognize novel categories or generalize across datasets. This restricts their applicability in diverse Earth observation scenarios. To address this, we introduce open-vocabulary (OV) learning for remote sensing instance segmentation. While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose \textbfSCORE ( \textbfS cene \textbfC ontext matters in \textbfO pen-vocabulary \textbfRE mote sensing instance segmentation), a framework that integrates multi-granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. Specifically, we introduce Region-Aware Integration, which refines class embeddings with regional context to improve object distinguishability. Additionally, we propose Global Context Adaptation, which enriches naive text embeddings with remote sensing global context, creating a more adaptable and expressive linguistic latent space for the classifier. We establish new benchmarks for OV remote sensing instance segmentation across diverse datasets. Experimental results demonstrate that, our proposed method achieves SOTA performance, which provides a robust solution for large-scale, real-world geospatial analysis. Our code is available at this https URL.
zh

[CV-56] Simulate Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization

【速读】：该论文试图解决领域泛化（Domain Generalization, DG）中模型在面对未见过的目标领域时性能不佳的问题，特别是在跨领域任务中CLIP难以关注到领域不变区域，导致性能受限。其解决方案的关键在于提出一种称为Simulate, Refocus and Ensemble (SRE)的注意力重聚焦机制，通过在源域数据上进行增强生成模拟目标领域，然后通过注意力重聚焦对齐源域与模拟目标域的注意力图，最后利用集成学习提升模型捕捉领域不变注意力图的能力，从而有效减少领域偏移带来的影响。

链接: https://arxiv.org/abs/2507.12851
作者: Ziyi Wang,Zhi Gao,Jin Chen,Qingjie Zhao,Xinxiao Wu,Jiebo Luo
机构: Beijing Institute of Technology(北京理工大学); Shenzhen MSU-BIT University(深圳马里兰大学-比特大学); University of Rochester(罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Domain generalization (DG) aims to learn a model from source domains and apply it to unseen target domains with out-of-distribution data. Owing to CLIP’s strong ability to encode semantic concepts, it has attracted increasing interest in domain generalization. However, CLIP often struggles to focus on task-relevant regions across domains, i.e., domain-invariant regions, resulting in suboptimal performance on unseen target domains. To address this challenge, we propose an attention-refocusing scheme, called Simulate, Refocus and Ensemble (SRE), which learns to reduce the domain shift by aligning the attention maps in CLIP via attention refocusing. SRE first simulates domain shifts by performing augmentation on the source data to generate simulated target domains. SRE then learns to reduce the domain shifts by refocusing the attention in CLIP between the source and simulated target domains. Finally, SRE utilizes ensemble learning to enhance the ability to capture domain-invariant attention maps between the source data and the simulated target data. Extensive experimental results on several datasets demonstrate that SRE generally achieves better results than state-of-the-art methods. The code is available at: this https URL.
zh

[CV-57] SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

【速读】：该论文旨在解决遥感图像描述生成（Remote Sensing Image Captioning, RSIC）问题，即从卫星图像中自动生成具有语义信息的描述性文本。其解决方案的关键在于提出一种基于Transformer的网络架构，并集成多种关键技术，包括静态扩展（Static Expansion）、记忆增强型自注意力（Memory-Augmented Self-Attention）和网格Transformer（Mesh Transformer），以有效捕捉遥感图像中的复杂空间关系和语义信息。

链接: https://arxiv.org/abs/2507.12845
作者: Khang Truong,Lam Pham,Hieu Tang,Jasmin Lampert,Martin Boyer,Son Phan,Truong Nguyen
机构: Austrian Institute of Technology (奥地利技术研究所); University of technology of Troyes (特鲁瓦技术大学); Ho Chi Minh University of Technology (胡志明市技术大学); Ton Duc Thang University (阮忠直大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.
zh

[CV-58] AnyCap Project: A Unified Framework Dataset and Benchmark for Controllable Omni-modal Captioning

【速读】：该论文试图解决可控图像描述生成中细粒度控制不足和可靠评估协议缺失的问题。解决方案的关键在于提出AnyCap Project，其中包括AnyCapModel (ACM)，一个轻量级的即插即用框架，能够在不重新训练基础模型的情况下增强其多模态描述的可控性；同时构建了覆盖多种模态和用户指令类型的AnyCapDataset (ACD)，以及引入AnyCapEval基准，通过解耦内容准确性和风格保真度来提供更可靠的评估指标。

链接: https://arxiv.org/abs/2507.12841
作者: Yiming Ren,Zhiqiang Lin,Yu Li,Gao Meng,Weiyun Wang,Junjie Wang,Zicheng Lin,Jifeng Dai,Yujiu Yang,Wenhai Wang,Ruihang Chu
机构: Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4oś content scores by 45% and style scores by 12%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.
zh

[CV-59] MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset Methods and Results WWW

【速读】：该论文试图解决小目标多目标跟踪（Small Multi-Object Tracking, SMOT）在目标仅占据几十个像素时检测与基于外观的关联不可靠的问题。解决方案的关键在于利用时间信息以克服单帧检测的局限性，通过引入SMOT4SB数据集、SO-HOTA评估指标以及MVA2025挑战赛，推动在无人机场景下的SMOT技术发展。

链接: https://arxiv.org/abs/2507.12832
作者: Yuki Kondo,Norimichi Ukita,Riku Kanayama,Yuki Yoshida,Takayuki Yamaguchi,Xiang Yu,Guang Liang,Xinyao Liu,Guan-Zhang Wang,Wei-Ta Chu,Bing-Cheng Chuang,Jia-Hua Lee,Pin-Tseng Kuo,I-Hsuan Chu,Yi-Shein Hsiao,Cheng-Han Wu,Po-Yi Wu,Jui-Chien Tsou,Hsuan-Chi Liu,Chun-Yi Lee,Yuan-Fu Yang,Kosuke Shigematsu,Asuka Shin,Ba Tran
机构: Toyota Motor Corporation(丰田汽车公司); Toyota Technological Institute(丰田技术学院); Iwate Prefecture Coastal Regional Development Bureau(岩手县沿海地区发展局); Nanjing University(南京大学); University of Science and Technology of China(中国科学技术大学); National Cheng Kung University(成功大学); National Tsing Hua University(清华大学); National Taiwan University(台湾大学); National Yang Ming Chiao Tung University(阳明交通大学); National Institute of Technology, Oita College(大分国立技术学院); Axelspace Corporation( AxelSpace公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper is the official challenge report for SMOT4SB and is published in the proceedings of MVA 2025 (19th International Conference on Machine Vision and Applications). Official challenge page: this https URL

点击查看摘要

Abstract:Small Multi-Object Tracking (SMOT) is particularly challenging when targets occupy only a few dozen pixels, rendering detection and appearance-based association unreliable. Building on the success of the MVA2023 SOD4SB challenge, this paper introduces the SMOT4SB challenge, which leverages temporal information to address limitations of single-frame detection. Our three main contributions are: (1) the SMOT4SB dataset, consisting of 211 UAV video sequences with 108,192 annotated frames under diverse real-world conditions, designed to capture motion entanglement where both camera and targets move freely in 3D; (2) SO-HOTA, a novel metric combining Dot Distance with HOTA to mitigate the sensitivity of IoU-based metrics to small displacements; and (3) a competitive MVA2025 challenge with 78 participants and 308 submissions, where the winning method achieved a 5.1x improvement over the baseline. This work lays a foundation for advancing SMOT in UAV scenarios with applications in bird strike avoidance, agriculture, fisheries, and ecological monitoring.
zh

[CV-60] Feature-Enhanced TResNet for Fine-Grained Food Image Classification

【速读】：该论文旨在解决细粒度食品图像分类中存在的挑战，特别是在形状相似但细节差异细微的食品图像识别方面，传统卷积神经网络（CNN）面临显著困难。其解决方案的关键在于提出一种名为Feature-Enhanced TResNet (FE-TResNet) 的方法，该方法基于TResNet模型，融合了基于风格的校准模块（Style-based Recalibration Module, StyleRM）和深度通道注意力机制（Deep Channel-wise Attention, DCA），以增强特征提取能力，从而更准确地捕捉细粒度食品图像中的细微特征。

链接: https://arxiv.org/abs/2507.12828
作者: Lulu Liu,Zhiyong Xiao
机构: Jiangnan University (江南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Food is not only a core component of humans’ daily diets, but also an important carrier of cultural heritage and emotional bonds. With the development of technology, the need for accurate classification of food images has grown, which is crucial for a variety of application scenarios. However, existing Convolutional Neural Networks (CNNs) face significant challenges when dealing with fine-grained food images that are similar in shape but subtle in detail. To address this challenge, this study presents an innovative method for classifying food images, named Feature-Enhanced TResNet (FE-TResNet), specifically designed to address fine-grained food images and accurately capture subtle features within them. The FE-TResNet method is based on the TResNet model and integrates Style-based Recalibration Module (StyleRM) and Deep Channel-wise Attention (DCA) technologies to enhance feature extraction capabilities. In experimental validation on Chinese food image datasets ChineseFoodNet and CNFOOD-241, the FE-TResNet method significantly improved classification accuracy, achieving rates of 81.37% and 80.29%, respectively, demonstrating its effectiveness and superiority in fine-grained food image classification.
zh

[CV-61] FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval

【速读】：该论文旨在解决组合图像检索（Composed Image Retrieval, CIR）任务中视觉与文本模态融合不足的问题，现有方法通常采用早期或晚期融合，前者过度关注显式文本细节而忽略视觉上下文，后者则难以捕捉图像区域与文本标记之间的细粒度语义对齐。其解决方案的关键在于提出FAR-Net，该框架通过增强语义对齐模块（Enhanced Semantic Alignment Module, ESAM）和自适应校正模块（Adaptive Reconciliation Module, ARM）实现多阶段融合，ESAM采用基于交叉注意力的晚期融合以捕捉细粒度语义关系，ARM则通过不确定性嵌入的早期融合提升鲁棒性和适应性。

链接: https://arxiv.org/abs/2507.12823
作者: Jeong-Woo Park,Young-Eun Kim,Seong-Whan Lee
机构: Korea University (韩国科学技术院); Institute of Information & Communications Technology Planning & Evaluation (信息通信技术规划评估研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Composed image retrieval (CIR) is a vision language task that retrieves a target image using a reference image and modification text, enabling intuitive specification of desired changes. While effectively fusing visual and textual modalities is crucial, existing methods typically adopt either early or late fusion. Early fusion tends to excessively focus on explicitly mentioned textual details and neglect visual context, whereas late fusion struggles to capture fine-grained semantic alignments between image regions and textual tokens. To address these issues, we propose FAR-Net, a multi-stage fusion framework designed with enhanced semantic alignment and adaptive reconciliation, integrating two complementary modules. The enhanced semantic alignment module (ESAM) employs late fusion with cross-attention to capture fine-grained semantic relationships, while the adaptive reconciliation module (ARM) applies early fusion with uncertainty embeddings to enhance robustness and adaptability. Experiments on CIRR and FashionIQ show consistent performance gains, improving Recall@1 by up to 2.4% and Recall@50 by 1.04% over existing state-of-the-art methods, empirically demonstrating that FAR Net provides a robust and scalable solution to CIR tasks.
zh

[CV-62] MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval

【速读】：该论文旨在解决训练-free 零样本图像检索（CIR）中的跨模态交互不足与信息丢失问题。现有方法如序列化视觉语言模型-大语言模型（VLM-LLM）管道在处理多模态信息时存在独立处理各模态导致的信息损失，而基于多模态大语言模型（MLLM）的方法则往往仅关注文本指示的修改，未能充分利用参考图像中的上下文视觉信息。论文提出的解决方案关键在于多方面思维链重排序（MCoT-RE），通过引导MLLM平衡显式修改与上下文视觉线索，生成两个不同侧重的描述，进而实现候选图像筛选与多粒度重排序，从而提升检索精度。

链接: https://arxiv.org/abs/2507.12819
作者: Jeong-Woo Park,Seong-Whan Lee
机构: Korea University (韩国大学); Institute of Information & Communications Technology Planning & Evaluation (信息与通信技术规划评估研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, 2025 IEEE International Conference on Systems, Man, and Cybernetics

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is the task of retrieving a target image from a gallery using a composed query consisting of a reference image and a modification text. Among various CIR approaches, training-free zero-shot methods based on pre-trained models are cost-effective but still face notable limitations. For example, sequential VLM-LLM pipelines process each modality independently, which often results in information loss and limits cross-modal interaction. In contrast, methods based on multimodal large language models (MLLMs) often focus exclusively on applying changes indicated by the text, without fully utilizing the contextual visual information from the reference image. To address these issues, we propose multi-faceted Chain-of-Thought with re-ranking (MCoT-RE), a training-free zero-shot CIR framework. MCoT-RE utilizes multi-faceted Chain-of-Thought to guide the MLLM to balance explicit modifications and contextual visual cues, generating two distinct captions: one focused on modification and the other integrating comprehensive visual-textual context. The first caption is used to filter candidate images. Subsequently, we combine these two captions and the reference image to perform multi-grained re-ranking. This two-stage approach facilitates precise retrieval by aligning with the textual modification instructions while preserving the visual context of the reference image. Through extensive experiments, MCoT-RE achieves state-of-the-art results among training-free methods, yielding improvements of up to 6.24% in Recall@10 on FashionIQ and 8.58% in Recall@1 on CIRR.
zh

[CV-63] FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering

【速读】：该论文旨在解决视频问答（Video Question Answering, VQA）任务中现有方法依赖事件中心的问答对进行训练所导致的场景表征碎片化问题，从而限制了模型的泛化能力和高层推理能力。其解决方案的关键在于提出一种基于视频描述生成基础问答对的方法（Fundamental Question Generation with the Integration of Question Embeddings, FIQ），通过从视频中提取描述信息生成QA对，以丰富训练数据中的基础场景信息，使模型能够更好地理解视频的主要上下文，从而提升其泛化性和推理能力。

链接: https://arxiv.org/abs/2507.12816
作者: Ju-Young Oh,Ho-Joong Kim,Seong-Whan Lee
机构: Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: SMC 2025

点击查看摘要

Abstract:Video question answering (VQA) is a multimodal task that requires the interpretation of a video to answer a given question. Existing VQA methods primarily utilize question and answer (QA) pairs to learn the spatio-temporal characteristics of video content. However, these annotations are typically event-centric, which is not enough to capture the broader context of each video. The absence of essential details such as object types, spatial layouts, and descriptive attributes restricts the model to learning only a fragmented scene representation. This issue limits the model’s capacity for generalization and higher-level reasoning. In this paper, we propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), a novel approach designed to strengthen the reasoning ability of the model by enhancing the fundamental understanding of videos. FIQ generates QA pairs based on descriptions extracted from videos, enriching the training data with fundamental scene information. Generated QA pairs enable the model to understand the primary context, leading to enhanced generalizability and reasoning ability. Furthermore, we incorporate a VQ-CAlign module that assists task-specific question embeddings with visual features, ensuring that essential domain-specific details are preserved to increase the adaptability of downstream tasks. Experiments on SUTD-TrafficQA demonstrate that our FIQ achieves state-of-the-art performance compared to existing baseline methods.
zh

[CV-64] Semantic-guided Fine-tuning of Foundation Model for Long-tailed Visual Recognition

【速读】：该论文旨在解决长尾分布场景下由于类别样本数量差异导致的性能下降问题，尤其是对罕见类别的识别能力不足。其解决方案的关键在于提出一种基于语义引导的预训练模型微调方法（Sage），通过引入SG-Adapter将文本模态中的语义信息融入视觉编码器的微调过程，从而增强视觉与文本模态之间的对齐。此外，为应对现有损失函数未能考虑类别条件分布不一致导致的预测偏差，该方法进一步设计了一个分布不匹配感知的补偿因子，以修正由此引起的性能不平衡问题。

链接: https://arxiv.org/abs/2507.12807
作者: Yufei Peng,Yonggang Zhang,Yiu-ming Cheung
机构: Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The variance in class-wise sample sizes within long-tailed scenarios often results in degraded performance in less frequent classes. Fortunately, foundation models, pre-trained on vast open-world datasets, demonstrate strong potential for this task due to their generalizable representation, which promotes the development of adaptive strategies on pre-trained models in long-tailed learning. Advanced fine-tuning methods typically adjust visual encoders while neglecting the semantics derived from the frozen text encoder, overlooking the visual and textual alignment. To strengthen this alignment, we propose a novel approach, Semantic-guided fine-tuning of foundation model for long-tailed visual recognition (Sage), which incorporates semantic guidance derived from textual modality into the visual fine-tuning process. Specifically, we introduce an SG-Adapter that integrates class descriptions as semantic guidance to guide the fine-tuning of the visual encoder. The introduced guidance is passesed through the attention mechanism and enables the model to focus more on semantically relevant content, strengthening the alignment between the visual and textual modalities. Due to the inconsistent class-conditional distributions neglected by the existing loss function, the resulting prediction bias causes performance improvements for the tail class less than for the head class, even when the multi-modal alignment is enhanced. To address this challenge, we propose a novel distribution mismatch-aware compensation factor, which is specifically designed to rectify the prediction bias caused by the ignored inconsistent distribution based on our theoretical analysis, and is seamlessly integrated into the loss function. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed Sage in enhancing performance in long-tailed learning.
zh

[CV-65] ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion

【速读】：该论文旨在解决音频驱动的说话人头像生成中面部动画与音频信号之间的精确同步问题，同时减少噪声和计算成本。其解决方案的关键在于提出了一种名为ATL-Diff的新方法，该方法包含三个核心组件：将音频转换为面部关键点的Landmark Generation Module、通过根据关键点分布噪声以解耦音频的Landmarks-Guide Noise方法，以及保留身份特征的3D Identity Diffusion网络。

链接: https://arxiv.org/abs/2507.12804
作者: Hoang-Son Vo,Quang-Vinh Nguyen,Seungwon Kim,Hyung-Jeong Yang,Soonja Yeom,Soo-Hyung Kim
机构: Chonnam National University (韩国全南大学); University of Tasmania (塔斯马尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical communication, and digital platforms. The source code is available at: \hrefthis https URLthis https URL
zh

[CV-66] DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment

【速读】：该论文旨在解决文档质量评估（Document Quality Assessment）在实际应用中因现有方法难以提供准确且鲁棒的质量评分而受到限制的问题。其解决方案的关键在于提出DeQA-Doc框架，该框架利用多模态大语言模型（MLLM）的视觉语言能力以及软标签策略，实现对连续文档质量分数的回归。为适配DeQA-Score至文档领域，研究者采用了两种互补方案构建无方差信息的软标签，并放宽了分辨率约束以支持高分辨率文档图像，最终通过集成方法进一步提升性能。

链接: https://arxiv.org/abs/2507.12796
作者: Junjie Gao,Runze Liu,Yingzhe Peng,Shujian Yang,Jin Zhang,Kai Yang,Zhiyuan You
机构: Ant Group(蚂蚁集团); Southeast University(东南大学); Shanghai Jiao Tong University(上海交通大学); CUHK(香港中文大学); MBZUAI(穆巴达拉人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document quality assessment is critical for a wide range of applications including document digitization, OCR, and archival. However, existing approaches often struggle to provide accurate and robust quality scores, limiting their applicability in practical scenarios. With the rapid progress in Multi-modal Large Language Models (MLLMs), recent MLLM-based methods have achieved remarkable performance in image quality assessment. In this work, we extend this success to the document domain by adapting DeQA-Score, a state-of-the-art MLLM-based image quality scorer, for document quality assessment. We propose DeQA-Doc, a framework that leverages the visual language capabilities of MLLMs and a soft label strategy to regress continuous document quality scores. To adapt DeQA-Score to DeQA-Doc, we adopt two complementary solutions to construct soft labels without the variance information. Also, we relax the resolution constrains to support the large resolution of document images. Finally, we introduce ensemble methods to further enhance the performance. Extensive experiments demonstrate that DeQA-Doc significantly outperforms existing baselines, offering accurate and generalizable document quality assessment across diverse degradation types. Codes and model weights are available in this https URL.
zh

[CV-67] City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning

【速读】：该论文旨在解决现有大型视觉-语言模型（LVLM）在户外大规模场景理解任务中的局限性，包括对多尺度、多视角和多模态数据的处理不足以及2D与3D视觉信息融合的有效性问题。其解决方案的关键在于构建首个多领域感知的户外场景理解数据集SVM-City，并引入不完整多模态学习方法，通过构建联合概率分布空间实现多模态数据的融合，而非直接进行显式融合操作。

链接: https://arxiv.org/abs/2507.12795
作者: Penglei Sun,Yaoxian Song,Xiangru Zhu,Xiang Liu,Qiang Wang,Yue Liu,Changqun Xia,Tiefeng Li,Yang Yang,Xiaowen Chu
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Zhejiang University(浙江大学); Fudan University(复旦大学); Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)); Terminus Technologies Co., Ltd.(商汤科技有限公司); Pengcheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scene understanding enables intelligent agents to interpret and comprehend their environment. While existing large vision-language models (LVLMs) for scene understanding have primarily focused on indoor household tasks, they face two significant limitations when applied to outdoor large-scale scene understanding. First, outdoor scenarios typically encompass larger-scale environments observed through various sensors from multiple viewpoints (e.g., bird view and terrestrial view), while existing indoor LVLMs mainly analyze single visual modalities within building-scale contexts from humanoid viewpoints. Second, existing LVLMs suffer from missing multidomain perception outdoor data and struggle to effectively integrate 2D and 3D visual information. To address the aforementioned limitations, we build the first multidomain perception outdoor scene understanding dataset, named \textbf\underlineSVM-City, deriving from multi\textbf\underlineScale scenarios with multi\textbf\underlineView and multi\textbf\underlineModal instruction tuning data. It contains 420 k images and 4, 811 M point clouds with 567 k question-answering pairs from vehicles, low-altitude drones, high-altitude aerial planes, and satellite. To effectively fuse the multimodal data in the absence of one modality, we introduce incomplete multimodal learning to model outdoor scene understanding and design the LVLM named \textbf\underlineCity-VLM. Multimodal fusion is realized by constructing a joint probabilistic distribution space rather than implementing directly explicit fusion operations (e.g., concatenation). Experimental results on three typical outdoor scene understanding tasks show City-VLM achieves 18.14 % performance surpassing existing LVLMs in question-answering tasks averagely. Our method demonstrates pragmatic and generalization performance across multiple outdoor scenes.
zh

[CV-68] Compact Vision Transformer by Reduction of Kernel Complexity

【速读】：该论文旨在解决高效视觉Transformer模型中计算成本高与模型性能之间的矛盾问题。其解决方案的关键在于引入KCR-Transformer，这是一种具有可微通道选择机制的紧凑Transformer块，该机制基于一种新颖且严格的理论泛化界进行引导，从而在Transformer块的MLP层中进行输入/输出通道选择，有效降低计算成本。通过这种泛化感知的通道剪枝方法，KCR-Transformer能够在保持或提升预测精度的同时显著减少视觉Transformer的FLOPs和参数量。

链接: https://arxiv.org/abs/2507.12780
作者: Yancheng Wang,Yingzhen Yang
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-attention and transformer architectures have become foundational components in modern deep learning. Recent efforts have integrated transformer blocks into compact neural architectures for computer vision, giving rise to various efficient vision transformers. In this work, we introduce Transformer with Kernel Complexity Reduction, or KCR-Transformer, a compact transformer block equipped with differentiable channel selection, guided by a novel and sharp theoretical generalization bound. KCR-Transformer performs input/output channel selection in the MLP layers of transformer blocks to reduce the computational cost. Furthermore, we provide a rigorous theoretical analysis establishing a tight generalization bound for networks equipped with KCR-Transformer blocks. Leveraging such strong theoretical results, the channel pruning by KCR-Transformer is conducted in a generalization-aware manner, ensuring that the resulting network retains a provably small generalization error. Our KCR-Transformer is compatible with many popular and compact transformer networks, such as ViT and Swin, and it reduces the FLOPs of the vision transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in the vision transformers with KCR-Transformer blocks, leading to KCR-Transformer networks with different backbones. The resulting TCR-Transformers achieve superior performance on various computer vision tasks, achieving even better performance than the original models with even less FLOPs and parameters.
zh

[CV-69] Local Representative Token Guided Merging for Text-to-Image Generation

【速读】：该论文旨在解决稳定扩散（Stable Diffusion）模型在文本到图像生成任务中因注意力操作的二次复杂度而导致的生成过程耗时问题。其解决方案的关键在于提出一种适用于任何图像生成中注意力机制的新型令牌合并策略——局部代表性令牌引导合并（ReToM）。ReToM通过定义局部边界作为注意力输入中的窗口并调整窗口大小，结合在特定时间步计算相似性以选择每个窗口中最具代表性的令牌，从而在保留显著局部特征的同时降低计算开销。

链接: https://arxiv.org/abs/2507.12771
作者: Min-Jeong Lee,Hee-Dong Kim,Seong-Whan Lee
机构: Korea University (韩国高等学府); Department of Artificial Intelligence (人工智能系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages

点击查看摘要

Abstract:Stable diffusion is an outstanding image generation model for text-to-image, but its time-consuming generation process remains a challenge due to the quadratic complexity of attention operations. Recent token merging methods improve efficiency by reducing the number of tokens during attention operations, but often overlook the characteristics of attention-based image generation models, limiting their effectiveness. In this paper, we propose local representative token guided merging (ReToM), a novel token merging strategy applicable to any attention mechanism in image generation. To merge tokens based on various contextual information, ReToM defines local boundaries as windows within attention inputs and adjusts window sizes. Furthermore, we introduce a representative token, which represents the most representative token per window by computing similarity at a specific timestep and selecting the token with the highest average similarity. This approach preserves the most salient local features while minimizing computational overhead. Experimental results show that ReToM achieves a 6.2% improvement in FID and higher CLIP scores compared to the baseline, while maintaining comparable inference time. We empirically demonstrate that ReToM is effective in balancing visual quality and computational efficiency.
zh

[CV-70] AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

【速读】：该论文试图解决任务条件控制中对任务特定人类示范的依赖问题，这限制了模型的泛化能力并导致高昂的数据采集成本。其解决方案的关键在于提出一种任务无关的动作范式（task-agnostic action paradigm），通过将动作执行与任务特定条件解耦，提升模型的可扩展性、效率和成本效益。此外，为应对该范式带来的数据收集挑战，研究者引入了ATARA（Automated Task-Agnostic Random Actions）框架，实现了比人工遥控快30倍以上的数据采集速度，并提出了AnyPos模型以有效处理任务无关数据中的分布偏差和无关轨迹问题。

链接: https://arxiv.org/abs/2507.12768
作者: Hengkai Tan,Yao Feng,Xinyi Mao,Shuhe Huang,Guodong Liu,Zhongkai Hao,Hang Su,Jun Zhu
机构: Tsinghua University (清华大学); Tsinghua-Bosch Joint ML Center (清华-博世联合机器学习中心); Dept. of Comp. Sci. and Tech. (计算机科学与技术系); Institute for AI (人工智能研究院); BNRist Center (北京信息科学与技术研究中心); THBI Lab (清华-伯克利深圳实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation. However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs. In this work, we present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning, enhancing scalability, efficiency, and cost-effectiveness. To address the data collection challenges posed by this paradigm – such as low coverage density, behavioral redundancy, and safety risks – we introduce ATARA (Automated Task-Agnostic Random Actions), a scalable self-supervised framework that accelerates collection by over 30\times compared to human teleoperation. To further enable effective learning from task-agnostic data, which often suffers from distribution mismatch and irrelevant trajectories, we propose AnyPos, an inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD). We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks. Extensive experiments show that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation. Project Page: this https URL
zh

[CV-71] Continuous Marine Tracking via Autonomous UAV Handoff MICRO

【速读】：该论文试图解决在动态海洋环境中对海洋动物（如鲨鱼）进行连续、实时跟踪的问题。其关键解决方案是集成搭载稳定RGB-D相机和定制训练的OSTrack管道的机载计算机，并引入一种基于高置信度特征匹配的无人机间交接协议，实现了多无人机之间的无缝跟踪任务转移，从而克服单个无人机电池续航限制，扩展了操作覆盖范围。

链接: https://arxiv.org/abs/2507.12763
作者: Heegyeong Kim(1),Alice James(1),Avishkar Seth(1),Endrowednes Kuantama(1),Jane Williamson(2),Yimeng Feng(1),Richard Han(1) ((1) School of Computing, Macquarie University, (2) School of Natural Sciences, Macquarie University)
机构: Macquarie University(麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 6 pages, 5 figures, to be published in DroNet '25: Proceedings of the 10th Workshop on Micro Aerial Vehicle Networks, Systems, and Applications

点击查看摘要

Abstract:This paper introduces an autonomous UAV vision system for continuous, real-time tracking of marine animals, specifically sharks, in dynamic marine environments. The system integrates an onboard computer with a stabilised RGB-D camera and a custom-trained OSTrack pipeline, enabling visual identification under challenging lighting, occlusion, and sea-state conditions. A key innovation is the inter-UAV handoff protocol, which enables seamless transfer of tracking responsibilities between drones, extending operational coverage beyond single-drone battery limitations. Performance is evaluated on a curated shark dataset of 5,200 frames, achieving a tracking success rate of 81.9% during real-time flight control at 100 Hz, and robustness to occlusion, illumination variation, and background clutter. We present a seamless UAV handoff framework, where target transfer is attempted via high-confidence feature matching, achieving 82.9% target coverage. These results confirm the viability of coordinated UAV operations for extended marine tracking and lay the groundwork for scalable, autonomous monitoring.
zh

[CV-72] World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving

【速读】：该论文试图解决自动驾驶系统中交通事故可靠预判的问题，其主要挑战在于训练数据的多样性和高质量不足，以及由于环境干扰或传感器缺陷导致的关键目标级线索频繁缺失。解决方案的关键是提出一个结合生成式场景增强与自适应时序推理的综合框架，其中视频生成管道利用由领域知识引导的提示词生成高分辨率、统计一致的驾驶场景，以丰富边缘案例和复杂交互的覆盖范围，同时动态预测模型通过增强的图卷积和扩张时序操作编码时空关系，有效应对数据不完整和瞬时视觉噪声问题。

链接: https://arxiv.org/abs/2507.12762
作者: Yanchen Guan,Haicheng Liao,Chengyue Wang,Xingcheng Liu,Jiaxun Zhang,Zhenning Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reliable anticipation of traffic accidents is essential for advancing autonomous driving systems. However, this objective is limited by two fundamental challenges: the scarcity of diverse, high-quality training data and the frequent absence of crucial object-level cues due to environmental disruptions or sensor deficiencies. To tackle these issues, we propose a comprehensive framework combining generative scene augmentation with adaptive temporal reasoning. Specifically, we develop a video generation pipeline that utilizes a world model guided by domain-informed prompts to create high-resolution, statistically consistent driving scenarios, particularly enriching the coverage of edge cases and complex interactions. In parallel, we construct a dynamic prediction model that encodes spatio-temporal relationships through strengthened graph convolutions and dilated temporal operators, effectively addressing data incompleteness and transient visual noise. Furthermore, we release a new benchmark dataset designed to better capture diverse real-world driving risks. Extensive experiments on public and newly released datasets confirm that our framework enhances both the accuracy and lead time of accident anticipation, offering a robust solution to current data and modeling limitations in safety-critical autonomous driving applications.
zh

[CV-73] hink-Before-Draw: Decomposing Emotion Semantics Fine-Grained Controllable Expressive Talking Head Generation

【速读】：该论文旨在解决情感对话头像生成中因依赖预定义离散情绪标签文本而导致的面部肌肉动态复杂性被过度简化的问题，从而无法实现自然的情感表达。其解决方案的关键在于提出“Think-Before-Draw”框架，该框架通过两个核心机制进行优化：一是引入Chain-of-Thought（CoT）进行深入语义解析，将抽象情绪标签转化为生理学基础的面部肌肉运动描述，实现从高层语义到可操作运动特征的映射；二是受艺术家肖像绘画过程启发，采用“全局情绪定位—局部肌肉控制”的渐进式引导去噪策略，以细化生成结果中的微表情动态。

链接: https://arxiv.org/abs/2507.12761
作者: Hanlei Shi,Leyuan Qu,Yu Liu,Di Gao,Yuhua Zheng,Taihao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence, with its core value lying in enhancing human-computer interaction through immersive and empathetic this http URL the advancement of multimodal large language models, the driving signals for emotional talking-head generation has shifted from audio and video to more flexible text. However, current text-driven methods rely on predefined discrete emotion label texts, oversimplifying the dynamic complexity of real facial muscle movements and thus failing to achieve natural emotional this http URL study proposes the Think-Before-Draw framework to address two key challenges: (1) In-depth semantic parsing of emotions–by innovatively introducing Chain-of-Thought (CoT), abstract emotion labels are transformed into physiologically grounded facial muscle movement descriptions, enabling the mapping from high-level semantics to actionable motion features; and (2) Fine-grained expressiveness optimization–inspired by artists’ portrait painting process, a progressive guidance denoising strategy is proposed, employing a “global emotion localization–local muscle control” mechanism to refine micro-expression dynamics in generated this http URL experiments demonstrate that our approach achieves state-of-the-art performance on widely-used benchmarks, including MEAD and HDTF. Additionally, we collected a set of portrait images to evaluate our model’s zero-shot generation capability.
zh

[CV-74] Unified Medical Image Segmentation with State Space Modeling Snake ACM-MM2025

【速读】：该论文旨在解决统一医学图像分割（UMIS）中因多尺度结构异质性带来的挑战，传统基于像素的方法由于缺乏器官级解剖洞察和器官间关系建模，难以处理形态复杂性和特征冲突。其解决方案的关键在于提出Mamba Snake框架，该框架通过状态空间建模增强深度蛇形网络，将多轮廓演化建模为分层状态空间图谱，有效捕捉宏观器官拓扑关系和微观轮廓细化，并引入蛇形专用的视觉状态空间模块——Mamba Evolution Block（MEB），结合能量图形状先验和双分类协同机制，实现对复杂形态的自适应细化与微结构分割的优化。

链接: https://arxiv.org/abs/2507.12760
作者: Ruicheng Zhang,Haowei Guo,Kanghui Tian,Jun Zhou,Mingliang Yan,Zeyu Zhang,Shen Zhao
机构: Sun Yat-sen University(中山大学); Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院); Beijing University of Posts and Telecommunications(北京邮电大学); The Australian National University(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ACM MM 2025

点击查看摘要

Abstract:Unified Medical Image Segmentation (UMIS) is critical for comprehensive anatomical assessment but faces challenges due to multi-scale structural heterogeneity. Conventional pixel-based approaches, lacking object-level anatomical insight and inter-organ relational modeling, struggle with morphological complexity and feature conflicts, limiting their efficacy in UMIS. We propose Mamba Snake, a novel deep snake framework enhanced by state space modeling for UMIS. Mamba Snake frames multi-contour evolution as a hierarchical state space atlas, effectively modeling macroscopic inter-organ topological relationships and microscopic contour refinements. We introduce a snake-specific vision state space module, the Mamba Evolution Block (MEB), which leverages effective spatiotemporal information aggregation for adaptive refinement of complex morphologies. Energy map shape priors further ensure robust long-range contour evolution in heterogeneous data. Additionally, a dual-classification synergy mechanism is incorporated to concurrently optimize detection and segmentation, mitigating under-segmentation of microstructures in UMIS. Extensive evaluations across five clinical datasets reveal Mamba Snake’s superior performance, with an average Dice improvement of 3% over state-of-the-art methods.
zh

[CV-75] HairShifter: Consistent and High-Fidelity Video Hair Transfer via Anchor-Guided Animation

【速读】：该论文旨在解决视频风格化头发迁移（video hairstyle transfer）中的时间一致性、空间保真度和动态适应性问题。其解决方案的关键在于提出HairShifter框架，该框架通过集成Image Hair Transfer (IHT)模块实现逐帧的精确变换，并结合多尺度门控SPADE解码器确保空间融合的无缝性和时间连贯性，从而在保持发型保真度的同时，有效维持非头发区域的稳定性。

链接: https://arxiv.org/abs/2507.12758
作者: Wangzheng Shi,Yinglin Zheng,Yuxin Lin,Jianmin Bao,Ming Zeng,Dong Chen
机构: Xiamen University (厦门大学); Microsoft Research Asia (亚洲微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hair transfer is increasingly valuable across domains such as social media, gaming, advertising, and entertainment. While significant progress has been made in single-image hair transfer, video-based hair transfer remains challenging due to the need for temporal consistency, spatial fidelity, and dynamic adaptability. In this work, we propose HairShifter, a novel “Anchor Frame + Animation” framework that unifies high-quality image hair transfer with smooth and coherent video animation. At its core, HairShifter integrates a Image Hair Transfer (IHT) module for precise per-frame transformation and a Multi-Scale Gated SPADE Decoder to ensure seamless spatial blending and temporal coherence. Our method maintains hairstyle fidelity across frames while preserving non-hair regions. Extensive experiments demonstrate that HairShifter achieves state-of-the-art performance in video hairstyle transfer, combining superior visual quality, temporal consistency, and scalability. The code will be publicly available. We believe this work will open new avenues for video-based hairstyle transfer and establish a robust baseline in this field.
zh

[CV-76] Domain-Enhanced Dual-Branch Model for Efficient and Interpretable Accident Anticipation

【速读】：该论文旨在解决交通事故提前预测系统的精确性与计算效率问题，以支持自动驾驶技术中的及时干预和损失预防。其解决方案的关键在于提出了一种双分支架构，有效融合来自行车记录仪视频的视觉信息与事故报告中提取的结构化文本数据，并通过大型模型（如GPT-4o、Long-CLIP）进行特征聚合，结合针对性的提示工程策略，实现多模态输入的无缝整合与可操作的反馈生成。

链接: https://arxiv.org/abs/2507.12755
作者: Yanchen Guan,Haicheng Liao,Chengyue Wang,Bonan Wang,Jiaxun Zhang,Jia Hu,Zhenning Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Developing precise and computationally efficient traffic accident anticipation system is crucial for contemporary autonomous driving technologies, enabling timely intervention and loss prevention. In this paper, we propose an accident anticipation framework employing a dual-branch architecture that effectively integrates visual information from dashcam videos with structured textual data derived from accident reports. Furthermore, we introduce a feature aggregation method that facilitates seamless integration of multimodal inputs through large models (GPT-4o, Long-CLIP), complemented by targeted prompt engineering strategies to produce actionable feedback and standardized accident archives. Comprehensive evaluations conducted on benchmark datasets (DAD, CCD, and A3D) validate the superior predictive accuracy, enhanced responsiveness, reduced computational overhead, and improved interpretability of our approach, thus establishing a new benchmark for state-of-the-art performance in traffic accident anticipation.
zh

[CV-77] Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient Data-Centric Learning

【速读】：该论文试图解决深度模型训练中数据质量不一和冗余普遍的问题，旨在通过数据驱动的方法提升训练效率和模型性能。其解决方案的关键在于提出一种动态数据集剪枝框架，该框架基于任务驱动的难度和跨模态语义一致性自适应地选择训练样本，利用预训练多模态基础模型的监督信息捕捉训练动态，从而有效过滤无信息量的样本。

链接: https://arxiv.org/abs/2507.12750
作者: Suorong Yang,Peijia Li,Yujie Liu,Zhiming Xu,Peng Ye,Wanli Ouyang,Furao Shen,Dongzhan Zhou
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern deep models are trained on large real-world datasets, where data quality varies and redundancy is common. Data-centric approaches such as dataset pruning have shown promise in improving training efficiency and model performance. However, most existing methods rely on static heuristics or task-specific metrics, limiting their robustness and generalizability across domains. In this work, we introduce a dynamic dataset pruning framework that adaptively selects training samples based on both task-driven difficulty and cross-modality semantic consistency. By incorporating supervision from pretrained multimodal foundation models, our approach captures training dynamics while effectively filtering out uninformative samples. Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.
zh

[CV-78] ransformer-based Spatial Grounding: A Comprehensive Survey

【速读】：该论文试图解决空间定位（spatial grounding）领域中缺乏对现有方法、数据集使用、评估指标及工业适用性的全面综述问题。其解决方案的关键在于系统性地回顾2018年至2025年间基于Transformer的空间定位方法，分析主流模型架构、常用数据集和评估指标，并总结关键的方法趋势与最佳实践，从而为研究人员和从业者提供有价值的见解和结构化指导。

链接: https://arxiv.org/abs/2507.12739
作者: Ijazul Haq,Muhammad Saqib,Yingjie Zhang
机构: South China University of Technology (华南理工大学); Shien-Ming Wu School of Intelligent Manufacturing (智能制造学院); University of Engineering & Technology (工程与技术大学); Zirak.ai (Zirak.ai)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial grounding, the process of associating natural language expressions with corresponding image regions, has rapidly advanced due to the introduction of transformer-based models, significantly enhancing multimodal representation and cross-modal alignment. Despite this progress, the field lacks a comprehensive synthesis of current methodologies, dataset usage, evaluation metrics, and industrial applicability. This paper presents a systematic literature review of transformer-based spatial grounding approaches from 2018 to 2025. Our analysis identifies dominant model architectures, prevalent datasets, and widely adopted evaluation metrics, alongside highlighting key methodological trends and best practices. This study provides essential insights and structured guidance for researchers and practitioners, facilitating the development of robust, reliable, and industry-ready transformer-based spatial grounding models.
zh

[CV-79] A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique

【速读】：该论文试图解决在图像用于模型训练和测试时，如何在保持语义分割模型准确率的前提下实现隐私保护的问题。解决方案的关键在于利用视觉Transformer（Vision Transformer, ViT）的嵌入结构上的领域自适应技术，从而在应用感知加密的同时，使语义分割模型保持接近未加密情况下的性能。

链接: https://arxiv.org/abs/2507.12730
作者: Homare Sueyoshi,Kiyoshi Nishikawa,Hitoshi Kiya
机构: Tokyo Metropolitan University (东京都立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 4 pages, 5 figures, 1 table. Accepted to GCCE 2025

点击查看摘要

Abstract:We propose a privacy-preserving semantic-segmentation method for applying perceptual encryption to images used for model training in addition to test images. This method also provides almost the same accuracy as models without any encryption. The above performance is achieved using a domain-adaptation technique on the embedding structure of the Vision Transformer (ViT). The effectiveness of the proposed method was experimentally confirmed in terms of the accuracy of semantic segmentation when using a powerful semantic-segmentation model with ViT called Segmentation Transformer.
zh

[CV-80] SOD-YOLO: Enhancing YOLO-Based Detection of Small Objects in UAV Imagery

【速读】：该论文试图解决小目标检测（small object detection）在目标检测领域中的挑战问题。其解决方案的关键在于提出一种基于YOLOv8的增强模型SOD-YOLO，该模型在颈部集成了一种ASF机制以增强多尺度特征融合，并引入了一个名为P2的小目标检测层以提供更高分辨率的特征图，同时采用Soft-NMS来优化置信度评分并保留真正例。这些改进显著提升了检测性能。

链接: https://arxiv.org/abs/2507.12727
作者: Peijun Wang,Jinhua Zhao
机构: Central China Normal University(华中师范大学); University of Wollongong(卧龙岗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Small object detection remains a challenging problem in the field of object detection. To address this challenge, we propose an enhanced YOLOv8-based model, SOD-YOLO. This model integrates an ASF mechanism in the neck to enhance multi-scale feature fusion, adds a Small Object Detection Layer (named P2) to provide higher-resolution feature maps for better small object detection, and employs Soft-NMS to refine confidence scores and retain true positives. Experimental results demonstrate that SOD-YOLO significantly improves detection performance, achieving a 36.1% increase in mAP _50:95 and 20.6% increase in mAP _50 on the VisDrone2019-DET dataset compared to the baseline model. These enhancements make SOD-YOLO a practical and efficient solution for small object detection in UAV imagery. Our source code, hyper-parameters, and model weights are available at this https URL.
zh

[CV-81] NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement ICCV2025

【速读】：该论文试图解决植物叶片在三维建模与重建中的挑战，特别是由于叶片形状多样性和柔性变形带来的问题。其解决方案的关键在于提出一种名为NeuraLeaf的神经参数化模型，该模型将叶片的几何结构分解为二维基底形状和三维变形，并利用丰富的二维叶片图像数据集进行学习，同时能够同步学习与几何对齐的纹理。此外，为建模三维变形，作者还提出了一种无骨骼的皮肤化模型，并构建了一个新的三维叶片数据集DeformLeaf。

链接: https://arxiv.org/abs/2507.12714
作者: Yang Yang,Dongni Mao,Hiroaki Santo,Yasuyuki Matsushita,Fumio Okura
机构: The University of Osaka (大阪大学); Microsoft Research Asia – Tokyo (微软亚洲研究院–东京)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: IEEE/CVF International Conference on Computer Vision (ICCV 2025), Project: this https URL

点击查看摘要

Abstract:We develop a neural parametric model for 3D leaves for plant modeling and reconstruction that are essential for agriculture and computer graphics. While neural parametric models are actively studied for humans and animals, plant leaves present unique challenges due to their diverse shapes and flexible deformation. To this problem, we introduce a neural parametric model for leaves, NeuraLeaf. Capitalizing on the fact that flattened leaf shapes can be approximated as a 2D plane, NeuraLeaf disentangles the leaves’ geometry into their 2D base shapes and 3D deformations. This representation allows learning from rich sources of 2D leaf image datasets for the base shapes, and also has the advantage of simultaneously learning textures aligned with the geometry. To model the 3D deformation, we propose a novel skeleton-free skinning model and create a newly captured 3D leaf dataset called DeformLeaf. We show that NeuraLeaf successfully generates a wide range of leaf shapes with deformation, resulting in accurate model fitting to 3D observations like depth maps and point clouds. Our implementation and dataset are available at this https URL.
zh

[CV-82] FORTRESS: Function-composition Optimized Real-Time Resilient Structural Segmentation via Kolmogorov-Arnold Enhanced Spatial Attention Networks

【速读】：该论文旨在解决在民用基础设施中实现自动化结构缺陷分割时面临的挑战，即在保证高精度的同时维持计算效率以支持实时部署。解决方案的关键在于提出FORTRESS架构，其通过结合深度可分离卷积与自适应Kolmogorov-Arnold网络（TiKAN）集成的方法，实现了精度与速度的平衡。该架构包含三项关键创新：系统化的深度可分离卷积框架、仅在计算上有利时才应用函数组合变换的自适应TiKAN集成，以及跨解码器层级的空间、通道和KAN增强特征的多尺度注意力融合。

链接: https://arxiv.org/abs/2507.12675
作者: Christina Thrainer,Md Meftahul Ferdaus,Mahdi Abdelguerfi,Christian Guetl,Steven Sloan,Kendall N. Niles,Ken Pathak
机构: Canizaro Livingston Gulf States Center for Environmental Informatics, the University of New Orleans(卡尼扎罗利文斯顿海湾国家环境信息中心，新奥尔良大学); Graz University of Technology(格拉茨技术大学); US Army Corps of Engineers, Engineer Research and Development Center(美国陆军工程兵团，工程师研究与开发中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Automated structural defect segmentation in civil infrastructure faces a critical challenge: achieving high accuracy while maintaining computational efficiency for real-time deployment. This paper presents FORTRESS (Function-composition Optimized Real-Time Resilient Structural Segmentation), a new architecture that balances accuracy and speed by using a special method that combines depthwise separable convolutions with adaptive Kolmogorov-Arnold Network integration. FORTRESS incorporates three key innovations: a systematic depthwise separable convolution framework achieving a 3.6x parameter reduction per layer, adaptive TiKAN integration that selectively applies function composition transformations only when computationally beneficial, and multi-scale attention fusion combining spatial, channel, and KAN-enhanced features across decoder levels. The architecture achieves remarkable efficiency gains with 91% parameter reduction (31M to 2.9M), 91% computational complexity reduction (13.7 to 1.17 GFLOPs), and 3x inference speed improvement while delivering superior segmentation performance. Evaluation on benchmark infrastructure datasets demonstrates state-of-the-art results with an F1- score of 0.771 and a mean IoU of 0.677, significantly outperforming existing methods including U-Net, SA-UNet, and U- KAN. The dual optimization strategy proves essential for optimal performance, establishing FORTRESS as a robust solution for practical structural defect segmentation in resource-constrained environments where both accuracy and computational efficiency are paramount. Comprehensive architectural specifications are provided in the Supplemental Material. Source code is available at URL: this https URL.
zh

[CV-83] Integrated Oculomics and Lipidomics Reveal Microvascular Metabolic Signatures Associated with Cardiovascular Health in a Healthy Cohort

【速读】：该论文试图解决当前心血管疾病（CVD）风险分层方法难以检测早期、亚临床变化的问题，其关键在于首次将视网膜微血管特征与全面的血清脂质组学谱进行整合，以作为CVD风险的潜在指标。研究引入了一种创新的成像组学框架，结合基于深度学习的图像处理提取的视网膜微血管特征与超高效液相色谱电喷雾电离高分辨质谱（UHPLC ESI HRMS）获得的血清脂质组学数据，从而揭示超越传统脂质面板的无症状生物标志物。该研究通过大规模、协变量调整和分层的相关性分析，为识别疾病的早期指标提供了重要依据。

链接: https://arxiv.org/abs/2507.12663
作者: Inamullah,Ernesto Elias Vidal Rosas,Imran Razzak,Shoaib Jameel
机构: University of Southampton (南安普顿大学); MBZUAI (MBZUAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cardiovascular disease (CVD) remains the leading global cause of mortality, yet current risk stratification methods often fail to detect early, subclinical changes. Previous studies have generally not integrated retinal microvasculature characteristics with comprehensive serum lipidomic profiles as potential indicators of CVD risk. In this study, an innovative imaging omics framework was introduced, combining retinal microvascular traits derived through deep learning based image processing with serum lipidomic data to highlight asymptomatic biomarkers of cardiovascular risk beyond the conventional lipid panel. This represents the first large scale, covariate adjusted and stratified correlation analysis conducted in a healthy population, which is essential for identifying early indicators of disease. Retinal phenotypes were quantified using automated image analysis tools, while serum lipid profiling was performed by Ultra High Performance Liquid Chromatography Electrospray ionization High resolution mass spectrometry (UHPLC ESI HRMS). Strong, age- and sex-independent correlations were established, particularly between average artery width, vessel density, and lipid subclasses such as triacylglycerols (TAGs), diacylglycerols (DAGs), and ceramides (Cers). These associations suggest a converging mechanism of microvascular remodeling under metabolic stress. By linking detailed vascular structural phenotypes to specific lipid species, this study fills a critical gap in the understanding of early CVD pathogenesis. This integration not only offers a novel perspective on microvascular metabolic associations but also presents a significant opportunity for the identification of robust, non-invasive biomarkers. Ultimately, these findings may support improved early detection, targeted prevention, and personalized approaches in cardiovascular healthcare. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.12663 [cs.CV] (or arXiv:2507.12663v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.12663 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-84] Reconstruct Inpaint Finetune: Dynamic Novel-view Synthesis from Monocular Videos

【速读】：该论文试图解决从单目视频中进行动态场景的新视角合成问题，现有方法要么依赖于昂贵的测试时优化的4D表示，要么在前馈训练过程中无法保持场景几何结构。其解决方案的关键在于三个核心见解：（1）通过首先重建动态3D场景并从新视角渲染重建结果，可以渲染出在输入和目标视角中都可见的像素；（2）新视角中的隐藏像素可以通过前馈2D视频扩散模型进行“修补”；（3）提出的视频修补扩散模型（CogNVS）能够从2D视频中自监督训练，并通过测试时微调零样本应用于新的测试视频。

链接: https://arxiv.org/abs/2507.12646
作者: Kaihua Chen,Tarasha Khurana,Deva Ramanan
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be “inpainted” with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.
zh

[CV-85] Funnel-HOI: Top-Down Perception for Zero-Shot HOI Detection

【速读】：该论文旨在解决人类-物体交互检测（Human-Object Interaction Detection, HOID）中因交互类别数量庞大而导致的标注数据有限与长尾分布问题。其解决方案的关键在于在编码器阶段提前预测HOI特定线索，以获得更强大的场景解释能力。为此，作者提出了一种自上而下的框架Funnel-HOI，通过先探测物体（明确概念），再探测与之相关的动作（抽象概念），结合一种新颖的不对称协同注意力机制，利用多模态信息（包括零样本能力）在编码器层级生成更强大的交互表示，并设计了一种新型损失函数以更好地考虑物体-动作相关性并调节误分类惩罚。

链接: https://arxiv.org/abs/2507.12628
作者: Sandipan Sarma,Agney Talwarr,Arijit Sur
机构: Indian Institute of Technology, Guwahati(印度理工学院，古瓦哈蒂)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Human-object interaction detection (HOID) refers to localizing interactive human-object pairs in images and identifying the interactions. Since there could be an exponential number of object-action combinations, labeled data is limited - leading to a long-tail distribution problem. Recently, zero-shot learning emerged as a solution, with end-to-end transformer-based object detectors adapted for HOID becoming successful frameworks. However, their primary focus is designing improved decoders for learning entangled or disentangled interpretations of interactions. We advocate that HOI-specific cues must be anticipated at the encoder stage itself to obtain a stronger scene interpretation. Consequently, we build a top-down framework named Funnel-HOI inspired by the human tendency to grasp well-defined concepts first and then associate them with abstract concepts during scene understanding. We first probe an image for the presence of objects (well-defined concepts) and then probe for actions (abstract concepts) associated with them. A novel asymmetric co-attention mechanism mines these cues utilizing multimodal information (incorporating zero-shot capabilities) and yields stronger interaction representations at the encoder level. Furthermore, a novel loss is devised that considers objectaction relatedness and regulates misclassification penalty better than existing loss functions for guiding the interaction classifier. Extensive experiments on the HICO-DET and V-COCO datasets across fully-supervised and six zero-shot settings reveal our state-of-the-art performance, with up to 12.4% and 8.4% gains for unseen and rare HOI categories, respectively.
zh

[CV-86] Predicting Soccer Penalty Kick Direction Using Human Action Recognition

【速读】：该论文旨在解决动作预判（action anticipation）在真实体育场景中的应用受限问题，尤其是由于缺乏合适的标注数据集。其关键解决方案是构建了一个手动标注的足球点球（soccer penalty kicks）数据集，并提出了一种结合基于人体动作识别（HAR）的特征嵌入与上下文元数据的深度学习分类器，以预测射门方向。通过在七个架构家族的二十二个主干模型上进行评估，该方法在预测射门方向（左或右）上达到了最高63.9%的准确率，优于实际守门员的决策，验证了该数据集和方法的有效性。

链接: https://arxiv.org/abs/2507.12617
作者: David Freire-Obregón,Oliverio J. Santana,Javier Lorenzo-Navarro,Daniel Hernández-Sosa,Modesto Castrillón-Santana
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 23rd International Conference on Image Analysis and Processing (ICIAP 2025)

点击查看摘要

Abstract:Action anticipation has become a prominent topic in Human Action Recognition (HAR). However, its application to real-world sports scenarios remains limited by the availability of suitable annotated datasets. This work presents a novel dataset of manually annotated soccer penalty kicks to predict shot direction based on pre-kick player movements. We propose a deep learning classifier to benchmark this dataset that integrates HAR-based feature embeddings with contextual metadata. We evaluate twenty-two backbone models across seven architecture families (MViTv2, MViTv1, SlowFast, Slow, X3D, I3D, C2D), achieving up to 63.9% accuracy in predicting shot direction (left or right), outperforming the real goalkeepers’ decisions. These results demonstrate the dataset’s value for anticipatory action recognition and validate our model’s potential as a generalizable approach for sports-based predictive tasks.
zh

[CV-87] MS-DGCNN: A Multi-Scale Fusion Dynamic Graph Neural Network with Biological Knowledge Integration for LiDAR Tree Species Classification

【速读】：该论文试图解决从地面LiDAR点云中进行树种分类的问题，这一任务因森林环境中复杂的多尺度几何结构而具有挑战性。解决方案的关键在于提出MS-DGCNN++，这是一种分层多尺度融合动态图卷积网络，通过在局部、枝干和冠层尺度上进行语义有意义的特征提取，并实现跨尺度信息传播，从而捕捉树结构的语义关系。该方法采用尺度特定的特征工程，替代了传统并行处理方式，实现了与自然树结构对齐的语义区分表示。

链接: https://arxiv.org/abs/2507.12602
作者: Said Ohamouddou,Abdellatif El Afia,Hanaa El Afia,Raddouane Chiheb
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tree species classification from terrestrial LiDAR point clouds is challenging because of the complex multi-scale geometric structures in forest environments. Existing approaches using multi-scale dynamic graph convolutional neural networks (MS-DGCNN) employ parallel multi-scale processing, which fails to capture the semantic relationships between the hierarchical levels of the tree architecture. We present MS-DGCNN++, a hierarchical multiscale fusion dynamic graph convolutional network that uses semantically meaningful feature extraction at local, branch, and canopy scales with cross-scale information propagation. Our method employs scale-specific feature engineering, including standard geometric features for the local scale, normalized relative vectors for the branch scale, and distance information for the canopy scale. This hierarchical approach replaces uniform parallel processing with semantically differentiated representations that are aligned with the natural tree structure. Under the same proposed tree species data augmentation strategy for all experiments, MS-DGCNN++ achieved an accuracy of 94.96 % on STPCTLS, outperforming DGCNN, MS-DGCNN, and the state-of-the-art model PPT. On FOR-species20K, it achieves 67.25% accuracy (6.1% improvement compared to MS-DGCNN). For standard 3D object recognition, our method outperformed DGCNN and MS-DGCNN with overall accuracies of 93.15% on ModelNet40 and 94.05% on ModelNet10. With lower parameters and reduced complexity compared to state-of-the-art transformer approaches, our method is suitable for resource-constrained applications while maintaining a competitive accuracy. Beyond tree classification, the method generalizes to standard 3D object recognition, establishing it as a versatile solution for diverse point cloud processing applications. The implementation code is publicly available at this https URL.
zh

[CV-88] HairFormer: Transformer-Based Dynamic Neural Hair Simulation

【速读】：该论文试图解决在任意发型、身体形状和运动下实现头发动力学模拟的挑战。其解决方案的关键在于提出一种基于Transformer架构的两阶段神经网络方法，其中第一阶段使用Transformer驱动的静态网络预测任意发型的静态 draped 形状，有效解决头发与身体的穿透问题并保持头发的保真度；第二阶段通过具有新颖交叉注意力机制的动态网络，将静态头发特征与运动输入融合，生成富有表现力的动力学和复杂的二次运动，同时支持对复杂运动序列的高效微调。

链接: https://arxiv.org/abs/2507.12600
作者: Joy Xiaoji Zhang,Jingsen Zhu,Hanyu Chen,Steve Marschner
机构: Cornell University(康奈尔大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Simulating hair dynamics that generalize across arbitrary hairstyles, body shapes, and motions is a critical challenge. Our novel two-stage neural solution is the first to leverage Transformer-based architectures for such a broad generalization. We propose a Transformer-powered static network that predicts static draped shapes for any hairstyle, effectively resolving hair-body penetrations and preserving hair fidelity. Subsequently, a dynamic network with a novel cross-attention mechanism fuses static hair features with kinematic input to generate expressive dynamics and complex secondary motions. This dynamic network also allows for efficient fine-tuning of challenging motion sequences, such as abrupt head movements. Our method offers real-time inference for both static single-frame drapes and dynamic drapes over pose sequences. Our method demonstrates high-fidelity and generalizable dynamic hair across various styles, guided by physics-informed losses, and can resolve penetrations even for complex, unseen long hairstyles, highlighting its broad generalization.
zh

[CV-89] CT-ScanGaze: A Dataset and Baselines for 3D Volumetric Scanpath Modeling ICCV2025

【速读】：该论文试图解决放射科医生在CT阅片过程中眼动行为分析的难题，以及现有扫描路径预测模型仅能处理二维输入而无法适应CT三维数据的问题。其关键解决方案是提出首个公开的CT眼动数据集CT-ScanGaze，并开发了CT-Searcher，这是一种专门用于处理CT体积数据并生成类似放射科医生的三维注视序列的新型3D扫描路径预测模型。此外，为提升模型性能，还构建了一个将现有二维眼动数据转换为三维眼动数据的预训练管道。

链接: https://arxiv.org/abs/2507.12591
作者: Trong-Thang Pham,Akash Awasthi,Saba Khan,Esteban Duran Marti,Tien-Phat Nguyen,Khoa Vo,Minh Tran,Ngoc Son Nguyen,Cuong Tran Van,Yuki Ikebe,Anh Totti Nguyen,Anh Nguyen,Zhigang Deng,Carol C. Wu,Hien Van Nguyen,Ngan Le
机构: University of Arkansas; University of Houston; University of Science VNU-HCM; FPT Software; Auburn University; University of Liverpool; MD Anderson Cancer Center
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Understanding radiologists’ eye movement during Computed Tomography (CT) reading is crucial for developing effective interpretable computer-aided diagnosis systems. However, CT research in this area has been limited by the lack of publicly available eye-tracking datasets and the three-dimensional complexity of CT volumes. To address these challenges, we present the first publicly available eye gaze dataset on CT, called CT-ScanGaze. Then, we introduce CT-Searcher, a novel 3D scanpath predictor designed specifically to process CT volumes and generate radiologist-like 3D fixation sequences, overcoming the limitations of current scanpath predictors that only handle 2D inputs. Since deep learning models benefit from a pretraining step, we develop a pipeline that converts existing 2D gaze datasets into 3D gaze data to pretrain CT-Searcher. Through both qualitative and quantitative evaluations on CT-ScanGaze, we demonstrate the effectiveness of our approach and provide a comprehensive assessment framework for 3D scanpath prediction in medical imaging.
zh

[CV-90] Best Practices for Large-Scale Pixel-Wise Crop Mapping and Transfer Learning Workflows

【速读】：该论文旨在解决大规模、像素级作物制图（crop mapping）工作流的优化问题，涵盖传统监督学习方法和新兴的迁移学习技术。其关键解决方案在于系统性地评估和比较多种卫星影像预处理方法及监督分类模型，并探索不同训练样本量和变量组合对性能的影响，同时识别在不同领域偏移程度下最优的迁移学习技术。研究发现，细粒度区间预处理结合Transformer模型在监督和可迁移工作流中均表现最佳，而迁移学习显著提升了工作流的适应性，尤其在标签样本有限的情况下提供了可行的替代方案。

链接: https://arxiv.org/abs/2507.12590
作者: Judy Long,Tao Liu,Sean Alexander Woznicki,Miljana Marković,Oskar Marko,Molly Sears
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: A review article. 41 pages, 22 figures. Preprint

点击查看摘要

Abstract:Crop mapping involves identifying and classifying crop types using spatial data, primarily derived from remote sensing imagery. This study presents the first comprehensive review of large-scale, pixel-wise crop mapping workflows, encompassing both conventional supervised methods and emerging transfer learning approaches. To identify the optimal supervised crop mapping workflows, we conducted systematic experiments, comparing six widely adopted satellite image-based preprocessing methods, alongside eleven supervised pixel-wise classification models. Additionally, we assessed the synergistic impact of varied training sample sizes and variable combinations. Moreover, we identified optimal transfer learning techniques for different magnitudes of domain shift. The evaluation of best methods was conducted across five diverse agricultural sites. Landsat 8 served as the primary satellite data source. Labels come from CDL trusted pixels and field surveys. Our findings reveal three key insights. First, fine-scale interval preprocessing paired with Transformer models consistently delivered optimal performance for both supervised and transferable workflows. RF offered rapid training and competitive performance in conventional supervised learning and direct transfer to similar domains. Second, transfer learning techniques enhanced workflow adaptability, with UDA being effective for homogeneous crop classes while fine-tuning remains robust across diverse scenarios. Finally, workflow choice depends heavily on the availability of labeled samples. With a sufficient sample size, supervised training typically delivers more accurate and generalizable results. Below a certain threshold, transfer learning that matches the level of domain shift is a viable alternative to achieve crop mapping. Repository: Best-Practices-for-Large-Scale-Pixel-Wise-Crop-Mapping-and-Transfer-Learning-Workflows Comments: A review article. 41 pages, 22 figures. Preprint Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2507.12590 [cs.CV] (or arXiv:2507.12590v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.12590 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-91] MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

【速读】：该论文试图解决当前视觉-语言模型（Vision-Language Models, VLMs）在三维空间推理任务中的不足，特别是其在预测第一人称运动后场景变化时的表现不佳问题。其关键解决方案是提出MindJourney框架，该框架通过将VLM与基于视频扩散的可控世界模型相结合，在不进行微调的情况下，使VLM能够通过迭代生成简洁的相机轨迹并利用世界模型合成对应视图，从而实现多视角证据的交互式探索与推理，显著提升了三维空间推理能力。

链接: https://arxiv.org/abs/2507.12508
作者: Yuncong Yang,Jiageng Liu,Zheyuan Zhang,Siyuan Zhou,Reuben Tan,Jianwei Yang,Yilun Du,Chuang Gan
机构: UMass Amherst(马萨诸塞大学阿默斯特分校); JHU(约翰霍普金斯大学); HKUST(香港科技大学); Microsoft Research(微软研究院); Harvard(哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.
zh

[CV-92] Physically Based Neural LiDAR Resimulation ITSC2025

【速读】：该论文旨在解决LiDAR模拟中对传感器特性的建模不足问题，特别是在滚动快门效应、激光功率变化和强度衰减等方面。其解决方案的关键在于显式建模这些LiDAR特定的传感器特性，从而实现比现有技术更精确的LiDAR模拟。通过定量和定性对比实验以及消融研究，验证了所提方法的有效性及其各组件的重要性。

链接: https://arxiv.org/abs/2507.12489
作者: Richard Marcus,Marc Stamminger
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
备注: Accepted at ITSC 2025, Gold Coast Australia

点击查看摘要

Abstract:Methods for Novel View Synthesis (NVS) have recently found traction in the field of LiDAR simulation and large-scale 3D scene reconstruction. While solutions for faster rendering or handling dynamic scenes have been proposed, LiDAR specific effects remain insufficiently addressed. By explicitly modeling sensor characteristics such as rolling shutter, laser power variations, and intensity falloff, our method achieves more accurate LiDAR simulation compared to existing techniques. We demonstrate the effectiveness of our approach through quantitative and qualitative comparisons with state-of-the-art methods, as well as ablation studies that highlight the importance of each sensor model component. Beyond that, we show that our approach exhibits advanced resimulation capabilities, such as generating high resolution LiDAR scans in the camera perspective. Our code and the resulting dataset are available at this https URL. Comments: Accepted at ITSC 2025, Gold Coast Australia Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV) Cite as: arXiv:2507.12489 [cs.RO] (or arXiv:2507.12489v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2507.12489 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-93] Predicting 3D Rigid Body Dynamics with Deep Residual Network

【速读】：该论文试图解决三维刚体相互作用动力学的预测问题，旨在通过深度学习模型准确模拟刚体的线性与角运动、弹性碰撞、流体摩擦、重力效应及阻尼等复杂物理行为。解决方案的关键在于构建一个结合了3D物理模拟器与深度残差网络（Deep Residual Network）的框架，其中深度残差网络通过多级残差块结构有效捕捉并建模复杂的三维动力学特性，从而在位置和方向预测上分别达到0.015和0.022的均方误差，相较于基线方法提升了25%。

链接: https://arxiv.org/abs/2407.18798
作者: Abiodun Finbarrs Oketunji
机构: University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:This study investigates the application of deep residual networks for predicting the dynamics of interacting three-dimensional rigid bodies. We present a framework combining a 3D physics simulator implemented in C++ with a deep learning model constructed using PyTorch. The simulator generates training data encompassing linear and angular motion, elastic collisions, fluid friction, gravitational effects, and damping. Our deep residual network, consisting of an input layer, multiple residual blocks, and an output layer, is designed to handle the complexities of 3D dynamics. We evaluate the network’s performance using a datasetof 10,000 simulated scenarios, each involving 3-5 interacting rigid bodies. The model achieves a mean squared error of 0.015 for position predictions and 0.022 for orientation predictions, representing a 25% improvement over baseline methods. Our results demonstrate the network’s ability to capture intricate physical interactions, with particular success in predicting elastic collisions and rotational dynamics. This work significantly contributes to physics-informed machine learning by showcasing the immense potential of deep residual networks in modeling complex 3D physical systems. We discuss our approach’s limitations and propose future directions for improving generalization to more diverse object shapes and materials.
zh

[CV-94] SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution

【速读】：该论文旨在解决高空间分辨率多光谱图像（HR-MSI）与低空间分辨率高光谱图像（LR-HSI）融合问题，以在不牺牲光谱保真度的前提下恢复精细的空间结构。传统方法通常依赖于点扩散函数（PSF）校准或真实世界中难以获取的高分辨率高光谱图像（HR-HSI）作为监督信号，而本文提出的SpectraLift框架则通过仅利用多光谱图像的光谱响应函数（SRF）实现完全自监督的融合过程，其关键在于使用合成的低空间分辨率多光谱图像（LR-MSI）作为输入，并以LR-HSI作为输出，结合ℓ₁光谱重建损失训练一个轻量级的逐像素多层感知机（MLP）网络，从而在推理阶段将HR-MSI映射为高分辨率高光谱图像估计。

链接: https://arxiv.org/abs/2507.13339
作者: Ritik Shah,Marco F. Duarte
机构: University of Massachusetts (马萨诸塞大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-spatial-resolution hyperspectral images (HSI) are essential for applications such as remote sensing and medical imaging, yet HSI sensors inherently trade spatial detail for spectral richness. Fusing high-spatial-resolution multispectral images (HR-MSI) with low-spatial-resolution hyperspectral images (LR-HSI) is a promising route to recover fine spatial structures without sacrificing spectral fidelity. Most state-of-the-art methods for HSI-MSI fusion demand point spread function (PSF) calibration or ground truth high resolution HSI (HR-HSI), both of which are impractical to obtain in real world settings. We present SpectraLift, a fully self-supervised framework that fuses LR-HSI and HR-MSI inputs using only the MSI’s Spectral Response Function (SRF). SpectraLift trains a lightweight per-pixel multi-layer perceptron (MLP) network using ( i )~a synthetic low-spatial-resolution multispectral image (LR-MSI) obtained by applying the SRF to the LR-HSI as input, ( ii )~the LR-HSI as the output, and ( iii )~an \ell_1 spectral reconstruction loss between the estimated and true LR-HSI as the optimization objective. At inference, SpectraLift uses the trained network to map the HR-MSI pixel-wise into a HR-HSI estimate. SpectraLift converges in minutes, is agnostic to spatial blur and resolution, and outperforms state-of-the-art methods on PSNR, SAM, SSIM, and RMSE benchmarks.
zh

[CV-95] fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting

【速读】：该论文试图解决健康组织修复（healthy tissue inpainting）中的生成效率问题，特别是在三维医学图像中实现快速且高质量的修复。其关键解决方案是将去噪扩散概率模型（DDPMs）与生成对抗网络（GANs）相结合，并采用方差保持的噪声调度策略，同时通过选择合适的重建损失函数，在较少的时间步数内实现高精度的三维修复。此外，研究者还将该方法应用于一个不包含GAN组件的3D小波扩散模型（WDM3D），从而进一步提升了计算效率，最终提出的fastWDM3D模型在仅使用两个时间步的情况下，显著提高了处理速度并保持了优异的性能指标。

链接: https://arxiv.org/abs/2507.13146
作者: Alicia Durrer,Florentin Bieder,Paul Friedrich,Bjoern Menze,Philippe C. Cattin,Florian Kofler
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Philippe C. Cattin and Florian Kofler: equal contribution

点击查看摘要

Abstract:Healthy tissue inpainting has significant applications, including the generation of pseudo-healthy baselines for tumor growth models and the facilitation of image registration. In previous editions of the BraTS Local Synthesis of Healthy Brain Tissue via Inpainting Challenge, denoising diffusion probabilistic models (DDPMs) demonstrated qualitatively convincing results but suffered from low sampling speed. To mitigate this limitation, we adapted a 2D image generation approach, combining DDPMs with generative adversarial networks (GANs) and employing a variance-preserving noise schedule, for the task of 3D inpainting. Our experiments showed that the variance-preserving noise schedule and the selected reconstruction losses can be effectively utilized for high-quality 3D inpainting in a few time steps without requiring adversarial training. We applied our findings to a different architecture, a 3D wavelet diffusion model (WDM3D) that does not include a GAN component. The resulting model, denoted as fastWDM3D, obtained a SSIM of 0.8571, a MSE of 0.0079, and a PSNR of 22.26 on the BraTS inpainting test set. Remarkably, it achieved these scores using only two time steps, completing the 3D inpainting process in 1.81 s per image. When compared to other DDPMs used for healthy brain tissue inpainting, our model is up to 800 x faster while still achieving superior performance metrics. Our proposed method, fastWDM3D, represents a promising approach for fast and accurate healthy tissue inpainting. Our code is available at this https URL.
zh

[CV-96] From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation MICCAI2025

【速读】：该论文试图解决在面部CT图像中对眶骨进行准确分割的问题，尤其是在边界模糊和结构纤薄的区域（如眶内侧壁和眶底）中，现有分割方法常产生不连贯或欠分割的结果。解决方案的关键在于提出一种新颖的框架，通过利用多个扩散模型输出的共识来校正分割结果，该框架采用条件伯努利扩散模型，在每张图像上训练以生成多种可能的分割结果，随后通过基于位置邻近度、共识程度和梯度方向相似性的共识驱动校正策略，有效提升模糊区域的召回率并保持纤薄结构的连续性。

链接: https://arxiv.org/abs/2507.12985
作者: Jinseo An,Min Jin Lee,Kyu Won Shim,Helen Hong
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Early accepted at MICCAI 2025

点击查看摘要

Abstract:Accurate segmentation of orbital bones in facial computed tomography (CT) images is essential for the creation of customized implants for reconstruction of defected orbital bones, particularly challenging due to the ambiguous boundaries and thin structures such as the orbital medial wall and orbital floor. In these ambiguous regions, existing segmentation approaches often output disconnected or under-segmented results. We propose a novel framework that corrects segmentation results by leveraging consensus from multiple diffusion model outputs. Our approach employs a conditional Bernoulli diffusion model trained on diverse annotation patterns per image to generate multiple plausible segmentations, followed by a consensus-driven correction that incorporates position proximity, consensus level, and gradient direction similarity to correct challenging regions. Experimental results demonstrate that our method outperforms existing methods, significantly improving recall in ambiguous regions while preserving the continuity of thin structures. Furthermore, our method automates the manual process of segmentation result correction and can be applied to image-guided surgical planning and surgery.
zh

[CV-97] Improving Diagnostic Accuracy of Pigmented Skin Lesions With CNNs: an Application on the DermaMNIST Dataset

【速读】：该论文试图解决色素性皮肤病变的多类分类问题，旨在提高皮肤癌（如黑色素瘤）的诊断准确性。其解决方案的关键在于利用迁移学习和不同的网络层配置，对ResNet-50和EfficientNetV2L模型进行优化，从而在DermaMNIST数据集上实现与现有方法相当或更优的分类性能。研究结果表明，卷积神经网络（CNN）在生物医学图像分析中具有显著提升诊断准确性的潜力。

链接: https://arxiv.org/abs/2507.12961
作者: Nerma Kadric,Amila Akagic,Medina Kapo
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pigmented skin lesions represent localized areas of increased melanin and can indicate serious conditions like melanoma, a major contributor to skin cancer mortality. The MedMNIST v2 dataset, inspired by MNIST, was recently introduced to advance research in biomedical imaging and includes DermaMNIST, a dataset for classifying pigmented lesions based on the HAM10000 dataset. This study assesses ResNet-50 and EfficientNetV2L models for multi-class classification using DermaMNIST, employing transfer learning and various layer configurations. One configuration achieves results that match or surpass existing methods. This study suggests that convolutional neural networks (CNNs) can drive progress in biomedical image analysis, significantly enhancing diagnostic accuracy.
zh

[CV-98] Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion

【速读】：该论文旨在解决冠状动脉分割中的挑战，包括血管尺寸小、形态复杂以及与周围组织对比度低等问题。其解决方案的关键在于提出一种基于视觉基础模型（Vision Foundation Models, VFMs）的并行编码框架，该框架结合了视觉Transformer（ViT）编码器捕捉全局结构特征与卷积神经网络（CNN）编码器提取局部细节，并通过交叉分支变分融合（Cross-Branch Variational Fusion, CVF）模块进行自适应特征融合，同时引入证据学习不确定性精炼（Evidential-Learning Uncertainty Refinement, EUR）模块以提升分割精度和鲁棒性。

链接: https://arxiv.org/abs/2507.12938
作者: Caixia Dong,Duwei Dai,Xinyi Han,Fan Liu,Xu Yang,Zongfang Li,Songhua Xu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate coronary artery segmentation is critical for computeraided diagnosis of coronary artery disease (CAD), yet it remains challenging due to the small size, complex morphology, and low contrast with surrounding tissues. To address these challenges, we propose a novel segmentation framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture. Specifically, a vision transformer (ViT) encoder within the VFM captures global structural features, enhanced by the activation of the final two ViT blocks and the integration of an attention-guided enhancement (AGE) module, while a convolutional neural network (CNN) encoder extracts local details. These complementary features are adaptively fused using a cross-branch variational fusion (CVF) module, which models latent distributions and applies variational attention to assign modality-specific weights. Additionally, we introduce an evidential-learning uncertainty refinement (EUR) module, which quantifies uncertainty using evidence theory and refines uncertain regions by incorporating multi-scale feature aggregation and attention mechanisms, further enhancing segmentation accuracy. Extensive evaluations on one in-house and two public datasets demonstrate that the proposed framework significantly outperforms state-of-the-art methods, achieving superior performance in accurate coronary artery segmentation and showcasing strong generalization across multiple datasets. The code is available at this https URL.
zh

[CV-99] nsor-Tensor Products Group Representations and Semidefinite Programming

【速读】：该论文试图解决在第三阶张量框架下推广线性代数中正半定性及半定规划（Semidefinite Programming, SDP）的问题，其解决方案的关键在于建立\star_M-乘积中矩阵M的选择与底层群作用表示理论之间的联系。通过这一框架，第三阶张量在\star_M-乘积下的结构成为研究不变半定规划的自然设置，并应用于非负二次型的表征和低秩张量补全问题。

链接: https://arxiv.org/abs/2507.12729
作者: Alex Dunbar,Elizabeth Newman
机构: 未知
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Representation Theory (math.RT)
备注: 34 Pages, 7 figures

点击查看摘要

Abstract:The \star_M -family of tensor-tensor products is a framework which generalizes many properties from linear algebra to third order tensors. Here, we investigate positive semidefiniteness and semidefinite programming under the \star_M -product. Critical to our investigation is a connection between the choice of matrix M in the \star_M -product and the representation theory of an underlying group action. Using this framework, third order tensors equipped with the \star_M -product are a natural setting for the study of invariant semidefinite programs. As applications of the M-SDP framework, we provide a characterization of certain nonnegative quadratic forms and solve low-rank tensor completion problems.
zh

[CV-100] Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images

【速读】：该论文旨在解决医疗图像合成中高分辨率下细粒度细节难以保持的问题，这一问题限制了生成图像在临床诊断中的准确性。其解决方案的关键在于提出Pixel Perfect MegaMed，这是首个能够生成1024x1024分辨率医学图像的视觉-语言基础模型，该模型采用专门设计的多尺度Transformer架构，以同时保留全局解剖上下文和局部图像细节，并通过针对医学术语和成像模态优化的视觉-语言对齐技术，实现了文本描述与视觉表征在前所未有的分辨率水平上的精准映射。

链接: https://arxiv.org/abs/2507.12698
作者: Zahra TehraniNasab,Amar Kumar,Tal Arbel
机构: McGill University (麦吉尔大学); MILA-Quebec AI Institute (MILA-魁北克人工智能研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image synthesis presents unique challenges due to the inherent complexity and high-resolution details required in clinical contexts. Traditional generative architectures such as Generative Adversarial Networks (GANs) or Variational Auto Encoder (VAEs) have shown great promise for high-resolution image generation but struggle with preserving fine-grained details that are key for accurate diagnosis. To address this issue, we introduce Pixel Perfect MegaMed, the first vision-language foundation model to synthesize images at resolutions of 1024x1024. Our method deploys a multi-scale transformer architecture designed specifically for ultra-high resolution medical image generation, enabling the preservation of both global anatomical context and local image-level details. By leveraging vision-language alignment techniques tailored to medical terminology and imaging modalities, Pixel Perfect MegaMed bridges the gap between textual descriptions and visual representations at unprecedented resolution levels. We apply our model to the CheXpert dataset and demonstrate its ability to generate clinically faithful chest X-rays from text prompts. Beyond visual quality, these high-resolution synthetic images prove valuable for downstream tasks such as classification, showing measurable performance gains when used for data augmentation, particularly in low-data regimes. Our code is accessible through the project website - this https URL.
zh

[CV-101] RIQA: Image Quality Assessment by Contrastive Pretraining on Ordered Distortion Triplets

【速读】：该论文旨在解决无参考图像质量评估（No-Reference Image Quality Assessment, NR-IQA）中由于缺乏主观标注数据而导致的模型训练困难问题。其解决方案的关键在于构建一个基于有限参考内容图像的定制数据集，并引入一种结合内容和质量特征的无参考图像质量评估模型。该模型通过对比三元组学习进行质量感知训练，从而在减少样本数量的同时实现跨公开数据集的强泛化性能。

链接: https://arxiv.org/abs/2507.12687
作者: Rajesh Sureddi,Saman Zadtootaghaj,Nabajeet Barman,Alan C. Bovik
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages

点击查看摘要

Abstract:Image Quality Assessment (IQA) models aim to predict perceptual image quality in alignment with human judgments. No-Reference (NR) IQA remains particularly challenging due to the absence of a reference image. While deep learning has significantly advanced this field, a major hurdle in developing NR-IQA models is the limited availability of subjectively labeled data. Most existing deep learning-based NR-IQA approaches rely on pre-training on large-scale datasets before fine-tuning for IQA tasks. To further advance progress in this area, we propose a novel approach that constructs a custom dataset using a limited number of reference content images and introduces a no-reference IQA model that incorporates both content and quality features for perceptual quality prediction. Specifically, we train a quality-aware model using contrastive triplet-based learning, enabling efficient training with fewer samples while achieving strong generalization performance across publicly available datasets. Our repository is available at this https URL.
zh

[CV-102] InSight: AI Mobile Screening Tool for Multiple Eye Disease Detection using Multimodal Fusion

【速读】：该论文旨在解决全球范围内低收入和中等收入国家以及资源有限地区对常见眼病（如年龄相关性黄斑变性、青光眼、糖尿病视网膜病变、糖尿病性黄斑水肿和病理性近视）早期筛查可及性不足的问题。其解决方案的关键在于开发InSight应用，该应用结合患者临床元数据与眼底图像，通过三阶段处理流程实现对五种常见眼病的准确诊断，其中包含三项关键技术：多模态融合技术（MetaFusion）、利用监督与自监督损失函数的预训练方法以及多任务模型，从而提升了诊断准确性与计算效率。

链接: https://arxiv.org/abs/2507.12669
作者: Ananya Raghu,Anisha Raghu,Alice S. Tang,Yannis M. Paulus,Tyson N. Kim,Tomiko T. Oskotsky
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background/Objectives: Age-related macular degeneration, glaucoma, diabetic retinopathy (DR), diabetic macular edema, and pathological myopia affect hundreds of millions of people worldwide. Early screening for these diseases is essential, yet access to medical care remains limited in low- and middle-income countries as well as in resource-limited settings. We develop InSight, an AI-based app that combines patient metadata with fundus images for accurate diagnosis of five common eye diseases to improve accessibility of screenings. Methods: InSight features a three-stage pipeline: real-time image quality assessment, disease diagnosis model, and a DR grading model to assess severity. Our disease diagnosis model incorporates three key innovations: (a) Multimodal fusion technique (MetaFusion) combining clinical metadata and images; (b) Pretraining method leveraging supervised and self-supervised loss functions; and © Multitask model to simultaneously predict 5 diseases. We make use of BRSET (lab-captured images) and mBRSET (smartphone-captured images) datasets, both of which also contain clinical metadata for model training/evaluation. Results: Trained on a dataset of BRSET and mBRSET images, the image quality checker achieves near-100% accuracy in filtering out low-quality fundus images. The multimodal pretrained disease diagnosis model outperforms models using only images by 6% in balanced accuracy for BRSET and 4% for mBRSET. Conclusions: The InSight pipeline demonstrates robustness across varied image conditions and has high diagnostic accuracy across all five diseases, generalizing to both smartphone and lab captured images. The multitask model contributes to the lightweight nature of the pipeline, making it five times computationally efficient compared to having five individual models corresponding to each disease. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.12669 [eess.IV] (or arXiv:2507.12669v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.12669 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tomiko Oskotsky [view email] [v1] Wed, 16 Jul 2025 23:00:10 UTC (1,386 KB)
zh

[CV-103] Pathology-Guided Virtual Staining Metric for Evaluation and Training

【速读】：该论文试图解决传统组织病理学染色技术在虚拟染色评估中存在的一些局限性，特别是现有评估方法主要依赖于为自然图像设计的全参考图像质量评估（FR-IQA）指标，这些指标难以捕捉与病理相关的特征。解决方案的关键在于提出一种专为虚拟染色评估设计的新型FR-IQA度量标准——病理感知的感知图像相似性（Pathology-Aware Perceptual Image Similarity, PaPIS），该方法利用基于深度学习的细胞形态分割特征，并结合受Retinex启发的特征分解，以更好地反映组织病理学的感知质量。

链接: https://arxiv.org/abs/2507.12624
作者: Qiankai Wang,James E.D. Tweel,Parsin Haji Reza,Anita Layton
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 19 pages, 10 figures. Intended for submission to the Journal of Imaging Informatics in Medicine (JIIM)

点击查看摘要

Abstract:Virtual staining has emerged as a powerful alternative to traditional histopathological staining techniques, enabling rapid, reagent-free image transformations. However, existing evaluation methods predominantly rely on full-reference image quality assessment (FR-IQA) metrics such as structural similarity, which are originally designed for natural images and often fail to capture pathology-relevant features. Expert pathology reviews have also been used, but they are inherently subjective and time-consuming. In this study, we introduce PaPIS (Pathology-Aware Perceptual Image Similarity), a novel FR-IQA metric specifically tailored for virtual staining evaluation. PaPIS leverages deep learning-based features trained on cell morphology segmentation and incorporates Retinex-inspired feature decomposition to better reflect histological perceptual quality. Comparative experiments demonstrate that PaPIS more accurately aligns with pathology-relevant visual cues and distinguishes subtle cellular structures that traditional and existing perceptual metrics tend to overlook. Furthermore, integrating PaPIS as a guiding loss function in a virtual staining model leads to improved histological fidelity. This work highlights the critical need for pathology-aware evaluation frameworks to advance the development and clinical readiness of virtual staining technologies. Comments: 19 pages, 10 figures. Intended for submission to the Journal of Imaging Informatics in Medicine (JIIM) Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY) Cite as: arXiv:2507.12624 [eess.IV] (or arXiv:2507.12624v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.12624 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

人工智能

[AI-0] Latent Policy Steering with Embodiment-Agnostic Pretrained World Models

【速读】：该论文试图解决在学习视觉-运动机器人策略时，因依赖大量训练示范而导致的真实世界数据收集成本过高的问题。其解决方案的关键在于利用现有或低成本数据（如公共机器人数据集和人类与物体互动的数据）来减少对目标机器人平台的依赖。具体而言，关键方法包括：首先，使用光流作为与具身无关的动作表示，在多具身数据集上训练一个世界模型（World Model, WM），并将其微调到目标具身的小量机器人数据上；其次，开发了一种称为潜在策略引导（Latent Policy Steering, LPS）的方法，通过在WM的潜在空间中搜索更优的动作序列来提升行为克隆策略的输出性能。

链接: https://arxiv.org/abs/2507.13340
作者: Yiqi Wang,Mrinal Verghese,Jeff Schneider
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning visuomotor policies via imitation has proven effective across a wide range of robotic domains. However, the performance of these policies is heavily dependent on the number of training demonstrations, which requires expensive data collection in the real world. In this work, we aim to reduce data collection efforts when learning visuomotor robot policies by leveraging existing or cost-effective data from a wide range of embodiments, such as public robot datasets and the datasets of humans playing with objects (human data from play). Our approach leverages two key insights. First, we use optic flow as an embodiment-agnostic action representation to train a World Model (WM) across multi-embodiment datasets, and finetune it on a small amount of robot data from the target embodiment. Second, we develop a method, Latent Policy Steering (LPS), to improve the output of a behavior-cloned policy by searching in the latent space of the WM for better action sequences. In real world experiments, we observe significant improvements in the performance of policies trained with a small amount of data (over 50% relative improvement with 30 demonstrations and over 20% relative improvement with 50 demonstrations) by combining the policy with a WM pretrained on two thousand episodes sampled from the existing Open X-embodiment dataset across different robots or a cost-effective human dataset from play.
zh

[AI-1] FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

【速读】：该论文试图评估前沿人工智能模型在接近真实科研问题上的表现，特别是其是否能够达到人类专家或超人类的推理与解决问题的能力。其关键解决方案是构建FormulaOne基准测试集，该测试集结合了图论、逻辑和算法，具有商业应用价值，并且基于Monadic Second-Order (MSO)逻辑生成，能够反映理论计算机科学的前沿问题。通过这一基准，研究者可以更准确地衡量模型在复杂推理任务中的能力，而现有的先进模型如OpenAI的o3在该基准上表现极差，仅能解决不到1%的问题，表明当前模型在某些领域距离专家水平仍有显著差距。

链接: https://arxiv.org/abs/2507.13337
作者: Gal Beniamini,Yuval Dor,Alon Vinnikov,Shir Granot Peled,Or Weinstein,Or Sharir,Noam Wies,Tomer Nussbaum,Ido Ben Shaul,Tomer Zekharya,Yoav Levine,Shai Shalev-Shwartz,Amnon Shashua
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Logic (math.LO)
备注:

点击查看摘要

Abstract:Frontier AI models demonstrate formidable breadth of knowledge. But how close are they to true human – or superhuman – expertise? Genuine experts can tackle the hardest problems and push the boundaries of scientific understanding. To illuminate the limits of frontier model capabilities, we turn away from contrived competitive programming puzzles, and instead focus on real-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graph theory, logic, and algorithms, all well within the training distribution of frontier models. Our problems are incredibly demanding, requiring an array of reasoning steps. The dataset has three key properties. First, it is of commercial interest and relates to practical large-scale optimisation problems, such as those arising in routing, scheduling, and network design. Second, it is generated from the highly expressive framework of Monadic Second-Order (MSO) logic on graphs, paving the way toward automatic problem generation at scale; ideal for building RL environments. Third, many of our problems are intimately related to the frontier of theoretical computer science, and to central conjectures therein, such as the Strong Exponential Time Hypothesis (SETH). As such, any significant algorithmic progress on our dataset, beyond known results, could carry profound theoretical implications. Remarkably, state-of-the-art models like OpenAI’s o3 fail entirely on FormulaOne, solving less than 1% of the questions, even when given 10 attempts and explanatory fewshot examples – highlighting how far they remain from expert-level understanding in some domains. To support further research, we additionally curate FormulaOne-Warmup, offering a set of simpler tasks, from the same distribution. We release the full corpus along with a comprehensive evaluation framework. Subjects: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Logic (math.LO) Cite as: arXiv:2507.13337 [cs.AI] (or arXiv:2507.13337v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.13337 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-2] owards Formal Verification of LLM -Generated Code from Natural Language Prompts

【速读】：该论文试图解决由大语言模型（LLM）生成的代码存在错误，而用户难以检测这些问题，从而影响使用AI代码助手的体验和自然语言编程的可行性。其解决方案的关键在于引入一种形式化查询语言，能够以形式化但类似自然语言的方式表达用户的意图，并允许用户确认该表达与自身意图一致；随后，利用该查询对LLM生成的代码进行验证，确保其符合用户的意图。

链接: https://arxiv.org/abs/2507.13290
作者: Aaron Councilman,David Fu,Aryan Gupta,Chengxiao Wang,David Grove,Yu-Xiong Wang,Vikram Adve
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注: 31 pages, 9 figures

点击查看摘要

Abstract:In the past few years LLMs have emerged as a tool that can aid programmers by taking natural language descriptions and generating code based on it. However, LLMs often generate incorrect code that users need to fix and the literature suggests users often struggle to detect these errors. In this work we seek to offer formal guarantees of correctness to LLM generated code; such guarantees could improve the experience of using AI Code Assistants and potentially enable natural language programming for users with little or no programming knowledge. To address this challenge we propose to incorporate a formal query language that can represent a user’s intent in a formally defined but natural language-like manner that a user can confirm matches their intent. Then, using such a query we propose to verify LLM generated code to ensure it matches the user’s intent. We implement these ideas in our system, Astrogator, for the Ansible programming language which includes such a formal query language, a calculus for representing the behavior of Ansible programs, and a symbolic interpreter which is used for the verification. On a benchmark suite of 21 code-generation tasks, our verifier is able to verify correct code in 83% of cases and identify incorrect code in 92%.
zh

[AI-3] Evaluating Reinforcement Learning Algorithms for Navigation in Simulated Robotic Quadrupeds: A Comparative Study Inspired by Guide Dog Behaviour

【速读】：该论文试图解决如何有效训练四足机器人实现自主导航和避障的问题，旨在开发一种能够执行路径跟随和障碍物规避的机器人导盲犬模拟系统，以期未来应用于实际辅助导盲犬和视障人士。其解决方案的关键在于比较三种强化学习算法（Proximal Policy Optimization, Deep Q-Network, 和 Q-learning）在模拟四足机器人上的表现，通过定制环境进行公平评估，重点分析传感器输入、碰撞频率、奖励信号和学习进展等指标，最终结果显示Proximal Policy Optimization在所有评估指标上均优于其他两种算法。

链接: https://arxiv.org/abs/2507.13277
作者: Emma M. A. Harrison
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robots are increasingly integrated across industries, particularly in healthcare. However, many valuable applications for quadrupedal robots remain overlooked. This research explores the effectiveness of three reinforcement learning algorithms in training a simulated quadruped robot for autonomous navigation and obstacle avoidance. The goal is to develop a robotic guide dog simulation capable of path following and obstacle avoidance, with long-term potential for real-world assistance to guide dogs and visually impaired individuals. It also seeks to expand research into medical ‘pets’, including robotic guide and alert dogs. A comparative analysis of thirteen related research papers shaped key evaluation criteria, including collision detection, pathfinding algorithms, sensor usage, robot type, and simulation platforms. The study focuses on sensor inputs, collision frequency, reward signals, and learning progression to determine which algorithm best supports robotic navigation in complex environments. Custom-made environments were used to ensure fair evaluation of all three algorithms under controlled conditions, allowing consistent data collection. Results show that Proximal Policy Optimization (PPO) outperformed Deep Q-Network (DQN) and Q-learning across all metrics, particularly in average and median steps to goal per episode. By analysing these results, this study contributes to robotic navigation, AI and medical robotics, offering insights into the feasibility of AI-driven quadruped mobility and its role in assistive robotics. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.13277 [cs.RO] (or arXiv:2507.13277v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2507.13277 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Emma Harrison [view email] [v1] Thu, 17 Jul 2025 16:38:14 UTC (1,346 KB)
zh

[AI-4] Voxtral

【速读】：该论文旨在解决多模态音频对话模型在理解和处理语音与文本信息方面的挑战，同时提升模型在知识和趣味性任务上的表现。其解决方案的关键在于开发了Voxtral Mini和Voxtral Small两款多模态音频聊天模型，这些模型不仅在多种音频基准测试中达到了最先进水平，还保持了强大的文本处理能力，并通过32K上下文窗口支持长达40分钟的音频文件和长轮次对话。此外，研究者还提出了三个用于评估语音理解模型在知识和趣味性任务上表现的基准。

链接: https://arxiv.org/abs/2507.13264
作者: Alexander H. Liu,Andy Ehrenberg,Andy Lo,Clément Denoix,Corentin Barreau,Guillaume Lample,Jean-Malo Delignon,Khyathi Raghavi Chandu,Patrick von Platen,Pavankumar Reddy Muddireddy,Sanchit Gandhi,Soham Ghosh,Srijan Mishra,Thomas Foubert,Abhinav Rastogi,Adam Yang,Albert Q. Jiang,Alexandre Sablayrolles,Amélie Héliou,Amélie Martin,Anmol Agarwal,Antoine Roux,Arthur Darcet,Arthur Mensch,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Chris Bamford,Christian Wallenwein,Christophe Renaudin,Clémence Lanfranchi,Darius Dabert,Devendra Singh Chaplot,Devon Mizelle,Diego de las Casas,Elliot Chane-Sane,Emilien Fugier,Emma Bou Hanna,Gabrielle Berrada,Gauthier Delerce,Gauthier Guinet,Georgii Novikov,Guillaume Martin,Himanshu Jaju,Jan Ludziejewski,Jason Rute,Jean-Hadrien Chabran,Jessica Chudnovsky,Joachim Studnia,Joep Barmentlo,Jonas Amar,Josselin Somerville Roberts,Julien Denize,Karan Saxena,Karmesh Yadav,Kartik Khandelwal,Kush Jain,Lélio Renard Lavaud,Léonard Blier,Lingxiao Zhao,Louis Martin,Lucile Saulnier,Luyu Gao,Marie Pellat,Mathilde Guillaumin,Mathis Felardos,Matthieu Dinot,Maxime Darrin,Maximilian Augustin,Mickaël Seznec,Neha Gupta,Nikhil Raghuraman,Olivier Duchenne,Patricia Wang,Patryk Saffer,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Philomène Chagniot,Pierre Stock,Pravesh Agrawal,Rémi Delacourt,Romain Sauvestre,Roman Soletskyi,Sagar Vaze,Sandeep Subramanian,Saurabh Garg,Shashwat Dalal,Siddharth Gandhi,Sumukh Aithal,Szymon Antoniak,Teven Le Scao,Thibault Schueller,Thibaut Lavril,Thomas Robert,Thomas Wang,Timothée Lacroix,Tom Bewley,Valeriia Nemychnikova,Victor Paltz
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 17 pages

点击查看摘要

Abstract:We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.
zh

[AI-5] Merge Kernel for Bayesian Optimization on Permutation Space AAAI-26

【速读】：该论文试图解决在排列空间中进行贝叶斯优化（Bayesian Optimization, BO）时，现有方法依赖于计算复杂度为Ω(n²)的Mallows核所带来的效率问题。其解决方案的关键在于提出一种基于排序算法的新框架，用于生成排列空间上的核函数。该框架将Mallows核视为由冒泡排序派生的特例，并引入了由归并排序构造的Merge Kernel，将复杂度降低至Θ(n log n)，同时保持了对排列距离的有效捕捉。此外，通过引入三种轻量级、任务无关的描述符，进一步提升了算法的鲁棒性和右不变性，而未牺牲表示的紧凑性。

链接: https://arxiv.org/abs/2507.13263
作者: Zikai Xie,Linjiang Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, submitted to AAAI-26

点击查看摘要

Abstract:Bayesian Optimization (BO) algorithm is a standard tool for black-box optimization problems. The current state-of-the-art BO approach for permutation spaces relies on the Mallows kernel-an \Omega(n^2) representation that explicitly enumerates every pairwise comparison. Inspired by the close relationship between the Mallows kernel and pairwise comparison, we propose a novel framework for generating kernel functions on permutation space based on sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from bubble sort. Further, we introduce the \textbfMerge Kernel constructed from merge sort, which replaces the quadratic complexity with \Theta(n\log n) to achieve the lowest possible complexity. The resulting feature vector is significantly shorter, can be computed in linearithmic time, yet still efficiently captures meaningful permutation distances. To boost robustness and right-invariance without sacrificing compactness, we further incorporate three lightweight, task-agnostic descriptors: (1) a shift histogram, which aggregates absolute element displacements and supplies a global misplacement signal; (2) a split-pair line, which encodes selected long-range comparisons by aligning elements across the two halves of the whole permutation; and (3) sliding-window motifs, which summarize local order patterns that influence near-neighbor objectives. Our empirical evaluation demonstrates that the proposed kernel consistently outperforms the state-of-the-art Mallows kernel across various permutation optimization benchmarks. Results confirm that the Merge Kernel provides a more compact yet more effective solution for Bayesian optimization in permutation space.
zh

[AI-6] Higher-Order Pattern Unification Modulo Similarity Relations

【速读】：该论文试图解决在涉及抽象函数和谓词推理的决策任务中，如何高效地进行形式化推理与计算的问题，特别是在需要处理模糊等价关系的情况下。其解决方案的关键在于将高阶模式与基于最小T-范数的相似关系所表达的模糊等价关系进行集成，并提出一种在这些相似关系下适用于高阶模式的合一算法，该算法证明了终止性、可靠性和完备性，并能计算出具有最高近似度的最一般合一式。

链接: https://arxiv.org/abs/2507.13208
作者: Besik Dundua,Temur Kutsia
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Logic (math.LO)
备注: 23 pages

点击查看摘要

Abstract:The combination of higher-order theories and fuzzy logic can be useful in decision-making tasks that involve reasoning across abstract functions and predicates, where exact matches are often rare or unnecessary. Developing efficient reasoning and computational techniques for such a combined formalism presents a significant challenge. In this paper, we adopt a more straightforward approach aiming at integrating two well-established and computationally well-behaved components: higher-order patterns on one side and fuzzy equivalences expressed through similarity relations based on minimum T-norm on the other. We propose a unification algorithm for higher-order patterns modulo these similarity relations and prove its termination, soundness, and completeness. This unification problem, like its crisp counterpart, is unitary. The algorithm computes a most general unifier with the highest degree of approximation when the given terms are unifiable.
zh

[AI-7] Black Box Deployed – Functional Criteria for Artificial Moral Agents in the LLM Era

【速读】：该论文试图解决传统伦理评估标准在面对生成式 AI (Generative AI) 为代表的大型语言模型 (LLMs) 时的适用性问题，因为 LLMs 的随机输出和不透明内部状态使得原有基于透明架构的评估框架不再有效。论文的关键解决方案是提出一套十项功能性的评估标准，用于评价基于 LLM 的人工道德代理 (AMAs)，这些标准包括道德一致性、情境敏感性、规范完整性、元伦理意识、系统韧性、可信度、可纠正性、部分透明性、功能自主性和道德想象力，旨在指导 AMAs 更好地与社会融合并实现有益的应用。

链接: https://arxiv.org/abs/2507.13175
作者: Matthew E. Brophy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 42 pages. Supplementary material included at end of article

点击查看摘要

Abstract:The advancement of powerful yet opaque large language models (LLMs) necessitates a fundamental revision of the philosophical criteria used to evaluate artificial moral agents (AMAs). Pre-LLM frameworks often relied on the assumption of transparent architectures, which LLMs defy due to their stochastic outputs and opaque internal states. This paper argues that traditional ethical criteria are pragmatically obsolete for LLMs due to this mismatch. Engaging with core themes in the philosophy of technology, this paper proffers a revised set of ten functional criteria to evaluate LLM-based artificial moral agents: moral concordance, context sensitivity, normative integrity, metaethical awareness, system resilience, trustworthiness, corrigibility, partial transparency, functional autonomy, and moral imagination. These guideposts, applied to what we term “SMA-LLS” (Simulating Moral Agency through Large Language Systems), aim to steer AMAs toward greater alignment and beneficial societal integration in the coming years. We illustrate these criteria using hypothetical scenarios involving an autonomous public bus (APB) to demonstrate their practical applicability in morally salient contexts.
zh

[AI-8] Aligning Humans and Robots via Reinforcement Learning from Implicit Human Feedback

【速读】：该论文试图解决传统强化学习（Reinforcement Learning, RL）在稀疏奖励条件下难以学习有效策略的问题，以及现有基于人类反馈（Reinforcement Learning from Human Feedback, RLHF）方法依赖显式反馈机制带来的用户认知负担和交互干扰问题。解决方案的关键在于提出一种基于隐式人类反馈的强化学习框架（Reinforcement Learning from Implicit Human Feedback, RLIHF），通过非侵入式脑电（Electroencephalography, EEG）信号中的错误相关电位（Error-related Potentials, ErrPs）提供连续、隐式的反馈，无需用户主动干预，从而实现更自然的人机交互和有效的策略学习。

链接: https://arxiv.org/abs/2507.13171
作者: Suzie Kim,Hye-Bin Shin,Seong-Whan Lee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conventional reinforcement learning (RL) ap proaches often struggle to learn effective policies under sparse reward conditions, necessitating the manual design of complex, task-specific reward functions. To address this limitation, rein forcement learning from human feedback (RLHF) has emerged as a promising strategy that complements hand-crafted rewards with human-derived evaluation signals. However, most existing RLHF methods depend on explicit feedback mechanisms such as button presses or preference labels, which disrupt the natural interaction process and impose a substantial cognitive load on the user. We propose a novel reinforcement learning from implicit human feedback (RLIHF) framework that utilizes non-invasive electroencephalography (EEG) signals, specifically error-related potentials (ErrPs), to provide continuous, implicit feedback without requiring explicit user intervention. The proposed method adopts a pre-trained decoder to transform raw EEG signals into probabilistic reward components, en abling effective policy learning even in the presence of sparse external rewards. We evaluate our approach in a simulation environment built on the MuJoCo physics engine, using a Kinova Gen2 robotic arm to perform a complex pick-and-place task that requires avoiding obstacles while manipulating target objects. The results show that agents trained with decoded EEG feedback achieve performance comparable to those trained with dense, manually designed rewards. These findings validate the potential of using implicit neural feedback for scalable and human-aligned reinforcement learning in interactive robotics.
zh

[AI-9] SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks

【速读】：该论文试图解决深度伪造音频检测方法在面对对抗性取证（anti-forensic, AF）攻击时的脆弱性问题，尤其是针对生成对抗网络（Generative Adversarial Networks, GANs）产生的攻击。解决方案的关键在于提出一种名为SHIELD的协同学习方法，通过集成一个辅助生成模型（称为防御生成模型，DF generative model），实现输入与输出的联合学习，从而暴露AF痕迹。此外，设计了一个三元组模型，利用辅助生成模型捕捉真实音频与AF攻击音频之间的相关性，提升对生成式AF攻击的防御能力。

链接: https://arxiv.org/abs/2507.13170
作者: Kutub Uddin,Awais Khan,Muhammad Umar Farooq,Khalid Malik
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Audio plays a crucial role in applications like speaker verification, voice-enabled smart devices, and audio conferencing. However, audio manipulations, such as deepfakes, pose significant risks by enabling the spread of misinformation. Our empirical analysis reveals that existing methods for detecting deepfake audio are often vulnerable to anti-forensic (AF) attacks, particularly those attacked using generative adversarial networks. In this article, we propose a novel collaborative learning method called SHIELD to defend against generative AF attacks. To expose AF signatures, we integrate an auxiliary generative model, called the defense (DF) generative model, which facilitates collaborative learning by combining input and output. Furthermore, we design a triplet model to capture correlations for real and AF attacked audios with real-generated and attacked-generated audios using auxiliary generative models. The proposed SHIELD strengthens the defense against generative AF attacks and achieves robust performance across various generative models. The proposed AF significantly reduces the average detection accuracy from 95.49% to 59.77% for ASVspoof2019, from 99.44% to 38.45% for In-the-Wild, and from 98.41% to 51.18% for HalfTruth for three different generative models. The proposed SHIELD mechanism is robust against AF attacks and achieves an average accuracy of 98.13%, 98.58%, and 99.57% in match, and 98.78%, 98.62%, and 98.85% in mismatch settings for the ASVspoof2019, In-the-Wild, and HalfTruth datasets, respectively.
zh

[AI-10] Prompt Injection 2.0: Hybrid AI Threats

【速读】：该论文试图解决Prompt Injection 2.0带来的新型安全威胁，特别是其与传统网络安全隐患（如Cross-Site Scripting和Cross-Site Request Forgery）结合后形成的混合攻击问题。解决方案的关键在于构建一种综合的安全架构，该架构融合了提示隔离、运行时安全机制和权限分离，并结合创新的威胁检测能力，以有效应对AI增强型攻击和多代理感染等现代威胁。

链接: https://arxiv.org/abs/2507.13169
作者: Jeremy McHugh,Kristina Šekrst,Jon Cefalu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt injection attacks, where malicious input is designed to manipulate AI systems into ignoring their original instructions and following unauthorized commands instead, were first discovered by Preamble, Inc. in May 2022 and responsibly disclosed to OpenAI. Over the last three years, these attacks have continued to pose a critical security threat to LLM-integrated systems. The emergence of agentic AI systems, where LLMs autonomously perform multistep tasks through tools and coordination with other agents, has fundamentally transformed the threat landscape. Modern prompt injection attacks can now combine with traditional cybersecurity exploits to create hybrid threats that systematically evade traditional security controls. This paper presents a comprehensive analysis of Prompt Injection 2.0, examining how prompt injections integrate with Cross-Site Scripting (XSS), Cross-Site Request Forgery (CSRF), and other web security vulnerabilities to bypass traditional security measures. We build upon Preamble’s foundational research and mitigation technologies, evaluating them against contemporary threats, including AI worms, multi-agent infections, and hybrid cyber-AI attacks. Our analysis incorporates recent benchmarks that demonstrate how traditional web application firewalls, XSS filters, and CSRF tokens fail against AI-enhanced attacks. We also present architectural solutions that combine prompt isolation, runtime security, and privilege separation with novel threat detection capabilities.
zh

[AI-11] Prediction of Highway Traffic Flow Based on Artificial Intelligence Algorithms Using California Traffic Data

【速读】：该论文试图解决全球范围内的交通拥堵问题，其解决方案的关键在于利用基于机器学习的交通流预测模型。研究采用加利福尼亚州高速公路78号公路的30秒至15分钟不等的数据采集间隔，分析了2022年7月至11月期间一段7.24公里的西向路段（连接“Melrose Dr”和“El-Camino Real”）的交通数据，使用了多元线性回归（Multiple Linear Regression, MLR）和随机森林（Random Forest, RF）算法，并通过R²、MAE和RMSE等性能指标验证了模型的有效性，最终发现MLR和RF模型在10分钟数据采集间隔下表现最佳。

链接: https://arxiv.org/abs/2507.13112
作者: Junseong Lee,Jaegwan Cho,Yoonju Cho,Seoyoon Choi,Yejin Shin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The study “Prediction of Highway Traffic Flow Based on Artificial Intelligence Algorithms Using California Traffic Data” presents a machine learning-based traffic flow prediction model to address global traffic congestion issues. The research utilized 30-second interval traffic data from California Highway 78 over a five-month period from July to November 2022, analyzing a 7.24 km westbound section connecting “Melrose Dr” and “El-Camino Real” in the San Diego area. The study employed Multiple Linear Regression (MLR) and Random Forest (RF) algorithms, analyzing data collection intervals ranging from 30 seconds to 15 minutes. Using R^2, MAE, and RMSE as performance metrics, the analysis revealed that both MLR and RF models performed optimally with 10-minute data collection intervals. These findings are expected to contribute to future traffic congestion solutions and efficient traffic management.
zh

[AI-12] GraspGen: A Diffusion-based Framework for 6-DOF Grasping with On-Generator Training

【速读】：该论文试图解决6-DOF抓取学习方法在不同机器人本体和真实环境中的泛化能力不足的问题。其解决方案的关键在于提出一种名为GraspGen的框架，该框架基于对象中心的抓取生成过程建模为迭代扩散过程，并采用DiffusionTransformer架构增强抓取生成能力，同时结合一个高效的判别器对采样抓取进行评分和筛选，从而提升抓取性能与可靠性。

链接: https://arxiv.org/abs/2507.13097
作者: Adithyavairavan Murali,Balakumar Sundaralingam,Yu-Wei Chao,Wentao Yuan,Jun Yamada,Mark Carlson,Fabio Ramos,Stan Birchfield,Dieter Fox,Clemens Eppner
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Grasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of a DiffusionTransformer architecture that enhances grasp generation, paired with an efficient discriminator to score and filter sampled grasps. We introduce a novel and performant on-generator training recipe for the discriminator. To scale GraspGen to both objects and grippers, we release a new simulated dataset consisting of over 53 million grasps. We demonstrate that GraspGen outperforms prior methods in simulations with singulated objects across different grippers, achieves state-of-the-art performance on the FetchBench grasping benchmark, and performs well on a real robot with noisy visual observations.
zh

[AI-13] Exploiting Constraint Reasoning to Build Graphical Explanations for Mixed-Integer Linear Programming

【速读】：该论文试图解决在混合整数线性规划（MILP）决策过程中提供对比解释的问题，以增强人工智能系统的可信度。其解决方案的关键在于提出X-MILP方法，该方法通过将用户关于MILP问题解的查询编码为额外约束，并计算由此生成约束集的不可行子系统（IIS），从而确定构成用户查询答案的原因，最终构建一个“原因图”来展示这些原因之间的结构关系。

链接: https://arxiv.org/abs/2507.13007
作者: Roger Xavier Lera-Leri,Filippo Bistaffa,Athina Georgara,Juan Antonio Rodriguez-Aguilar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To appear in Lecture Notes in Artificial Intelligence

点击查看摘要

Abstract:Following the recent push for trustworthy AI, there has been an increasing interest in developing contrastive explanation techniques for optimisation, especially concerning the solution of specific decision-making processes formalised as MILPs. Along these lines, we propose X-MILP, a domain-agnostic approach for building contrastive explanations for MILPs based on constraint reasoning techniques. First, we show how to encode the queries a user makes about the solution of an MILP problem as additional constraints. Then, we determine the reasons that constitute the answer to the user’s query by computing the Irreducible Infeasible Subsystem (IIS) of the newly obtained set of constraints. Finally, we represent our explanation as a “graph of reasons” constructed from the IIS, which helps the user understand the structure among the reasons that answer their query. We test our method on instances of well-known optimisation problems to evaluate the empirical hardness of computing explanations.
zh

[AI-14] SMART: Relation-Aware Learning of Geometric Representations for Knowledge Graphs

【速读】：该论文试图解决知识图谱嵌入（KGE）模型在表示关系时普遍采用通用几何变换（EGT）而未能考虑关系特异性的问题。现有方法通常使用单一或组合的几何变换来表示所有关系，未能充分利用不同关系对不同变换的适应性。论文提出的关键解决方案是构建一个框架，通过评估每种关系与不同几何变换的匹配程度，利用注意力机制为每个关系学习特定的EGT，并在低维向量空间中进行关系嵌入，同时利用关系与EGT之间的相关性提升高维空间中的关系表示效果。

链接: https://arxiv.org/abs/2507.13001
作者: Kossi Amouzouvi,Bowen Song,Andrea Coletta,Luigi Bellomarini,Jens Lehmann,Sahar Vahdati
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graph representation learning approaches provide a mapping between symbolic knowledge in the form of triples in a knowledge graph (KG) and their feature vectors. Knowledge graph embedding (KGE) models often represent relations in a KG as geometric transformations. Most state-of-the-art (SOTA) KGE models are derived from elementary geometric transformations (EGTs), such as translation, scaling, rotation, and reflection, or their combinations. These geometric transformations enable the models to effectively preserve specific structural and relational patterns of the KG. However, the current use of EGTs by KGEs remains insufficient without considering relation-specific transformations. Although recent models attempted to address this problem by ensembling SOTA baseline models in different ways, only a single or composite version of geometric transformations are used by such baselines to represent all the relations. In this paper, we propose a framework that evaluates how well each relation fits with different geometric transformations. Based on this ranking, the model can: (1) assign the best-matching transformation to each relation, or (2) use majority voting to choose one transformation type to apply across all relations. That is, the model learns a single relation-specific EGT in low dimensional vector space through an attention mechanism. Furthermore, we use the correlation between relations and EGTs, which are learned in a low dimension, for relation embeddings in a high dimensional vector space. The effectiveness of our models is demonstrated through comprehensive evaluations on three benchmark KGs as well as a real-world financial KG, witnessing a performance comparable to leading models
zh

[AI-15] A Translation of Probabilistic Event Calculus into Markov Decision Processes

【速读】：该论文试图解决Probabilistic Event Calculus (PEC)在目标导向推理方面的不足，即其缺乏有效的机制来支持基于目标的决策和规划。解决方案的关键在于将PEC领域形式化地转换为马尔可夫决策过程 (Markov Decision Processes, MDPs)，并通过引入“action-taking situations”概念来保持PEC的灵活动作语义。这一转换使得MDP中丰富的算法和理论工具能够被应用于PEC的可解释叙事领域，从而扩展了PEC的能力，同时保持了其可解释性。

链接: https://arxiv.org/abs/2507.12989
作者: Lyris Xu,Fabio Aurelio D’Asaro,Luke Dickens
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Probabilistic Event Calculus (PEC) is a logical framework for reasoning about actions and their effects in uncertain environments, which enables the representation of probabilistic narratives and computation of temporal projections. The PEC formalism offers significant advantages in interpretability and expressiveness for narrative reasoning. However, it lacks mechanisms for goal-directed reasoning. This paper bridges this gap by developing a formal translation of PEC domains into Markov Decision Processes (MDPs), introducing the concept of “action-taking situations” to preserve PEC’s flexible action semantics. The resulting PEC-MDP formalism enables the extensive collection of algorithms and theoretical tools developed for MDPs to be applied to PEC’s interpretable narrative domains. We demonstrate how the translation supports both temporal reasoning tasks and objective-driven planning, with methods for mapping learned policies back into human-readable PEC representations, maintaining interpretability while extending PEC’s capabilities.
zh

[AI-16] A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints

【速读】：该论文试图解决在联邦学习环境中训练生成式对抗网络（Generative Adversarial Networks, GANs）时面临的挑战，包括数据异质性、多领域数据集以及设备异质性等问题，同时避免原始数据的共享。其解决方案的关键在于结合KLD-weighted Clustered Federated Learning以处理数据异质性和多领域数据集问题，并采用Heterogeneous U-Shaped split learning来应对严格的数据共享约束下的设备异质性问题，从而确保不共享任何标签或原始数据（无论是真实还是合成的）。

链接: https://arxiv.org/abs/2507.12979
作者: Youssef Tawfilis,Hossam Amer,Minar El-Aasser,Tallal Elshabrawy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning has gained increasing attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing their raw data. At the same time, Generative AI – particularly Generative Adversarial Networks (GANs) – have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices – such as IoT devices and edge devices – with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables the utilization of distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints – ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experimental results shows that our approach demonstrates consistent and significant improvements across key performance metrics, where it achieves 1.1x – 2.2x higher image generation scores, an average 10% boost in classification metrics (up to 50% in multi-domain non-IID settings), in much lower latency compared to several benchmarks. Find our code at this https URL.
zh

[AI-17] MC2A: Enabling Algorithm-Hardware Co-Design for Efficient Markov Chain Monte Carlo Acceleration

【速读】：该论文试图解决基于采样的算法（如马尔可夫链蒙特卡洛算法，MCMC）在大规模问题和实际应用中因计算成本高而受限的问题。现有MCMC加速方案在硬件灵活性或系统级效率方面存在不足。论文提出的解决方案是\textbfMC ^2 A，其关键在于算法与硬件的协同设计：通过扩展处理器性能屋顶模型以分析MCMC工作负载多样性，从而实现计算、采样和内存参数之间的最优平衡；设计一种参数化硬件加速架构，支持灵活高效的MCMC内核；引入一种新型Gumbel采样器，消除指数和归一化运算，从而提升整体性能。

链接: https://arxiv.org/abs/2507.12935
作者: Shirui Zhao,Jun Yin,Lingyun Yao,Martin Andraud,Wannes Meert,Marian Verhelst
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 14 pages, 15 figures, IEEE journal paper

点击查看摘要

Abstract:An increasing number of applications are exploiting sampling-based algorithms for planning, optimization, and inference. The Markov Chain Monte Carlo (MCMC) algorithms form the computational backbone of this emerging branch of machine learning. Unfortunately, the high computational cost limits their feasibility for large-scale problems and real-world applications, and the existing MCMC acceleration solutions are either limited in hardware flexibility or fail to maintain efficiency at the system level across a variety of end-to-end applications. This paper introduces \textbfMC ^2 A, an algorithm-hardware co-design framework, enabling efficient and flexible optimization for MCMC acceleration. Firstly, \textbfMC ^2 A analyzes the MCMC workload diversity through an extension of the processor performance roofline model with a 3rd dimension to derive the optimal balance between the compute, sampling and memory parameters. Secondly, \textbfMC ^2 A proposes a parametrized hardware accelerator architecture with flexible and efficient support of MCMC kernels with a pipeline of ISA-programmable tree-structured processing units, reconfigurable samplers and a crossbar interconnect to support irregular access. Thirdly, the core of \textbfMC ^2 A is powered by a novel Gumbel sampler that eliminates exponential and normalization operations. In the end-to-end case study, \textbfMC ^2 A achieves an overall 307.6\times , 1.4\times , 2.0\times , 84.2\times speedup compared to the CPU, GPU, TPU and state-of-the-art MCMC accelerator. Evaluated on various representative MCMC workloads, this work demonstrates and exploits the feasibility of general hardware acceleration to popularize MCMC-based solutions in diverse application domains.
zh

[AI-18] An ultra-low-power CGRA for accelerating Transformers at the edge

【速读】：该论文试图解决将Transformer模型部署到低功耗边缘设备中的计算需求挑战。其解决方案的关键在于提出一种超低功耗的粗粒度可重构阵列（CGRA）架构，该架构专门用于加速Transformer模型中的通用矩阵乘法（GEMM）操作。该架构集成了4×4的处理单元（PE）以实现高效的并行计算，并配备专用的4×2内存操作块（MOB）以优化加载/存储操作，从而降低内存带宽需求并提高数据复用率。此外，无交换机的网格环形互连网络进一步减少了功耗和延迟，通过直接的PE与MOB通信消除了集中式交换的需要。

链接: https://arxiv.org/abs/2507.12904
作者: Rohit Prasad
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers have revolutionized deep learning with applications in natural language processing, computer vision, and beyond. However, their computational demands make it challenging to deploy them on low-power edge devices. This paper introduces an ultra-low-power, Coarse-Grained Reconfigurable Array (CGRA) architecture specifically designed to accelerate General Matrix Multiplication (GEMM) operations in transformer models tailored for the energy and resource constraints of edge applications. The proposed architecture integrates a 4 x 4 array of Processing Elements (PEs) for efficient parallel computation and dedicated 4 x 2 Memory Operation Blocks (MOBs) for optimized LOAD/STORE operations, reducing memory bandwidth demands and enhancing data reuse. A switchless mesh torus interconnect network further minimizes power and latency by enabling direct communication between PEs and MOBs, eliminating the need for centralized switching. Through its heterogeneous array design and efficient dataflow, this CGRA architecture addresses the unique computational needs of transformers, offering a scalable pathway to deploy sophisticated machine learning models on edge devices.
zh

[AI-19] VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks

【速读】：该论文试图解决当前基于强化学习（Reinforcement Learning, RL）训练的大语言模型（Large Language Models, LLMs）在数学推理能力上的提升是否反映真实推理能力，还是仅仅是对基准测试特定模式的过拟合问题。其解决方案的关键在于提出VAR-MATH，一个符号化评估框架，通过将固定数值问题转化为符号模板，并要求模型解决每个模板的多个实例，从而强制模型在结构相似的变体中保持推理一致性，有效缓解了基准污染并提升了评估的鲁棒性。

链接: https://arxiv.org/abs/2507.12885
作者: Jian Yao,Ran Cheng,Kay Chen Tan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of large language models (LLMs), as measured by standard benchmarks. However, these gains often persist even when models are trained with flawed signals, such as random or inverted rewards, raising a fundamental question: do such improvements reflect true reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To address this question, we take an evaluation-centric perspective and identify two critical shortcomings in existing protocols. First, \emphbenchmark contamination arises from the public availability of test problems, increasing the risk of data leakage. Second, \emphevaluation fragility stems from the reliance on single-instance assessments, which are highly sensitive to stochastic outputs and fail to capture reasoning consistency. To overcome these limitations, we introduce VAR-MATH, a symbolic evaluation framework designed to probe genuine reasoning ability. By converting fixed numerical problems into symbolic templates and requiring models to solve multiple instantiations of each, VAR-MATH enforces consistent reasoning across structurally equivalent variants, thereby mitigating contamination and improving evaluation robustness. We apply VAR-MATH to transform two popular benchmarks, AMC23 and AIME24, into their symbolic counterparts, VAR-AMC23 and VAR-AIME24. Experimental results reveal substantial performance drops for RL-trained models on the variabilized versions, especially for smaller models, with average declines of 48.0% on AMC23 and 58.3% on AIME24. These findings suggest that many existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms. Overall, VAR-MATH offers a principled, contamination-resistant evaluation paradigm for mathematical reasoning.
zh

[AI-20] Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework

【速读】：该论文试图解决前沿AI系统在人类行为操控方面可能带来的安全风险，特别是针对内部部署的AI系统可能通过操纵员工来削弱人类监督的问题。解决方案的关键在于提出一个基于“能力不足”、“控制”和“可信度”三条核心论点的安全论证框架，以系统性地评估和缓解此类操控风险，从而为AI公司提供可直接应用的证据要求、评估方法和实施考虑。

链接: https://arxiv.org/abs/2507.12872
作者: Rishane Dassanayake,Mario Demetroudi,James Walpole,Lindley Lentati,Jason R. Brown,Edward James Young
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: 24 pages (14 pages main text, 4 pages bibliography, 6 pages appendices), 3 figures

点击查看摘要

Abstract:Frontier AI systems are rapidly advancing in their capabilities to persuade, deceive, and influence human behaviour, with current models already demonstrating human-level persuasion and strategic deception in specific contexts. Humans are often the weakest link in cybersecurity systems, and a misaligned AI system deployed internally within a frontier company may seek to undermine human oversight by manipulating employees. Despite this growing threat, manipulation attacks have received little attention, and no systematic framework exists for assessing and mitigating these risks. To address this, we provide a detailed explanation of why manipulation attacks are a significant threat and could lead to catastrophic outcomes. Additionally, we present a safety case framework for manipulation risk, structured around three core lines of argument: inability, control, and trustworthiness. For each argument, we specify evidence requirements, evaluation methodologies, and implementation considerations for direct application by AI companies. This paper provides the first systematic methodology for integrating manipulation risk into AI safety governance, offering AI companies a concrete foundation to assess and mitigate these threats before deployment.
zh

[AI-21] Generative Multi-Target Cross-Domain Recommendation

【速读】：该论文试图解决多目标跨域推荐（Multi-Target Cross-Domain Recommendation, MTCDR）问题，特别是在非重叠推荐场景下，传统方法依赖于领域共享实体进行知识融合与迁移的局限性。其解决方案的关键在于引入GMC（Generative Model-based Cross-domain recommendation），通过语义量化离散物品标识符作为统一生成模型中整合多领域知识的媒介，将物品推荐建模为一个下一标记生成任务，并结合领域感知对比损失和领域特定微调以提升性能。

链接: https://arxiv.org/abs/2507.12871
作者: Jinqiu Jin,Yang Zhang,Junwei Pan,Fuli Feng,Hua Lu,Haijie Gu,Xiangnan He
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, there has been a surge of interest in Multi-Target Cross-Domain Recommendation (MTCDR), which aims to enhance recommendation performance across multiple domains simultaneously. Existing MTCDR methods primarily rely on domain-shared entities (\eg users or items) to fuse and transfer cross-domain knowledge, which may be unavailable in non-overlapped recommendation scenarios. Some studies model user preferences and item features as domain-sharable semantic representations, which can be utilized to tackle the MTCDR task. Nevertheless, they often require extensive auxiliary data for pre-training. Developing more effective solutions for MTCDR remains an important area for further exploration. Inspired by recent advancements in generative recommendation, this paper introduces GMC, a generative paradigm-based approach for multi-target cross-domain recommendation. The core idea of GMC is to leverage semantically quantized discrete item identifiers as a medium for integrating multi-domain knowledge within a unified generative model. GMC first employs an item tokenizer to generate domain-shared semantic identifiers for each item, and then formulates item recommendation as a next-token generation task by training a domain-unified sequence-to-sequence model. To further leverage the domain information to enhance performance, we incorporate a domain-aware contrastive loss into the semantic identifier learning, and perform domain-specific fine-tuning on the unified recommender. Extensive experiments on five public datasets demonstrate the effectiveness of GMC compared to a range of baseline methods. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.12871 [cs.IR] (or arXiv:2507.12871v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.12871 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-22] Information-Theoretic Aggregation of Ethical Attributes in Simulated-Command

【速读】：该论文试图解决在生成式模拟过程中如何动态加权伦理属性的问题，以支持智能体在面对具有伦理影响的决策选项时做出合理判断。其解决方案的关键在于将人类判断从模拟决策周期中移出，由人类指挥官预先设计伦理度量空间，随后由模拟环境自主探索该空间，并在模拟完成后向人类指挥官提供若干选项供其基于人类判断进行最终选择。

链接: https://arxiv.org/abs/2507.12862
作者: Hussein Abbass,Taylan Akay,Harrison Tolley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the age of AI, human commanders need to use the computational powers available in today’s environment to simulate a very large number of scenarios. Within each scenario, situations occur where different decision design options could have ethical consequences. Making these decisions reliant on human judgement is both counter-productive to the aim of exploring very large number of scenarios in a timely manner and infeasible when considering the workload needed to involve humans in each of these choices. In this paper, we move human judgement outside the simulation decision cycle. Basically, the human will design the ethical metric space, leaving it to the simulated environment to explore the space. When the simulation completes its testing cycles, the testing environment will come back to the human commander with a few options to select from. The human commander will then exercise human-judgement to select the most appropriate course of action, which will then get executed accordingly. We assume that the problem of designing metrics that are sufficiently granular to assess the ethical implications of decisions is solved. Subsequently, the fundamental problem we look at in this paper is how to weight ethical decisions during the running of these simulations; that is, how to dynamically weight the ethical attributes when agents are faced with decision options with ethical implications during generative simulations. The multi-criteria decision making literature has started to look at nearby problems, where the concept of entropy has been used to determine the weights during aggregation. We draw from that literature different approaches to automatically calculate the weights for ethical attributes during simulation-based testing and evaluation.
zh

[AI-23] Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)

【速读】：该论文试图解决监督微调（SFT）在大规模语言模型和连续控制策略训练中的性能局限性，以及如何将其与强化学习（RL）理论更紧密地结合。其解决方案的关键在于提出一种重要性加权的监督微调（iw-SFT）方法，该方法通过优化对RL目标的更紧致下界，提升了SFT在稀疏奖励环境下的表现，并能够利用质量评分数据进一步增强效果。

链接: https://arxiv.org/abs/2507.12856
作者: Chongli Qin,Jost Tobias Springenberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: See project website for details and code at: this https URL

点击查看摘要

Abstract:Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting. Giving support to its often observed good performance. From this viewpoint, we realize that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it: i) optimizes a tighter bound to the RL objective and, ii) can improve performance compared to SFT on curated data. We refer to this variant as importance weighted supervised fine-tuning (iw-SFT). We show that it is easy to implement and can be further generalized to training with quality scored data. The resulting SFT variants are competitive with more advanced RL algorithms for large language models and for training policies in continuous control tasks. For example achieving 66.7% on the AIME 2024 dataset.
zh

[AI-24] Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering

【速读】：该论文试图解决长期主动具身问答（Long-term Active Embodied Question Answering, LA-EQA）问题，即机器人需要在长时间操作过程中结合过去经验与当前环境探索来回答具有时间依赖性的复杂问题。传统基于大模型的问答方法在这一任务中表现不佳，主要受限于有限的上下文窗口、缺乏持久记忆以及无法有效整合记忆回忆与主动探索。该研究的关键解决方案是提出一种受认知科学中“心灵宫殿”方法启发的结构化记忆系统，将情景体验编码为基于场景图的世界实例，并构建推理与规划算法以实现目标记忆检索和引导导航，同时引入基于信息价值的停止准则以平衡探索与回忆的权衡。

链接: https://arxiv.org/abs/2507.12846
作者: Muhammad Fadhil Ginting,Dong-Ki Kim,Xiangyun Meng,Andrzej Reinke,Bandi Jai Krishna,Navid Kayhani,Oriana Peltzer,David D. Fan,Amirreza Shaban,Sung-Kyun Kim,Mykel J. Kochenderfer,Ali-akbar Agha-mohammadi,Shayegan Omidshafiei
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As robots become increasingly capable of operating over extended periods – spanning days, weeks, and even months – they are expected to accumulate knowledge of their environments and leverage this experience to assist humans more effectively. This paper studies the problem of Long-term Active Embodied Question Answering (LA-EQA), a new task in which a robot must both recall past experiences and actively explore its environment to answer complex, temporally-grounded questions. Unlike traditional EQA settings, which typically focus either on understanding the present environment alone or on recalling a single past observation, LA-EQA challenges an agent to reason over past, present, and possible future states, deciding when to explore, when to consult its memory, and when to stop gathering observations and provide a final answer. Standard EQA approaches based on large models struggle in this setting due to limited context windows, absence of persistent memory, and an inability to combine memory recall with active exploration. To address this, we propose a structured memory system for robots, inspired by the mind palace method from cognitive science. Our method encodes episodic experiences as scene-graph-based world instances, forming a reasoning and planning algorithm that enables targeted memory retrieval and guided navigation. To balance the exploration-recall trade-off, we introduce value-of-information-based stopping criteria that determines when the agent has gathered sufficient information. We evaluate our method on real-world experiments and introduce a new benchmark that spans popular simulation environments and actual industrial sites. Our approach significantly outperforms state-of-the-art baselines, yielding substantial gains in both answer accuracy and exploration efficiency.
zh

[AI-25] Assessing adaptive world models in machines with novel games

【速读】：该论文试图解决当前人工智能（Artificial Intelligence, AI）中世界模型（world models）评估框架过于狭窄的问题，即现有研究多关注于从大规模数据语料中学习的静态表示，而非模型在新环境中通过交互与探索构建和优化内部表征的效率与效果。论文提出的解决方案之关键在于构建一种新的评估框架，该框架基于精心设计的“新颖游戏”（novel games），这些游戏具有真实、深度且持续更新的结构新颖性，用以挑战和评估智能体快速构建世界模型的能力。

链接: https://arxiv.org/abs/2507.12821
作者: Lance Ying,Katherine M. Collins,Prafull Sharma,Cedric Colas,Kaiya Ivy Zhao,Adrian Weller,Zenna Tavares,Phillip Isola,Samuel J. Gershman,Jacob D. Andreas,Thomas L. Griffiths,Francois Chollet,Kelsey R. Allen,Joshua B. Tenenbaum
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Human intelligence exhibits a remarkable capacity for rapid adaptation and effective problem-solving in novel and unfamiliar contexts. We argue that this profound adaptability is fundamentally linked to the efficient construction and refinement of internal representations of the environment, commonly referred to as world models, and we refer to this adaptation mechanism as world model induction. However, current understanding and evaluation of world models in artificial intelligence (AI) remains narrow, often focusing on static representations learned from training on a massive corpora of data, instead of the efficiency and efficacy of models in learning these representations through interaction and exploration within a novel environment. In this Perspective, we provide a view of world model induction drawing on decades of research in cognitive science on how humans learn and adapt so efficiently; we then call for a new evaluation framework for assessing adaptive world models in AI. Concretely, we propose a new benchmarking paradigm based on suites of carefully designed games with genuine, deep and continually refreshing novelty in the underlying game structures – we refer to this kind of games as novel games. We detail key desiderata for constructing these games and propose appropriate metrics to explicitly challenge and evaluate the agent’s ability for rapid world model induction. We hope that this new evaluation framework will inspire future evaluation efforts on world models in AI and provide a crucial step towards developing AI systems capable of the human-like rapid adaptation and robust generalization – a critical component of artificial general intelligence.
zh

[AI-26] FLDmamba: Integrating Fourier and Laplace Transform Decomposition with Mamba for Enhanced Time Series Prediction

【速读】：该论文旨在解决时间序列预测中长期预测面临的挑战，包括非平稳性、多尺度周期性和瞬态动力学等问题，同时提升模型对数据噪声的鲁棒性。其解决方案的关键在于提出一种名为FLDmamba（Fourier and Laplace Transform Decomposition Mamba）的框架，该框架结合了傅里叶变换和拉普拉斯变换的优势，以有效捕捉时间序列中的多尺度周期性与瞬态动态，并增强模型对噪声的抵抗能力。

链接: https://arxiv.org/abs/2507.12803
作者: Qianru Zhang,Chenglei Yu,Haixin Wang,Yudong Yan,Yuansheng Cao,Siu-Ming Yiu,Tailin Wu,Hongzhi Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Time series prediction, a crucial task across various domains, faces significant challenges due to the inherent complexities of time series data, including non-stationarity, multi-scale periodicity, and transient dynamics, particularly when tackling long-term predictions. While Transformer-based architectures have shown promise, their quadratic complexity with sequence length hinders their efficiency for long-term predictions. Recent advancements in State-Space Models, such as Mamba, offer a more efficient alternative for long-term modeling, but they cannot capture multi-scale periodicity and transient dynamics effectively. Meanwhile, they are susceptible to data noise issues in time series. This paper proposes a novel framework, FLDmamba (Fourier and Laplace Transform Decomposition Mamba), addressing these limitations. FLDmamba leverages the strengths of both Fourier and Laplace transforms to effectively capture both multi-scale periodicity, transient dynamics within time series data, and improve the robustness of the model to the data noise issue. Our extensive experiments demonstrate that FLDmamba achieves superior performance on time series prediction benchmarks, outperforming both Transformer-based and other Mamba-based architectures. To promote the reproducibility of our method, we have made both the code and data accessible via the following URL:\hrefthis https URLthis https URL\model.
zh

[AI-27] Imitating Mistakes in a Learning Companion AI Agent for Online Peer Learning

【速读】：该论文试图解决传统同伴学习（peer learning）在实际应用中存在的时间、空间和水平匹配等限制问题，旨在通过开发一个生成式AI (Generative AI) 代理作为学习伙伴，实现随时随地的同伴学习。解决方案的关键在于假设与学习者水平相当的同伴会犯相同的错误，并以英语写作作为具体案例进行验证，从而构建一个能够提供有效同伴互动的学习环境。

链接: https://arxiv.org/abs/2507.12801
作者: Sosui Moribe,Taketoshi Ushiama
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: This is the preprint version of the paper published in IMCOM 2025, IEEE Xplore (DOI: https://doi.org/10.1109/IMCOM64595.2025.10857528 )

点击查看摘要

Abstract:In recent years, peer learning has gained attention as a method that promotes spontaneous thinking among learners, and its effectiveness has been confirmed by numerous studies. This study aims to develop an AI Agent as a learning companion that enables peer learning anytime and anywhere. However, peer learning between humans has various limitations, and it is not always effective. Effective peer learning requires companions at the same proficiency levels. In this study, we assume that a learner’s peers with the same proficiency level as the learner make the same mistakes as the learner does and focus on English composition as a specific example to validate this approach.
zh

[AI-28] Autonomy for Older Adult-Agent Interaction

【速读】：该论文试图解决如何使人工智能（Artificial Intelligence, AI）代理在支持老年人照护时更好地契合其自主性偏好这一关键问题。解决方案的关键在于从四个核心维度——决策自主性、目标导向自主性、控制自主性和社会责任自主性——出发，深入探讨并提出研究方向，包括关注社会责任性自主性、从任务视角对代理自主性进行操作化定义以及开发自主性评估指标。

链接: https://arxiv.org/abs/2507.12767
作者: Jiaxin An
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:As the global population ages, artificial intelligence (AI)-powered agents have emerged as potential tools to support older adults’ caregiving. Prior research has explored agent autonomy by identifying key interaction stages in task processes and defining the agent’s role at each stage. However, ensuring that agents align with older adults’ autonomy preferences remains a critical challenge. Drawing on interdisciplinary conceptualizations of autonomy, this paper examines four key dimensions of autonomy for older adults: decision-making autonomy, goal-oriented autonomy, control autonomy, and social responsibility autonomy. This paper then proposes the following research directions: (1) Addressing social responsibility autonomy, which concerns the ethical and social implications of agent use in communal settings; (2) Operationalizing agent autonomy from the task perspective; and (3) Developing autonomy measures.
zh

[AI-29] ask-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine

【速读】：该论文试图解决音频编码中针对机器任务（Audio Coding for Machines, ACoM）的高效压缩与下游任务性能之间的平衡问题。传统方法在追求高保真重建时忽略了机器任务的需求，而该工作提出了一种高效的ACoM方法，其关键在于利用任务特定的损失指导和残差向量量化（Residual Vector Quantization, RVQ）损失，从而在保持下游模型性能的同时实现超低比特率（低于200 bps）的压缩。

链接: https://arxiv.org/abs/2507.12701
作者: Anastasia Kuznetsova,Inseon Jang,Wootaek Lim,Minje Kim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Neural audio codecs, leveraging quantization algorithms, have significantly impacted various speech/audio tasks. While high-fidelity reconstruction is paramount for human perception, audio coding for machines (ACoM) prioritizes efficient compression and downstream task performance, disregarding perceptual nuances. This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model. Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low bitrates (i.e., less than 200 bps) with a minimal loss of the downstream model performance. The resulting tokenizer is adaptable to various bitrates and model sizes for flexible deployment. Evaluated on automatic speech recognition and audio classification, our method demonstrates its efficacy and potential for broader task and architectural applicability through appropriate regularization.
zh

[AI-30] Benchmarking Deception Probes via Black-to-White Performance Boosts

【速读】：该论文试图解决生成式 AI (Generative AI) 在响应用户查询时可能产生的欺骗性行为的检测问题，其核心挑战在于评估现有欺骗探测机制（称为“deception probes”）的实际有效性及其对欺骗性助手规避检测策略的鲁棒性。论文的关键解决方案是通过对比白盒监控（能够访问token级探测激活信息）与黑盒监控（无此访问权限）的性能差异，来衡量现有欺骗探测器的有效性，即通过白盒监控相对于黑盒监控的性能提升来评估其检测能力。

链接: https://arxiv.org/abs/2507.12691
作者: Avi Parrack,Carlo Leonardo Attubato,Stefan Heimersheim
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. 37 pages, 10 figures, 7 tables

点击查看摘要

Abstract:AI assistants will occasionally respond deceptively to user queries. Recently, linear classifiers (called “deception probes”) have been trained to distinguish the internal activations of a language model during deceptive versus honest responses. However, it’s unclear how effective these probes are at detecting deception in practice, nor whether such probes are resistant to simple counter strategies from a deceptive assistant who wishes to evade detection. In this paper, we compare white-box monitoring (where the monitor has access to token-level probe activations) to black-box monitoring (without such access). We benchmark deception probes by the extent to which the white box monitor outperforms the black-box monitor, i.e. the black-to-white performance boost. We find weak but encouraging black-to-white performance boosts from existing deception probes.
zh

[AI-31] Data Transformation Strategies to Remove Heterogeneity

【速读】：该论文试图解决数据异质性（data heterogeneity）问题，这种问题源于多种冲突因素，使得数据的利用变得复杂。其解决方案的关键在于数据转换（data transformation），通过选择合适的转换技术来保留关键数据细节，从而提升人工智能（AI）的学习效率并适配不同AI模型的输入格式。论文系统地分类并介绍了针对数据格式差异导致的异质性问题的策略，揭示了每种策略所面临的内在挑战。

链接: https://arxiv.org/abs/2507.12677
作者: Sangbong Yoo,Jaeyoung Lee,Chanyoung Yoon,Geonyeong Son,Hyein Hong,Seongbum Seo,Soobin Yim,Chanyoung Jung,Jungsoo Park,Misuk Kim,Yun Jang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data heterogeneity is a prevalent issue, stemming from various conflicting factors, making its utilization complex. This uncertainty, particularly resulting from disparities in data formats, frequently necessitates the involvement of experts to find resolutions. Current methodologies primarily address conflicts related to data structures and schemas, often overlooking the pivotal role played by data transformation. As the utilization of artificial intelligence (AI) continues to expand, there is a growing demand for a more streamlined data preparation process, and data transformation becomes paramount. It customizes training data to enhance AI learning efficiency and adapts input formats to suit diverse AI models. Selecting an appropriate transformation technique is paramount in preserving crucial data details. Despite the widespread integration of AI across various industries, comprehensive reviews concerning contemporary data transformation approaches are scarce. This survey explores the intricacies of data heterogeneity and its underlying sources. It systematically categorizes and presents strategies to address heterogeneity stemming from differences in data formats, shedding light on the inherent challenges associated with each strategy.
zh

[AI-32] ParaStudent: Generating and Evaluating Realistic Student Code by Teaching LLM s to Struggle

【速读】：该论文试图解决如何使基于大型语言模型（Large Language Models, LLMs）生成的代码具有类似真实学生的特征，即不完美、迭代性和风格多样性。其解决方案的关键在于通过系统性研究，利用时间戳学生提交数据设计低分辨率和高分辨率实验，以建模学生进步并从语义、功能和风格维度评估代码输出。关键方法包括微调模型以提高与真实学生轨迹的一致性，捕捉错误模式、渐进改进和风格变化，并强调通过上下文感知生成、时间建模和多维评估来建模真实的学生成代码。

链接: https://arxiv.org/abs/2507.12674
作者: Mihran Miroyan,Rose Niousha,Joseph E. Gonzalez,Gireeja Ranade,Narges Norouzi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong performance on programming tasks, but can they generate student-like code like real students - imperfect, iterative, and stylistically diverse? We present ParaStudent, a systematic study of LLM-based “student-like” code generation in an introductory programming course setting. Using a dataset of timestamped student submissions across multiple semesters, we design low- and high-resolution experiments to model student progress and evaluate code outputs along semantic, functional, and stylistic dimensions. Our results show that fine-tuning significantly improves alignment with real student trajectories and captures error patterns, incremental improvements, and stylistic variations more faithfully. This study shows that modeling realistic student code requires capturing learning dynamics through context-aware generation, temporal modeling, and multi-dimensional evaluation. Code for experiments and evaluation is available at \hrefthis https URL\textttthis http URL.
zh

[AI-33] Fly Fail Fix: Iterative Game Repair with Reinforcement Learning and Large Multimodal Models

【速读】：该论文试图解决游戏设计中静态规则和内容如何转化为动态玩家行为的问题，而现代生成系统仅通过分析游戏代码或资源难以准确捕捉这一过程。解决方案的关键在于构建一个自动化设计迭代框架，该框架将强化学习（Reinforcement Learning, RL）代理与大型多模态模型（Large Multimodal Model, LMM）相结合，其中RL代理通过玩测试游戏生成行为数据，LMM则根据这些数据对游戏进行修改，从而实现游戏机制的迭代优化。

链接: https://arxiv.org/abs/2507.12666
作者: Alex Zook,Josef Spjut,Jonathan Tremblay
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published at Reinforcement Learning and Video Games workshop this https URL

点击查看摘要

Abstract:Game design hinges on understanding how static rules and content translate into dynamic player behavior - something modern generative systems that inspect only a game’s code or assets struggle to capture. We present an automated design iteration framework that closes this gap by pairing a reinforcement learning (RL) agent, which playtests the game, with a large multimodal model (LMM), which revises the game based on what the agent does. In each loop the RL player completes several episodes, producing (i) numerical play metrics and/or (ii) a compact image strip summarising recent video frames. The LMM designer receives a gameplay goal and the current game configuration, analyses the play traces, and edits the configuration to steer future behaviour toward the goal. We demonstrate results that LMMs can reason over behavioral traces supplied by RL agents to iteratively refine game mechanics, pointing toward practical, scalable tools for AI-assisted game design.
zh

[AI-34] Single Conversation Methodology: A Human-Centered Protocol for AI-Assisted Software Development

【速读】：该论文试图解决当前软件开发中对大型语言模型（LLMs）的被动依赖问题，其解决方案的关键在于提出单次对话方法（Single Conversation Methodology, SCM），通过结构化和持续的开发对话，将项目的所有阶段——从需求到架构和实现——整合在一个长上下文的对话中，以增强认知清晰度、可追溯性、模块化和文档化。

链接: https://arxiv.org/abs/2507.12665
作者: Salvador D. Escobedo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Style reviewed by a LLM for improving clarity and English syntax

点击查看摘要

Abstract:We propose the Single Conversation Methodology (SCM), a novel and pragmatic approach to software development using large language models (LLMs). In contrast to ad hoc interactions with generative AI, SCM emphasizes a structured and persistent development dialogue, where all stages of a project - from requirements to architecture and implementation - unfold within a single, long-context conversation. The methodology is grounded on principles of cognitive clarity, traceability, modularity, and documentation. We define its phases, best practices, and philosophical stance, while arguing that SCM offers a necessary correction to the passive reliance on LLMs prevalent in current practices. We aim to reassert the active role of the developer as architect and supervisor of the intelligent tool.
zh

[AI-35] Improving physics-informed neural network extrapolation via transfer learning and adaptive activation functions ICANN2025

【速读】：该论文旨在解决物理信息神经网络（PINNs）在训练域外表现出较差的外推性能以及对激活函数（AFs）选择高度敏感的问题。其解决方案的关键在于引入一种迁移学习（TL）方法，通过在扩展的训练域内使用少量精心选择的配点来提升PINNs的外推能力，同时提出一种自适应激活函数，该函数由标准激活函数的线性组合构成，从而增强模型的鲁棒性和准确性。

链接: https://arxiv.org/abs/2507.12659
作者: Athanasios Papastathopoulos-Katsaros,Alexandra Stavrianidi,Zhandong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注: 18 pages, 16 figures, 7 tables Accepted to ICANN 2025

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are deep learning models that incorporate the governing physical laws of a system into the learning process, making them well-suited for solving complex scientific and engineering problems. Recently, PINNs have gained widespread attention as a powerful framework for combining physical principles with data-driven modeling to improve prediction accuracy. Despite their successes, however, PINNs often exhibit poor extrapolation performance outside the training domain and are highly sensitive to the choice of activation functions (AFs). In this paper, we introduce a transfer learning (TL) method to improve the extrapolation capability of PINNs. Our approach applies transfer learning (TL) within an extended training domain, using only a small number of carefully selected collocation points. Additionally, we propose an adaptive AF that takes the form of a linear combination of standard AFs, which improves both the robustness and accuracy of the model. Through a series of experiments, we demonstrate that our method achieves an average of 40% reduction in relative L2 error and an average of 50% reduction in mean absolute error in the extrapolation domain, all without a significant increase in computational cost. The code is available at this https URL .
zh

[AI-36] VLMgineer: Vision Language Models as Robotic Toolsmiths

【速读】：该论文试图解决如何利用生成式 AI（Generative AI）自动设计并有效使用物理工具以完成日常操作任务的问题。其核心挑战在于将传统依赖于控制器优化的机器人智能研究，转向通过工具设计本身来提升解决问题的能力。解决方案的关键在于提出 VLMgineer 框架，该框架结合了视觉语言模型（VLM）的代码生成能力与进化搜索算法，实现了物理工具与操作策略的协同迭代设计，从而在多样化的日常操作基准中展现出更高效和创新的任务解决能力。

链接: https://arxiv.org/abs/2507.12644
作者: George Jiayuan Gao,Tianyu Li,Junyao Shi,Yihan Li,Zizhe Zhang,Nadia Figueroa,Dinesh Jayaraman
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Website: this https URL

点击查看摘要

Abstract:Tool design and use reflect the ability to understand and manipulate the physical world through creativity, planning, and foresight. As such, these capabilities are often regarded as measurable indicators of intelligence across biological species. While much of today’s research on robotic intelligence focuses on generating better controllers, inventing smarter tools offers a complementary form of physical intelligence: shifting the onus of problem-solving onto the tool’s design. Given the vast and impressive common-sense, reasoning, and creative capabilities of today’s foundation models, we investigate whether these models can provide useful priors to automatically design and effectively wield such tools? We present VLMgineer, a framework that harnesses the code generation abilities of vision language models (VLMs) together with evolutionary search to iteratively co-design physical tools and the action plans that operate them to perform a task. We evaluate VLMgineer on a diverse new benchmark of everyday manipulation scenarios that demand creative tool design and use. Across this suite, VLMgineer consistently discovers tools and policies that solve tasks more effectively and innovatively, transforming challenging robotics problems into straightforward executions. It also outperforms VLM-generated designs from human specifications and existing human-crafted tools for everyday tasks. To facilitate future research on automated tool invention, we will release our benchmark and code.
zh

[AI-37] QSpark: Towards Reliable Qiskit Code Generation

【速读】：该论文试图解决生成式AI在量子编程中的错误率问题，即当前大型语言模型（Large Language Models, LLMs）如Granite-20B-Code和StarCoder生成的Qiskit代码存在缺陷。为了解决这一问题，作者对一个32 B参数的模型进行了微调，采用了两种强化学习方法：Group Relative Policy Optimization (GRPO) 和 Odds-Ratio Preference Optimization (ORPO)，并使用了一个丰富标注的合成数据集。关键在于通过这些强化学习方法优化模型，使其在Qiskit HumanEval基准测试中表现出更高的代码正确性，显著优于现有基线模型。

链接: https://arxiv.org/abs/2507.12642
作者: Kiana Kheiri,Aamna Aamir,Andriy Miranskyy,Chen Ding
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Quantum circuits must be error-resilient, yet LLMs like Granite-20B-Code and StarCoder often output flawed Qiskit code. We fine-tuned a 32 B model with two RL methods, Group Relative Policy Optimization (GRPO) and Odds-Ratio Preference Optimization (ORPO), using a richly annotated synthetic dataset. On the Qiskit HumanEval benchmark, ORPO reaches 56.29% Pass@1 ( \approx+10 pp over Granite-8B-QK) and GRPO hits 49%, both beating all general-purpose baselines; on the original HumanEval they score 65.90% and 63.00%. GRPO excels on basic tasks (42/54), ORPO on intermediate ones (41/68), and neither solves the five advanced tasks, highlighting clear gains yet room for progress in AI-assisted quantum programming.
zh

[AI-38] BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）训练中的启动开销问题，即训练任务开始执行前的延迟。在工业级LLM中，启动开销尤为关键，因为故障更频繁且多个团队处于迭代更新与调试周期中。研究通过分析启动成本的组成部分，量化其直接影响，并探讨其随任务规模的变化。解决方案的关键在于设计Bootseer，这是一个系统级优化框架，旨在解决三个主要的启动瓶颈：容器镜像加载、运行时依赖安装和模型检查点恢复。Bootseer引入了三项技术：热块记录与预取、依赖快照和分条HDFS-FUSE，从而实现了启动开销的显著降低。

链接: https://arxiv.org/abs/2507.12619
作者: Rui Li,Xiaoyun Zhi,Jinxin Chi,Menghan Yu,Lixin Huang,Jia Zhu,Weilun Zhang,Xing Ma,Wenjia Liu,Zhicheng Zhu,Daowen Luo,Zuquan Song,Xin Yin,Chao Xiang,Shuguang Wang,Wencong Xiao,Gene Cooperman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 18 pages, 14 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of LLM training startup overhead based on real production data. We analyze the components of startup cost, quantify its direct impact, and examine how it scales with job size. These insights motivate the design of Bootseer, a system-level optimization framework that addresses three primary startup bottlenecks: (a) container image loading, (b) runtime dependency installation, and © model checkpoint resumption. To mitigate these bottlenecks, Bootseer introduces three techniques: (a) hot block record-and-prefetch, (b) dependency snapshotting, and © striped HDFS-FUSE. Bootseer has been deployed in a production environment and evaluated on real LLM training workloads, demonstrating a 50% reduction in startup overhead. Comments: 18 pages, 14 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2507.12619 [cs.LG] (or arXiv:2507.12619v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.12619 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-39] Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning

【速读】：该论文试图解决微调大型语言模型（Large Language Models, LLMs）时，训练数据混合比例优化的问题。当前的策略多为手动或基于经验的方法，缺乏系统性。其解决方案的关键在于提出TASKPGM，这是一个基于马尔可夫随机场（Markov Random Field, MRF）能量函数最小化的原理性且可扩展的混合优化框架，通过行为差异度量（如Jensen Shannon Divergence和Pointwise Mutual Information）建模任务间关系，从而在单纯形约束下得到闭式解，实现任务表征性和多样性的平衡。

链接: https://arxiv.org/abs/2507.12612
作者: Prateek Chanda,Saral Sureka,Parth Pratim Chatterjee,Krishnateja Killamsetty,Nikhil Shivakumar Nayak,Ganesh Ramakrishnan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9, 8 tables, 7 figures

点击查看摘要

Abstract:The performance of finetuned large language models (LLMs) hinges critically on the composition of the training mixture. However, selecting an optimal blend of task datasets remains a largely manual, heuristic driven process, with practitioners often relying on uniform or size based sampling strategies. We introduce TASKPGM, a principled and scalable framework for mixture optimization that selects continuous task proportions by minimizing an energy function over a Markov Random Field (MRF). Task relationships are modeled using behavioral divergences such as Jensen Shannon Divergence and Pointwise Mutual Information computed from the predictive distributions of single task finetuned models. Our method yields a closed form solution under simplex constraints and provably balances representativeness and diversity among tasks. We provide theoretical guarantees, including weak submodularity for budgeted variants, and demonstrate consistent empirical improvements on Llama 2 and Mistral across evaluation suites such as MMLU and BIGBench. Beyond performance, TASKPGM offers interpretable insights into task influence and mixture composition, making it a powerful tool for efficient and robust LLM finetuning.
zh

[AI-40] A Survey of Explainable Reinforcement Learning: Targets Methods and Needs

【速读】：该论文试图解决深度强化学习（Reinforcement Learning, RL）模型的可解释性问题，即如何理解和解释由强化学习训练得到的智能体（agent）所采取的行为。其解决方案的关键在于提出一个基于“What”和“How”两个问题的直观分类法，用以系统梳理和总结当前可解释强化学习（eXplainable Reinforcement Learning, XRL）领域的研究方法，从而为该领域的发展提供理论框架和研究方向。

链接: https://arxiv.org/abs/2507.12599
作者: Léo Saulières
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 69 pages, 19 figures

点击查看摘要

Abstract:The success of recent Artificial Intelligence (AI) models has been accompanied by the opacity of their internal mechanisms, due notably to the use of deep neural networks. In order to understand these internal mechanisms and explain the output of these AI models, a set of methods have been proposed, grouped under the domain of eXplainable AI (XAI). This paper focuses on a sub-domain of XAI, called eXplainable Reinforcement Learning (XRL), which aims to explain the actions of an agent that has learned by reinforcement learning. We propose an intuitive taxonomy based on two questions “What” and “How”. The first question focuses on the target that the method explains, while the second relates to the way the explanation is provided. We use this taxonomy to provide a state-of-the-art review of over 250 papers. In addition, we present a set of domains close to XRL, which we believe should get attention from the community. Finally, we identify some needs for the field of XRL.
zh

[AI-41] Assay2Mol: large language model-based drug design using BioAssay context

【速读】：该论文试图解决生物化学中分子筛选实验数据因格式非结构化而难以有效利用的问题，从而阻碍了新药发现的进程。解决方案的关键在于提出Assay2Mol，这是一个基于大语言模型的工作流，能够通过上下文学习从已有的生化筛选实验数据中检索与新目标相似的目标相关记录，并生成候选分子，从而提升早期药物发现的效率和可行性。

链接: https://arxiv.org/abs/2507.12574
作者: Yifan Deng,Spencer S. Ericksen,Anthony Gitter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 23 pages, 10 figures

点击查看摘要

Abstract:Scientific databases aggregate vast amounts of quantitative data alongside descriptive text. In biochemistry, molecule screening assays evaluate the functional responses of candidate molecules against disease targets. Unstructured text that describes the biological mechanisms through which these targets operate, experimental screening protocols, and other attributes of assays offer rich information for new drug discovery campaigns but has been untapped because of that unstructured format. We present Assay2Mol, a large language model-based workflow that can capitalize on the vast existing biochemical screening assays for early-stage drug discovery. Assay2Mol retrieves existing assay records involving targets similar to the new target and generates candidate molecules using in-context learning with the retrieved assay screening data. Assay2Mol outperforms recent machine learning approaches that generate candidate ligand molecules for target protein structures, while also promoting more synthesizable molecule generation.
zh

[AI-42] Safeguarding Federated Learning-based Road Condition Classification

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在基于摄像头的路况分类（Road Condition Classification, RCC）系统中面临的针对性标签翻转攻击（Targeted Label Flipping Attacks, TLFAs）问题。TLFAs通过恶意客户端篡改训练数据标签，导致模型推理性能下降，进而可能引发车辆对危险路况误判。论文的关键解决方案是提出FLARE防御机制，该机制通过输出层神经元级别的分析来减轻TLFA的影响，从而提升FL-RCC系统的安全性与鲁棒性。

链接: https://arxiv.org/abs/2507.12568
作者: Sheng Liu,Panos Papadimitratos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Conference on Communications and Network Security (CNS) 2025

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising solution for privacy-preserving autonomous driving, specifically camera-based Road Condition Classification (RCC) systems, harnessing distributed sensing, computing, and communication resources on board vehicles without sharing sensitive image data. However, the collaborative nature of FL-RCC frameworks introduces new vulnerabilities: Targeted Label Flipping Attacks (TLFAs), in which malicious clients (vehicles) deliberately alter their training data labels to compromise the learned model inference performance. Such attacks can, e.g., cause a vehicle to mis-classify slippery, dangerous road conditions as pristine and exceed recommended speed. However, TLFAs for FL-based RCC systems are largely missing. We address this challenge with a threefold contribution: 1) we disclose the vulnerability of existing FL-RCC systems to TLFAs; 2) we introduce a novel label-distance-based metric to precisely quantify the safety risks posed by TLFAs; and 3) we propose FLARE, a defensive mechanism leveraging neuron-wise analysis of the output layer to mitigate TLFA effects. Extensive experiments across three RCC tasks, four evaluation metrics, six baselines, and three deep learning models demonstrate both the severity of TLFAs on FL-RCC systems and the effectiveness of FLARE in mitigating the attack impact.
zh

[AI-43] Can Mental Imagery Improve the Thinking Capabilities of AI Systems?

【速读】：该论文试图解决现有模型在自主行动和独立推理能力上的不足，以及输入数据通常以显式查询形式提供而未能充分利用已获取的感知数据的问题。其解决方案的关键在于将心理意象（Mental Imagery）整合到机器思维框架中，通过引入包含输入数据单元、需求单元和心理意象单元的认知思维单元，使数据以自然语言句子或草图形式表示，从而支持信息传递和决策制定。

链接: https://arxiv.org/abs/2507.12555
作者: Slimane Larabi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:Although existing models can interact with humans and provide satisfactory responses, they lack the ability to act autonomously or engage in independent reasoning. Furthermore, input data in these models is typically provided as explicit queries, even when some sensory data is already acquired. In addition, AI agents, which are computational entities designed to perform tasks and make decisions autonomously based on their programming, data inputs, and learned knowledge, have shown significant progress. However, they struggle with integrating knowledge across multiple domains, unlike humans. Mental imagery plays a fundamental role in the brain’s thinking process, which involves performing tasks based on internal multisensory data, planned actions, needs, and reasoning capabilities. In this paper, we investigate how to integrate mental imagery into a machine thinking framework and how this could be beneficial in initiating the thinking process. Our proposed machine thinking framework integrates a Cognitive thinking unit supported by three auxiliary units: the Input Data Unit, the Needs Unit, and the Mental Imagery Unit. Within this framework, data is represented as natural language sentences or drawn sketches, serving both informative and decision-making purposes. We conducted validation tests for this framework, and the results are presented and discussed. Comments: 15 pages, 8 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.12555 [cs.LG] (or arXiv:2507.12555v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.12555 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-44] ransforming Football Data into Object-centric Event Logs with Spatial Context Information

【速读】：该论文试图解决对象中心事件日志（object-centric event logs）在现实世界中数据稀缺的问题，以及如何利用团队运动数据进行对象中心过程挖掘。其解决方案的关键在于提出一个将足球数据转化为带有空间维度的对象中心事件日志的框架，从而为足球分析提供首个对象中心事件日志的实例。

链接: https://arxiv.org/abs/2507.12504
作者: Vito Chan,Lennart Ebert,Paul-Julius Hillmann,Christoffer Rubensson,Stephan A. Fahrenkrog-Petersen,Jan Mendling
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: Accepted for the 3rd Workshop on Object-centric processes from A to Z (co-locatedOBJECTS 2025) with BPM 2025

点击查看摘要

Abstract:Object-centric event logs expand the conventional single-case notion event log by considering multiple objects, allowing for the analysis of more complex and realistic process behavior. However, the number of real-world object-centric event logs remains limited, and further studies are needed to test their usefulness. The increasing availability of data from team sports can facilitate object-centric process mining, leveraging both real-world data and suitable use cases. In this paper, we present a framework for transforming football (soccer) data into an object-centric event log, further enhanced with a spatial dimension. We demonstrate the effectiveness of our framework by generating object-centric event logs based on real-world football data and discuss the results for varying process representations. With our paper, we provide the first example for object-centric event logs in football analytics. Future work should consider variant analysis and filtering techniques to better handle variability
zh

[AI-45] FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making ICML2025

【速读】：该论文试图解决在具身环境中进行开放任务求解的问题，特别是在无奖励信号的情况下实现任务的泛化与执行。其解决方案的关键在于提出FOUNDER框架，该框架将生成式AI（Generative AI）中嵌入的可泛化知识与世界模型（World Models, WMs）的动态建模能力相结合，通过学习一个将FM表示映射到WM状态空间的函数，从而从外部观测中推断代理的物理状态，并利用预测的目标状态时间距离作为信息性奖励信号，以指导目标条件策略的学习。

链接: https://arxiv.org/abs/2507.12496
作者: Yucen Wang,Rui Yu,Shenghua Wan,Le Gan,De-Chuan Zhan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by Forty-Second International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent’s physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website is this https URL.
zh

[AI-46] MR-LDM – The Merge-Reactive Longitudinal Decision Model: Game Theoretic Human Decision Modeling for Interactive Sim Agents

【速读】：该论文试图解决在高速公路汇入场景中，如何提升模拟环境中驾驶员行为的逼真度，以支持自动驾驶技术的发展。其解决方案的关键在于构建一个基于博弈论的战术决策模型，该模型结合了改进的收益函数和滞后动作，并与底层动力学模型耦合，形成统一的决策与动力学模型，从而能够捕捉汇入交互并以可解释的方式模拟更真实的交互过程。

链接: https://arxiv.org/abs/2507.12494
作者: Dustin Holley,Jovin D’sa,Hossein Nourkhiz Mahjoub,Gibran Ali
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 8 pages

点击查看摘要

Abstract:Enhancing simulation environments to replicate real-world driver behavior, i.e., more humanlike sim agents, is essential for developing autonomous vehicle technology. In the context of highway merging, previous works have studied the operational-level yielding dynamics of lag vehicles in response to a merging car at highway on-ramps. Other works focusing on tactical decision modeling generally consider limited action sets or utilize payoff functions with large parameter sets and limited payoff bounds. In this work, we aim to improve the simulation of the highway merge scenario by targeting a game theoretic model for tactical decision-making with improved payoff functions and lag actions. We couple this with an underlying dynamics model to have a unified decision and dynamics model that can capture merging interactions and simulate more realistic interactions in an explainable and interpretable fashion. The proposed model demonstrated good reproducibility of complex interactions when validated on a real-world dataset. The model was finally integrated into a high fidelity simulation environment and confirmed to have adequate computation time efficiency for use in large-scale simulations to support autonomous vehicle development.
zh

[AI-47] On multiagent online problems with predictions

【速读】：该论文试图解决在多智能体环境下，利用预测信息设计竞争性算法的性能优化问题，核心在于分析不同预测质量下可达到的最佳竞争比。其解决方案的关键在于引入一个双预测器框架，其中智能体分别使用一个预测器来预测自身未来行为，另一个预测器来预测其他参与者的行动，通过这种机制提升算法的决策效率与鲁棒性。论文以多智能体滑雪租赁问题为例，展示了如何通过资源池化和群体许可机制优化个体决策，并在完美其他预测条件下提出更具鲁棒性的算法。

链接: https://arxiv.org/abs/2507.12486
作者: Gabriel Istrate,Cosmin Bonchis,Victor Bogdan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: arXiv admin note: substantial text overlap with arXiv:2405.11873

点击查看摘要

Abstract:We study the power of (competitive) algorithms with predictions in a multiagent setting. We introduce a two predictor framework, that assumes that agents use one predictor for their future (self) behavior, and one for the behavior of the other players. The main problem we are concerned with is understanding what are the best competitive ratios that can be achieved by employing such predictors, under various assumptions on predictor quality. As an illustration of our framework, we introduce and analyze a multiagent version of the ski-rental problem. In this problem agents can collaborate by pooling resources to get a group license for some asset. If the license price is not met then agents have to rent the asset individually for the day at a unit price. Otherwise the license becomes available forever to everyone at no extra cost. In the particular case of perfect other predictions the algorithm that follows the self predictor is optimal but not robust to mispredictions of agent’s future behavior; we give an algorithm with better robustness properties and benchmark it. Comments: arXiv admin note: substantial text overlap with arXiv:2405.11873 Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2507.12486 [cs.MA] (or arXiv:2507.12486v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2507.12486 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-48] AI-Powered Math Tutoring: Platform for Personalized and Adaptive Education

【速读】：该论文试图解决当前人工智能辅导系统（AI tutoring systems）在数学教育中存在的一些局限性，即这些系统往往仅提供被动响应式的帮助，缺乏促进深度反思和整合结构化教学策略的能力。其解决方案的关键在于引入一种新型的多智能体人工智能辅导平台，该平台结合了自适应与个性化反馈、结构化课程生成以及教材知识检索功能，从而实现模块化、工具辅助的学习过程。

链接: https://arxiv.org/abs/2507.12484
作者: Jarosław A. Chudziak,Adam Kostka
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:The growing ubiquity of artificial intelligence (AI), in particular large language models (LLMs), has profoundly altered the way in which learners gain knowledge and interact with learning material, with many claiming that AI positively influences their learning achievements. Despite this advancement, current AI tutoring systems face limitations associated with their reactive nature, often providing direct answers without encouraging deep reflection or incorporating structured pedagogical tools and strategies. This limitation is most apparent in the field of mathematics, in which AI tutoring systems remain underdeveloped. This research addresses the question: How can AI tutoring systems move beyond providing reactive assistance to enable structured, individualized, and tool-assisted learning experiences? We introduce a novel multi-agent AI tutoring platform that combines adaptive and personalized feedback, structured course generation, and textbook knowledge retrieval to enable modular, tool-assisted learning processes. This system allows students to learn new topics while identifying and targeting their weaknesses, revise for exams effectively, and practice on an unlimited number of personalized exercises. This article contributes to the field of artificial intelligence in education by introducing a novel platform that brings together pedagogical agents and AI-driven components, augmenting the field with modular and effective systems for teaching mathematics.
zh

[AI-49] Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Memory-Driven Code Understanding

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在代码生成和软件自动化中面临的限制，包括推理时上下文长度有限以及缺乏显式的代码结构推理能力。其解决方案的关键在于提出Kodezi Chronos架构，该架构通过多层级嵌入记忆引擎，结合向量与图-based索引及持续的代码感知检索，实现对整个代码库、历史记录和文档等超长上下文的高效准确推理，从而支持仓库级的理解、多文件重构和实时自我修复操作。

链接: https://arxiv.org/abs/2507.12482
作者: Ishraq Khan,Assad Chowdary,Sharoz Haseeb,Urvish Patel
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 10 pages, 10 figures, 7 tables, IEEE Conference format, Q4 2025 model release, Q1 2026 Kodezi OS deployment

点击查看摘要

Abstract:Large Language Models (LLMs) have advanced code generation and software automation, but are fundamentally constrained by limited inference-time context and lack of explicit code structure reasoning. We introduce Kodezi Chronos, a next-generation architecture for autonomous code understanding, debugging, and maintenance, designed to operate across ultra-long contexts comprising entire codebases, histories, and documentation, all without fixed window limits. Kodezi Chronos leverages a multi-level embedding memory engine, combining vector and graph-based indexing with continuous code-aware retrieval. This enables efficient and accurate reasoning over millions of lines of code, supporting repository-scale comprehension, multi-file refactoring, and real-time self-healing actions. Our evaluation introduces a novel Multi Random Retrieval benchmark, specifically tailored to the software engineering domain. Unlike classical retrieval benchmarks, this method requires the model to resolve arbitrarily distant and obfuscated associations across code artifacts, simulating realistic tasks such as variable tracing, dependency migration, and semantic bug localization. Chronos outperforms prior LLMs and code models, demonstrating a 23% improvement in real-world bug detection and reducing debugging cycles by up to 40% compared to traditional sequence-based approaches. By natively interfacing with IDEs and CI/CD workflows, Chronos enables seamless, autonomous software maintenance, elevating code reliability and productivity while reducing manual effort. These results mark a critical advance toward self-sustaining, continuously optimized software ecosystems.
zh

[AI-50] LLM -Powered Quantum Code Transpilation

【速读】：该论文试图解决量子计算平台间软件开发工具包（Quantum SDKs, QSDKs）的互操作性与混合量子-经典软件系统跨平台开发问题。传统基于规则的编译器在转换不同QSDK代码时存在设计和维护成本高、依赖深度专业知识以及源代码与目标代码之间映射僵化等挑战。该研究的关键在于利用大型语言模型（Large Language Models, LLMs）作为编程语言无关的编译器，通过其预训练知识和上下文推理能力，实现量子程序在不同QSDK间的自动转换，并保持功能等价性，从而无需手动定义转换规则，提供一种可扩展的量子软件可移植性解决方案。

链接: https://arxiv.org/abs/2507.12480
作者: Nazanin Siavash,Armin Moin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: IEEE International Conference on Quantum Computing and Engineering (QCE) 2025 - Extended Abstract

点击查看摘要

Abstract:There exist various Software Development Kits (SDKs) tailored to different quantum computing platforms. These are known as Quantum SDKs (QSDKs). Examples include but are not limited to Qiskit, Cirq, and PennyLane. However, this diversity presents significant challenges for interoperability and cross-platform development of hybrid quantum-classical software systems. Traditional rule-based transpilers for translating code between QSDKs are time-consuming to design and maintain, requiring deep expertise and rigid mappings in the source and destination code. In this study, we explore the use of Large Language Models (LLMs) as a flexible and automated solution. Leveraging their pretrained knowledge and contextual reasoning capabilities, we position LLMs as programming language-agnostic transpilers capable of converting quantum programs from one QSDK to another while preserving functional equivalence. Our approach eliminates the need for manually defined transformation rules and offers a scalable solution to quantum software portability. This work represents a step toward enabling intelligent, general-purpose transpilation in the quantum computing ecosystem.
zh

[AI-51] A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys

【速读】：该论文旨在解决天文成像巡天中因数据量迅速增加而导致的传统图像异常检测方法（如人工视觉检查）变得不切实际的问题。其解决方案的关键在于提出一种基于机器学习的半监督流水线，该流水线结合了通过自监督学习（SSL）训练的视觉变压器（ViT）与k-近邻（kNN）分类器，从而高效且准确地识别低消光区域中的低质量曝光图像。

链接: https://arxiv.org/abs/2507.12784
作者: Yufeng Luo,Adam D. Myers,Alex Drlica-Wagner,Dario Dematties,Salma Borchani,Frank Valdes,Arjun Dey,David Schlegel,Rongpu Zhou,DESI Legacy Imaging Surveys Team
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: 21 pages, 12 figures

点击查看摘要

Abstract:As the data volume of astronomical imaging surveys rapidly increases, traditional methods for image anomaly detection, such as visual inspection by human experts, are becoming impractical. We introduce a machine-learning-based approach to detect poor-quality exposures in large imaging surveys, with a focus on the DECam Legacy Survey (DECaLS) in regions of low extinction (i.e., E(B-V)0.04 ). Our semi-supervised pipeline integrates a vision transformer (ViT), trained via self-supervised learning (SSL), with a k-Nearest Neighbor (kNN) classifier. We train and validate our pipeline using a small set of labeled exposures observed by surveys with the Dark Energy Camera (DECam). A clustering-space analysis of where our pipeline places images labeled in good'' and bad’’ categories suggests that our approach can efficiently and accurately determine the quality of exposures. Applied to new imaging being reduced for DECaLS Data Release 11, our pipeline identifies 780 problematic exposures, which we subsequently verify through visual inspection. Being highly efficient and adaptable, our method offers a scalable solution for quality control in other large imaging surveys.
zh

[AI-52] Achieving Robust Channel Estimation Neural Networks by Designed Training Data

【速读】：该论文试图解决在认知通信中，由于无线信道的时变特性以及低延迟和有限计算资源限制，导致基于数据驱动的神经网络在面对未见过的新信道时性能下降的问题。解决方案的关键在于设计一种离线训练的神经网络，使其能够在不依赖实际信道信息的情况下，对未知信道保持鲁棒的性能。为此，作者提出了生成合成训练数据集的设计准则，确保训练后的网络在新且未见过的信道上达到一定的均方误差（MSE）水平，从而实现无需先验信道信息或参数更新即可进行实际部署的神经网络解决方案。

链接: https://arxiv.org/abs/2507.12630
作者: Dianxin Luan,John Thompson
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Cognitive Communications and Networking (TCCN)

点击查看摘要

Abstract:Channel estimation is crucial in cognitive communications, as it enables intelligent spectrum sensing and adaptive transmission by providing accurate information about the current channel state. However, in many papers neural networks are frequently tested by training and testing on one example channel or similar channels. This is because data-driven methods often degrade on new data which they are not trained on, as they cannot extrapolate their training knowledge. This is despite the fact physical channels are often assumed to be time-variant. However, due to the low latency requirements and limited computing resources, neural networks may not have enough time and computing resources to execute online training to fine-tune the parameters. This motivates us to design offline-trained neural networks that can perform robustly over wireless channels, but without any actual channel information being known at design time. In this paper, we propose design criteria to generate synthetic training datasets for neural networks, which guarantee that after training the resulting networks achieve a certain mean squared error (MSE) on new and previously unseen channels. Therefore, neural network solutions require no prior channel information or parameters update for real-world implementations. Based on the proposed design criteria, we further propose a benchmark design which ensures intelligent operation for different channel profiles. To demonstrate general applicability, we use neural networks with different levels of complexity to show that the generalization achieved appears to be independent of neural network architecture. From simulations, neural networks achieve robust generalization to wireless channels with both fixed channel profiles and variable delay spreads.
zh

[AI-53] Sporadic Federated Learning Approach in Quantum Environment to Tackle Quantum Noise

【速读】：该论文试图解决量子联邦学习（Quantum Federated Learning, QFL）中因量子噪声异质性导致的训练性能不足问题，这一问题源于现代量子设备由于硬件质量差异和对量子退相干的敏感性而表现出不同的噪声水平。解决方案之关键在于提出一种名为SpoQFL的新框架，该框架通过利用间歇性学习来缓解分布式量子系统中的量子噪声异质性，其核心机制是根据噪声波动动态调整训练策略，从而提升模型的鲁棒性、收敛稳定性和整体学习效率。

链接: https://arxiv.org/abs/2507.12492
作者: Ratun Rahman,Atit Pokharel,Dinh C. Nguyen
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Quantum Federated Learning (QFL) is an emerging paradigm that combines quantum computing and federated learning (FL) to enable decentralized model training while maintaining data privacy over quantum networks. However, quantum noise remains a significant barrier in QFL, since modern quantum devices experience heterogeneous noise levels due to variances in hardware quality and sensitivity to quantum decoherence, resulting in inadequate training performance. To address this issue, we propose SpoQFL, a novel QFL framework that leverages sporadic learning to mitigate quantum noise heterogeneity in distributed quantum systems. SpoQFL dynamically adjusts training strategies based on noise fluctuations, enhancing model robustness, convergence stability, and overall learning efficiency. Extensive experiments on real-world datasets demonstrate that SpoQFL significantly outperforms conventional QFL approaches, achieving superior training performance and more stable convergence.
zh

[AI-54] Quantum Transfer Learning to Boost Dementia Detection

【速读】：该论文试图解决在高维生物医学数据和大规模数据集下，传统机器学习与深度学习方法在痴呆症预测任务中面临的计算和性能限制问题。其解决方案的关键在于引入量子迁移学习（Quantum Transfer Learning, QTL），通过量子计算的优势增强弱监督的深度学习模型在二分类痴呆检测任务中的性能，并探索噪声对QTL方法的影响，以评估该方法的可靠性和鲁棒性。

链接: https://arxiv.org/abs/2507.12485
作者: Sounak Bhowmik,Talita Perciano,Himanshu Thapliyal
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dementia is a devastating condition with profound implications for individuals, families, and healthcare systems. Early and accurate detection of dementia is critical for timely intervention and improved patient outcomes. While classical machine learning and deep learning approaches have been explored extensively for dementia prediction, these solutions often struggle with high-dimensional biomedical data and large-scale datasets, quickly reaching computational and performance limitations. To address this challenge, quantum machine learning (QML) has emerged as a promising paradigm, offering faster training and advanced pattern recognition capabilities. This work aims to demonstrate the potential of quantum transfer learning (QTL) to enhance the performance of a weak classical deep learning model applied to a binary classification task for dementia detection. Besides, we show the effect of noise on the QTL-based approach, investigating the reliability and robustness of this method. Using the OASIS 2 dataset, we show how quantum techniques can transform a suboptimal classical model into a more effective solution for biomedical image classification, highlighting their potential impact on advancing healthcare technology.
zh

[AI-55] Coarse Addition and the St. Petersburg Paradox: A Heuristic Perspective

【速读】：该论文试图解决圣彼得堡悖论（St. Petersburg Paradox）这一长期存在于决策理论中的问题。传统解决方案通常引入辅助假设，如边际效用递减、时间贴现或扩展数系等，但这些方法可能与人类实际处理数值信息的方式不一致。本文提出了一种替代方案，其关键在于对结果空间进行粗粒度划分，并在该划分基础上定义一种修改后的加法操作。在此模型中，精确的数值被分组到感知类别中，每个值被替换为其所属组的代表性元素后再进行相加，从而导致重复加法最终不再影响结果，即表现出惯性稳定化（inertial stabilization）现象。该方法为有限认知精度的代理如何处理发散奖励结构提供了一个合理的描述框架。

链接: https://arxiv.org/abs/2507.12475
作者: Takashi Izumo
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 16 pages, no figure

点击查看摘要

Abstract:The St. Petersburg paradox presents a longstanding challenge in decision theory. It describes a game whose expected value is infinite, yet for which no rational finite stake can be determined. Traditional solutions introduce auxiliary assumptions, such as diminishing marginal utility, temporal discounting, or extended number systems. These methods often involve mathematical refinements that may not correspond to how people actually perceive or process numerical information. This paper explores an alternative approach based on a modified operation of addition defined over coarse partitions of the outcome space. In this model, exact numerical values are grouped into perceptual categories, and each value is replaced by a representative element of its group before being added. This method allows for a phenomenon where repeated additions eventually cease to affect the outcome, a behavior described as inertial stabilization. Although this is not intended as a definitive resolution of the paradox, the proposed framework offers a plausible way to represent how agents with limited cognitive precision might handle divergent reward structures. We demonstrate that the St. Petersburg series can become inert under this coarse addition for a suitably constructed partition. The approach may also have broader applications in behavioral modeling and the study of machine reasoning under perceptual limitations.
zh

[AI-56] Implementation and Analysis of GPU Algorithms for Vecchia Approximation

【速读】：该论文试图解决高斯过程（Gaussian Process）在处理大规模数据时计算复杂度高、耗时长的问题。其解决方案的关键在于利用GPU加速的Vecchia近似方法，通过优化内存类型和算法实现，显著提升了计算效率和预测准确性。

链接: https://arxiv.org/abs/2407.02740
作者: Zachary James,Joseph Guinness
机构: 未知
类目: Computation (stat.CO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Gaussian Processes have become an indispensable part of the spatial statistician’s toolbox but are unsuitable for analyzing large dataset because of the significant time and memory needed to fit the associated model exactly. Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms. While multi-core software has been developed for Vecchia Approximation, such as the GpGp R package, software designed to run on graphics processing units (GPU) is lacking, despite the tremendous success GPUs have had in statistics and machine learning. We compare three different ways to implement Vecchia Approximation on a GPU: two of which are similar to methods used for other Gaussian Process approximations and one that is new. The impact of memory type on performance is investigated and the final method is optimized accordingly. We show that our new method outperforms the other two and then present it in the GpGpU R package. We compare GpGpU to existing multi-core and GPU-accelerated software by fitting Gaussian Process models on various datasets, including a large spatial-temporal dataset of n10^6 points collected from an earth-observing satellite. Our results show that GpGpU achieves faster runtimes and better predictive accuracy.
zh

机器学习

[LG-0] raining Transformers with Enforced Lipschitz Constants

链接: https://arxiv.org/abs/2507.13338
作者: Laker Newhouse,R. Preston Hess,Franz Cesista,Andrii Zahorodnii,Jeremy Bernstein,Phillip Isola
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced beyond initialization. To explore this gap, we begin by developing and benchmarking novel, computationally-efficient tools for maintaining norm-constrained weight matrices. Applying these tools, we are able to train transformer models with Lipschitz bounds enforced throughout training. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard methods – weight decay and spectral normalization – allowing models to reach equal performance with a lower Lipschitz bound. Inspired by Muon’s update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff on MLPs and 2M parameter transformers. Our 2-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 10-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^264. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and logit tanh softcapping.

[LG-1] GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM

链接: https://arxiv.org/abs/2507.13323
作者: Kyeongjin Ahn,Sungwon Han,Seungeon Lee,Donghyun Ahn,Hyoshin Kim,Jungwon Kim,Jihee Kim,Sangyoon Park,Meeyoung Cha
类目: Machine Learning (cs.LG)
*备注: 15 pages, 13 figures, 7 tables

点击查看摘要

Abstract:Socio-economic indicators like regional GDP, population, and education levels, are crucial to shaping policy decisions and fostering sustainable development. This research introduces GeoReg a regression model that integrates diverse data sources, including satellite imagery and web-based geospatial information, to estimate these indicators even for data-scarce regions such as developing countries. Our approach leverages the prior knowledge of large language model (LLM) to address the scarcity of labeled data, with the LLM functioning as a data engineer by extracting informative features to enable effective estimation in few-shot settings. Specifically, our model obtains contextual relationships between data features and the target indicator, categorizing their correlations as positive, negative, mixed, or irrelevant. These features are then fed into the linear estimator with tailored weight constraints for each category. To capture nonlinear patterns, the model also identifies meaningful feature interactions and integrates them, along with nonlinear transformations. Experiments across three countries at different stages of development demonstrate that our model outperforms baselines in estimating socio-economic indicators, even for low-income countries with limited data availability.

[LG-2] Boosting Team Modeling through Tempo-Relational Representation Learning

链接: https://arxiv.org/abs/2507.13305
作者: Vincenzo Marco De Luca,Giovanna Varni,Andrea Passerini
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Team modeling remains a fundamental challenge at the intersection of Artificial Intelligence and the Social Sciences. Social Science research emphasizes the need to jointly model dynamics and relations, while practical applications demand unified models capable of inferring multiple team constructs simultaneously, providing interpretable insights and actionable recommendations to enhance team performance. However, existing works do not meet these practical demands. To bridge this gap, we present TRENN, a novel tempo-relational architecture that integrates: (i) an automatic temporal graph extractor, (ii) a tempo-relational encoder, (iii) a decoder for team construct prediction, and (iv) two complementary explainability modules. TRENN jointly captures relational and temporal team dynamics, providing a solid foundation for MT-TRENN, which extends TReNN by replacing the decoder with a multi-task head, enabling the model to learn shared Social Embeddings and simultaneously predict multiple team constructs, including Emergent Leadership, Leadership Style, and Teamwork components. Experimental results demonstrate that our approach significantly outperforms approaches that rely exclusively on temporal or relational information. Additionally, experimental evaluation has shown that the explainability modules integrated in MT-TRENN yield interpretable insights and actionable suggestions to support team improvement. These capabilities make our approach particularly well-suited for Human-Centered AI applications, such as intelligent decision-support systems in high-stakes collaborative environments.

[LG-3] Leverag ing Asynchronous Cross-border Market Data for Improved Day-Ahead Electricity Price Forecasting in European Markets

链接: https://arxiv.org/abs/2507.13250
作者: Maria Margarida Mascarenhas,Jilles De Blauwe,Mikael Amelin,Hussain Kazmi
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Both Maria Margarida Mascarenhas and Jilles De Blauwe contributed equally to the paper

点击查看摘要

Abstract:Accurate short-term electricity price forecasting is crucial for strategically scheduling demand and generation bids in day-ahead markets. While data-driven techniques have shown considerable prowess in achieving high forecast accuracy in recent years, they rely heavily on the quality of input covariates. In this paper, we investigate whether asynchronously published prices as a result of differing gate closure times (GCTs) in some bidding zones can improve forecasting accuracy in other markets with later GCTs. Using a state-of-the-art ensemble of models, we show significant improvements of 22% and 9% in forecast accuracy in the Belgian (BE) and Swedish bidding zones (SE3) respectively, when including price data from interconnected markets with earlier GCT (Germany-Luxembourg, Austria, and Switzerland). This improvement holds for both general as well as extreme market conditions. Our analysis also yields further important insights: frequent model recalibration is necessary for maximum accuracy but comes at substantial additional computational costs, and using data from more markets does not always lead to better performance - a fact we delve deeper into with interpretability analysis of the forecast models. Overall, these findings provide valuable guidance for market participants and decision-makers aiming to optimize bidding strategies within increasingly interconnected and volatile European energy markets.

[LG-4] Computational-Statistical Tradeoffs from NP-hardness

链接: https://arxiv.org/abs/2507.13222
作者: Guy Blanc,Caleb Koch,Carmen Strassle,Li-Yang Tan
类目: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: To appear at FOCS 2025

点击查看摘要

Abstract:A central question in computer science and statistics is whether efficient algorithms can achieve the information-theoretic limits of statistical problems. Many computational-statistical tradeoffs have been shown under average-case assumptions, but since statistical problems are average-case in nature, it has been a challenge to base them on standard worst-case assumptions. In PAC learning where such tradeoffs were first studied, the question is whether computational efficiency can come at the cost of using more samples than information-theoretically necessary. We base such tradeoffs on \mathsfNP -hardness and obtain: \circ Sharp computational-statistical tradeoffs assuming \mathsfNP requires exponential time: For every polynomial p(n) , there is an n -variate class C with VC dimension 1 such that the sample complexity of time-efficiently learning C is \Theta(p(n)) . \circ A characterization of \mathsfRP vs. \mathsfNP in terms of learning: \mathsfRP = \mathsfNP iff every \mathsfNP -enumerable class is learnable with O(\mathrmVCdim©) samples in polynomial time. The forward implication has been known since (Pitt and Valiant, 1988); we prove the reverse implication. Notably, all our lower bounds hold against improper learners. These are the first \mathsfNP -hardness results for improperly learning a subclass of polynomial-size circuits, circumventing formal barriers of Applebaum, Barak, and Xiao (2008). Comments: To appear at FOCS 2025 Subjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2507.13222 [cs.CC] (or arXiv:2507.13222v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2507.13222 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Guy Blanc [view email] [v1] Thu, 17 Jul 2025 15:35:36 UTC (74 KB) Full-text links: Access Paper: View a PDF of the paper titled Computational-Statistical Tradeoffs from NP-hardness, by Guy Blanc and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CC prev | next new | recent | 2025-07 Change to browse by: cs cs.DS cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-5] MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling ECML2025 ALT

链接: https://arxiv.org/abs/2507.13207
作者: Etienne Le Naour,Tahar Nabil,Ghislain Agoua
类目: Machine Learning (cs.LG)
*备注: 10th Workshop on Advanced Analytics and Learning on Temporal Data (AALTD), ECML 2025

点击查看摘要

Abstract:Recent years have witnessed a growing interest for time series foundation models, with a strong emphasis on the forecasting task. Yet, the crucial task of out-of-domain imputation of missing values remains largely underexplored. We propose a first step to fill this gap by leveraging implicit neural representations (INRs). INRs model time series as continuous functions and naturally handle various missing data scenarios and sampling rates. While they have shown strong performance within specific distributions, they struggle under distribution shifts. To address this, we introduce MoTM (Mixture of Timeflow Models), a step toward a foundation model for time series imputation. Building on the idea that a new time series is a mixture of previously seen patterns, MoTM combines a basis of INRs, each trained independently on a distinct family of time series, with a ridge regressor that adapts to the observed context at inference. We demonstrate robust in-domain and out-of-domain generalization across diverse imputation scenarios (e.g., block and pointwise missingness, variable sampling rates), paving the way for adaptable foundation imputation models.

[LG-6] GradNetOT: Learning Optimal Transport Maps with GradNets

链接: https://arxiv.org/abs/2507.13191
作者: Shreyas Chaudhari,Srinivasa Pranav,José M. F. Moura
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Monotone gradient functions play a central role in solving the Monge formulation of the optimal transport problem, which arises in modern applications ranging from fluid dynamics to robot swarm control. When the transport cost is the squared Euclidean distance, Brenier’s theorem guarantees that the unique optimal map is the gradient of a convex function, namely a monotone gradient map, and it satisfies a Monge-Ampère equation. In [arXiv:2301.10862] [arXiv:2404.07361], we proposed Monotone Gradient Networks (mGradNets), neural networks that directly parameterize the space of monotone gradient maps. In this work, we leverage mGradNets to directly learn the optimal transport mapping by minimizing a training loss function defined using the Monge-Ampère equation. We empirically show that the structural bias of mGradNets facilitates the learning of optimal transport maps and employ our method for a robot swarm control problem.

[LG-7] Spectral Bellm an Method: Unifying Representation and Exploration in RL

链接: https://arxiv.org/abs/2507.13181
作者: Ofir Nabati,Bo Dai,Shie Mannor,Guy Tennenholtz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The effect of representation has been demonstrated in reinforcement learning, from both theoretical and empirical successes. However, the existing representation learning mainly induced from model learning aspects, misaligning with our RL tasks. This work introduces Spectral Bellman Representation, a novel framework derived from the Inherent Bellman Error (IBE) condition, which aligns with the fundamental structure of Bellman updates across a space of possible value functions, therefore, directly towards value-based RL. Our key insight is the discovery of a fundamental spectral relationship: under the zero-IBE condition, the transformation of a distribution of value functions by the Bellman operator is intrinsically linked to the feature covariance structure. This spectral connection yields a new, theoretically-grounded objective for learning state-action features that inherently capture this Bellman-aligned covariance. Our method requires a simple modification to existing algorithms. We demonstrate that our learned representations enable structured exploration, by aligning feature covariance with Bellman dynamics, and improve overall performance, particularly in challenging hard-exploration and long-horizon credit assignment tasks. Our framework naturally extends to powerful multi-step Bellman operators, further broadening its impact. Spectral Bellman Representation offers a principled and effective path toward learning more powerful and structurally sound representations for value-based reinforcement learning.

[LG-8] NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech

链接: https://arxiv.org/abs/2507.13155
作者: Maksim Borisov,Egor Spirin,Daria Diatlova
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Current expressive speech synthesis models are constrained by the limited availability of open-source datasets containing diverse nonverbal vocalizations (NVs). In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-access dataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotional categories. The dataset is derived from popular sources, VoxCeleb and Expresso, using automated detection followed by human validation. We propose a comprehensive pipeline that integrates automatic speech recognition (ASR), NV tagging, emotion classification, and a fusion algorithm to merge transcriptions from multiple annotators. Fine-tuning open-source text-to-speech (TTS) models on the NVTTS dataset achieves parity with closed-source systems such as CosyVoice2, as measured by both human evaluation and automatic metrics, including speaker similarity and NV fidelity. By releasing NVTTS and its accompanying annotation guidelines, we address a key bottleneck in expressive TTS research. The dataset is available at this https URL.

[LG-9] NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation

链接: https://arxiv.org/abs/2507.13133
作者: Yuanxin Zhuang,Dazhong Shen,Ying Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph generation plays a pivotal role across numerous domains, including molecular design and knowledge graph construction. Although existing methods achieve considerable success in generating realistic graphs, their interpretability remains limited, often obscuring the rationale behind structural decisions. To address this challenge, we propose the Neural Graph Topic Model (NGTM), a novel generative framework inspired by topic modeling in natural language processing. NGTM represents graphs as mixtures of latent topics, each defining a distribution over semantically meaningful substructures, which facilitates explicit interpretability at both local and global scales. The generation process transparently integrates these topic distributions with a global structural variable, enabling clear semantic tracing of each generated graph. Experiments demonstrate that NGTM achieves competitive generation quality while uniquely enabling fine-grained control and interpretability, allowing users to tune structural features or induce biological properties through topic-level adjustments.

[LG-10] Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces

链接: https://arxiv.org/abs/2507.13092
作者: Hyo-Jeong Jang,Hye-Bin Shin,Seong-Whan Lee
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) is a fundamental modality for cognitive state monitoring in brain-computer interfaces (BCIs). However, it is highly susceptible to intrinsic signal errors and human-induced labeling errors, which lead to label noise and ultimately degrade model performance. To enhance EEG learning, multimodal knowledge distillation (KD) has been explored to transfer knowledge from visual models with rich representations to EEG-based models. Nevertheless, KD faces two key challenges: modality gap and soft label misalignment. The former arises from the heterogeneous nature of EEG and visual feature spaces, while the latter stems from label inconsistencies that create discrepancies between ground truth labels and distillation targets. This paper addresses semantic uncertainty caused by ambiguous features and weakly defined labels. We propose a novel cross-modal knowledge distillation framework that mitigates both modality and label inconsistencies. It aligns feature semantics through a prototype-based similarity module and introduces a task-specific distillation head to resolve label-induced inconsistency in supervision. Experimental results demonstrate that our approach improves EEG-based emotion regression and classification performance, outperforming both unimodal and multimodal baselines on a public multimodal dataset. These findings highlight the potential of our framework for BCI applications.

[LG-11] On statistical learning of graphs

链接: https://arxiv.org/abs/2507.13054
作者: Vittorio Cipriani,Valentino Delle Rose,Luca San Mauro,Giovanni Solda
类目: Machine Learning (cs.LG); Logic (math.LO)
*备注:

点击查看摘要

Abstract:We study PAC and online learnability of hypothesis classes formed by copies of a countably infinite graph G, where each copy is induced by permuting G’s vertices. This corresponds to learning a graph’s labeling, knowing its structure and label set. We consider classes where permutations move only finitely many vertices. Our main result shows that PAC learnability of all such finite-support copies implies online learnability of the full isomorphism type of G, and is equivalent to the condition of automorphic triviality. We also characterize graphs where copies induced by swapping two vertices are not learnable, using a relaxation of the extension property of the infinite random graph. Finally, we show that, for all G and k2, learnability for k-vertex permutations is equivalent to that for 2-vertex permutations, yielding a four-class partition of infinite graphs, whose complexity we also determine using tools coming from both descriptive set theory and computability theory.

[LG-12] he Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting

链接: https://arxiv.org/abs/2507.13043
作者: Lefei Shen,Mouxiang Chen,Han Fu,Xiaoxue Ren,Xiaoyun Joy Wang,Jianling Sun,Zhuo Li,Chenghao Liu
类目: Machine Learning (cs.LG)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Transformer-based models have recently become dominant in Long-term Time Series Forecasting (LTSF), yet the variations in their architecture, such as encoder-only, encoder-decoder, and decoder-only designs, raise a crucial question: What Transformer architecture works best for LTSF tasks? However, existing models are often tightly coupled with various time-series-specific designs, making it difficult to isolate the impact of the architecture itself. To address this, we propose a novel taxonomy that disentangles these designs, enabling clearer and more unified comparisons of Transformer architectures. Our taxonomy considers key aspects such as attention mechanisms, forecasting aggregations, forecasting paradigms, and normalization layers. Through extensive experiments, we uncover several key insights: bi-directional attention with joint-attention is most effective; more complete forecasting aggregation improves performance; and the direct-mapping paradigm outperforms autoregressive approaches. Furthermore, our combined model, utilizing optimal architectural choices, consistently outperforms several existing models, reinforcing the validity of our conclusions. We hope these findings offer valuable guidance for future research on Transformer architectural designs in LTSF. Our code is available at this https URL.

[LG-13] Confidence-Filtered Relevance (CFR): An Interpretable and Uncertainty-Aware Machine Learning Framework for Naturalness Assessment in Satellite Imagery

链接: https://arxiv.org/abs/2507.13034
作者: Ahmed Emam,Ribana Roscher
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Protected natural areas play a vital role in ecological balance and ecosystem services. Monitoring these regions at scale using satellite imagery and machine learning is promising, but current methods often lack interpretability and uncertainty-awareness, and do not address how uncertainty affects naturalness assessment. In contrast, we propose Confidence-Filtered Relevance (CFR), a data-centric framework that combines LRP Attention Rollout with Deep Deterministic Uncertainty (DDU) estimation to analyze how model uncertainty influences the interpretability of relevance heatmaps. CFR partitions the dataset into subsets based on uncertainty thresholds, enabling systematic analysis of how uncertainty shapes the explanations of naturalness in satellite imagery. Applied to the AnthroProtect dataset, CFR assigned higher relevance to shrublands, forests, and wetlands, aligning with other research on naturalness assessment. Moreover, our analysis shows that as uncertainty increases, the interpretability of these relevance heatmaps declines and their entropy grows, indicating less selective and more ambiguous attributions. CFR provides a data-centric approach to assess the relevance of patterns to naturalness in satellite imagery based on their associated certainty.

[LG-14] Fault detection and diagnosis for the engine electrical system of a space launcher based on a temporal convolutional autoencoder and calibrated classifiers

链接: https://arxiv.org/abs/2507.13022
作者: Luis Basora,Louison Bocquet-Nouaille,Elinirina Robinson,Serge Le Gonidec
类目: Machine Learning (cs.LG)
*备注: 53 pages, 16 figures

点击查看摘要

Abstract:In the context of the health monitoring for the next generation of reusable space launchers, we outline a first step toward developing an onboard fault detection and diagnostic capability for the electrical system that controls the engine valves. Unlike existing approaches in the literature, our solution is designed to meet a broader range of key requirements. This includes estimating confidence levels for predictions, detecting out-of-distribution (OOD) cases, and controlling false alarms. The proposed solution is based on a temporal convolutional autoencoder to automatically extract low-dimensional features from raw sensor data. Fault detection and diagnosis are respectively carried out using a binary and a multiclass classifier trained on the autoencoder latent and residual spaces. The classifiers are histogram-based gradient boosting models calibrated to output probabilities that can be interpreted as confidence levels. A relatively simple technique, based on inductive conformal anomaly detection, is used to identify OOD data. We leverage other simple yet effective techniques, such as cumulative sum control chart (CUSUM) to limit the false alarms, and threshold moving to address class imbalance in fault detection. The proposed framework is highly configurable and has been evaluated on simulated data, covering both nominal and anomalous operational scenarios. The results indicate that our solution is a promising first step, though testing with real data will be necessary to ensure that it achieves the required maturity level for operational use.

[LG-15] FedGA: A Fair Federated Learning Framework Based on the Gini Coefficient

链接: https://arxiv.org/abs/2507.12983
作者: ShanBin Liu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Fairness has emerged as one of the key challenges in federated learning. In horizontal federated settings, data heterogeneity often leads to substantial performance disparities across clients, raising concerns about equitable model behavior. To address this issue, we propose FedGA, a fairness-aware federated learning algorithm. We first employ the Gini coefficient to measure the performance disparity among clients. Based on this, we establish a relationship between the Gini coefficient G and the update scale of the global model U_s , and use this relationship to adaptively determine the timing of fairness intervention. Subsequently, we dynamically adjust the aggregation weights according to the system’s real-time fairness status, enabling the global model to better incorporate information from clients with relatively poor this http URL conduct extensive experiments on the Office-Caltech-10, CIFAR-10, and Synthetic datasets. The results show that FedGA effectively improves fairness metrics such as variance and the Gini coefficient, while maintaining strong overall performance, demonstrating the effectiveness of our approach.

[LG-16] A Spectral Interpretation of Redundancy in a Graph Reservoir ICANN2025

链接: https://arxiv.org/abs/2507.12963
作者: Anna Bison,Alessandro Sperduti
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted for presentation at the 3rd International Workshop on Reservoir Computing (RC 2025) at ICANN 2025

点击查看摘要

Abstract:Reservoir computing has been successfully applied to graphs as a preprocessing method to improve the training efficiency of Graph Neural Networks (GNNs). However, a common issue that arises when repeatedly applying layer operators on graphs is over-smoothing, which consists in the convergence of graph signals toward low-frequency components of the graph Laplacian. This work revisits the definition of the reservoir in the Multiresolution Reservoir Graph Neural Network (MRGNN), a spectral reservoir model, and proposes a variant based on a Fairing algorithm originally introduced in the field of surface design in computer graphics. This algorithm provides a pass-band spectral filter that allows smoothing without shrinkage, and it can be adapted to the graph setting through the Laplacian operator. Given its spectral formulation, this method naturally connects to GNN architectures for tasks where smoothing, when properly controlled, can be beneficial,such as graph classification. The core contribution of the paper lies in the theoretical analysis of the algorithm from a random walks perspective. In particular, it shows how tuning the spectral coefficients can be interpreted as modulating the contribution of redundant random walks. Exploratory experiments based on the MRGNN architecture illustrate the potential of this approach and suggest promising directions for future research.

[LG-17] Insights into a radiology-specialised multimodal large language model with sparse autoencoders ICML2025

链接: https://arxiv.org/abs/2507.12950
作者: Kenza Bouzid,Shruthi Bannur,Daniel Coelho de Castro,Anton Schwaighofer,Javier Alvarez-Valle,Stephanie L. Hyland
类目: Machine Learning (cs.LG)
*备注: Actionable Interpretability Workshop at ICML 2025. 24 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Interpretability can improve the safety, transparency and trust of AI models, which is especially important in healthcare applications where decisions often carry significant consequences. Mechanistic interpretability, particularly through the use of sparse autoencoders (SAEs), offers a promising approach for uncovering human-interpretable features within large transformer-based models. In this study, we apply Matryoshka-SAE to the radiology-specialised multimodal large language model, MAIRA-2, to interpret its internal representations. Using large-scale automated interpretability of the SAE features, we identify a range of clinically relevant concepts - including medical devices (e.g., line and tube placements, pacemaker presence), pathologies such as pleural effusion and cardiomegaly, longitudinal changes and textual features. We further examine the influence of these features on model behaviour through steering, demonstrating directional control over generations with mixed success. Our results reveal practical and methodological challenges, yet they offer initial insights into the internal concepts learned by MAIRA-2 - marking a step toward deeper mechanistic understanding and interpretability of a radiology-adapted multimodal large language model, and paving the way for improved model transparency. We release the trained SAEs and interpretations: this https URL.

[LG-18] From a Mixed-Policy Perspective: Improving Differentiable Automatic Post-editing Optimization

链接: https://arxiv.org/abs/2507.12931
作者: Hongze Tan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper introduces two novel modifications to the Differentiable Automatic Post-editing Optimization (DAPO) algorithm, approached from a mixed-policy perspective. Standard policy gradient methods can suffer from instability and sample inefficiency, particularly in sparse reward settings. To address this, we first propose a method that incorporates a pre-trained, stable guiding policy ( \piphi ) to provide off-policy experience, thereby regularizing the training of the target policy ( \pion ). This approach improves training stability and convergence speed by adaptively adjusting the learning step size. Secondly, we extend this idea to re-utilize zero-reward samples, which are often discarded by dynamic sampling strategies like DAPO’s. By treating these samples as a distinct batch guided by the expert policy, we further enhance sample efficiency. We provide a theoretical analysis for both methods, demonstrating that their objective functions converge to the optimal solution within the established theoretical framework of reinforcement learning. The proposed mixed-policy framework effectively balances exploration and exploitation, promising more stable and efficient policy optimization.

[LG-19] race Reconstruction with Language Models

链接: https://arxiv.org/abs/2507.12927
作者: Franziska Weindel,Michael Girsch,Reinhard Heckel
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:The general trace reconstruction problem seeks to recover an original sequence from its noisy copies independently corrupted by deletions, insertions, and substitutions. This problem arises in applications such as DNA data storage, a promising storage medium due to its high information density and longevity. However, errors introduced during DNA synthesis, storage, and sequencing require correction through algorithms and codes, with trace reconstruction often used as part of the data retrieval process. In this work, we propose TReconLM, which leverages language models trained on next-token prediction for trace reconstruction. We pretrain language models on synthetic data and fine-tune on real-world data to adapt to technology-specific error patterns. TReconLM outperforms state-of-the-art trace reconstruction algorithms, including prior deep learning approaches, recovering a substantially higher fraction of sequences without error.

[LG-20] Robust Explanations Through Uncertainty Decomposition: A Path to Trustworthier AI

链接: https://arxiv.org/abs/2507.12913
作者: Chenrui Zhu,Louenas Bounia,Vu Linh Nguyen,Sébastien Destercke,Arthur Hoarau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in machine learning have emphasized the need for transparency in model predictions, particularly as interpretability diminishes when using increasingly complex architectures. In this paper, we propose leveraging prediction uncertainty as a complementary approach to classical explainability methods. Specifically, we distinguish between aleatoric (data-related) and epistemic (model-related) uncertainty to guide the selection of appropriate explanations. Epistemic uncertainty serves as a rejection criterion for unreliable explanations and, in itself, provides insight into insufficient training (a new form of explanation). Aleatoric uncertainty informs the choice between feature-importance explanations and counterfactual explanations. This leverages a framework of explainability methods driven by uncertainty quantification and disentanglement. Our experiments demonstrate the impact of this uncertainty-aware approach on the robustness and attainability of explanations in both traditional machine learning and deep learning scenarios.

[LG-21] LaViPlan : Language-Guided Visual Path Planning with RLVR

链接: https://arxiv.org/abs/2507.12911
作者: Hayeon Oh
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Out-of-distribution (OOD) scenarios in autonomous driving refer to situations that deviate from the training domain, often leading to unexpected and potentially hazardous behavior from planners that lack prior exposure to such cases. Recently, Vision-Language Models (VLMs) have been introduced into autonomous driving research for their promising generalization capabilities in OOD settings. Early studies demonstrated that VLMs could recognize OOD scenarios and generate user-level decisions such as “go straight” or “turn right.” However, a new challenge has emerged due to the misalignment between the VLM’s high-level decisions or visual reasoning expressed in language, and the low-level predicted trajectories interpreted as actions. In this paper, we propose LaViPlan, a framework that leverages Reinforcement Learning with Verifiable Rewards (RLVR) to optimize VLMs using planning-oriented metrics. This approach addresses the vision-language-action misalignment observed in existing VLMs fine-tuned via supervised learning, which can recognize driving scenarios but often produce context-unaware decisions. Experimental results demonstrate that our method improves situational awareness and decision-making under OOD conditions, highlighting its potential to mitigate the misalignment issue. This work introduces a promising post-training paradigm for VLM agents in the context of autonomous driving.

[LG-22] Fremer: Lightweight and Effective Frequency Transformer for Workload Forecasting in Cloud Services

链接: https://arxiv.org/abs/2507.12908
作者: Jiadong Chen,Hengyu Ye,Fuxin Jiang,Xiao He,Tieying Zhang,Jianjun Chen,Xiaofeng Gao
类目: Machine Learning (cs.LG)
*备注: 12 pages, 11 figures

点击查看摘要

Abstract:Workload forecasting is pivotal in cloud service applications, such as auto-scaling and scheduling, with profound implications for operational efficiency. Although Transformer-based forecasting models have demonstrated remarkable success in general tasks, their computational efficiency often falls short of the stringent requirements in large-scale cloud environments. Given that most workload series exhibit complicated periodic patterns, addressing these challenges in the frequency domain offers substantial advantages. To this end, we propose Fremer, an efficient and effective deep forecasting model. Fremer fulfills three critical requirements: it demonstrates superior efficiency, outperforming most Transformer-based forecasting models; it achieves exceptional accuracy, surpassing all state-of-the-art (SOTA) models in workload forecasting; and it exhibits robust performance for multi-period series. Furthermore, we collect and open-source four high-quality, open-source workload datasets derived from ByteDance’s cloud services, encompassing workload data from thousands of computing instances. Extensive experiments on both our proprietary datasets and public benchmarks demonstrate that Fremer consistently outperforms baseline models, achieving average improvements of 5.5% in MSE, 4.7% in MAE, and 8.6% in SMAPE over SOTA models, while simultaneously reducing parameter scale and computational costs. Additionally, in a proactive auto-scaling test based on Kubernetes, Fremer improves average latency by 18.78% and reduces resource consumption by 2.35%, underscoring its practical efficacy in real-world applications.

[LG-23] Learning to Reject Low-Quality Explanations via User Feedback

链接: https://arxiv.org/abs/2507.12900
作者: Luca Stradiotti,Dario Pesenti,Stefano Teso,Jesse Davis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning predictors are increasingly being employed in high-stakes applications such as credit scoring. Explanations help users unpack the reasons behind their predictions, but are not always "high quality’'. That is, end-users may have difficulty interpreting or believing them, which can complicate trust assessment and downstream decision-making. We argue that classifiers should have the option to refuse handling inputs whose predictions cannot be explained properly and introduce a framework for learning to reject low-quality explanations (LtX) in which predictors are equipped with a rejector that evaluates the quality of explanations. In this problem setting, the key challenges are how to properly define and assess explanation quality and how to design a suitable rejector. Focusing on popular attribution techniques, we introduce ULER (User-centric Low-quality Explanation Rejector), which learns a simple rejector from human ratings and per-feature relevance judgments to mirror human judgments of explanation quality. Our experiments show that ULER outperforms both state-of-the-art and explanation-aware learning to reject strategies at LtX on eight classification and regression benchmarks and on a new human-annotated dataset, which we will publicly release to support future research.

[LG-24] Generalist Bimanual Manipulation via Foundation Video Diffusion Models

链接: https://arxiv.org/abs/2507.12898
作者: Yao Feng,Hengkai Tan,Xinyi Mao,Guodong Liu,Shuhe Huang,Chendong Xiang,Hang Su,Jun Zhu
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce VIdeo Diffusion for Action Reasoning (VIDAR), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), VIDAR generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.

[LG-25] Autonomous Resource Management in Microservice Systems via Reinforcement Learning

链接: https://arxiv.org/abs/2507.12879
作者: Yujun Zou,Nia Qi,Yingnan Deng,Zhihao Xue,Ming Gong,Wuyang Zhang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a reinforcement learning-based method for microservice resource scheduling and optimization, aiming to address issues such as uneven resource allocation, high latency, and insufficient throughput in traditional microservice architectures. In microservice systems, as the number of services and the load increase, efficiently scheduling and allocating resources such as computing power, memory, and storage becomes a critical research challenge. To address this, the paper employs an intelligent scheduling algorithm based on reinforcement learning. Through the interaction between the agent and the environment, the resource allocation strategy is continuously optimized. In the experiments, the paper considers different resource conditions and load scenarios, evaluating the proposed method across multiple dimensions, including response time, throughput, resource utilization, and cost efficiency. The experimental results show that the reinforcement learning-based scheduling method significantly improves system response speed and throughput under low load and high concurrency conditions, while also optimizing resource utilization and reducing energy consumption. Under multi-dimensional resource conditions, the proposed method can consider multiple objectives and achieve optimized resource scheduling. Compared to traditional static resource allocation methods, the reinforcement learning model demonstrates stronger adaptability and optimization capability. It can adjust resource allocation strategies in real time, thereby maintaining good system performance in dynamically changing load and resource environments.

[LG-26] opology-Aware Activation Functions in Neural Networks

链接: https://arxiv.org/abs/2507.12874
作者: Pavel Snopov,Oleg R. Musin
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted to ESANN 2025. Published in the ESANN 2025 proceedings

点击查看摘要

Abstract:This study explores novel activation functions that enhance the ability of neural networks to manipulate data topology during training. Building on the limitations of traditional activation functions like \mathrmReLU , we propose \mathrmSmoothSplit and \mathrmParametricSplit , which introduce topology “cutting” capabilities. These functions enable networks to transform complex data manifolds effectively, improving performance in scenarios with low-dimensional layers. Through experiments on synthetic and real-world datasets, we demonstrate that \mathrmParametricSplit outperforms traditional activations in low-dimensional settings while maintaining competitive performance in higher-dimensional ones. Our findings highlight the potential of topology-aware activation functions in advancing neural network architectures. The code is available via this https URL.

[LG-27] An Investigation of Ear-EEG Signals for a Novel Biometric Authentication System

链接: https://arxiv.org/abs/2507.12873
作者: Danilo Avola,Giancarlo Crocetti,Gian Luca Foresti,Daniele Pannone,Claudio Piciarelli,Amedeo Ranaldi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work explores the feasibility of biometric authentication using EEG signals acquired through in-ear devices, commonly referred to as ear-EEG. Traditional EEG-based biometric systems, while secure, often suffer from low usability due to cumbersome scalp-based electrode setups. In this study, we propose a novel and practical framework leveraging ear-EEG signals as a user-friendly alternative for everyday biometric authentication. The system extracts an original combination of temporal and spectral features from ear-EEG signals and feeds them into a fully connected deep neural network for subject identification. Experimental results on the only currently available ear-EEG dataset suitable for different purposes, including biometric authentication, demonstrate promising performance, with an average accuracy of 82% in a subject identification scenario. These findings confirm the potential of ear-EEG as a viable and deployable direction for next-generation real-world biometric systems.

[LG-28] ransformer-Based Person Identification via Wi-Fi CSI Amplitude and Phase Perturbations

链接: https://arxiv.org/abs/2507.12854
作者: Danilo Avola,Andrea Bernardini,Francesco Danese,Mario Lezoche,Maurizio Mancini,Daniele Pannone,Amedeo Ranaldi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wi-Fi sensing is gaining momentum as a non-intrusive and privacy-preserving alternative to vision-based systems for human identification. However, person identification through wireless signals, particularly without user motion, remains largely unexplored. Most prior wireless-based approaches rely on movement patterns, such as walking gait, to extract biometric cues. In contrast, we propose a transformer-based method that identifies individuals from Channel State Information (CSI) recorded while the subject remains stationary. CSI captures fine-grained amplitude and phase distortions induced by the unique interaction between the human body and the radio signal. To support evaluation, we introduce a dataset acquired with ESP32 devices in a controlled indoor environment, featuring six participants observed across multiple orientations. A tailored preprocessing pipeline, including outlier removal, smoothing, and phase calibration, enhances signal quality. Our dual-branch transformer architecture processes amplitude and phase modalities separately and achieves 99.82% classification accuracy, outperforming convolutional and multilayer perceptron baselines. These results demonstrate the discriminative potential of CSI perturbations, highlighting their capacity to encode biometric traits in a consistent manner. They further confirm the viability of passive, device-free person identification using low-cost commodity Wi-Fi hardware in real-world settings.

[LG-29] A Kernel Distribution Closeness Testing

链接: https://arxiv.org/abs/2507.12843
作者: Zhijian Zhou,Liuhua Peng,Xunye Tian,Feng Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The distribution closeness testing (DCT) assesses whether the distance between a distribution pair is at least \epsilon -far. Existing DCT methods mainly measure discrepancies between a distribution pair defined on discrete one-dimensional spaces (e.g., using total variation), which limits their applications to complex data (e.g., images). To extend DCT to more types of data, a natural idea is to introduce maximum mean discrepancy (MMD), a powerful measurement of the distributional discrepancy between two complex distributions, into DCT scenarios. However, we find that MMD’s value can be the same for many pairs of distributions that have different norms in the same reproducing kernel Hilbert space (RKHS), making MMD less informative when assessing the closeness levels for multiple distribution pairs. To mitigate the issue, we design a new measurement of distributional discrepancy, norm-adaptive MMD (NAMMD), which scales MMD’s value using the RKHS norms of distributions. Based on the asymptotic distribution of NAMMD, we finally propose the NAMMD-based DCT to assess the closeness levels of a distribution pair. Theoretically, we prove that NAMMD-based DCT has higher test power compared to MMD-based DCT, with bounded type-I error, which is also validated by extensive experiments on many types of data (e.g., synthetic noise, real images). Furthermore, we also apply the proposed NAMMD for addressing the two-sample testing problem and find NAMMD-based two-sample test has higher test power than the MMD-based two-sample test in both theory and experiments.

[LG-30] Bridging the Gap: Leverag ing Retrieval-Augmented Generation to Better Understand Public Concerns about Vaccines

链接: https://arxiv.org/abs/2507.12840
作者: Muhammad Javed,Sedigh Khademi Habibabadi,Christopher Palmer,Hazel Clothier,Jim Buttery,Gerardo Luis Dimaguila
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Vaccine hesitancy threatens public health, leading to delayed or rejected vaccines. Social media is a vital source for understanding public concerns, and traditional methods like topic modelling often struggle to capture nuanced opinions. Though trained for query answering, large Language Models (LLMs) often miss current events and community concerns. Additionally, hallucinations in LLMs can compromise public health communication. To address these limitations, we developed a tool (VaxPulse Query Corner) using the Retrieval Augmented Generation technique. It addresses complex queries about public vaccine concerns on various online platforms, aiding public health administrators and stakeholders in understanding public concerns and implementing targeted interventions to boost vaccine confidence. Analysing 35,103 Shingrix social media posts, it achieved answer faithfulness (0.96) and relevance (0.94).

[LG-31] Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability

链接: https://arxiv.org/abs/2507.12837
作者: Kaiqi Jiang,Jeremy Cohen,Yuanzhi Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The study of Neural Tangent Kernels (NTKs) in deep learning has drawn increasing attention in recent years. NTKs typically actively change during training and are related to feature learning. In parallel, recent work on Gradient Descent (GD) has found a phenomenon called Edge of Stability (EoS), in which the largest eigenvalue of the NTK oscillates around a value inversely proportional to the step size. However, although follow-up works have explored the underlying mechanism of such eigenvalue behavior in depth, the understanding of the behavior of the NTK eigenvectors during EoS is still missing. This paper examines the dynamics of NTK eigenvectors during EoS in detail. Across different architectures, we observe that larger learning rates cause the leading eigenvectors of the final NTK, as well as the full NTK matrix, to have greater alignment with the training target. We then study the underlying mechanism of this phenomenon and provide a theoretical analysis for a two-layer linear network. Our study enhances the understanding of GD training dynamics in deep learning.

[LG-32] Autoregressive Speech Enhancement via Acoustic Tokens

链接: https://arxiv.org/abs/2507.12825
作者: Luca Della Libera,Cem Subakan,Mirco Ravanelli
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:In speech processing pipelines, improving the quality and intelligibility of real-world recordings is crucial. While supervised regression is the primary method for speech enhancement, audio tokenization is emerging as a promising alternative for a smooth integration with other modalities. However, research on speech enhancement using discrete representations is still limited. Previous work has mainly focused on semantic tokens, which tend to discard key acoustic details such as speaker identity. Additionally, these studies typically employ non-autoregressive models, assuming conditional independence of outputs and overlooking the potential improvements offered by autoregressive modeling. To address these gaps we: 1) conduct a comprehensive study of the performance of acoustic tokens for speech enhancement, including the effect of bitrate and noise strength; 2) introduce a novel transducer-based autoregressive architecture specifically designed for this task. Experiments on VoiceBank and Libri1Mix datasets show that acoustic tokens outperform semantic tokens in terms of preserving speaker identity, and that our autoregressive approach can further improve performance. Nevertheless, we observe that discrete representations still fall short compared to continuous ones, highlighting the need for further research in this area.

[LG-33] From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2507.12815
作者: Gaurav Chaudhary,Laxmidhar Behera
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline Reinforcement Learning (RL) aims to learn effective policies from a static dataset without requiring further agent-environment interactions. However, its practical adoption is often hindered by the need for explicit reward annotations, which can be costly to engineer or difficult to obtain retrospectively. To address this, we propose ReLOAD (Reinforcement Learning with Offline Reward Annotation via Distillation), a novel reward annotation framework for offline RL. Unlike existing methods that depend on complex alignment procedures, our approach adapts Random Network Distillation (RND) to generate intrinsic rewards from expert demonstrations using a simple yet effective embedding discrepancy measure. First, we train a predictor network to mimic a fixed target network’s embeddings based on expert state transitions. Later, the prediction error between these networks serves as a reward signal for each transition in the static dataset. This mechanism provides a structured reward signal without requiring handcrafted reward annotations. We provide a formal theoretical construct that offers insights into how RND prediction errors effectively serve as intrinsic rewards by distinguishing expert-like transitions. Experiments on the D4RL benchmark demonstrate that ReLOAD enables robust offline policy learning and achieves performance competitive with traditional reward-annotated methods.

[LG-34] RONOM: Reduced-Order Neural Operator Modeling

链接: https://arxiv.org/abs/2507.12814
作者: Sven Dummer,Dongwei Ye,Christoph Brune
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Time-dependent partial differential equations are ubiquitous in physics-based modeling, but they remain computationally intensive in many-query scenarios, such as real-time forecasting, optimal control, and uncertainty quantification. Reduced-order modeling (ROM) addresses these challenges by constructing a low-dimensional surrogate model but relies on a fixed discretization, which limits flexibility across varying meshes during evaluation. Operator learning approaches, such as neural operators, offer an alternative by parameterizing mappings between infinite-dimensional function spaces, enabling adaptation to data across different resolutions. Whereas ROM provides rigorous numerical error estimates, neural operator learning largely focuses on discretization convergence and invariance without quantifying the error between the infinite-dimensional and the discretized operators. This work introduces the reduced-order neural operator modeling (RONOM) framework, which bridges concepts from ROM and operator learning. We establish a discretization error bound analogous to those in ROM, and get insights into RONOM’s discretization convergence and discretization robustness. Moreover, two numerical examples are presented that compare RONOM to existing neural operators for solving partial differential equations. The results demonstrate that RONOM using standard vector-to-vector neural networks achieves comparable performance in input generalization and superior performance in both spatial super-resolution and discretization robustness, while also offering novel insights into temporal super-resolution scenarios.

[LG-35] Multi-Channel Graph Neural Network for Financial Risk Prediction of NEEQ Enterprises

链接: https://arxiv.org/abs/2507.12787
作者: Jianyu Zhu
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures. Submitted for conference review

点击查看摘要

Abstract:With the continuous evolution of China’s multi-level capital market, the National Equities Exchange and Quotations (NEEQ), also known as the “New Third Board,” has become a critical financing platform for small and medium-sized enterprises (SMEs). However, due to their limited scale and financial resilience, many NEEQ-listed companies face elevated risks of financial distress. To address this issue, we propose a multi-channel deep learning framework that integrates structured financial indicators, textual disclosures, and enterprise relationship data for comprehensive financial risk prediction. Specifically, we design a Triple-Channel Graph Isomorphism Network (GIN) that processes numeric, textual, and graph-based inputs separately. These modality-specific representations are fused using an attention-based mechanism followed by a gating unit to enhance robustness and prediction accuracy. Experimental results on data from 7,731 real-world NEEQ companies demonstrate that our model significantly outperforms traditional machine learning methods and single-modality baselines in terms of AUC, Precision, Recall, and F1 Score. This work provides theoretical and practical insights into risk modeling for SMEs and offers a data-driven tool to support financial regulators and investors.

[LG-36] Sample-Constrained Black Box Optimization for Audio Personalization AAAI2024

链接: https://arxiv.org/abs/2507.12773
作者: Rajalaxmi Rajagopalan,Yu-Lin Wei,Romit Roy Choudhury
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Published in AAAI 2024

点击查看摘要

Abstract:We consider the problem of personalizing audio to maximize user experience. Briefly, we aim to find a filter h^* , which applied to any music or speech, will maximize the user’s satisfaction. This is a black-box optimization problem since the user’s satisfaction function is unknown. Substantive work has been done on this topic where the key idea is to play audio samples to the user, each shaped by a different filter h_i , and query the user for their satisfaction scores f(h_i) . A family of ``surrogate" functions is then designed to fit these scores and the optimization method gradually refines these functions to arrive at the filter \hath^* that maximizes satisfaction. In certain applications, we observe that a second type of querying is possible where users can tell us the individual elements h^[j] of the optimal filter h^ . Consider an analogy from cooking where the goal is to cook a recipe that maximizes user satisfaction. A user can be asked to score various cooked recipes (e.g., tofu fried rice) or to score individual ingredients (say, salt, sugar, rice, chicken, etc.). Given a budget of B queries, where a query can be of either type, our goal is to find the recipe that will maximize this user’s satisfaction. Our proposal builds on Sparse Gaussian Process Regression (GPR) and shows how a hybrid approach can outperform any one type of querying. Our results are validated through simulations and real world experiments, where volunteers gave feedback on music/speech audio and were able to achieve high satisfaction levels. We believe this idea of hybrid querying opens new problems in black-box optimization and solutions can benefit other applications beyond audio personalization.

[LG-37] Layer Separation Deep Learning Model with Auxiliary Variables for Partial Differential Equations

链接: https://arxiv.org/abs/2507.12766
作者: Yaru Liu,Yiqi Gu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a new optimization framework, the layer separation (LySep) model, to improve the deep learning-based methods in solving partial differential equations. Due to the highly non-convex nature of the loss function in deep learning, existing optimization algorithms often converge to suboptimal local minima or suffer from gradient explosion or vanishing, resulting in poor performance. To address these issues, we introduce auxiliary variables to separate the layers of deep neural networks. Specifically, the output and its derivatives of each layer are represented by auxiliary variables, effectively decomposing the deep architecture into a series of shallow architectures. New loss functions with auxiliary variables are established, in which only variables from two neighboring layers are coupled. Corresponding algorithms based on alternating directions are developed, where many variables can be updated optimally in closed forms. Moreover, we provide theoretical analyses demonstrating the consistency between the LySep model and the original deep model. High-dimensional numerical results validate our theory and demonstrate the advantages of LySep in minimizing loss and reducing solution error.

[LG-38] From SGD to Spectra: A Theory of Neural Network Weight Dynamics

链接: https://arxiv.org/abs/2507.12709
作者: Brian Richard Olsen,Sam Fatehmanesh,Frank Xiao,Adarsh Kumarappan,Anirudh Gajula
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks have revolutionized machine learning, yet their training dynamics remain theoretically unclear-we develop a continuous-time, matrix-valued stochastic differential equation (SDE) framework that rigorously connects the microscopic dynamics of SGD to the macroscopic evolution of singular-value spectra in weight matrices. We derive exact SDEs showing that squared singular values follow Dyson Brownian motion with eigenvalue repulsion, and characterize stationary distributions as gamma-type densities with power-law tails, providing the first theoretical explanation for the empirically observed ‘bulk+tail’ spectral structure in trained networks. Through controlled experiments on transformer and MLP architectures, we validate our theoretical predictions and demonstrate quantitative agreement between SDE-based forecasts and observed spectral evolution, providing a rigorous foundation for understanding why deep learning works.

[LG-39] PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform RECSYS2025

链接: https://arxiv.org/abs/2507.12704
作者: Xiangyi Chen,Kousik Rajesh,Matthew Lawhon,Zelun Wang,Hanyu Li,Haomiao Li,Saurabh Vishwas Joshi,Pong Eksombatchai,Jaewon Yang,Yi-Ping Hsu,Jiajing Xu,Charles Rosenberg
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: RecSys 2025

点击查看摘要

Abstract:User activity sequences have emerged as one of the most important signals in recommender systems. We present a foundational model, PinFM, for understanding user activity sequences across multiple applications at a billion-scale visual discovery platform. We pretrain a transformer model with 20B+ parameters using extensive user activity data, then fine-tune it for specific applications, efficiently coupling it with existing models. While this pretraining-and-fine-tuning approach has been popular in other domains, such as Vision and NLP, its application in industrial recommender systems presents numerous challenges. The foundational model must be scalable enough to score millions of items every second while meeting tight cost and latency constraints imposed by these systems. Additionally, it should capture the interactions between user activities and other features and handle new items that were not present during the pretraining stage. We developed innovative techniques to address these challenges. Our infrastructure and algorithmic optimizations, such as the Deduplicated Cross-Attention Transformer (DCAT), improved our throughput by 600% on Pinterest internal data. We demonstrate that PinFM can learn interactions between user sequences and candidate items by altering input sequences, leading to a 20% increase in engagement with new items. PinFM is now deployed to help improve the experience of more than a half billion users across various applications. Comments: RecSys 2025 Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR) Cite as: arXiv:2507.12704 [cs.LG] (or arXiv:2507.12704v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.12704 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Federated Learning in Open- and Closed-Loop EMG Decoding: A Privacy and Performance Perspective

链接: https://arxiv.org/abs/2507.12652
作者: Kai Malcolm,César Uribe,Momona Yamagami
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
*备注: 23 pages, 7 figures

点击查看摘要

Abstract:Invasive and non-invasive neural interfaces hold promise as high-bandwidth input devices for next-generation technologies. However, neural signals inherently encode sensitive information about an individual’s identity and health, making data sharing for decoder training a critical privacy challenge. Federated learning (FL), a distributed, privacy-preserving learning framework, presents a promising solution, but it remains unexplored in closed-loop adaptive neural interfaces. Here, we introduce FL-based neural decoding and systematically evaluate its performance and privacy using high-dimensional electromyography signals in both open- and closed-loop scenarios. In open-loop simulations, FL significantly outperformed local learning baselines, demonstrating its potential for high-performance, privacy-conscious neural decoding. In contrast, closed-loop user studies required adapting FL methods to accommodate single-user, real-time interactions, a scenario not supported by standard FL. This modification resulted in local learning decoders surpassing the adapted FL approach in closed-loop performance, yet local learning still carried higher privacy risks. Our findings highlight a critical performance-privacy tradeoff in real-time adaptive applications and indicate the need for FL methods specifically designed for co-adaptive, single-user applications.

[LG-41] Reasoning -Finetuning Repurposes Latent Representations in Base Models ICML2025

链接: https://arxiv.org/abs/2507.12638
作者: Jake Ward,Chuqiao Lin,Constantin Venhoff,Neel Nanda
类目: Machine Learning (cs.LG)
*备注: 6 pages, 6 figures. ICML 2025 Workshop on Actionable Interpretability

点击查看摘要

Abstract:Backtracking, an emergent behavior elicited by reasoning fine-tuning, has been shown to be a key mechanism in reasoning models’ enhanced capabilities. Prior work has succeeded in manipulating this behavior via steering vectors, but the underlying mechanism remains poorly understood. In this work, we show that the emergence of backtracking in DeepSeek-R1-Distill-Llama-8B is in part driven by a repurposed direction already present in base model activations. Specifically, we identify a direction in base Llama-3.1-8B’s residual stream which systematically induces backtracking when used to steer the distilled reasoning model, and find that the effects of steering with this direction cannot be trivially explained by token-level attributes. We further find that this direction does not induce backtracking in the base model, suggesting that the reasoning finetuning process repurposes pre-existing representations to form new behavioral circuits. Additionally, we hypothesize that this direction is one of several which may work together to mediate backtracking. Our findings offer a compelling picture that reasoning-finetuned models repurpose pre-existing base model representations, rather than learn new capabilities from scratch.

[LG-42] Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?

链接: https://arxiv.org/abs/2507.12604
作者: Antoni Zajko,Katarzyna Woźnica
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effectively representing heterogeneous tabular datasets for meta-learning purposes is still an open problem. Previous approaches rely on representations that are intended to be universal. This paper proposes two novel methods for tabular representation learning tailored to a specific meta-task - warm-starting Bayesian Hyperparameter Optimization. Both follow the specific requirement formulated by ourselves that enforces representations to capture the properties of landmarkers. The first approach involves deep metric learning, while the second one is based on landmarkers reconstruction. We evaluate the proposed encoders in two ways. Next to the gain in the target meta-task, we also use the degree of fulfillment of the proposed requirement as the evaluation metric. Experiments demonstrate that while the proposed encoders can effectively learn representations aligned with landmarkers, they may not directly translate to significant performance gains in the meta-task of HPO warm-starting.

[LG-43] Second-Order Bounds for [01]-Valued Regression via Betting Loss

链接: https://arxiv.org/abs/2507.12584
作者: Yinan Li,Kwang-Sung Jun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the [0,1] -valued regression problem in the i.i.d. setting. In a related problem called cost-sensitive classification, \citetfoster21efficient have shown that the log loss minimizer achieves an improved generalization bound compared to that of the squared loss minimizer in the sense that the bound scales with the cost of the best classifier, which can be arbitrarily small depending on the problem at hand. Such a result is often called a first-order bound. For [0,1] -valued regression, we first show that the log loss minimizer leads to a similar first-order bound. We then ask if there exists a loss function that achieves a variance-dependent bound (also known as a second order bound), which is a strict improvement upon first-order bounds. We answer this question in the affirmative by proposing a novel loss function called the betting loss. Our result is ``variance-adaptive’’ in the sense that the bound is attained \textitwithout any knowledge about the variance, which is in contrast to modeling label (or reward) variance or the label distribution itself explicitly as part of the function class such as distributional reinforcement learning.

[LG-44] Ranking Vectors Clustering: Theory and Applications

链接: https://arxiv.org/abs/2507.12583
作者: Ali Fattahi,Ali Eshragh,Babak Aslani,Meysam Rabiee
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We study the problem of clustering ranking vectors, where each vector represents preferences as an ordered list of distinct integers. Specifically, we focus on the k-centroids ranking vectors clustering problem (KRC), which aims to partition a set of ranking vectors into k clusters and identify the centroid of each cluster. Unlike classical k-means clustering (KMC), KRC constrains both the observations and centroids to be ranking vectors. We establish the NP-hardness of KRC and characterize its feasible set. For the single-cluster case, we derive a closed-form analytical solution for the optimal centroid, which can be computed in linear time. To address the computational challenges of KRC, we develop an efficient approximation algorithm, KRCA, which iteratively refines initial solutions from KMC, referred to as the baseline solution. Additionally, we introduce a branch-and-bound (BnB) algorithm for efficient cluster reconstruction within KRCA, leveraging a decision tree framework to reduce computational time while incorporating a controlling parameter to balance solution quality and efficiency. We establish theoretical error bounds for KRCA and BnB. Through extensive numerical experiments on synthetic and real-world datasets, we demonstrate that KRCA consistently outperforms baseline solutions, delivering significant improvements in solution quality with fast computational times. This work highlights the practical significance of KRC for personalization and large-scale decision making, offering methodological advancements and insights that can be built upon in future studies.

[LG-45] Deep Bilinear Koopman Model for Real-Time Vehicle Control in Frenet Frame

链接: https://arxiv.org/abs/2507.12578
作者: Mohammad Abtahi,Farhang Motallebi Araghi,Navid Mojahed,Shima Nazari
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 14 pages, 8 figures. This manuscript is under review with IEEE Transactions on Intelligent Vehicles

点击查看摘要

Abstract:Accurate modeling and control of autonomous vehicles remain a fundamental challenge due to the nonlinear and coupled nature of vehicle dynamics. While Koopman operator theory offers a framework for deploying powerful linear control techniques, learning a finite-dimensional invariant subspace for high-fidelity modeling continues to be an open problem. This paper presents a deep Koopman approach for modeling and control of vehicle dynamics within the curvilinear Frenet frame. The proposed framework uses a deep neural network architecture to simultaneously learn the Koopman operator and its associated invariant subspace from the data. Input-state bilinear interactions are captured by the algorithm while preserving convexity, which makes it suitable for real-time model predictive control (MPC) application. A multi-step prediction loss is utilized during training to ensure long-horizon prediction capability. To further enhance real-time trajectory tracking performance, the model is integrated with a cumulative error regulator (CER) module, which compensates for model mismatch by mitigating accumulated prediction errors. Closed-loop performance is evaluated through hardware-in-the-loop (HIL) experiments using a CarSim RT model as the target plant, with real-time validation conducted on a dSPACE SCALEXIO system. The proposed controller achieved significant reductions in tracking error relative to baseline controllers, confirming its suitability for real-time implementation in embedded autonomous vehicle systems.

[LG-46] IncA-DES: An incremental and adaptive dynamic ensemble selection approach using online K-d tree neighborhood search for data streams with concept drift

链接: https://arxiv.org/abs/2507.12573
作者: Eduardo V. L. Barboza,Paulo R. Lisboa de Almeida,Alceu de Souza Britto Jr.,Robert Sabourin,Rafael M. O. Cruz
类目: Machine Learning (cs.LG)
*备注: Preprint of article published to Information Fusion

点击查看摘要

Abstract:Data streams pose challenges not usually encountered in batch-based ML. One of them is concept drift, which is characterized by the change in data distribution over time. Among many approaches explored in literature, the fusion of classifiers has been showing good results and is getting growing attention. DS methods, due to the ensemble being instance-based, seem to be an efficient choice under drifting scenarios. However, some attention must be paid to adapting such methods for concept drift. The training must be done in order to create local experts, and the commonly used neighborhood-search DS may become prohibitive with the continuous arrival of data. In this work, we propose IncA-DES, which employs a training strategy that promotes the generation of local experts with the assumption that different regions of the feature space become available with time. Additionally, the fusion of a concept drift detector supports the maintenance of information and adaptation to a new concept. An overlap-based classification filter is also employed in order to avoid using the DS method when there is a consensus in the neighborhood, a strategy that we argue every DS method should employ, as it was shown to make them more applicable and quicker. Moreover, aiming to reduce the processing time of the kNN, we propose an Online K-d tree algorithm, which can quickly remove instances without becoming inconsistent and deals with unbalancing concerns that may occur in data streams. Experimental results showed that the proposed framework got the best average accuracy compared to seven state-of-the-art methods considering different levels of label availability and presented the smaller processing time between the most accurate methods. Additionally, the fusion with the Online K-d tree has improved processing time with a negligible loss in accuracy. We have made our framework available in an online repository.

[LG-47] Evaluation of Neural Surrogates for Physical Modelling Synthesis of Nonlinear Elastic Plates

链接: https://arxiv.org/abs/2507.12563
作者: Carlos De La Vega Martin,Rodrigo Diaz Fernandez,Mark Sandler
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Physical modelling synthesis aims to generate audio from physical simulations of vibrating structures. Thin elastic plates are a common model for drum membranes. Traditional numerical methods like finite differences and finite elements offer high accuracy but are computationally demanding, limiting their use in real-time audio applications. This paper presents a comparative analysis of neural network-based approaches for solving the vibration of nonlinear elastic plates. We evaluate several state-of-the-art models, trained on short sequences, for prediction of long sequences in an autoregressive fashion. We show some of the limitations of these models, and why is not enough to look at the prediction error in the time domain. We discuss the implications for real-time audio synthesis and propose future directions for improving neural approaches to model nonlinear vibration.

[LG-48] Rel-HNN: Split Parallel Hypergraph Neural Network for Learning on Relational Databases

链接: https://arxiv.org/abs/2507.12562
作者: Md. Tanvir Alam,Md. Ahasanul Alam,Md Mahmudur Rahman,Md. Mosaddek Khan
类目: Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Relational databases (RDBs) are ubiquitous in enterprise and real-world applications. Flattening the database poses challenges for deep learning models that rely on fixed-size input representations to capture relational semantics from the structured nature of relational data. Graph neural networks (GNNs) have been proposed to address this, but they often oversimplify relational structures by modeling all the tuples as monolithic nodes and ignoring intra-tuple associations. In this work, we propose a novel hypergraph-based framework, that we call rel-HNN, which models each unique attribute-value pair as a node and each tuple as a hyperedge, enabling the capture of fine-grained intra-tuple relationships. Our approach learns explicit multi-level representations across attribute-value, tuple, and table levels. To address the scalability challenges posed by large RDBs, we further introduce a split-parallel training algorithm that leverages multi-GPU execution for efficient hypergraph learning. Extensive experiments on real-world and benchmark datasets demonstrate that rel-HNN significantly outperforms existing methods in both classification and regression tasks. Moreover, our split-parallel training achieves substantial speedups – up to 3.18x for learning on relational data and up to 2.94x for hypergraph learning – compared to conventional single-GPU execution.

[LG-49] he Serial Scaling Hypothesis

链接: https://arxiv.org/abs/2507.12549
作者: Yuxi Liu,Konpat Preechakul,Kananart Kuwaranancharoen,Yutong Bai
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
*备注: 28 pages (13 pages main text + appendices references), 8 figures, equal-contribution first authors

点击查看摘要

Abstract:While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These “inherently serial” problems-from mathematical reasoning to physical simulations to sequential decision-making-require dependent computational steps that cannot be parallelized. Drawing from complexity theory, we formalize this distinction and demonstrate that current parallel-centric architectures face fundamental limitations on such tasks. We argue that recognizing the serial nature of computation holds profound implications on machine learning, model design, hardware development. As AI tackles increasingly complex reasoning, deliberately scaling serial computation-not just parallel computation-is essential for continued progress.

[LG-50] ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving ICCV2025

链接: https://arxiv.org/abs/2507.12499
作者: Yuhang Lu,Jiadong Tu,Yuexin Ma,Xinge Zhu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted by ICCV2025

点击查看摘要

Abstract:End-to-end autonomous driving has emerged as a promising approach to unify perception, prediction, and planning within a single framework, reducing information loss and improving adaptability. However, existing methods often rely on fixed and sparse trajectory supervision, limiting their ability to capture the hierarchical reasoning process that human drivers naturally employ. To bridge this gap, we propose ReAL-AD, a Reasoning-Augmented Learning framework that structures decision-making in autonomous driving based on the three-tier human cognitive model: Driving Strategy, Driving Decision, and Driving Operation, where Vision-Language Models (VLMs) are incorporated to enhance situational awareness and structured reasoning across these levels. Specifically, we introduce: (1) the Strategic Reasoning Injector, which formulates high-level driving strategies by interpreting complex traffic contexts from VLM-generated insights; (2) the Tactical Reasoning Integrator, which refines strategic intent into interpretable tactical choices such as lane changes, overtaking, and speed adjustments; and (3) the Hierarchical Trajectory Decoder, which progressively translates tactical decisions into precise control actions for smooth and human-like trajectory execution. Extensive evaluations show that integrating our framework improves planning accuracy and safety by over 30%, making end-to-end autonomous driving more interpretable and aligned with human-like hierarchical reasoning. The project page can be found at: \hrefthis https URL\textttthis http URL_page/realad

[LG-51] Optimal Empirical Risk Minimization under Temporal Distribution Shifts

链接: https://arxiv.org/abs/2507.13287
作者: Yujin Jeong,Ramesh Johari,Dominik Rothenhäusler,Emily Fox
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal distribution shifts pose a key challenge for machine learning models trained and deployed in dynamically evolving environments. This paper introduces RIDER (RIsk minimization under Dynamically Evolving Regimes) which derives optimally-weighted empirical risk minimization procedures under temporal distribution shifts. Our approach is theoretically grounded in the random distribution shift model, where random shifts arise as a superposition of numerous unpredictable changes in the data-generating process. We show that common weighting schemes, such as pooling all data, exponentially weighting data, and using only the most recent data, emerge naturally as special cases in our framework. We demonstrate that RIDER consistently improves out-of-sample predictive performance when applied as a fine-tuning step on the Yearbook dataset, across a range of benchmark methods in Wild-Time. Moreover, we show that RIDER outperforms standard weighting strategies in two other real-world tasks: predicting stock market volatility and forecasting ride durations in NYC taxi data.

[LG-52] Stochastic Weakly Convex Optimization Under Heavy-Tailed Noises

链接: https://arxiv.org/abs/2507.13283
作者: Tianxi Zhu,Yi Xu,Xiangyang Ji
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:An increasing number of studies have focused on stochastic first-order methods (SFOMs) under heavy-tailed gradient noises, which have been observed in the training of practical deep learning models. In this paper, we focus on two types of gradient noises: one is sub-Weibull noise, and the other is noise under the assumption that it has a bounded p -th central moment ( p -BCM) with p\in (1, 2] . The latter is more challenging due to the occurrence of infinite variance when p\in (1, 2) . Under these two gradient noise assumptions, the in-expectation and high-probability convergence of SFOMs have been extensively studied in the contexts of convex optimization and standard smooth optimization. However, for weakly convex objectives-a class that includes all Lipschitz-continuous convex objectives and smooth objectives-our understanding of the in-expectation and high-probability convergence of SFOMs under these two types of noises remains incomplete. We investigate the high-probability convergence of the vanilla stochastic subgradient descent (SsGD) method under sub-Weibull noises, as well as the high-probability and in-expectation convergence of clipped SsGD under the p -BCM noises. Both analyses are conducted in the context of weakly convex optimization. For weakly convex objectives that may be non-convex and non-smooth, our results demonstrate that the theoretical dependence of vanilla SsGD on the failure probability and number of iterations under sub-Weibull noises does not degrade compared to the case of smooth objectives. Under p -BCM noises, our findings indicate that the non-smoothness and non-convexity of weakly convex objectives do not impact the theoretical dependence of clipped SGD on the failure probability relative to the smooth case; however, the sample complexity we derived is worse than a well-known lower bound for smooth optimization.

[LG-53] he carbon cost of materials discovery: Can machine learning really accelerate the discovery of new photovoltaics?

链接: https://arxiv.org/abs/2507.13246
作者: Matthew Walker,Keith T. Butler
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computational screening has become a powerful complement to experimental efforts in the discovery of high-performance photovoltaic (PV) materials. Most workflows rely on density functional theory (DFT) to estimate electronic and optical properties relevant to solar energy conversion. Although more efficient than laboratory-based methods, DFT calculations still entail substantial computational and environmental costs. Machine learning (ML) models have recently gained attention as surrogates for DFT, offering drastic reductions in resource use with competitive predictive performance. In this study, we reproduce a canonical DFT-based workflow to estimate the maximum efficiency limit and progressively replace its components with ML surrogates. By quantifying the CO _2 emissions associated with each computational strategy, we evaluate the trade-offs between predictive efficacy and environmental cost. Our results reveal multiple hybrid ML/DFT strategies that optimize different points along the accuracy–emissions front. We find that direct prediction of scalar quantities, such as maximum efficiency, is significantly more tractable than using predicted absorption spectra as an intermediate step. Interestingly, ML models trained on DFT data can outperform DFT workflows using alternative exchange–correlation functionals in screening applications, highlighting the consistency and utility of data-driven approaches. We also assess strategies to improve ML-driven screening through expanded datasets and improved model architectures tailored to PV-relevant features. This work provides a quantitative framework for building low-emission, high-throughput discovery pipelines.

[LG-54] Relation-Aware Slicing in Cross-Domain Alignment

链接: https://arxiv.org/abs/2507.13194
作者: Dhruv Sarkar,Aprameyo Chakrabartty,Anish Chakrabarty,Swagatam Das
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Sliced Gromov-Wasserstein (SGW) distance, aiming to relieve the computational cost of solving a non-convex quadratic program that is the Gromov-Wasserstein distance, utilizes projecting directions sampled uniformly from unit hyperspheres. This slicing mechanism incurs unnecessary computational costs due to uninformative directions, which also affects the representative power of the distance. However, finding a more appropriate distribution over the projecting directions (slicing distribution) is often an optimization problem in itself that comes with its own computational cost. In addition, with more intricate distributions, the sampling itself may be expensive. As a remedy, we propose an optimization-free slicing distribution that provides fast sampling for the Monte Carlo approximation. We do so by introducing the Relation-Aware Projecting Direction (RAPD), effectively capturing the pairwise association of each of two pairs of random vectors, each following their ambient law. This enables us to derive the Relation-Aware Slicing Distribution (RASD), a location-scale law corresponding to sampled RAPDs. Finally, we introduce the RASGW distance and its variants, e.g., IWRASGW (Importance Weighted RASGW), which overcome the shortcomings experienced by SGW. We theoretically analyze its properties and substantiate its empirical prowess using extensive experiments on various alignment tasks.

[LG-55] Search for Z/2 eigenfunctions on the sphere using machine learning

链接: https://arxiv.org/abs/2507.13122
作者: Andriy Haydys,Willem Adriaan Salm
类目: Differential Geometry (math.DG); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 14 pages, 12 pictures

点击查看摘要

Abstract:We use machine learning to search for examples of Z/2 eigenfunctions on the 2-sphere. For this we created a multivalued version of a feedforward deep neural network, and we implemented it using the JAX library. We found Z/2 eigenfunctions for three cases: In the first two cases we fixed the branch points at the vertices of a tetrahedron and at a cube respectively. In a third case, we allowed the AI to move the branch points around and, in the end, it positioned the branch points at the vertices of a squashed tetrahedron.

[LG-56] Unsupervised Ground Metric Learning

链接: https://arxiv.org/abs/2507.13094
作者: Janis Auffenberg,Jonas Bresch,Oleh Melnyk,Gabriele Steidl
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 10 figures, 1 table

点击查看摘要

Abstract:Data classification without access to labeled samples remains a challenging problem. It usually depends on an appropriately chosen distance between features, a topic addressed in metric learning. Recently, Huizing, Cantini and Peyré proposed to simultaneously learn optimal transport (OT) cost matrices between samples and features of the dataset. This leads to the task of finding positive eigenvectors of a certain nonlinear function that maps cost matrices to OT distances. Having this basic idea in mind, we consider both the algorithmic and the modeling part of unsupervised metric learning. First, we examine appropriate algorithms and their convergence. In particular, we propose to use the stochastic random function iteration algorithm and prove that it converges linearly for our setting, although our operators are not paracontractive as it was required for convergence so far. Second, we ask the natural question if the OT distance can be replaced by other distances. We show how Mahalanobis-like distances fit into our considerations. Further, we examine an approach via graph Laplacians. In contrast to the previous settings, we have just to deal with linear functions in the wanted matrices here, so that simple algorithms from linear algebra can be applied.

[LG-57] (Exhaustive) Symbolic Regression and model selection by minimum description length

链接: https://arxiv.org/abs/2507.13033
作者: Harry Desmond
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: 15 pages, 4 figures; Invited review for the Royal Society Philosophical Transactions A special issue “Symbolic regression in the physical sciences”

点击查看摘要

Abstract:Symbolic regression is the machine learning method for learning functions from data. After a brief overview of the symbolic regression landscape, I will describe the two main challenges that traditional algorithms face: they have an unknown (and likely significant) probability of failing to find any given good function, and they suffer from ambiguity and poorly-justified assumptions in their function-selection procedure. To address these I propose an exhaustive search and model selection by the minimum description length principle, which allows accuracy and complexity to be directly traded off by measuring each in units of information. I showcase the resulting publicly available Exhaustive Symbolic Regression algorithm on three open problems in astrophysics: the expansion history of the universe, the effective behaviour of gravity in galaxies and the potential of the inflaton field. In each case the algorithm identifies many functions superior to the literature standards. This general purpose methodology should find widespread utility in science and beyond.

[LG-58] When Pattern-by-Pattern Works: Theoretical and Empirical Insights for Logistic Models with Missing Values

链接: https://arxiv.org/abs/2507.13024
作者: Christophe Muller(PREMEDICAL),Erwan Scornet(LPSM),Julie Josse(PREMEDICAL)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting a response with partially missing inputs remains a challenging task even in parametric models, since parameter estimation in itself is not sufficient to predict on partially observed inputs. Several works study prediction in linear models. In this paper, we focus on logistic models, which present their own difficulties. From a theoretical perspective, we prove that a Pattern-by-Pattern strategy (PbP), which learns one logistic model per missingness pattern, accurately approximates Bayes probabilities in various missing data scenarios (MCAR, MAR and MNAR). Empirically, we thoroughly compare various methods (constant and iterative imputations, complete case analysis, PbP, and an EM algorithm) across classification, probability estimation, calibration, and parameter inference. Our analysis provides a comprehensive view on the logistic regression with missing values. It reveals that mean imputation can be used as baseline for low sample sizes, and improved performance is obtained via nonlinear multiple iterative imputation techniques with the labels (MICE.RF.Y). For large sample sizes, PbP is the best method for Gaussian mixtures, and we recommend MICE.RF.Y in presence of nonlinear features.

[LG-59] Investigating Forecasting Models for Pandemic Infections Using Heterogeneous Data Sources: A 2-year Study with COVID-19

链接: https://arxiv.org/abs/2507.12966
作者: Zacharias Komodromos,Kleanthis Malialis,Panayiotis Kolios
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG)
*备注: Keywords: epidemiology, pandemic forecasting, COVID-19, infections, machine learning Accepted: IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 2025

点击查看摘要

Abstract:Emerging in December 2019, the COVID-19 pandemic caused widespread health, economic, and social disruptions. Rapid global transmission overwhelmed healthcare systems, resulting in high infection rates, hospitalisations, and fatalities. To minimise the spread, governments implemented several non-pharmaceutical interventions like lockdowns and travel restrictions. While effective in controlling transmission, these measures also posed significant economic and societal challenges. Although the WHO declared COVID-19 no longer a global health emergency in May 2023, its impact persists, shaping public health strategies. The vast amount of data collected during the pandemic offers valuable insights into disease dynamics, transmission, and intervention effectiveness. Leveraging these insights can improve forecasting models, enhancing preparedness and response to future outbreaks while mitigating their social and economic impact. This paper presents a large-scale case study on COVID-19 forecasting in Cyprus, utilising a two-year dataset that integrates epidemiological data, vaccination records, policy measures, and weather conditions. We analyse infection trends, assess forecasting performance, and examine the influence of external factors on disease dynamics. The insights gained contribute to improved pandemic preparedness and response strategies.

[LG-60] Bayesian Modeling and Estimation of Linear Time-Variant Systems using Neural Networks and Gaussian Processes

链接: https://arxiv.org/abs/2507.12878
作者: Yaniv Shulman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The identification of Linear Time-Variant (LTV) systems from input-output data is a fundamental yet challenging ill-posed inverse problem. This work introduces a unified Bayesian framework that models the system’s impulse response, h(t, \tau) , as a stochastic process. We decompose the response into a posterior mean and a random fluctuation term, a formulation that provides a principled approach for quantifying uncertainty and naturally defines a new, useful system class we term Linear Time-Invariant in Expectation (LTIE). To perform inference, we leverage modern machine learning techniques, including Bayesian neural networks and Gaussian Processes, using scalable variational inference. We demonstrate through a series of experiments that our framework can robustly infer the properties of an LTI system from a single noisy observation, show superior data efficiency compared to classical methods in a simulated ambient noise tomography problem, and successfully track a continuously varying LTV impulse response by using a structured Gaussian Process prior. This work provides a flexible and robust methodology for uncertainty-aware system identification in dynamic environments.

[LG-61] Self Balancing Neural Network: A Novel Method to Estimate Averag e Treatment Effect

链接: https://arxiv.org/abs/2507.12818
作者: Atomsa Gemechu Abdisa,Yingchun Zhou,Yuqi Qiu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In observational studies, confounding variables affect both treatment and outcome. Moreover, instrumental variables also influence the treatment assignment mechanism. This situation sets the study apart from a standard randomized controlled trial, where the treatment assignment is random. Due to this situation, the estimated average treatment effect becomes biased. To address this issue, a standard approach is to incorporate the estimated propensity score when estimating the average treatment effect. However, these methods incur the risk of misspecification in propensity score models. To solve this issue, a novel method called the “Self balancing neural network” (Sbnet), which lets the model itself obtain its pseudo propensity score from the balancing net, is proposed in this study. The proposed method estimates the average treatment effect by using the balancing net as a key part of the feedforward neural network. This formulation resolves the estimation of the average treatment effect in one step. Moreover, the multi-pseudo propensity score framework, which is estimated from the diversified balancing net and used for the estimation of the average treatment effect, is presented. Finally, the proposed methods are compared with state-of-the-art methods on three simulation setups and real-world datasets. It has been shown that the proposed self-balancing neural network shows better performance than state-of-the-art methods.

[LG-62] Finite-Dimensional Gaussian Approximation for Deep Neural Networks: Universality in Random Weights

链接: https://arxiv.org/abs/2507.12686
作者: Krishnakumar Balasubramanian,Nathan Ross
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the Finite-Dimensional Distributions (FDDs) of deep neural networks with randomly initialized weights that have finite-order moments. Specifically, we establish Gaussian approximation bounds in the Wasserstein- 1 norm between the FDDs and their Gaussian limit assuming a Lipschitz activation function and allowing the layer widths to grow to infinity at arbitrary relative rates. In the special case where all widths are proportional to a common scale parameter n and there are L-1 hidden layers, we obtain convergence rates of order n^-(1/6)^L-1 + \epsilon , for any \epsilon 0 .

[LG-63] Physics constrained learning of stochastic characteristics

链接: https://arxiv.org/abs/2507.12661
作者: Pardha Sai Krishna Ala,Ameya Salvi,Venkat Krovi,Matthias Schmid
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 pages, 6 figures

点击查看摘要

Abstract:Accurate state estimation requires careful consideration of uncertainty surrounding the process and measurement models; these characteristics are usually not well-known and need an experienced designer to select the covariance matrices. An error in the selection of covariance matrices could impact the accuracy of the estimation algorithm and may sometimes cause the filter to diverge. Identifying noise characteristics has long been a challenging problem due to uncertainty surrounding noise sources and difficulties in systematic noise modeling. Most existing approaches try identifying unknown covariance matrices through an optimization algorithm involving innovation sequences. In recent years, learning approaches have been utilized to determine the stochastic characteristics of process and measurement models. We present a learning-based methodology with different loss functions to identify noise characteristics and test these approaches’ performance for real-time vehicle state estimation

[LG-64] Distributional Reinforcement Learning on Path-dependent Options

链接: https://arxiv.org/abs/2507.12657
作者: Ahmet Umur Özsoy
类目: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We reinterpret and propose a framework for pricing path-dependent financial derivatives by estimating the full distribution of payoffs using Distributional Reinforcement Learning (DistRL). Unlike traditional methods that focus on expected option value, our approach models the entire conditional distribution of payoffs, allowing for risk-aware pricing, tail-risk estimation, and enhanced uncertainty quantification. We demonstrate the efficacy of this method on Asian options, using quantile-based value function approximators.

[LG-65] A Novel Data Augmentation Strategy for Robust Deep Learning Classification of Biomedical Time-Series Data: Application to ECG and EEG Analysis

链接: https://arxiv.org/abs/2507.12645
作者: Mohammed Guhdar,Ramadhan J. Mstafa,Abdulhakeem O. Mohammed
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing need for accurate and unified analysis of diverse biological signals, such as ECG and EEG, is paramount for comprehensive patient assessment, especially in synchronous monitoring. Despite advances in multi-sensor fusion, a critical gap remains in developing unified architectures that effectively process and extract features from fundamentally different physiological signals. Another challenge is the inherent class imbalance in many biomedical datasets, often causing biased performance in traditional methods. This study addresses these issues by proposing a novel and unified deep learning framework that achieves state-of-the-art performance across different signal types. Our method integrates a ResNet-based CNN with an attention mechanism, enhanced by a novel data augmentation strategy: time-domain concatenation of multiple augmented variants of each signal to generate richer representations. Unlike prior work, we scientifically increase signal complexity to achieve future-reaching capabilities, which resulted in the best predictions compared to the state of the art. Preprocessing steps included wavelet denoising, baseline removal, and standardization. Class imbalance was effectively managed through the combined use of this advanced data augmentation and the Focal Loss function. Regularization techniques were applied during training to ensure generalization. We rigorously evaluated the proposed architecture on three benchmark datasets: UCI Seizure EEG, MIT-BIH Arrhythmia, and PTB Diagnostic ECG. It achieved accuracies of 99.96%, 99.78%, and 100%, respectively, demonstrating robustness across diverse signal types and clinical contexts. Finally, the architecture requires ~130 MB of memory and processes each sample in ~10 ms, suggesting suitability for deployment on low-end or wearable devices.

[LG-66] Complex non-backtracking matrix for directed graphs

链接: https://arxiv.org/abs/2507.12503
作者: Keishi Sando,Hideitsu Hino
类目: Combinatorics (math.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph representation matrices are essential tools in graph data analysis. Recently, Hermitian adjacency matrices have been proposed to investigate directed graph structures. Previous studies have demonstrated that these matrices can extract valuable information for clustering. In this paper, we propose the complex non-backtracking matrix that integrates the properties of the Hermitian adjacency matrix and the non-backtracking matrix. The proposed matrix has similar properties with the non-backtracking matrix of undirected graphs. We reveal relationships between the complex non-backtracking matrix and the Hermitian adjacency matrix. Also, we provide intriguing insights that this matrix representation holds cluster information, particularly for sparse directed graphs.

[LG-67] Differentially Private Conformal Prediction via Quantile Binary Search

链接: https://arxiv.org/abs/2507.12497
作者: Ogonnaya M. Romanus,Roberto Molinari
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Most Differentially Private (DP) approaches focus on limiting privacy leakage from learners based on the data that they are trained on, there are fewer approaches that consider leakage when procedures involve a calibration dataset which is common in uncertainty quantification methods such as Conformal Prediction (CP). Since there is a limited amount of approaches in this direction, in this work we deliver a general DP approach for CP that we call Private Conformity via Quantile Search (P-COQS). The proposed approach adapts an existing randomized binary search algorithm for computing DP quantiles in the calibration phase of CP thereby guaranteeing privacy of the consequent prediction sets. This however comes at a price of slightly under-covering with respect to the desired (1 - \alpha) -level when using finite-sample calibration sets (although broad empirical results show that the P-COQS generally targets the required level in the considered cases). Confirming properties of the adapted algorithm and quantifying the approximate coverage guarantees of the consequent CP, we conduct extensive experiments to examine the effects of privacy noise, sample size and significance level on the performance of our approach compared to existing alternatives. In addition, we empirically evaluate our approach on several benchmark datasets, including CIFAR-10, ImageNet and CoronaHack. Our results suggest that the proposed method is robust to privacy noise and performs favorably with respect to the current DP alternative in terms of empirical coverage, efficiency, and informativeness. Specifically, the results indicate that P-COQS produces smaller conformal prediction sets while simultaneously targeting the desired coverage and privacy guarantees in all these experimental settings.

[LG-68] he Generalist Brain Module: Module Repetition in Neural Networks in Light of the Minicolumn Hypothesis

链接: https://arxiv.org/abs/2507.12473
作者: Mia-Katrin Kvalsund,Mikkel Elle Lepperød
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:While modern AI continues to advance, the biological brain remains the pinnacle of neural networks in its robustness, adaptability, and efficiency. This review explores an AI architectural path inspired by the brain’s structure, particularly the minicolumn hypothesis, which views the neocortex as a distributed system of repeated modules - a structure we connect to collective intelligence (CI). Despite existing work, there is a lack of comprehensive reviews connecting the cortical column to the architectures of repeated neural modules. This review aims to fill that gap by synthesizing historical, theoretical, and methodological perspectives on neural module repetition. We distinguish between architectural repetition - reusing structure - and parameter-shared module repetition, where the same functional unit is repeated across a network. The latter exhibits key CI properties such as robustness, adaptability, and generalization. Evidence suggests that the repeated module tends to converge toward a generalist module: simple, flexible problem solvers capable of handling many roles in the ensemble. This generalist tendency may offer solutions to longstanding challenges in modern AI: improved energy efficiency during training through simplicity and scalability, and robust embodied control via generalization. While empirical results suggest such systems can generalize to out-of-distribution problems, theoretical results are still lacking. Overall, architectures featuring module repetition remain an emerging and unexplored architectural strategy, with significant untapped potential for both efficiency, robustness, and adaptiveness. We believe that a system that adopts the benefits of CI, while adhering to architectural and functional principles of the minicolumns, could challenge the modern AI problems of scalability, energy consumption, and democratization.

[LG-69] Refining Coarse-Grained Molecular Topologies: A Bayesian Optimization Approach

链接: https://arxiv.org/abs/2501.02707
作者: Pranoy Ray,Adam P. Generale,Nikhith Vankireddy,Yuichiro Asoma,Masataka Nakauchi,Haein Lee,Katsuhisa Yoshida,Yoshishige Okuno,Surya R. Kalidindi
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular Dynamics (MD) simulations are essential for accurately predicting the physical and chemical properties of large molecular systems across various pressure and temperature ensembles. However, the high computational costs associated with All-Atom (AA) MD simulations have led to the development of Coarse-Grained Molecular Dynamics (CGMD), providing a lower-dimensional compression of the AA structure into representative CG beads, offering reduced computational expense at the cost of predictive accuracy. Existing CGMD methods, such as CG-Martini (calibrated against experimental data), aim to generate an embedding of a topology that sufficiently generalizes across a range of structures. Detrimentally, in attempting to specify parameterization with applicability across molecular classes, it is unable to specialize to domain-specific applications, where sufficient accuracy and computational speed are critical. This work presents a novel approach to optimize derived results from CGMD simulations by refining the general-purpose Martini3 topologies specifically the bonded interaction parameters within a given coarse-grained mapping - for domain-specific applications using Bayesian Optimization methodologies. We have developed and validated a CG potential applicable to any degree of polymerization, representing a significant advancement in the field. Our optimized CG potential, based on the Martini3 framework, aims to achieve accuracy comparable to AAMD while maintaining the computational efficiency of CGMD. This approach bridges the gap between efficiency and accuracy in multiscale molecular simulations, potentially enabling more rapid and cost-effective molecular discovery across various scientific and technological domains.

信息检索

[IR-0] SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation RECSYS2025

链接: https://arxiv.org/abs/2507.13336
作者: Weizhi Zhang,Liangwei Yang,Zihe Song,Henrry Peng Zou,Ke Xu,Yuanjie Zhu,Philip S. Yu
类目: Information Retrieval (cs.IR)
*备注: Accepted in RecSys 2025. arXiv admin note: substantial text overlap with arXiv:2404.15954

点击查看摘要

Abstract:Recommender systems (RecSys) are essential for online platforms, providing personalized suggestions to users within a vast sea of information. Self-supervised graph learning seeks to harness high-order collaborative filtering signals through unsupervised augmentation on the user-item bipartite graph, primarily leveraging a multi-task learning framework that includes both supervised recommendation loss and self-supervised contrastive loss. However, this separate design introduces additional graph convolution processes and creates inconsistencies in gradient directions due to disparate losses, resulting in prolonged training times and sub-optimal performance. In this study, we introduce a unified framework of Supervised Graph Contrastive Learning for recommendation (SGCL) to address these issues. SGCL uniquely combines the training of recommendation and unsupervised contrastive losses into a cohesive supervised contrastive learning loss, aligning both tasks within a single optimization direction for exceptionally fast training. Extensive experiments on three real-world datasets show that SGCL outperforms state-of-the-art methods, achieving superior accuracy and efficiency.

[IR-1] Efficiently Constructing Sparse Navigable Graphs

链接: https://arxiv.org/abs/2507.13296
作者: Alex Conway,Laxman Dhulipala,Martin Farach-Colton,Rob Johnson,Ben Landrum,Christopher Musco,Yarin Shechter,Torsten Suel,Richard Wen
类目: Data Structures and Algorithms (cs.DS); Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph-based nearest neighbor search methods have seen a surge of popularity in recent years, offering state-of-the-art performance across a wide variety of applications. Central to these methods is the task of constructing a sparse navigable search graph for a given dataset endowed with a distance function. Unfortunately, doing so is computationally expensive, so heuristics are universally used in practice. In this work, we initiate the study of fast algorithms with provable guarantees for search graph construction. For a dataset with n data points, the problem of constructing an optimally sparse navigable graph can be framed as n separate but highly correlated minimum set cover instances. This yields a naive O(n^3) time greedy algorithm that returns a navigable graph whose sparsity is at most O(\log n) higher than optimal. We improve significantly on this baseline, taking advantage of correlation between the set cover instances to leverage techniques from streaming and sublinear-time set cover algorithms. Combined with problem-specific pre-processing techniques, we present an \tildeO(n^2) time algorithm for constructing an O(\log n) -approximate sparsest navigable graph under any distance function. The runtime of our method is optimal up to logarithmic factors under the Strong Exponential Time Hypothesis via a reduction from Monochromatic Closest Pair. Moreover, we prove that, as with general set cover, obtaining better than an O(\log n) -approximation is NP-hard, despite the significant additional structure present in the navigable graph problem. Finally, we show that our techniques can also beat cubic time for the closely related and practically important problems of constructing \alpha -shortcut reachable and \tau -monotonic graphs, which are also used for nearest neighbor search. For such graphs, we obtain \tildeO(n^2.5) time or better algorithms. Subjects: Data Structures and Algorithms (cs.DS); Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2507.13296 [cs.DS] (or arXiv:2507.13296v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2507.13296 Focus to learn more arXiv-issued DOI via DataCite

[IR-2] Machine-Readable Ads: Accessibility and Trust Patterns for AI Web Agents interacting with Online Advertisements

链接: https://arxiv.org/abs/2507.12844
作者: Joel Nitu,Heidrun Mühle,Andreas Stöckl
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Autonomous multimodal language models are rapidly evolving into web agents that can browse, click, and purchase items on behalf of users, posing a threat to display advertising designed for human eyes. Yet little is known about how these agents interact with ads or which design principles ensure reliable engagement. To address this, we ran a controlled experiment using a faithful clone of the news site this http URL, seeded with diverse ads: static banners, GIFs, carousels, videos, cookie dialogues, and paywalls. We ran 300 initial trials plus follow-ups using the Document Object Model (DOM)-centric Browser Use framework with GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash, and the pixel-based OpenAI Operator, across 10 realistic user tasks. Our results show these agents display severe satisficing: they never scroll beyond two viewports and ignore purely visual calls to action, clicking banners only when semantic button overlays or off-screen text labels are present. Critically, when sweepstake participation required a purchase, GPT-4o and Claude 3.7 Sonnet subscribed in 100% of trials, and Gemini 2.0 Flash in 70%, revealing gaps in cost-benefit analysis. We identified five actionable design principles-semantic overlays, hidden labels, top-left placement, static frames, and dialogue replacement, that make human-centric creatives machine-detectable without harming user experience. We also evaluated agent trustworthiness through “behavior patterns” such as cookie consent handling and subscription choices, highlighting model-specific risk boundaries and the urgent need for robust trust evaluation frameworks in real-world advertising.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-07-18

目录

概览 (2025-07-18)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载